Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] MTAVG-Bench: A Comprehensive Benchmark for Evaluating Multi-Talker Dialogue-Centric Audio-Video Generation

Yang-Hao Zhou, Haitian Li, Rexar Lin, Heyan Huang, Jinxing Zhou, Changsen Yuan, Tian Lan, Ziqin Zhou, Yudong Li, Jiajun Xu, Jingyun Liao, Yi-Ming Cheng, Xuefeng Chen, Xian-Ling Mao, Yousheng Feng

Main category: cs.MM

TL;DR: MTAVG-Bench: A benchmark for evaluating audio-visual multi-speaker dialogue generation in T2AV models, addressing gaps in existing evaluation for multi-talker settings.

Details

Motivation: Existing evaluation benchmarks are designed for human-recorded videos or single-speaker settings, failing to capture errors in generated multi-talker dialogue videos like identity drift, unnatural turn transitions, and audio-visual misalignment.

Method: Built via semi-automatic pipeline generating 1.8k videos using multiple popular models with carefully designed prompts, yielding 2.4k manually annotated QA pairs. Evaluates at four levels: audio-visual signal fidelity, temporal attribute consistency, social interaction, and cinematic expression.

Result: Benchmarked 12 proprietary and open-source omni-models, with Gemini 3 Pro achieving strongest overall performance. Leading open-source models remain competitive in signal fidelity and consistency.

Conclusion: MTAVG-Bench enables fine-grained failure analysis for rigorous model comparison and targeted video generation refinement in multi-speaker dialogue generation.

Abstract: Recent advances in text-to-audio-video (T2AV) generation have enabled models to synthesize audio-visual videos with multi-participant dialogues. However, existing evaluation benchmarks remain largely designed for human-recorded videos or single-speaker settings. As a result, potential errors that occur in generated multi-talker dialogue videos, such as identity drift, unnatural turn transitions, and audio-visual misalignment, cannot be effectively captured and analyzed. To address this issue, we introduce MTAVG-Bench, a benchmark for evaluating audio-visual multi-speaker dialogue generation. MTAVG-Bench is built via a semi-automatic pipeline, where 1.8k videos are generated using multiple popular models with carefully designed prompts, yielding 2.4k manually annotated QA pairs. The benchmark evaluates multi-speaker dialogue generation at four levels: audio-visual signal fidelity, temporal attribute consistency, social interaction, and cinematic expression. We benchmark 12 proprietary and open-source omni-models on MTAVG-Bench, with Gemini 3 Pro achieving the strongest overall performance, while leading open-source models remain competitive in signal fidelity and consistency. Overall, MTAVG-Bench enables fine-grained failure analysis for rigorous model comparison and targeted video generation refinement.

Relevance: 9/10

Mohamed Saleh, Zahra Ahmadi

Main category: cs.MM

TL;DR: CMQKA is a novel cross-modal fusion mechanism with linear complexity that enables hierarchical audio-visual fusion, implemented in SNNergy framework for energy-efficient multimodal processing with state-of-the-art results.

Details

Motivation: Existing audio-visual fusion methods face a trade-off: attention-based methods have quadratic complexity preventing hierarchical architectures, while efficient fusion uses simplistic concatenation that fails to capture complex cross-modal dependencies.

Method: Introduces CMQKA with bidirectional cross-modal Query-Key attention achieving linear O(N) complexity through binary operations, and SNNergy framework with hierarchical architecture using event-driven binary spike operations for energy efficiency.

Result: Achieves state-of-the-art results on audio-visual benchmarks CREMA-D, AVE, and UrbanSound8K-AV, significantly outperforming existing multimodal fusion baselines with remarkable energy efficiency.

Conclusion: Advances multimodal fusion by introducing scalable fusion mechanism enabling hierarchical cross-modal integration with practical energy efficiency for real-world audio-visual intelligence systems.

Abstract: Effective multimodal fusion requires mechanisms that can capture complex cross-modal dependencies while remaining computationally scalable for real-world deployment. Existing audio-visual fusion approaches face a fundamental trade-off: attention-based methods effectively model cross-modal relationships but incur quadratic computational complexity that prevents hierarchical, multi-scale architectures, while efficient fusion strategies rely on simplistic concatenation that fails to extract complementary cross-modal information. We introduce CMQKA, a novel cross-modal fusion mechanism that achieves linear O(N) complexity through efficient binary operations, enabling scalable hierarchical fusion previously infeasible with conventional attention. CMQKA employs bidirectional cross-modal Query-Key attention to extract complementary spatiotemporal features and uses learnable residual fusion to preserve modality-specific characteristics while enriching representations with cross-modal information. Building upon CMQKA, we present SNNergy, an energy-efficient multimodal fusion framework with a hierarchical architecture that processes inputs through progressively decreasing spatial resolutions and increasing semantic abstraction. This multi-scale fusion capability allows the framework to capture both local patterns and global context across modalities. Implemented with event-driven binary spike operations, SNNergy achieves remarkable energy efficiency while maintaining fusion effectiveness and establishing new state-of-the-art results on challenging audio-visual benchmarks, including CREMA-D, AVE, and UrbanSound8K-AV, significantly outperforming existing multimodal fusion baselines. Our framework advances multimodal fusion by introducing a scalable fusion mechanism that enables hierarchical cross-modal integration with practical energy efficiency for real-world audio-visual intelligence systems.

Relevance: 9/10

[3] LPIPS-AttnWav2Lip: Generic Audio-Driven lip synchronization for Talking Head Generation in the Wild

Zhipeng Chen, Xinheng Wang, Lun Xie, Haijie Yuan, Hang Pan

Main category: cs.SD

TL;DR: LPIPS-AttnWav2Lip: A U-Net-based method for audio-driven talking head generation with improved lip synchronization using residual CBAM, semantic alignment, and LPIPS loss for better audio-visual coherence and image quality.

Details

Motivation: The main challenge in talking head generation is achieving audio-visual coherence between lips and audio (lip synchronization). Researchers need methods that can reconstruct face images of any speaker based on audio with high synchronization accuracy and visual quality.

Method: U-Net architecture with residual CBAM to encode and fuse audio-visual information; semantic alignment module to extend receptive field and match statistical information between visual features and audio latent vectors; LPIPS loss for better image quality and training stability.

Result: The method achieves outstanding performance in lip synchronization accuracy and visual quality, as demonstrated by both subjective and objective evaluations.

Conclusion: LPIPS-AttnWav2Lip provides an effective generic solution for audio-driven talking head generation with improved lip synchronization and image quality through novel architectural components and loss functions.

Abstract: Researchers have shown a growing interest in Audio-driven Talking Head Generation. The primary challenge in talking head generation is achieving audio-visual coherence between the lips and the audio, known as lip synchronization. This paper proposes a generic method, LPIPS-AttnWav2Lip, for reconstructing face images of any speaker based on audio. We used the U-Net architecture based on residual CBAM to better encode and fuse audio and visual modal information. Additionally, the semantic alignment module extends the receptive field of the generator network to obtain the spatial and channel information of the visual features efficiently; and match statistical information of visual features with audio latent vector to achieve the adjustment and injection of the audio content information to the visual information. To achieve exact lip synchronization and to generate realistic high-quality images, our approach adopts LPIPS Loss, which simulates human judgment of image quality and reduces instability possibility during the training process. The proposed method achieves outstanding performance in terms of lip synchronization accuracy and visual quality as demonstrated by subjective and objective evaluation results. The code for the paper is available at the following link: https://github.com/FelixChan9527/LPIPS-AttnWav2Lip

Relevance: 9/10

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 296]
cs.CV [Total: 406]
cs.AI [Total: 209]
cs.SD [Total: 27]
cs.LG [Total: 659]
cs.MA [Total: 15]
cs.MM [Total: 5]
eess.AS [Total: 14]
eess.IV [Total: 28]

cs.CL

[1] PPoGA: Predictive Plan-on-Graph with Action for Knowledge Graph Question Answering

MinGyu Jeon, SuWan Cho, JaeYoung Shu

Main category: cs.CL

TL;DR: PPoGA is a novel KGQA framework that introduces self-correction mechanisms for flawed reasoning plans, enabling both path and plan correction to improve robustness in complex question answering.

Details

Motivation: Current LLMs augmented with KGs for complex question answering often fail when their initial high-level reasoning plan is flawed, analogous to cognitive functional fixedness where agents cannot restructure their approach once committed to an unworkable solution.

Method: PPoGA uses a Planner-Executor architecture to separate high-level strategy from low-level execution, incorporates Predictive Processing to anticipate outcomes, and features a self-correction mechanism that performs both Path Correction (local execution errors) and Plan Correction (identifying, discarding, and reformulating entire ineffective plans).

Result: Extensive experiments on three challenging multi-hop KGQA benchmarks (GrailQA, CWQ, WebQSP) show PPoGA achieves state-of-the-art performance, significantly outperforming existing methods.

Conclusion: The work highlights the importance of metacognitive abilities like problem restructuring for building more robust and flexible AI reasoning systems, demonstrating that self-correction mechanisms can overcome limitations of fixed reasoning plans in KG-augmented LLMs.

Abstract: Large Language Models (LLMs) augmented with Knowledge Graphs (KGs) have advanced complex question answering, yet they often remain susceptible to failure when their initial high-level reasoning plan is flawed. This limitation, analogous to cognitive functional fixedness, prevents agents from restructuring their approach, leading them to pursue unworkable solutions. To address this, we propose PPoGA (Predictive Plan-on-Graph with Action), a novel KGQA framework inspired by human cognitive control and problem-solving. PPoGA incorporates a Planner-Executor architecture to separate high-level strategy from low-level execution and leverages a Predictive Processing mechanism to anticipate outcomes. The core innovation of our work is a self-correction mechanism that empowers the agent to perform not only Path Correction for local execution errors but also Plan Correction by identifying, discarding, and reformulating the entire plan when it proves ineffective. We conduct extensive experiments on three challenging multi-hop KGQA benchmarks: GrailQA, CWQ, and WebQSP. The results demonstrate that PPoGA achieves state-of-the-art performance, significantly outperforming existing methods. Our work highlights the critical importance of metacognitive abilities like problem restructuring for building more robust and flexible AI reasoning systems.

[2] Unlocking Electronic Health Records: A Hybrid Graph RAG Approach to Safe Clinical AI for Patient QA

Samuel Thio, Matthew Lewis, Spiros Denaxas, Richard JB Dobson

Main category: cs.CL

TL;DR: MediGRAF is a hybrid Graph RAG system that combines structured Neo4j Text2Cypher queries with vector embeddings for unstructured data to enable comprehensive clinical information retrieval from EHRs.

Details

Motivation: EHR systems overwhelm clinicians with vast clinical data, making critical details easily overlooked. Current LLM solutions for clinical settings face limitations in context grounding and hallucinations, with existing retrieval methods isolating structured (SQL/Cypher) and unstructured (semantic search) approaches rather than integrating both.

Method: MediGRAF combines Neo4j Text2Cypher capabilities for structured relationship traversal with vector embeddings for unstructured narrative retrieval, creating a hybrid Graph RAG system that enables natural language querying of complete patient journeys using MIMIC-IV dataset.

Result: The system achieved 100% recall for factual queries and a mean expert quality score of 4.25/5 for complex inference tasks with zero safety violations, using 10 patients from MIMIC-IV dataset (5,973 nodes, 5,963 relationships).

Conclusion: Hybrid graph-grounding significantly advances clinical information retrieval, offering a safer, more comprehensive alternative to standard LLM deployments by bridging structured and unstructured data retrieval.

Abstract: Electronic health record (EHR) systems present clinicians with vast repositories of clinical information, creating a significant cognitive burden where critical details are easily overlooked. While Large Language Models (LLMs) offer transformative potential for data processing, they face significant limitations in clinical settings, particularly regarding context grounding and hallucinations. Current solutions typically isolate retrieval methods focusing either on structured data (SQL/Cypher) or unstructured semantic search but fail to integrate both simultaneously. This work presents MediGRAF (Medical Graph Retrieval Augmented Framework), a novel hybrid Graph RAG system that bridges this gap. By uniquely combining Neo4j Text2Cypher capabilities for structured relationship traversal with vector embeddings for unstructured narrative retrieval, MediGRAF enables natural language querying of the complete patient journey. Using 10 patients from the MIMIC-IV dataset (generating 5,973 nodes and 5,963 relationships), we generated enough nodes and data for patient level question answering (QA), and we evaluated this architecture across varying query complexities. The system demonstrated 100% recall for factual queries which means all relevant information was retrieved and in the output, while complex inference tasks achieved a mean expert quality score of 4.25/5 with zero safety violations. These results demonstrate that hybrid graph-grounding significantly advances clinical information retrieval, offering a safer, more comprehensive alternative to standard LLM deployments.

[3] G-MemLLM: Gated Latent Memory Augmentation for Long-Context Reasoning in Large Language Models

Xun Xu

Main category: cs.CL

TL;DR: G-MemLLM: A memory-augmented LLM architecture with gated latent memory bank for improved long-context reasoning and factual consistency.

Details

Motivation: LLMs struggle with finite context windows and maintaining long-term factual consistency during multi-hop reasoning. Existing methods suffer from context rot or information dilution over long horizons.

Method: Proposes G-MemLLM with frozen LLM backbone + trainable Latent Memory Bank using GRU-style gated update logic to selectively update, preserve, or overwrite memory slots, preventing vanishing gradients.

Result: Significant improvements on HotpotQA and ZsRE benchmarks: 13.3% accuracy boost on ZsRE for Llama 3.1-8B, 8.56 points Answer F1 boost for GPT-2, and 6.89 points Supporting Fact F1 boost for Llama 3.1-8B on HotpotQA.

Conclusion: G-MemLLM effectively enhances multi-hop reasoning and relational precision across model scales by addressing context limitations through gated memory mechanisms.

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, yet they remain constrained by the finite capacity of their context windows and the inherent difficulty of maintaining long-term factual consistency during multi-hop reasoning. While existing methods utilize context compression or recurrent tokens, they often suffer from ``context rot’’ or the dilution of information over long horizons. In this paper, we propose \textbf{G-MemLLM}, a memory-augmented architecture that integrates a frozen LLM backbone with a trainable \textbf{Latent Memory Bank}. Our key innovation is a GRU-style gated update logic that allows the model to selectively update, preserve, or overwrite latent memory slots, preventing the vanishing gradients of knowledge common in recurrent systems. We evaluate G-MemLLM across scales, from GPT-2 (124M) to Llama 3.1 (8B), on the HotpotQA and Zero-Shot Relation Extraction (ZsRE) benchmarks. Our results demonstrate that G-MemLLM significantly enhances multi-hop reasoning and relational precision, achieving a 13.3% accuracy boost on ZsRE for Llama 3.1-8B, and it also yields improvements across model scales, boosting Answer F1 by 8.56 points for GPT-2 and increasing Supporting Fact F1 by 6.89 points for Llama 3.1-8B on HotpotQA.

[4] PTCBENCH: Benchmarking Contextual Stability of Personality Traits in LLM Systems

Jiongchi Yu, Yuhan Ma, Xiaoyu Zhang, Junjie Wang, Qiang Hu, Chao Shen, Xiaofei Xie

Main category: cs.CL

TL;DR: PTCBENCH is a benchmark for evaluating personality consistency in LLMs under different situational contexts, revealing that external scenarios can trigger significant personality changes and affect reasoning capabilities.

Details

Motivation: Existing work overlooks that personality traits are dynamic and context-dependent, which is critical for maintaining consistent and authentic LLM personalities in affective agents and AI systems for user trust and engagement.

Method: Introduces PTCBENCH benchmark that subjects models to 12 distinct external conditions (location contexts and life events) and assesses personality using the NEO Five-Factor Inventory across 39,240 personality trait records.

Result: Certain external scenarios like “Unemployment” can trigger significant personality changes in LLMs and even alter their reasoning capabilities, revealing context-dependent personality instability.

Conclusion: PTCBENCH establishes an extensible framework for evaluating personality consistency in realistic, evolving environments, offering insights for developing robust and psychologically aligned AI systems.

Abstract: With the increasing deployment of large language models (LLMs) in affective agents and AI systems, maintaining a consistent and authentic LLM personality becomes critical for user trust and engagement. However, existing work overlooks a fundamental psychological consensus that personality traits are dynamic and context-dependent. To bridge this gap, we introduce PTCBENCH, a systematic benchmark designed to quantify the consistency of LLM personalities under controlled situational contexts. PTCBENCH subjects models to 12 distinct external conditions spanning diverse location contexts and life events, and rigorously assesses the personality using the NEO Five-Factor Inventory. Our study on 39,240 personality trait records reveals that certain external scenarios (e.g., “Unemployment”) can trigger significant personality changes of LLMs, and even alter their reasoning capabilities. Overall, PTCBENCH establishes an extensible framework for evaluating personality consistency in realistic, evolving environments, offering actionable insights for developing robust and psychologically aligned AI systems.

[5] SafeTalkCoach: Diversity-Driven Multi-Agent Simulation for Parent-Teen Health Conversations

Benyamin Tabarsi, Wenbo Li, Tahreem Yasir, Aryan Santhosh Kumar, Laura Widman, Dongkuan Xu, Tiffany Barnes

Main category: cs.CL

TL;DR: SafeTalkCoach is a multi-agent dialogue generation framework that simulates diverse and realistic parent-child conversations about sexual health, addressing the scarcity of real-world data on this sensitive topic.

Details

Motivation: Real-world data on parent-child sexual health conversations is scarce due to their private and sensitive nature. Existing LLM-based dialogue generation approaches often lack realism, diversity, and adherence to best practices in this domain.

Method: A diversity-driven multi-agent framework integrating crowd-sourced/synthesized scenarios, sexual health guidelines, evidence-based personas, adaptive control modules, and hierarchical diversification to generate realistic conversations.

Result: SafeTalkCoach generates diverse conversations while maintaining realism, communication quality, and controllability, as demonstrated through evaluations. The framework includes an accompanying dataset.

Conclusion: The SafeTalkCoach framework and dataset can support both AI research and health communication practices by providing realistic simulations of sensitive parent-child conversations about sexual health.

Abstract: The importance of effective parent-child communication about sexual health is widely acknowledged, but real-world data on these conversations is scarce and challenging to collect, due to their private and sensitive nature. Although LLMs have been widely adopted in dialogue generation, they may deviate from best practices and frequently lack realism and diversity. We introduce SafeTalkCoach, a diversity-driven multi-agent dialogue generation framework that simulates parent-child conversations about sexual health, and present an accompanying dataset. SafeTalkCoach integrates crowd-sourced and synthesized scenarios, established sexual health guidelines, evidence-based personas, adaptive control modules, and hierarchical diversification. Through evaluations, we demonstrate that SafeTalkCoach generates diverse conversations while maintaining realism, communication quality, and controllability in practice. Our goal is that the SafeTalkCoach framework and the dataset support both AI research and health communications practices.

[6] Construct, Align, and Reason: Large Ontology Models for Enterprise Knowledge Management

Yao Zhang, Hongyin Zhu

Main category: cs.CL

TL;DR: LOM is a large ontology model framework for enterprise knowledge management that integrates structured and unstructured data into a unified ontology, with a three-stage training pipeline for semantic reasoning.

Details

Motivation: Enterprise knowledge management struggles with integrating heterogeneous data sources and lacks semantic understanding for complex reasoning. Traditional knowledge graphs have limitations in discovering implicit relationships and supporting sophisticated question answering.

Method: Proposes a construct-align-reason framework: 1) Build dual-layer enterprise ontology from structured databases and unstructured text, 2) Three-stage training: ontology instruction fine-tuning for structure understanding, text-ontology grounding for semantic encoding, and multi-task instruction tuning with curriculum learning on ontology-language pairs.

Result: The 4B-parameter LOM achieves 89.47% accuracy on their benchmark, outperforming DeepSeek-V3.2 on complex graph reasoning tasks, demonstrating effective fusion of ontology structure and language understanding.

Conclusion: LOM successfully addresses enterprise knowledge integration challenges by combining structured and unstructured data into a unified ontology with enhanced semantic reasoning capabilities through specialized training techniques.

Abstract: Enterprise-scale knowledge management faces significant challenges in integrating multi-source heterogeneous data and enabling effective semantic reasoning. Traditional knowledge graphs often struggle with implicit relationship discovery and lack sufficient semantic understanding for complex question answering. To address these limitations, we introduce a unified construct–align–reason framework, the large ontology model (LOM). We first build a dual-layer enterprise ontology from structured databases and unstructured text, subsequently fusing these sources into a comprehensive enterprise ontology. To enable instruction-aligned reasoning, we propose a unified three-stage training pipeline: ontology instruction fine-tuning to improve structural understanding; text-ontology grounding to strengthen node semantic encoding; and multi-task instruction tuning on ontology-language pairs with curriculum learning to enhance semantic reasoning and generation. We also construct comprehensive training and evaluation datasets covering diverse ontology reasoning tasks. On this benchmark, our 4B-parameter LOM achieves 89.47% accuracy and outperforms DeepSeek-V3.2 on complex graph reasoning, indicating effective fusion of ontology structure and language.

[7] Reversible Diffusion Decoding for Diffusion Language Models

Xinyun Wang, Min Zhang, Sen Cui, Zhikang Chen, Bo Jiang, Kun Kuang, Mingbao Lin

Main category: cs.CL

TL;DR: RDD introduces reversibility into diffusion language models to recover from early commitment errors during parallel token generation, improving robustness while maintaining efficiency.

Details

Motivation: Diffusion language models enable parallel token generation through block-wise decoding, but their irreversible commitments can lead to stagnation where the reverse diffusion process fails to progress under suboptimal context, requiring a solution to recover from early errors.

Method: Proposes Reversible Diffusion Decoding (RDD) that detects stagnation as state-dependent failure and enables efficient backtracking to earlier blocks without recomputation via cached model states, using confidence-guided re-masking to selectively reinitialize uncertain tokens while preserving reliable context.

Result: Experiments show RDD improves generation robustness and quality over baselines with minimal computational overhead.

Conclusion: RDD’s reversible formulation allows decoding to recover from early commitment errors while maintaining the parallel efficiency of diffusion-based generation.

Abstract: Diffusion language models enable parallel token generation through block-wise decoding, but their irreversible commitments can lead to stagnation, where the reverse diffusion process fails to make further progress under a suboptimal context.We propose Reversible Diffusion Decoding (RDD), a decoding framework that introduces reversibility into block-wise diffusion generation. RDD detects stagnation as a state-dependent failure of the reverse process and enables efficient backtracking to earlier blocks without recomputation via cached model states. To avoid repeated failure trajectories, RDD applies confidence-guided re-masking to selectively reinitialize uncertain tokens while preserving reliable context.This reversible formulation allows decoding to recover from early commitment errors while maintaining the parallel efficiency of diffusion-based generation. Experiments show that RDD improves generation robustness and quality over baselines with minimal computational overhead.

[8] DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking

Tianyi Hu, Niket Tandon, Akhil Arora

Main category: cs.CL

TL;DR: DIVERGE is a plug-and-play agentic RAG framework that addresses the limitation of standard RAG systems in handling open-ended questions with multiple plausible answers by promoting diverse viewpoints while maintaining answer quality.

Details

Motivation: Standard RAG systems assume single correct answers, overlooking scenarios with multiple plausible answers where diversity is essential. Current systems underutilize retrieved context diversity, limiting creativity and compromising fair information access.

Method: Proposes DIVERGE framework with reflection-guided generation and memory-augmented iterative refinement to promote diverse viewpoints. Introduces novel metrics for evaluating diversity-quality trade-off in open-ended questions.

Result: DIVERGE achieves the best diversity-quality trade-off compared to competitive baselines and previous state-of-the-art methods on the real-world Infinity-Chat dataset, substantially improving diversity while maintaining quality.

Conclusion: Reveals systematic limitation of current LLM-based systems for open-ended information-seeking and shows that explicitly modeling diversity can mitigate it. The framework addresses underutilization of retrieved context diversity in standard RAG systems.

Abstract: Existing retrieval-augmented generation (RAG) systems are primarily designed under the assumption that each query has a single correct answer. This overlooks common information-seeking scenarios with multiple plausible answers, where diversity is essential to avoid collapsing to a single dominant response, thereby constraining creativity and compromising fair and inclusive information access. Our analysis reveals a commonly overlooked limitation of standard RAG systems: they underutilize retrieved context diversity, such that increasing retrieval diversity alone does not yield diverse generations. To address this limitation, we propose DIVERGE, a plug-and-play agentic RAG framework with novel reflection-guided generation and memory-augmented iterative refinement, which promotes diverse viewpoints while preserving answer quality. We introduce novel metrics tailored to evaluating the diversity-quality trade-off in open-ended questions, and show that they correlate well with human judgments. We demonstrate that DIVERGE achieves the best diversity-quality trade-off compared to competitive baselines and previous state-of-the-art methods on the real-world Infinity-Chat dataset, substantially improving diversity while maintaining quality. More broadly, our results reveal a systematic limitation of current LLM-based systems for open-ended information-seeking and show that explicitly modeling diversity can mitigate it. Our code is available at: https://github.com/au-clan/Diverge

[9] Benchmarking Uncertainty Calibration in Large Language Model Long-Form Question Answering

Philip Müller, Nicholas Popovič, Michael Färber, Peter Steinbach

Main category: cs.CL

TL;DR: Benchmark study of uncertainty quantification methods for LLMs in scientific QA, revealing limitations in current approaches and calibration practices.

Details

Motivation: Reliable uncertainty quantification is critical for trustworthy adoption of LLMs in scientific question answering, but existing UQ approaches remain weakly validated in scientific QA settings that require fact-retrieval and reasoning capabilities.

Method: Created large-scale benchmark for evaluating UQ metrics in reasoning-demanding QA, studying calibration of UQ methods. Analyzed 20 LLMs (base, instruction-tuned, reasoning variants) across 7 scientific QA datasets with 685,000 long-form responses. Evaluated representative UQ approaches at token and sequence levels.

Result: Instruction tuning causes probability mass polarization, reducing reliability of token-level confidences. Reasoning fine-tuning mitigates this effect depending on provider. At sequence level, verbalized approaches are systematically biased and poorly correlated with correctness, while answer frequency (consistency across samples) yields best calibration. ECE alone is misleading for judging UQ performance.

Conclusion: Current UQ methods for LLMs have critical limitations, and standard benchmarking practices are inadequate. Answer frequency provides most reliable calibration, while instruction tuning negatively impacts uncertainty estimation reliability.

Abstract: Large Language Models (LLMs) are commonly used in Question Answering (QA) settings, increasingly in the natural sciences if not science at large. Reliable Uncertainty Quantification (UQ) is critical for the trustworthy uptake of generated answers. Existing UQ approaches remain weakly validated in scientific QA, a domain relying on fact-retrieval and reasoning capabilities. We introduce the first large-scale benchmark for evaluating UQ metrics in reasoning-demanding QA studying calibration of UQ methods, providing an extensible open-source framework to reproducibly assess calibration. Our study spans up to 20 large language models of base, instruction-tuned and reasoning variants. Our analysis covers seven scientific QA datasets, including both multiple-choice and arithmetic question answering tasks, using prompting to emulate an open question answering setting. We evaluate and compare methods representative of prominent approaches on a total of 685,000 long-form responses, spanning different reasoning complexities representative of domain-specific tasks. At the token level, we find that instruction tuning induces strong probability mass polarization, reducing the reliability of token-level confidences as estimates of uncertainty. Models further fine-tuned for reasoning are exposed to the same effect, but the reasoning process appears to mitigate it depending on the provider. At the sequence level, we show that verbalized approaches are systematically biased and poorly correlated with correctness, while answer frequency (consistency across samples) yields the most reliable calibration. In the wake of our analysis, we study and report the misleading effect of relying exclusively on ECE as a sole measure for judging performance of UQ methods on benchmark datasets. Our findings expose critical limitations of current UQ methods for LLMs and standard practices in benchmarking thereof.

[10] Faithful-Patchscopes: Understanding and Mitigating Model Bias in Hidden Representations Explanation of Large Language Models

Xilin Gong, Shu Yang, Zehua Cao, Lynne Billard, Di Wang

Main category: cs.CL

TL;DR: LLMs in Patchscopes framework show unfaithfulness due to linguistic biases overriding contextual information; proposed BALOR method recalibrates logits to suppress bias and amplify context.

Details

Motivation: LLMs using Patchscopes framework for interpreting hidden representations tend to rely on inherent linguistic patterns rather than contextual information encoded in representations, leading to systematic unfaithfulness in explanations.

Method: Created dataset to evaluate faithfulness under biased cases; proposed Bias Alignment through Logit Recalibration (BALOR) which contrasts output logits from unpatched prompts (capturing bias) with logits from patched contextual information, then recalibrates distribution.

Result: Found 18.84% average faithfulness decrease due to bias; BALOR consistently outperforms baselines across multiple LLMs with up to 33% relative performance improvement.

Conclusion: LLMs’ linguistic biases significantly impact faithfulness in representation interpretation; BALOR effectively mitigates this issue by recalibrating logits to prioritize contextual information over inherent biases.

Abstract: Large Language Models (LLMs) have demonstrated strong capabilities for hidden representation interpretation through Patchscopes, a framework that uses LLMs themselves to generate human-readable explanations by decoding from internal hidden representations. However, our work shows that LLMs tend to rely on inherent linguistic patterns, which can override contextual information encoded in the hidden representations during decoding. For example, even when a hidden representation encodes the contextual attribute “purple” for “broccoli”, LLMs still generate “green” in their explanations, reflecting a strong prior association. This behavior reveals a systematic unfaithfulness in Patchscopes. To systematically study this issue, we first designed a dataset to evaluate the faithfulness of Patchscopes under biased cases, and our results show that there is an 18.84% faithfulness decrease on average. We then propose Bias Alignment through Logit Recalibration (BALOR), which treats the output logits from an unpatched prompt as capturing model bias and contrasts them with logits obtained under patched contextual information. By recalibrating the logit distribution through this contrast, BALOR suppresses model bias and amplifies contextual information during generation. Experiments across multiple LLMs demonstrate that BALOR consistently outperforms existing baselines, achieving up to 33% relative performance improvement.

[11] Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling

Zhijie Huang, Stephen McIntosh, Daisuke Saito, Nobuaki Minematsu

Main category: cs.CL

TL;DR: Kanade is a single-layer disentangled speech tokenizer that separates acoustic constants to create a single token stream capturing phonetics and prosody while suppressing speaker identity, achieving SOTA speaker disentanglement and lexical availability with excellent reconstruction.

Details

Motivation: Tokenization is crucial for speech modeling, which must handle continuous signals mixing linguistic and non-linguistic information. A good speech tokenizer should extract phonetics and prosody while suppressing irrelevant information like speaker identity and enable high-quality synthesis.

Method: Kanade uses a single-layer disentangled speech tokenizer that separates out acoustic constants to create a single stream of tokens capturing rich phonetics and prosody, without needing auxiliary methods that existing disentangled codecs often rely on.

Result: Kanade achieves state-of-the-art speaker disentanglement and lexical availability while maintaining excellent reconstruction quality.

Conclusion: Kanade presents an effective single-layer disentangled speech tokenizer that successfully extracts linguistic information while suppressing non-linguistic features like speaker identity, enabling better speech modeling and synthesis.

Abstract: A good language model starts with a good tokenizer. Tokenization is especially important for speech modeling, which must handle continuous signals that mix linguistic and non-linguistic information. A speech tokenizer should extract phonetics and prosody, suppress linguistically irrelevant information like speaker identity, and enable high-quality synthesis. We present Kanade, a single-layer disentangled speech tokenizer that realizes this ideal. Kanade separates out acoustic constants to create a single stream of tokens that captures rich phonetics and prosody. It does so without the need for auxiliary methods that existing disentangled codecs often rely on. Experiments show that Kanade achieves state-of-the-art speaker disentanglement and lexical availability, while maintaining excellent reconstruction quality.

[12] MiNER: A Two-Stage Pipeline for Metadata Extraction from Municipal Meeting Minutes

Rodrigo Batista, Luís Filipe Cunha, Purificação Silvano, Nuno Guimarães, Alípio Jorge, Evelin Amorim, Ricardo Campos

Main category: cs.CL

TL;DR: Two-stage pipeline for extracting metadata from municipal meeting minutes using QA models to locate metadata segments followed by transformer-based NER with deslexicalization, benchmarked against LLMs.

Details

Motivation: Municipal meeting minutes have heterogeneous formats and lack standardized metadata, making automated extraction challenging. Existing NER models are not adapted to domain-specific categories in this specialized domain.

Method: Two-stage approach: 1) QA model identifies opening/closing segments containing metadata, 2) Transformer models (BERTimbau, XLM-RoBERTa with/without CRF layer) perform fine-grained entity extraction enhanced by deslexicalization. Benchmarked against Phi and Gemini LLMs.

Result: Strong in-domain performance outperforming larger general-purpose LLMs, but reduced generalization across municipalities due to variability and linguistic complexity. Established first benchmark for this domain.

Conclusion: Proposed pipeline provides solid foundation for metadata extraction from municipal records, though cross-municipality generalization remains challenging due to domain-specific variability.

Abstract: Municipal meeting minutes are official documents of local governance, exhibiting heterogeneous formats and writing styles. Effective information retrieval (IR) requires identifying metadata such as meeting number, date, location, participants, and start/end times, elements that are rarely standardized or easy to extract automatically. Existing named entity recognition (NER) models are ill-suited to this task, as they are not adapted to such domain-specific categories. In this paper, we propose a two-stage pipeline for metadata extraction from municipal minutes. First, a question answering (QA) model identifies the opening and closing text segments containing metadata. Transformer-based models (BERTimbau and XLM-RoBERTa with and without a CRF layer) are then applied for fine-grained entity extraction and enhanced through deslexicalization. To evaluate our proposed pipeline, we benchmark both open-weight (Phi) and closed-weight (Gemini) LLMs, assessing predictive performance, inference cost, and carbon footprint. Our results demonstrate strong in-domain performance, better than larger general-purpose LLMs. However, cross-municipality evaluation reveals reduced generalization reflecting the variability and linguistic complexity of municipal records. This work establishes the first benchmark for metadata extraction from municipal meeting minutes, providing a solid foundation for future research in this domain.

[13] A Baseline Multimodal Approach to Emotion Recognition in Conversations

Víctor Yeste, Rodrigo Rivas-Arévalo

Main category: cs.CL

TL;DR: Lightweight multimodal baseline for emotion recognition in conversations using text and speech features with late-fusion ensemble on Friends sitcom dataset

Details

Motivation: To provide an accessible reference implementation for multimodal emotion recognition in conversations, not aiming for state-of-the-art but for transparency and future comparison

Method: Combines transformer-based text classifier with self-supervised speech representation model using simple late-fusion ensemble on SemEval-2024 Task 3 dataset

Result: Reports baseline setup and empirical results under limited training protocol, highlighting when multimodal fusion improves over unimodal models

Conclusion: Provides transparent baseline for future comparisons in multimodal emotion recognition from conversational audio and text

Abstract: We present a lightweight multimodal baseline for emotion recognition in conversations using the SemEval-2024 Task 3 dataset built from the sitcom Friends. The goal of this report is not to propose a novel state-of-the-art method, but to document an accessible reference implementation that combines (i) a transformer-based text classifier and (ii) a self-supervised speech representation model, with a simple late-fusion ensemble. We report the baseline setup and empirical results obtained under a limited training protocol, highlighting when multimodal fusion improves over unimodal models. This preprint is provided for transparency and to support future, more rigorous comparisons.

[14] Detecting AI-Generated Content in Academic Peer Reviews

Siyuan Shen, Kai Wang

Main category: cs.CL

TL;DR: Study detects increasing AI-generated content in peer reviews at ICLR and Nature Communications from 2022-2025, with 20% of ICLR and 12% of NC reviews classified as AI-generated in 2025.

Details

Motivation: To examine the temporal emergence and prevalence of AI-generated content in academic peer reviews, addressing growing concerns about LLMs' role in scholarly evaluation processes.

Method: Applied a detection model trained on historical peer reviews to analyze later review cycles at International Conference on Learning Representations (ICLR) and Nature Communications (NC), tracking temporal patterns from pre-2022 through 2025.

Result: Minimal AI-generated content detected before 2022, followed by substantial increase through 2025: approximately 20% of ICLR reviews and 12% of Nature Communications reviews classified as AI-generated in 2025. Most pronounced growth in NC occurred between Q3 and Q4 2024.

Conclusion: Evidence suggests rapidly increasing presence of AI-assisted content in peer review, highlighting need for further study of implications for scholarly evaluation processes.

Abstract: The growing availability of large language models (LLMs) has raised questions about their role in academic peer review. This study examines the temporal emergence of AI-generated content in peer reviews by applying a detection model trained on historical reviews to later review cycles at International Conference on Learning Representations (ICLR) and Nature Communications (NC). We observe minimal detection of AI-generated content before 2022, followed by a substantial increase through 2025, with approximately 20% of ICLR reviews and 12% of Nature Communications reviews classified as AI-generated in 2025. The most pronounced growth of AI-generated reviews in NC occurs between the third and fourth quarter of 2024. Together, these findings provide suggestive evidence of a rapidly increasing presence of AI-assisted content in peer review and highlight the need for further study of its implications for scholarly evaluation.

[15] Bias in the Ear of the Listener: Assessing Sensitivity in Audio Language Models Across Linguistic, Demographic, and Positional Variations

Sheng-Lun Wei, Yu-Ling Liao, Yen-Hua Chang, Hen-Hsen Huang, Hsin-Hsi Chen

Main category: cs.CL

TL;DR: First systematic investigation of speech bias in multilingual MLLMs using BiasInEar dataset spanning English, Chinese, Korean with gender/accent balance, revealing MLLMs are robust to demographic factors but sensitive to language and option order.

Details

Motivation: To systematically investigate speech bias in multilingual multimodal large language models (MLLMs) and establish a unified framework for assessing fairness and robustness in speech-integrated LLMs, bridging the gap between text- and speech-based evaluation.

Method: Constructed BiasInEar dataset (70.8 hours, 11,200 questions) based on Global MMLU Lite spanning English, Chinese, Korean balanced by gender and accent. Evaluated nine representative models using four complementary metrics (accuracy, entropy, APES, Fleiss’ κ) under linguistic (language/accent), demographic (gender), and structural (option order) perturbations.

Result: MLLMs are relatively robust to demographic factors (gender) but highly sensitive to language and option order, suggesting speech can amplify existing structural biases. Architectural design and reasoning strategy substantially affect robustness across languages.

Conclusion: Establishes a unified framework for assessing fairness and robustness in speech-integrated LLMs, providing resources for future research on speech bias in multilingual MLLMs.

Abstract: This work presents the first systematic investigation of speech bias in multilingual MLLMs. We construct and release the BiasInEar dataset, a speech-augmented benchmark based on Global MMLU Lite, spanning English, Chinese, and Korean, balanced by gender and accent, and totaling 70.8 hours ($\approx$4,249 minutes) of speech with 11,200 questions. Using four complementary metrics (accuracy, entropy, APES, and Fleiss’ $κ$), we evaluate nine representative models under linguistic (language and accent), demographic (gender), and structural (option order) perturbations. Our findings reveal that MLLMs are relatively robust to demographic factors but highly sensitive to language and option order, suggesting that speech can amplify existing structural biases. Moreover, architectural design and reasoning strategy substantially affect robustness across languages. Overall, this study establishes a unified framework for assessing fairness and robustness in speech-integrated LLMs, bridging the gap between text- and speech-based evaluation. The resources can be found at https://github.com/ntunlplab/BiasInEar.

[16] DETOUR: An Interactive Benchmark for Dual-Agent Search and Reasoning

Li Siyan, Darshan Deshpande, Anand Kannappan, Rebecca Qian

Main category: cs.CL

TL;DR: DETOUR is a dual-agent benchmark for evaluating tip-of-the-tongue search across multiple modalities (text, image, audio, video) in multi-turn conversations, showing current models struggle with only 36% accuracy.

Details

Motivation: Existing benchmarks for tip-of-the-tongue search are limited to single-turn settings, failing to capture the realistic multi-turn nature of human recollection processes in conversation.

Method: Introduces DETOUR benchmark with 1,011 prompts using a dual-agent design: Primary Agent (subject of evaluation) queries a Memory Agent to identify recollected entities across text, image, audio, and video modalities.

Result: Current state-of-the-art models achieve only 36% accuracy on the full multimodal benchmark, demonstrating significant challenges in underspecified search scenarios.

Conclusion: The benchmark reveals substantial gaps in current models’ ability to handle realistic tip-of-the-tongue search across multiple modalities, highlighting the need for improved capabilities in underspecified scenarios.

Abstract: When recalling information in conversation, people often arrive at the recollection after multiple turns. However, existing benchmarks for evaluating agent capabilities in such tip-of-the-tongue search processes are restricted to single-turn settings. To more realistically simulate tip-of-the-tongue search, we introduce Dual-agent based Evaluation Through Obscure Under-specified Retrieval (DETOUR), a dual-agent evaluation benchmark containing 1,011 prompts. The benchmark design involves a Primary Agent, which is the subject of evaluation, tasked with identifying the recollected entity through querying a Memory Agent that is held consistent across evaluations. Our results indicate that current state-of-the-art models still struggle with our benchmark, only achieving 36% accuracy when evaluated on all modalities (text, image, audio, and video), highlighting the importance of enhancing capabilities in underspecified scenarios.

[17] DecompressionLM: Deterministic, Diagnostic, and Zero-Shot Concept Graph Extraction from Language Models

Zhaochen Hong, Jiaxuan You

Main category: cs.CL

TL;DR: DecompressionLM is a zero-shot concept graph extraction framework that discovers what language models encode without pre-defined queries or shared cross-sequence state, using Van der Corput sequences for deterministic parallel generation.

Details

Motivation: Existing knowledge probing methods rely on pre-defined queries, limiting extraction to known concepts. The paper aims to overcome three limitations of common decoding-based probing approaches: cross-sequence coupling that concentrates probability mass on high-frequency prefixes, competitive decoding effects that suppress long-tail concepts, and scalability constraints from sequential exploration.

Method: DecompressionLM uses Van der Corput low-discrepancy sequences with arithmetic decoding to enable deterministic, embarrassingly parallel generation without shared state across sequences. It’s a stateless framework for zero-shot concept graph extraction that doesn’t require pre-specified queries.

Result: Across two model families and five quantization variants, activation-aware quantization (AWQ-4bit) expands concept coverage by 30-170%, while uniform quantization (GPTQ-Int4) induces 71-86% coverage collapse. Corpus-based verification reveals a 17-point hallucination gap between top- and bottom-ranked MMLU-Pro Law models.

Conclusion: DecompressionLM establishes concept coverage as a complementary evaluation dimension for assessing knowledge breadth and factual grounding in compressed models, useful for their deployment. The method reveals divergent behaviors in quantized models not reliably reflected by explanation-level perplexity.

Abstract: Existing knowledge probing methods rely on pre-defined queries, limiting extraction to known concepts. We introduce DecompressionLM, a stateless framework for zero-shot concept graph extraction that discovers what language models encode without pre-specified queries or shared cross-sequence state. Our method targets three limitations of common decoding-based probing approaches: cross-sequence coupling that concentrates probability mass on high-frequency prefixes, competitive decoding effects that suppress long-tail concepts, and scalability constraints arising from sequential exploration. Using Van der Corput low-discrepancy sequences with arithmetic decoding, DecompressionLM enables deterministic, embarrassingly parallel generation without shared state across sequences. Across two model families and five quantization variants, we find that activation-aware quantization (AWQ-4bit) expands concept coverage by 30-170%, while uniform quantization (GPTQ-Int4) induces 71-86% coverage collapse – divergent behaviors not reliably reflected by explanation-level perplexity. Corpus-based verification further reveals a 17-point hallucination gap between top- and bottom-ranked MMLU-Pro Law models. DecompressionLM establishes concept coverage as a complementary evaluation dimension for assessing knowledge breadth and factual grounding in compressed models useful for their deployment.

[18] Clause-Internal or Clause-External? Testing Turkish Reflexive Binding in Adapted versus Chain of Thought Large Language Models

Sercan Karakaş

Main category: cs.CL

TL;DR: LLMs tested on Turkish reflexive pronoun binding show different patterns: Trendyol-LLM strongly prefers local antecedents while OpenAI o1 Mini shows more balanced distribution between local and non-local bindings.

Details

Motivation: To evaluate whether state-of-the-art large language models properly capture the binding relations of Turkish reflexive pronouns, specifically testing their understanding of local vs. non-local antecedents for reflexives like "kendi" and "kendisi".

Method: Constructed a balanced set of 100 Turkish sentences with local vs. non-local antecedents for reflexives. Tested two systems: OpenAI chain-of-thought model (o1 Mini) and Trendyol-LLM-7B-base-v0.1 (LLaMA-2-derived, fine-tuned on Turkish). Used combined sentence-level perplexity and forced-choice paradigm to assess antecedent choices.

Result: Trendyol-LLM favored local bindings in ~70% of trials, showing strong locality bias. OpenAI o1 Mini distributed choices almost evenly between local and long-distance readings, revealing marked contrast in binding behavior between the two systems.

Conclusion: Different LLMs exhibit distinct binding behaviors for Turkish reflexives, with fine-tuned models showing stronger locality bias while reasoning-focused models show more balanced patterns, indicating varying linguistic competence across architectures.

Abstract: This study evaluates whether state-of-the-art large language models capture the binding relations of Turkish reflexive pronouns. We construct a balanced set of 100 sentences that pit local against non-local antecedents for the reflexives kendi and kendisi, and test two contrasting systems: an OpenAI chain-of-thought model designed for multi-step reasoning and Trendyol-LLM-7B-base-v0.1, a LLaMA-2-derived model extensively fine-tuned on Turkish data. Antecedent choice is assessed using a combined sentence-level perplexity and forced-choice paradigm. Trendyol-LLM favours local bindings in approximately 70% of trials, exhibiting a strong locality bias, whereas o1 Mini distributes its choices almost evenly between local and long-distance readings, revealing a marked contrast in binding behaviour across the two systems.

[19] Segment-Level Attribution for Selective Learning of Long Reasoning Traces

Siyuan Wang, Yanchen Liu, Xiang Ren

Main category: cs.CL

TL;DR: Selective SFT framework identifies important reasoning segments using integrated gradient attribution to improve model efficiency and accuracy by focusing learning on reflective reasoning patterns.

Details

Motivation: Current Large Reasoning Models generate verbose chains of thought with significant redundancy, where only a small fraction meaningfully contributes to answer prediction. This redundancy propagates through supervised finetuning as models learn to imitate uninformative patterns, degrading performance.

Method: Uses integrated gradient attribution to quantify token influence on final answers, aggregating into two segment-level metrics: attribution strength (overall magnitude) and direction consistency (uniformity of attribution signs). Proposes selective SFT framework that identifies important segments with high attribution strength but moderate consistency (indicating reflective reasoning), then applies selective SFT on these segments while masking loss for unimportant ones.

Result: Experiments across multiple models and datasets show improved accuracy and output efficiency, enabling more effective learning from long reasoning traces.

Conclusion: The selective learning framework based on attribution analysis effectively identifies and focuses on important reasoning segments, improving model performance and efficiency while reducing redundancy in chain-of-thought reasoning.

Abstract: Large Reasoning Models (LRMs) achieve strong reasoning performance by generating long chains of thought (CoTs), yet only a small fraction of these traces meaningfully contributes to answer prediction, while the majority contains repetitive or truncated content. Such output redundancy is further propagated after supervised finetuning (SFT), as models learn to imitate verbose but uninformative patterns, which can degrade performance. To this end, we incorporate integrated gradient attribution to quantify each token’s influence on final answers and aggregate them into two segment-level metrics: (1) \textit{attribution strength} measures the overall attribution magnitude; and (2) \textit{direction consistency} captures whether tokens’ attributions within a segment are uniformly positive or negative (high consistency), or a mixture of both (moderate consistency). Based on these two metrics, we propose a segment-level selective learning framework to identify important segments with high attribution strength but moderate consistency that indicate reflective rather than shallow reasoning. The framework then applies selective SFT on these important segments while masking loss for unimportant ones. Experiments across multiple models and datasets show that our approach improves accuracy and output efficiency, enabling more effective learning from long reasoning traces~\footnote{Code and data are available at https://github.com/SiyuanWangw/SegmentSelectiveSFT}.

[20] When Agents “Misremember” Collectively: Exploring the Mandela Effect in LLM-based Multi-Agent Systems

Naen Xu, Hengyu An, Shuo Shi, Jinghuai Zhang, Chunyi Zhou, Changjiang Li, Tianyu Du, Zhihui Fu, Jun Wang, Shouling Ji

Main category: cs.CL

TL;DR: Study of Mandela effect (collective memory bias) in LLM-based multi-agent systems, proposing MANBENCH benchmark and mitigation strategies

Details

Motivation: Multi-agent LLM systems are vulnerable to collective cognitive biases like the Mandela effect, where groups misremember events due to social influence and misinformation, raising ethical concerns about misinformation spread

Method: Proposed MANBENCH benchmark with 4 task types susceptible to Mandela effect and 5 interaction protocols; evaluated multiple LLMs; proposed mitigation strategies including prompt-level defenses (cognitive anchoring, source scrutiny) and model-level alignment-based defense

Result: Quantified Mandela effect in LLM-based multi-agent systems; achieved 74.40% average reduction in Mandela effect compared to baseline using proposed mitigation strategies

Conclusion: Findings provide insights for developing more resilient and ethically aligned collaborative multi-agent systems by addressing collective memory biases

Abstract: Recent advancements in large language models (LLMs) have significantly enhanced the capabilities of collaborative multi-agent systems, enabling them to address complex challenges. However, within these multi-agent systems, the susceptibility of agents to collective cognitive biases remains an underexplored issue. A compelling example is the Mandela effect, a phenomenon where groups collectively misremember past events as a result of false details reinforced through social influence and internalized misinformation. This vulnerability limits our understanding of memory bias in multi-agent systems and raises ethical concerns about the potential spread of misinformation. In this paper, we conduct a comprehensive study on the Mandela effect in LLM-based multi-agent systems, focusing on its existence, causing factors, and mitigation strategies. We propose MANBENCH, a novel benchmark designed to evaluate agent behaviors across four common task types that are susceptible to the Mandela effect, using five interaction protocols that vary in agent roles and memory timescales. We evaluate agents powered by several LLMs on MANBENCH to quantify the Mandela effect and analyze how different factors affect it. Moreover, we propose strategies to mitigate this effect, including prompt-level defenses (e.g., cognitive anchoring and source scrutiny) and model-level alignment-based defense, achieving an average 74.40% reduction in the Mandela effect compared to the baseline. Our findings provide valuable insights for developing more resilient and ethically aligned collaborative multi-agent systems.

[21] What Matters to an LLM? Behavioral and Computational Evidences from Summarization

Yongxin Zhou, Changshun Wu, Philippe Mulhem, Didier Schwab, Maxime Peyrard

Main category: cs.CL

TL;DR: LLMs show consistent importance patterns in summarization that differ from pre-LLM baselines, with middle-to-late layers strongly predictive of importance selection.

Details

Motivation: To understand the internal notion of importance that drives LLMs' information selection in summarization tasks, as this remains hidden despite their state-of-the-art performance.

Method: Combined behavioral and computational analyses: generating length-controlled summaries to derive empirical importance distributions, analyzing attention heads alignment with importance patterns, and examining layer-wise importance prediction.

Result: LLMs converge on consistent importance patterns different from pre-LLM baselines, cluster more by model family than size, certain attention heads align with importance distributions, and middle-to-late layers strongly predict importance.

Conclusion: The study provides initial insights into what LLMs prioritize in summarization and how this priority is internally represented, opening paths for interpreting and controlling information selection in LLMs.

Abstract: Large Language Models (LLMs) are now state-of-the-art at summarization, yet the internal notion of importance that drives their information selections remains hidden. We propose to investigate this by combining behavioral and computational analyses. Behaviorally, we generate a series of length-controlled summaries for each document and derive empirical importance distributions based on how often each information unit is selected. These reveal that LLMs converge on consistent importance patterns, sharply different from pre-LLM baselines, and that LLMs cluster more by family than by size. Computationally, we identify that certain attention heads align well with empirical importance distributions, and that middle-to-late layers are strongly predictive of importance. Together, these results provide initial insights into what LLMs prioritize in summarization and how this priority is internally represented, opening a path toward interpreting and ultimately controlling information selection in these models.

[22] Words that make SENSE: Sensorimotor Norms in Learned Lexical Token Representations

Abhinav Gupta, Toben H. Mintz, Jesse Thomason

Main category: cs.CL

TL;DR: SENSE is a model that predicts sensorimotor associations from word embeddings, bridging computational linguistics with grounded language understanding through sensory experience.

Details

Motivation: Traditional word embeddings capture meaning from co-occurrence patterns, but human language understanding is grounded in sensory and motor experiences. The paper aims to bridge this gap by developing computational models that can predict sensorimotor associations from text.

Method: Developed SENSE (Sensorimotor Embedding Norm Scoring Engine), a learned projection model that predicts Lancaster sensorimotor norms from word lexical embeddings. Also conducted a behavioral study with 281 participants who selected which candidate nonce words evoked specific sensorimotor associations.

Result: Found statistically significant correlations between human selection rates and SENSE ratings across 6 of 11 sensorimotor modalities. Sublexical analysis revealed systematic phonosthemic patterns for the interoceptive norm, suggesting computational methods can identify candidate phonesthemes from text data.

Conclusion: SENSE successfully bridges computational linguistics with grounded language understanding by predicting sensorimotor associations from word embeddings, demonstrating that computational models can capture aspects of embodied language understanding.

Abstract: While word embeddings derive meaning from co-occurrence patterns, human language understanding is grounded in sensory and motor experience. We present $\text{SENSE}$ $(\textbf{S}\text{ensorimotor }$ $\textbf{E}\text{mbedding }$ $\textbf{N}\text{orm }$ $\textbf{S}\text{coring }$ $\textbf{E}\text{ngine})$, a learned projection model that predicts Lancaster sensorimotor norms from word lexical embeddings. We also conducted a behavioral study where 281 participants selected which among candidate nonce words evoked specific sensorimotor associations, finding statistically significant correlations between human selection rates and $\text{SENSE}$ ratings across 6 of the 11 modalities. Sublexical analysis of these nonce words selection rates revealed systematic phonosthemic patterns for the interoceptive norm, suggesting a path towards computationally proposing candidate phonosthemes from text data.

[23] Intention-Adaptive LLM Fine-Tuning for Text Revision Generation

Zhexiong Liu, Diane Litman

Main category: cs.CL

TL;DR: Intention-Tuning: A layer-wise LLM fine-tuning framework for revision generation that dynamically selects subsets of LLM layers to learn writer intentions and transfers representations to generate revisions, effective on small corpora.

Details

Motivation: LLMs excel at context-based text generation but struggle with intention-based tasks like revision generation, especially in multi-intent scenarios. Fine-tuning requires large annotated data which is scarce and expensive in revision tasks.

Method: Proposes Intention-Tuning, an intention-adaptive layer-wise fine-tuning framework that dynamically selects a subset of LLM layers to learn writer intentions, then transfers these representations to revision generation.

Result: Experimental results show Intention-Tuning is effective and efficient on small revision corpora, outperforming several Parameter-Efficient Fine-Tuning (PEFT) baselines.

Conclusion: The framework addresses the challenge of intention-based revision generation with limited data by adaptively learning intentions through selective layer tuning and representation transfer.

Abstract: Large Language Models (LLMs) have achieved impressive capabilities in various context-based text generation tasks, such as summarization and reasoning; however, their applications in intention-based generation tasks remain underexplored. One such example is revision generation, which requires the generated text to explicitly reflect the writer’s actual intentions. Identifying intentions and generating desirable revisions are challenging due to their complex and diverse nature. Although prior work has employed LLMs to generate revisions with few-shot learning, they struggle with handling entangled multi-intent scenarios. While fine-tuning LLMs using intention-based instructions appears promising, it demands large amounts of annotated data, which is expensive and scarce in the revision community. To address these challenges, we propose Intention-Tuning, an intention-adaptive layer-wise LLM fine-tuning framework that dynamically selects a subset of LLM layers to learn the intentions and subsequently transfers their representations to revision generation. Experimental results suggest that Intention-Tuning is effective and efficient on small revision corpora, outperforming several PEFT baselines.

[24] From Knowledge to Inference: Scaling Laws of Specialized Reasoning on GlobalHealthAtlas

Zhaokun Yan, Zhaohan Liu, Wuzheng Dong, Lijie Feng, Chengxiao Dai

Main category: cs.CL

TL;DR: GlobalHealthAtlas: A large-scale multilingual dataset for public health reasoning with 280K instances across 15 domains and 17 languages, featuring difficulty stratification and LLM-assisted quality control pipeline.

Details

Motivation: Public health reasoning requires population-level inference grounded in scientific evidence, expert consensus, and safety constraints, but remains underexplored as a structured ML problem with limited supervised signals and benchmarks.

Method: Created a large-scale multilingual dataset (280,210 instances) spanning 15 public health domains and 17 languages, stratified into three difficulty levels. Used LLM-assisted construction pipeline with retrieval, duplication checks, evidence grounding, and label validation. Developed a domain-aligned evaluator distilled from LLM judgments across six dimensions.

Result: Built GlobalHealthAtlas dataset enabling reproducible training and evaluation of LLMs for safety-critical public health reasoning beyond conventional QA benchmarks.

Conclusion: The contributions enable training and evaluation of LLMs for safety-critical public health reasoning, addressing the gap in structured ML approaches for this domain with proper benchmarks and evaluation frameworks.

Abstract: Public health reasoning requires population level inference grounded in scientific evidence, expert consensus, and safety constraints. However, it remains underexplored as a structured machine learning problem with limited supervised signals and benchmarks. We introduce \textbf{GlobalHealthAtlas}, a large scale multilingual dataset of 280,210 instances spanning 15 public health domains and 17 languages, stratified into three difficulty levels from health literacy to epidemiological and policy reasoning. Instances are derived from openly available public health sources and labeled by language, domain, and difficulty to support supervised learning and slice based evaluation. We further propose large language model (LLM) assisted construction and quality control pipeline with retrieval, duplication, evidence grounding checks, and label validation to improve consistency at scale. Finally, we present a domain aligned evaluator distilled from high confidence judgments of diverse LLMs to assess outputs along six dimensions: Accuracy, Reasoning, Completeness, Consensus Alignment, Terminology Norms, and Insightfulness. Together, these contributions enable reproducible training and evaluation of LLMs for safety critical public health reasoning beyond conventional QA benchmarks.

[25] Culturally-Grounded Governance for Multilingual Language Models: Rights, Data Boundaries, and Accountable AI Design

Hanjing Shi, Dominic DiFranzo

Main category: cs.CL

TL;DR: This paper proposes a culturally grounded governance framework for multilingual large language models to address inequities in training data, misalignment with local norms, and limited accountability for marginalized language communities.

Details

Motivation: Current MLLM governance frameworks assume English-centric data and homogeneous users, creating systematic risks for low-resource languages and culturally marginalized communities where data practices and accountability mechanisms fail to align with local norms and rights.

Method: The paper synthesizes existing evidence on multilingual model behavior, data asymmetries, and sociotechnical harm, drawing on cross-cultural perspectives in human-centered computing and AI governance to articulate a conceptual governance framework.

Result: Identifies three key governance challenges: cultural/linguistic inequities in training data and evaluation, misalignment between global deployment and local norms/values/power structures, and limited accountability mechanisms for marginalized language communities.

Conclusion: Proposes a conceptual agenda reframing multilingual AI governance as a sociocultural and rights-based problem, outlining design and policy implications for data stewardship, transparency, and participatory accountability to prevent reproduction of global inequalities.

Abstract: Multilingual large language models (MLLMs) are increasingly deployed across cultural, linguistic, and political contexts, yet existing governance frameworks largely assume English-centric data, homogeneous user populations, and abstract notions of fairness. This creates systematic risks for low-resource languages and culturally marginalized communities, where data practices, model behavior, and accountability mechanisms often fail to align with local norms, rights, and expectations. Drawing on cross-cultural perspectives in human-centered computing and AI governance, this paper synthesizes existing evidence on multilingual model behavior, data asymmetries, and sociotechnical harm, and articulates a culturally grounded governance framework for MLLMs. We identify three interrelated governance challenges: cultural and linguistic inequities in training data and evaluation practices, misalignment between global deployment and locally situated norms, values, and power structures, and limited accountability mechanisms for addressing harms experienced by marginalized language communities. Rather than proposing new technical benchmarks, we contribute a conceptual agenda that reframes multilingual AI governance as a sociocultural and rights based problem. We outline design and policy implications for data stewardship, transparency, and participatory accountability, and argue that culturally grounded governance is essential for ensuring that multilingual language models do not reproduce existing global inequalities under the guise of scale and neutrality.

[26] Reasoning by Commented Code for Table Question Answering

Seho Pyo, Jiheon Seok, Jaejin Lee

Main category: cs.CL

TL;DR: A commented step-by-step code generation framework for TableQA that decomposes reasoning into multi-line Python programs with natural language comments, improving numerical accuracy and interpretability over existing methods.

Details

Motivation: Conventional linearization of tables disrupts 2D relationships in structured data, and existing TableQA methods have limited numerical accuracy and reduced interpretability due to end-to-end answer generation or single-line program queries.

Method: Introduces a commented, step-by-step code-generation framework that incorporates explicit reasoning into Python program generation. Decomposes TableQA reasoning into multi-line executable programs with concise natural language comments to promote clearer reasoning.

Result: Achieves 70.9% accuracy on WikiTableQuestions benchmark using Qwen2.5-Coder-7B-Instruct, surpassing Repanda baseline (67.6%). Combined with robust end-to-end TableQA model via lightweight answer-selection mechanism achieves up to 84.3% accuracy.

Conclusion: The commented step-by-step code generation framework improves TableQA performance by making reasoning more explicit and interpretable, and can be effectively combined with existing models for further accuracy gains.

Abstract: Table Question Answering (TableQA) poses a significant challenge for large language models (LLMs) because conventional linearization of tables often disrupts the two-dimensional relationships intrinsic to structured data. Existing methods, which depend on end-to-end answer generation or single-line program queries, typically exhibit limited numerical accuracy and reduced interpretability. This work introduces a commented, step-by-step code-generation framework that incorporates explicit reasoning into the Python program-generation process. The approach decomposes TableQA reasoning into multi-line executable programs with concise natural language comments, thereby promoting clearer reasoning and increasing the likelihood of generating correct code. On the WikiTableQuestions benchmark, the proposed method achieves 70.9% accuracy using Qwen2.5-Coder-7B-Instruct, surpassing the Repanda baseline (67.6%). Integrating the proposed framework with a robust end-to-end TableQA model via a lightweight answer-selection mechanism yields further improvements. This combined approach achieves up to 84.3% accuracy on the WikiTableQuestions benchmark.

[27] A Hierarchical and Attentional Analysis of Argument Structure Constructions in BERT Using Naturalistic Corpora

Liu Kaipeng, Wu Ling

Main category: cs.CL

TL;DR: BERT processes four Argument Structure Constructions with hierarchical representations: construction-specific info emerges early, forms maximally separable clusters in middle layers, and persists through later stages.

Details

Motivation: To understand how BERT processes fundamental Argument Structure Constructions and examine the hierarchical nature of construction-specific representations across model layers.

Method: Multi-dimensional analytical framework combining MDS and t-SNE for dimensionality reduction, GDV for cluster separation metrics, FDR for linear diagnostic probing, and attention mechanism analysis.

Result: Reveals hierarchical representational structure: construction-specific information emerges in early layers, forms maximally separable clusters in middle layers, and is maintained through later processing stages.

Conclusion: BERT develops hierarchical representations of argument structure constructions with optimal separation in middle layers, suggesting systematic processing of linguistic constructions.

Abstract: This study investigates how the Bidirectional Encoder Representations from Transformers model processes four fundamental Argument Structure Constructions. We employ a multi-dimensional analytical framework, which integrates MDS, t-SNE as dimensionality reduction, Generalized Discrimination Value (GDV) as cluster separation metrics, Fisher Discriminant Ratio (FDR) as linear diagnostic probing, and attention mechanism analysis. Our results reveal a hierarchical representational structure. Construction-specific information emerges in early layers, forms maximally separable clusters in middle layers, and is maintained through later processing stages.

[28] Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs

Shree Harsha Bokkahalli Satish, Gustav Eje Henter, Éva Székely

Main category: cs.CL

TL;DR: Current MCQA bias benchmarks for SpeechLLMs show limited cross-task generalization - models trained for specific MCQA behaviors don’t reliably transfer to other MCQA tasks or long-form generation tasks.

Details

Motivation: To investigate whether current multiple-choice question answering (MCQA) bias benchmarks for speech large language models reliably predict model behavior across different MCQA tasks and more realistic long-form generation tasks.

Method: Fine-tuned three SpeechLLMs using LoRA adapters to induce specific MCQA behaviors (preference for stereotypical, anti-stereotypical, or neutral answers), then evaluated generalization to another MCQA benchmark and long-form creative generation tasks.

Result: Performance on MCQA bias benchmarks fails to reliably predict performance across other MCQA benchmarks and long-form tasks, showing limited evidence of cross-task generalization in the speech domain.

Conclusion: Current MCQA bias benchmarks have limited generalizability and don’t reliably predict model behavior in more realistic settings; authors propose an evaluation suite for measuring behavior transferability in future models.

Abstract: Recent work in benchmarking bias and fairness in speech large language models (SpeechLLMs) has relied heavily on multiple-choice question answering (MCQA) formats. The model is tasked to choose between stereotypical, anti-stereotypical, or neutral/irrelevant answers given an input speech prompt and an optional text prompt. Such MCQA benchmarks implicitly assume that model performance is consistent across other MCQA tasks, voices, and other task formats such as more realistic, long-form evaluations. In this paper, we probe that assumption. We fine-tune three SpeechLLMs using LoRA adapters to induce specific MCQA behaviours: preference for stereotypical, anti-stereotypical, or neutral/uncertain answers. We then evaluate whether these behaviours generalise to another, distinct MCQA benchmark, and more critically to long-form, creative generation tasks. Our results show that performance on MCQA bias benchmarks fails to reliably predict performances across other MCQA benchmarks, and more importantly across long-form tasks. We conclude that current MCQA bias benchmarks show limited evidence of cross-task generalisation in the speech domain, and also propose an evaluation suite for measuring behaviour transferability in future models and benchmarks.

[29] The French Drama Revolution: Political Economy and Literary Production, 1700-1900

Thiago Dumont Oliveira

Main category: cs.CL

TL;DR: Analysis of French drama evolution from 1700-1900 using topic modeling shows significant thematic shifts post-French Revolution, with bourgeois themes rising and coevolution with economic growth patterns.

Details

Motivation: To understand how French drama evolved over two centuries and how its thematic content related to major political and economic changes, particularly the French Revolution and industrialization.

Method: Used Latent Dirichlet Allocation (LDA) for topic modeling on French drama texts from 1700-1900, then applied Jensen-Shannon Divergence to measure topic distribution changes over time.

Result: Found profound changes in topical distribution after the French Revolution (1789-1850), with bourgeois themes becoming prevalent from late 18th century onward. Topic prevalence patterns showed coevolution with French GDP trends.

Conclusion: French drama’s thematic evolution closely tracked political and economic transformations, particularly the rise of bourgeois themes reflecting societal changes during industrialization and post-revolutionary periods.

Abstract: This paper investigates the changing nature of French drama between 1700-1900 using Latent Dirichlet Allocation and Jensen-Shannon Divergence. Results indicate that the topical distribution of French drama changed profoundly after the French Revolution, particularly between 1789 and 1850. Bourgeois themes emerged among the most prevalent topics since the late 18th century. To assess the coevolution of drama and economic growth, I plot the yearly prevalence of topics alongside French GDP between 1700-1900, and discuss these changes in light of the political and economic changes prompted by the French Revolution and the industrialization of the country.

[30] Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR

Yuxiang Mei, Dongxing Xu, Jiaen Liang, Yanhua Long

Main category: cs.CL

TL;DR: Enhanced LLM-based ASR framework combining Whisper and mHuBERT encoders with cross-attention fusion for multilingual conversational speech recognition, achieving competitive results but still lagging behind fine-tuned E2E Whisper models.

Details

Motivation: To address limitations in previous LLM-based ASR systems: simple feature concatenation may not fully exploit complementary information from different speech encoders, and the performance gap between LLM-based ASR and end-to-end encoder-decoder ASR remains unexplored.

Method: Proposes enhanced LLM-based ASR framework with fine-tuned Whisper and mHuBERT encoders, evaluates E2E Whisper models with LoRA and full fine-tuning, and introduces cross-attention-based fusion mechanisms for parallel-speech-encoder architecture.

Result: Achieved CER/WER of 10.69% on MLC-SLM Challenge evaluation set, ranking on par with top systems despite using only 1,500 hours of training data vs. competitors’ large-scale training sets. However, final LLM-based ASR still underperforms compared to fine-tuned E2E Whisper model.

Conclusion: The enhanced LLM-based ASR framework shows competitive performance but reveals that current LLM-based approaches still cannot match fine-tuned E2E models, providing valuable empirical guidance for future Speech-LLM design.

Abstract: The INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM) promotes multilingual conversational ASR with large language models (LLMs). Our previous SHNU-mASR system adopted a competitive parallel-speech-encoder architecture that integrated Whisper and mHuBERT with an LLM. However, it faced two challenges: simple feature concatenation may not fully exploit complementary information, and the performance gap between LLM-based ASR and end-to-end(E2E) encoder-decoder ASR remained unexplored. In this work, we present an enhanced LLM-based ASR framework that combines fine-tuned Whisper and mHuBERT encoders with an LLM to enrich speech representations. We first evaluate E2E Whisper models with LoRA and full fine-tuning on the MLC-SLM ASR task, and then propose cross-attention-based fusion mechanisms for the parallel-speech-encoder. On the official evaluation set of the MLC-SLM Challenge, our system achieves a CER/WER of 10.69%, ranking on par with the top-ranked Track 1 systems, even though it uses only 1,500 hours of baseline training data compared with their large-scale training sets. Nonetheless, we find that our final LLM-based ASR still does not match the performance of a fine-tuned E2E Whisper model, providing valuable empirical guidance for future Speech-LLM design. Our code is publicly available at https://github.com/1535176727/MLC-SLM.

[31] Hermes the Polyglot: A Unified Framework to Enhance Expressiveness for Multimodal Interlingual Subtitling

Chaoqun Cui, Shijing Wang, Liangbin Huang, Qingqing Gu, Zhaolong Huang, Xiao Zeng, Wenji Mao

Main category: cs.CL

TL;DR: Hermes is an LLM-based framework for automated interlingual subtitling that addresses challenges in semantic coherence, pronoun/terminology translation, and expressiveness through integrated modules for speaker diarization, terminology identification, and expressiveness enhancement.

Details

Motivation: Interlingual subtitling for visual media localization hasn't been explored in machine translation despite LLM advances. Subtitle texts present unique challenges including semantic coherence, pronoun/terminology translation, and translation expressiveness that standard MT approaches don't address.

Method: Hermes framework integrates three modules: 1) Speaker Diarization to handle multiple speakers, 2) Terminology Identification for domain-specific terms, and 3) Expressiveness Enhancement to maintain natural language flow and emotional tone in translations.

Result: Hermes achieves state-of-the-art diarization performance and generates expressive, contextually coherent translations, advancing research in automated interlingual subtitling.

Conclusion: The Hermes framework successfully addresses key challenges in interlingual subtitling through specialized modules, demonstrating the potential of LLM-based approaches for visual media localization.

Abstract: Interlingual subtitling, which translates subtitles of visual media into a target language, is essential for entertainment localization but has not yet been explored in machine translation. Although Large Language Models (LLMs) have significantly advanced the general capabilities of machine translation, the distinctive characteristics of subtitle texts pose persistent challenges in interlingual subtitling, particularly regarding semantic coherence, pronoun and terminology translation, and translation expressiveness. To address these issues, we present Hermes, an LLM-based automated subtitling framework. Hermes integrates three modules: Speaker Diarization, Terminology Identification, and Expressiveness Enhancement, which effectively tackle the above challenges. Experiments demonstrate that Hermes achieves state-of-the-art diarization performance and generates expressive, contextually coherent translations, thereby advancing research in interlingual subtitling.

[32] Lookahead-then-Verify: Reliable Constrained Decoding for Diffusion LLMs under Context-Free Grammars

Yitong Zhang, Yongmin Li, Yuetong Liu, Jia Li, Xiaoran Jia, Zherui Li, Ge Li

Main category: cs.CL

TL;DR: LAVE is a constrained decoding approach for Diffusion Large Language Models that uses parallel lookahead verification to ensure grammatical correctness during generation, addressing the unique challenges of non-autoregressive models.

Details

Motivation: Diffusion LLMs struggle with generating syntactically valid outputs for formal languages like source code and chemical expressions. Existing constrained decoding techniques don't work well with non-autoregressive dLLMs, and current dLLM-specific approaches allow intermediate outputs that can't be completed into valid sentences.

Method: LAVE leverages dLLMs’ ability to predict token distributions for all positions in parallel. When a new token is proposed, it performs lookahead using these distributions to efficiently verify token validity, ensuring intermediate outputs can always be extended into valid sentences.

Result: Extensive experiments across four dLLMs and three benchmarks show LAVE consistently outperforms existing baselines with substantial improvements in syntactic correctness, while adding negligible runtime overhead.

Conclusion: LAVE provides an effective constrained decoding approach specifically designed for dLLMs that reliably enforces grammatical correctness during generation, addressing key limitations of current methods.

Abstract: Diffusion Large Language Models (dLLMs) have demonstrated promising generative capabilities and are increasingly used to produce formal languages defined by context-free grammars, such as source code and chemical expressions. However, as probabilistic models, they still struggle to generate syntactically valid outputs reliably. A natural and promising direction to address this issue is to adapt constrained decoding techniques to enforce grammatical correctness during generation. However, applying these techniques faces two primary obstacles. On the one hand, the non-autoregressive nature of dLLMs renders most existing constrained decoding approaches inapplicable. On the other hand, current approaches specifically designed for dLLMs may allow intermediate outputs that are impossible to complete into valid sentences, which significantly limits their reliability in practice. To address these challenges, we present LAVE, a constrained decoding approach specifically designed for dLLMs. Our approach leverages a key property of dLLMs, namely their ability to predict token distributions for all positions in parallel during each forward pass. Whenever a new token is proposed by model, LAVE performs lookahead using these distributions to efficiently and reliably verify the validity of the proposed token. This design ensures reliable constraints by reliably preserving the potential for intermediate outputs to be extended into valid sentences. Extensive experiments across four widely used dLLMs and three representative benchmarks demonstrate that LAVE consistently outperforms existing baselines and achieves substantial improvements in syntactic correctness, while incurring negligible runtime overhead.

[33] Transformer-Based Model for Multilingual Hope Speech Detection

Nsrin Ashraf, Mariam Labib, Hamada Nayel

Main category: cs.CL

TL;DR: The paper presents a transformer-based system for hope speech detection in English and German, evaluating RoBERTa for English and XLM-RoBERTa for both languages, achieving competitive performance metrics.

Details

Motivation: To develop effective systems for hope speech detection across multiple languages, leveraging pre-trained large language models to enhance performance in natural language processing tasks.

Method: Implemented and evaluated various transformers: RoBERTa for English-only detection and multilingual XLM-RoBERTa for both English and German hope speech classification.

Result: RoBERTa achieved weighted f1-score of 0.818 and accuracy of 81.8% for English. XLM-RoBERTa achieved weighted f1-score of 0.786 and accuracy of 78.5% for both languages.

Conclusion: The results demonstrate the importance of pre-trained large language models in enhancing NLP task performance, particularly for multilingual applications like hope speech detection.

Abstract: This paper describes a system that has been submitted to the “PolyHope-M” at RANLP2025. In this work various transformers have been implemented and evaluated for hope speech detection for English and Germany. RoBERTa has been implemented for English, while the multilingual model XLM-RoBERTa has been implemented for both English and German languages. The proposed system using RoBERTa reported a weighted f1-score of 0.818 and an accuracy of 81.8% for English. On the other hand, XLM-RoBERTa achieved a weighted f1-score of 0.786 and an accuracy of 78.5%. These results reflects the importance of improvement of pre-trained large language models and how these models enhancing the performance of different natural language processing tasks.

[34] Language Family Matters: Evaluating LLM-Based ASR Across Linguistic Boundaries

Yuchen Zhang, Ravi Shekhar, Haralambos Mouratidis

Main category: cs.CL

TL;DR: Proposes family-based connector sharing for multilingual LLM-powered ASR, reducing parameters while improving cross-domain generalization by grouping languages by linguistic families.

Details

Motivation: Current LLM-powered ASR systems use separate connectors per language, ignoring linguistic relatedness and creating parameter inefficiency. Need for more efficient multilingual deployment strategies.

Method: Proposes connector-sharing strategy based on linguistic family membership - one connector per language family instead of per language. Links frozen speech encoder to pretrained LLM via lightweight connectors grouped by linguistic families.

Result: Family-based connectors reduce parameter count while improving generalization across domains. Validated across two multilingual LLMs and two real-world corpora (curated and crowd-sourced speech).

Conclusion: Linguistic family-based connector sharing offers practical and scalable strategy for multilingual ASR deployment, balancing efficiency with performance.

Abstract: Large Language Model (LLM)-powered Automatic Speech Recognition (ASR) systems achieve strong performance with limited resources by linking a frozen speech encoder to a pretrained LLM via a lightweight connector. Prior work trains a separate connector per language, overlooking linguistic relatedness. We propose an efficient and novel connector-sharing strategy based on linguistic family membership, enabling one connector per family, and empirically validate its effectiveness across two multilingual LLMs and two real-world corpora spanning curated and crowd-sourced speech. Our results show that family-based connectors reduce parameter count while improving generalization across domains, offering a practical and scalable strategy for multilingual ASR deployment.

[35] Jailbreaking LLMs via Calibration

Yuxuan Lu, Yongkang Guo, Yuqing Kong

Main category: cs.CL

TL;DR: Safety alignment in LLMs creates systematic discrepancies; the paper proposes modeling alignment as distortion of pre-alignment distributions and frames jailbreaking as forecast aggregation with optimal strategies derived from loss-induced dual spaces.

Details

Motivation: Safety alignment in LLMs creates systematic discrepancies between aligned outputs and underlying pre-aligned data distributions, which can be exploited for jailbreaking attacks. The paper aims to formalize this phenomenon and develop more effective jailbreaking methods.

Method: Proposes a framework modeling safety alignment effects as systematic distortion of pre-alignment distributions. Casts Weak-to-Strong Jailbreaking as a forecast aggregation problem, deriving optimal aggregation strategies characterized by Gradient Shift in loss-induced dual spaces. Shows existing logit-arithmetic methods are special cases under cross-entropy loss, and derives broader family of aggregation rules for other proper losses. Introduces a new hybrid aggregation rule.

Result: Evaluations across red-teaming benchmarks and math utility tasks using frontier models demonstrate superior Attack Success Rates and lower “Jailbreak Tax” compared to existing methods, especially on safety-hardened gpt-oss-120b.

Conclusion: The proposed framework provides a principled approach to understanding and exploiting systematic discrepancies created by safety alignment, leading to more effective jailbreaking methods with better performance on hardened models.

Abstract: Safety alignment in Large Language Models (LLMs) often creates a systematic discrepancy between a model’s aligned output and the underlying pre-aligned data distribution. We propose a framework in which the effect of safety alignment on next-token prediction is modeled as a systematic distortion of a pre-alignment distribution. We cast Weak-to-Strong Jailbreaking as a forecast aggregation problem and derive an optimal aggregation strategy characterized by a Gradient Shift in the loss-induced dual space. We show that logit-arithmetic jailbreaking methods are a special case of this framework under cross-entropy loss, and derive a broader family of aggregation rules corresponding to other proper losses. We also propose a new hybrid aggregation rule. Evaluations across red-teaming benchmarks and math utility tasks using frontier models demonstrate that our approach achieves superior Attack Success Rates and lower “Jailbreak Tax” compared with existing methods, especially on the safety-hardened gpt-oss-120b.

[36] Formal Semantic Control over Language Models

Yingji Zhang

Main category: cs.CL

TL;DR: This thesis focuses on making language model representations more interpretable and controllable through deliberate shaping of latent space geometry, using VAE frameworks for both sentence-level and reasoning-level semantic manipulation.

Details

Motivation: To advance semantic representation learning to make language models more interpretable and controllable, enabling localized, quasi-symbolic, compositional control through deliberate shaping of latent space geometry.

Method: Uses VAE framework with two complementary approaches: (1) Sentence-level learning and control - disentangling and manipulating semantic features in latent space to guide sentence generation using explanatory text; (2) Reasoning-level learning and control - isolating and steering inference behaviors in latent space to control Natural Language Inference, focusing on Explanatory NLI tasks.

Result: Introduces novel theoretical frameworks and practical methodologies that demonstrate enhanced interpretability and controllability of latent spaces for natural language across the thesis.

Conclusion: The work moves toward language models with systematically interpretable, precisely structured, and reliably directed internal semantic representations through geometric manipulation of latent spaces.

Abstract: This thesis advances semantic representation learning to render language representations or models more semantically and geometrically interpretable, and to enable localised, quasi-symbolic, compositional control through deliberate shaping of their latent space geometry. We pursue this goal within a VAE framework, exploring two complementary research directions: (i) Sentence-level learning and control: disentangling and manipulating specific semantic features in the latent space to guide sentence generation, with explanatory text serving as the testbed; and (ii) Reasoning-level learning and control: isolating and steering inference behaviours in the latent space to control NLI. In this direction, we focus on Explanatory NLI tasks, in which two premises (explanations) are provided to infer a conclusion. The overarching objective is to move toward language models whose internal semantic representations can be systematically interpreted, precisely structured, and reliably directed. We introduce a set of novel theoretical frameworks and practical methodologies, together with corresponding experiments, to demonstrate that our approaches enhance both the interpretability and controllability of latent spaces for natural language across the thesis.

[37] LegalOne: A Family of Foundation Models for Reliable Legal Reasoning

Haitao Li, Yifan Chen, Shuo Miao, Qian Dong, Jia Chen, Yiran Hu, Junjie Chen, Minghao Qin, Qingyao Ai, Yiqun Liu, Cheng Luo, Quan Zhou, Ya Zhang, Jikun Hu

Main category: cs.CL

TL;DR: LegalOne: A family of Chinese legal domain foundational models developed through a three-phase pipeline (mid-training with PAS, fine-tuning with LEAD, and curriculum RL) that achieves SOTA performance on legal tasks by mastering legal reasoning.

Details

Motivation: General LLMs lack precise legal domain knowledge and struggle with multi-step judicial reasoning needed for legal applications, creating a gap for domain-specific models.

Method: Three-phase pipeline: 1) Mid-training with Plasticity-Adjusted Sampling (PAS) for domain adaptation, 2) Supervised fine-tuning with Legal Agentic CoT Distillation (LEAD) to extract reasoning from legal texts, 3) Curriculum Reinforcement Learning (RL) for progressive reasoning development.

Result: LegalOne achieves state-of-the-art performance across various legal tasks, surpassing larger general-purpose LLMs through enhanced knowledge density and efficiency.

Conclusion: LegalOne provides a comprehensive solution for legal AI with trustworthy and interpretable foundation models suitable for high-stakes judicial applications.

Abstract: While Large Language Models (LLMs) have demonstrated impressive general capabilities, their direct application in the legal domain is often hindered by a lack of precise domain knowledge and complexity of performing rigorous multi-step judicial reasoning. To address this gap, we present LegalOne, a family of foundational models specifically tailored for the Chinese legal domain. LegalOne is developed through a comprehensive three-phase pipeline designed to master legal reasoning. First, during mid-training phase, we propose Plasticity-Adjusted Sampling (PAS) to address the challenge of domain adaptation. This perplexity-based scheduler strikes a balance between the acquisition of new knowledge and the retention of original capabilities, effectively establishing a robust legal foundation. Second, during supervised fine-tuning, we employ Legal Agentic CoT Distillation (LEAD) to distill explicit reasoning from raw legal texts. Unlike naive distillation, LEAD utilizes an agentic workflow to convert complex judicial processes into structured reasoning trajectories, thereby enforcing factual grounding and logical rigor. Finally, we implement a Curriculum Reinforcement Learning (RL) strategy. Through a progressive reinforcement process spanning memorization, understanding, and reasoning, LegalOne evolves from simple pattern matching to autonomous and reliable legal reasoning. Experimental results demonstrate that LegalOne achieves state-of-the-art performance across a wide range of legal tasks, surpassing general-purpose LLMs with vastly larger parameter counts through enhanced knowledge density and efficiency. We publicly release the LegalOne weights and the LegalKit evaluation framework to advance the field of Legal AI, paving the way for deploying trustworthy and interpretable foundation models in high-stakes judicial applications.

[38] Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation

Lakshan Cooray, Deshan Sumanathilaka, Pattigadapa Venkatesh Raju

Main category: cs.CL

TL;DR: Instruction-tuned Small Language Models (SLMs) evaluated for multi-turn customer-service QA using history summarization, showing some can approach LLM performance but struggle with dialogue continuity.

Details

Motivation: Large Language Models (LLMs) have high computational costs and deployment constraints, making them impractical for resource-constrained environments. Small Language Models (SLMs) offer efficiency but their effectiveness for multi-turn customer-service QA with dialogue continuity requirements remains underexplored.

Method: Used instruction-tuned SLMs with history summarization strategy to preserve conversational state. Evaluated nine low-parameterized SLMs against three commercial LLMs using lexical/semantic similarity metrics, human evaluation, and LLM-as-a-judge methods. Introduced conversation stage-based qualitative analysis.

Result: Results show notable variation across SLMs - some demonstrate near-LLM performance, while others struggle with dialogue continuity and contextual alignment. The study highlights both potential and limitations of low-parameterized models for real-world customer-service QA.

Conclusion: SLMs show promise for efficient customer-service QA but current limitations in maintaining dialogue continuity and contextual understanding need to be addressed for practical deployment in resource-constrained environments.

Abstract: Customer-service question answering (QA) systems increasingly rely on conversational language understanding. While Large Language Models (LLMs) achieve strong performance, their high computational cost and deployment constraints limit practical use in resource-constrained environments. Small Language Models (SLMs) provide a more efficient alternative, yet their effectiveness for multi-turn customer-service QA remains underexplored, particularly in scenarios requiring dialogue continuity and contextual understanding. This study investigates instruction-tuned SLMs for context-summarized multi-turn customer-service QA, using a history summarization strategy to preserve essential conversational state. We also introduce a conversation stage-based qualitative analysis to evaluate model behavior across different phases of customer-service interactions. Nine instruction-tuned low-parameterized SLMs are evaluated against three commercial LLMs using lexical and semantic similarity metrics alongside qualitative assessments, including human evaluation and LLM-as-a-judge methods. Results show notable variation across SLMs, with some models demonstrating near-LLM performance, while others struggle to maintain dialogue continuity and contextual alignment. These findings highlight both the potential and current limitations of low-parameterized language models for real-world customer-service QA systems.

[39] EchoReview: Learning Peer Review from the Echoes of Scientific Citations

Yinuo Zhang, Dingcheng Huang, Haifeng Suo, Yizhuo Li, Ziya Zhao, Junhao Xu, Zhiying Tu, Dianhui Chu, Deming Zhai, Xianming Liu, Xiaoyan Yu, Dianbo Sui

Main category: cs.CL

TL;DR: EchoReview is a framework that synthesizes review data from academic citations to train automated peer reviewers, addressing limitations of human review data.

Details

Motivation: Traditional peer review systems face scalability issues with growing scientific submissions. Existing supervised approaches are limited by single-source human review data that suffers from subjectivity and inconsistency.

Method: Proposes EchoReview, a citation-context-driven data synthesis framework that mines implicit evaluative signals from academic citations and transforms community judgments into structured review-style data. Creates EchoReview-16K dataset and trains EchoReviewer-7B model.

Result: EchoReviewer-7B achieves significant improvements on core review dimensions like evidence support and comprehensiveness, validating citation context as an effective data paradigm for automated peer review.

Conclusion: Citation context provides a robust alternative to human review data for training reliable automated peer review systems, addressing scalability and consistency issues.

Abstract: As the volume of scientific submissions continues to grow rapidly, traditional peer review systems are facing unprecedented scalability pressures, highlighting the urgent need for automated reviewing methods that are both scalable and reliable. Existing supervised fine-tuning approaches based on real review data are fundamentally constrained by single-source of data as well as the inherent subjectivity and inconsistency of human reviews, limiting their ability to support high-quality automated reviewers. To address these issues, we propose EchoReview, a citation-context-driven data synthesis framework that systematically mines implicit collective evaluative signals from academic citations and transforms scientific community’s long-term judgments into structured review-style data. Based on this pipeline, we construct EchoReview-16K, the first large-scale, cross-conference, and cross-year citation-driven review dataset, and train an automated reviewer, EchoReviewer-7B. Experimental results demonstrate that EchoReviewer-7B can achieve significant and stable improvements on core review dimensions such as evidence support and review comprehensiveness, validating citation context as a robust and effective data paradigm for reliable automated peer review.

[40] ExperienceWeaver: Optimizing Small-sample Experience Learning for LLM-based Clinical Text Improvement

Ziyan Xiao, Yinghao Zhu, Liang Peng, Lequan Yu

Main category: cs.CL

TL;DR: ExperienceWeaver: A hierarchical framework for clinical text improvement that distills noisy feedback into structured knowledge (Tips and Strategies) to enable LLMs to learn reasoning behind revisions in small-sample settings.

Details

Motivation: Clinical text improvement is crucial for healthcare efficiency but challenging due to limited high-quality data and complex medical documentation constraints. Current LLM approaches struggle in small-sample settings: supervised fine-tuning is data-intensive, and retrieval-augmented generation provides superficial corrections without capturing revision reasoning.

Method: ExperienceWeaver shifts focus from data retrieval to experience learning by distilling noisy, multi-dimensional feedback into structured knowledge: error-specific Tips and high-level Strategies. This distilled experience is injected into an agentic pipeline, enabling models to learn “how to revise” rather than just “what to revise”.

Result: Extensive evaluations across four clinical datasets show ExperienceWeaver consistently improves performance, surpassing state-of-the-art models like Gemini-3 Pro in small-sample settings.

Conclusion: ExperienceWeaver addresses limitations of current LLM approaches for clinical text improvement by focusing on experience distillation rather than data retrieval, enabling better performance in data-scarce medical contexts.

Abstract: Clinical text improvement is vital for healthcare efficiency but remains difficult due to limited high-quality data and the complex constraints of medical documentation. While Large Language Models (LLMs) show promise, current approaches struggle in small-sample settings: supervised fine-tuning is data-intensive and costly, while retrieval-augmented generation often provides superficial corrections without capturing the reasoning behind revisions. To address these limitations, we propose ExperienceWeaver, a hierarchical framework that shifts the focus from data retrieval to experience learning. Instead of simply recalling past examples, ExperienceWeaver distills noisy, multi-dimensional feedback into structured, actionable knowledge. Specifically, error-specific Tips and high-level Strategies. By injecting this distilled experience into an agentic pipeline, the model learns “how to revise” rather than just “what to revise”. Extensive evaluations across four clinical datasets demonstrate that ExperienceWeaver consistently improves performance, surpassing state-of-the-art models such as Gemini-3 Pro in small-sample settings.

[41] CURP: Codebook-based Continuous User Representation for Personalized Generation with LLMs

Liang Wang, Xinyi Mou, Xiaoyou Liu, Xuanjing Huang, Zhongyu Wei

Main category: cs.CL

TL;DR: CURP is a framework for efficient user modeling in LLMs using bidirectional encoders and discrete prototype codebooks for plug-and-play personalization with minimal parameters.

Details

Motivation: Existing user modeling methods for LLMs struggle to balance personalization quality with computational and data efficiency, creating a need for more efficient approaches.

Method: Uses a bidirectional user encoder and discrete prototype codebook to extract multi-dimensional user traits, enabling plug-and-play personalization with only ~20M parameters (0.2% of total model size).

Result: Achieves superior performance and generalization on variant generation tasks compared to strong baselines, with better interpretability and scalability.

Conclusion: CURP provides an efficient solution for user modeling in LLMs that balances personalization quality with computational efficiency through its novel architecture.

Abstract: User modeling characterizes individuals through their preferences and behavioral patterns to enable personalized simulation and generation with Large Language Models (LLMs) in contemporary approaches. However, existing methods, whether prompt-based or training-based methods, face challenges in balancing personalization quality against computational and data efficiency. We propose a novel framework CURP, which employs a bidirectional user encoder and a discrete prototype codebook to extract multi-dimensional user traits. This design enables plug-and-play personalization with a small number of trainable parameters (about 20M parameters, about 0.2% of the total model size). Through extensive experiments on variant generation tasks, we show that CURP achieves superior performance and generalization compared to strong baselines, while offering better interpretability and scalability. The code are available at https://github.com/RaidonWong/CURP_code

[42] Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

Shengrui Li, Fei Zhao, Kaiyan Zhao, Jieying Ye, Haifeng Liu, Fangcheng Shi, Zheyong Xie, Yao Hu, Shaosheng Cao

Main category: cs.CL

TL;DR: DeMix is a framework that uses model merging to predict optimal data mixtures for LLM pre-training, decoupling search from training costs.

Details

Motivation: Finding optimal data mixtures for LLM pre-training is challenging due to unreliable proxy experiments or prohibitively expensive large-scale exploration.

Method: Trains component models on candidate datasets at scale, then derives data mixture proxies via weighted model merging instead of training proxy models for each mixture.

Result: DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining optimal mixtures with higher benchmark performance at lower search cost.

Conclusion: DeMix enables evaluation of unlimited sampled mixtures without extra training burden, facilitating better mixture discovery through more search trials.

Abstract: Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny-scale proxy experiments or require prohibitively expensive large-scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Extensive experiments demonstrate that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost. Additionally, we release the DeMix Corpora, a comprehensive 22T-token dataset comprising high-quality pre-training data with validated mixtures to facilitate open research. Our code and DeMix Corpora is available at https://github.com/Lucius-lsr/DeMix.

[43] Temporal Leakage in Search-Engine Date-Filtered Web Retrieval: A Case Study from Retrospective Forecasting

Ali El Lahib, Ying-Jieh Xia, Zehan Li, Yuxuan Wang, Xinyu Pi

Main category: cs.CL

TL;DR: Date filters in search engines are unreliable for temporal evaluation of forecasters due to post-cutoff information leakage, leading to inflated accuracy metrics.

Details

Motivation: To investigate the reliability of search-engine date filters for retrospective evaluation of search-augmented forecasting systems, which is crucial for credible temporal assessment.

Method: Audited Google Search with before: filters, analyzed leakage mechanisms, and tested forecasting accuracy using LLM (gpt-oss-120b) with both leaky and leak-free documents.

Result: 71% of questions returned pages with post-cutoff leakage, 41% directly revealed answers. LLM forecasting with leaky documents showed inflated accuracy (Brier score 0.108 vs 0.242 with clean docs).

Conclusion: Date-restricted search is insufficient for temporal evaluation; stronger retrieval safeguards or evaluation on frozen web snapshots are needed for credible forecasting assessment.

Abstract: Search-engine date filters are widely used to enforce pre-cutoff retrieval in retrospective evaluations of search-augmented forecasters. We show this approach is unreliable: auditing Google Search with a before: filter, 71% of questions return at least one page containing strong post-cutoff leakage, and for 41%, at least one page directly reveals the answer. Using a large language model (LLM), gpt-oss-120b, to forecast with these leaky documents, we demonstrate an inflated prediction accuracy (Brier score 0.108 vs. 0.242 with leak-free documents). We characterize common leakage mechanisms, including updated articles, related-content modules, unreliable metadata/timestamps, and absence-based signals, and argue that date-restricted search is insufficient for temporal evaluation. We recommend stronger retrieval safeguards or evaluation on frozen, time-stamped web snapshots to ensure credible retrospective forecasting.

[44] Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning

Zhipeng Chen, Xiaobo Qin, Wayne Xin Zhao, Youbin Wu, Ji-Rong Wen

Main category: cs.CL

TL;DR: A²D: Adaptive Ability Decomposing method that enhances RLVR by training a decomposer to break complex questions into simpler sub-questions, then using these sub-questions to guide the reasoner’s training.

Details

Motivation: RLVR (Reinforcement Learning with Verifiable Rewards) has limited information during training, leading to blind exploration and failure on challenging problems. Need to provide additional guidance without relying on teacher models.

Method: Two-stage approach: 1) Train a decomposer via RLVR to break complex questions into simpler sub-questions, 2) Use decomposer to annotate training data with sub-questions, then train reasoner under RLVR with sub-question guidance.

Result: Method outperforms competitive baselines, works as plug-and-play module for different RLVR algorithms, and analysis reveals how RLVR affects decomposer performance and what guidance types enhance reasoner exploration/exploitation.

Conclusion: A²D effectively enhances RLVR by providing structured guidance through question decomposition, improving reasoning ability without requiring teacher models.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has shown great potential to enhance the reasoning ability of large language models (LLMs). However, due to the limited amount of information provided during the RLVR process, the model can only engage in largely blind exploration, which often results in failure on challenging problems. To provide additional information for the RLVR process without relying on a teacher model, we propose A$^2$D, an Adaptive Ability Decomposing method for enhancing the effectiveness of RLVR. Specifically, we first train a decomposer via RLVR without distillation, enabling it to decompose complex questions into a set of simpler sub-questions. Next, we use this decomposer to annotate sub-questions for each question in the training dataset, and then train the reasoner under RLVR with sub-question guidance. To better understand A$^2$D, we first compare its performance with competitive baselines, showing its effectiveness. Next, we observe that our method functions as a plug-and-play module that can be applied to different RLVR algorithms. Furthermore, we conduct an analysis of the decomposer, revealing how the RLVR process affects its performance and behavior, and which type of guidance is better suited for enhancing the reasoner’s exploration and exploitation abilities.

[45] APR: Penalizing Structural Redundancy in Large Reasoning Models via Anchor-based Process Rewards

Kaiyan Chang, Chenwei Zhu, Yingfeng Luo, Yifu Huo, Chenglong Wang, Xiaoqian Liu, Qiaozhi He, Tong Xiao, Zhengtao Yu, Jingbo Zhu

Main category: cs.CL

TL;DR: The paper introduces Anchor-based Process Reward (APR), a method that identifies and penalizes redundant post-answer reasoning in Large Reasoning Models to improve efficiency without sacrificing performance.

Details

Motivation: Test-Time Scaling improves Large Reasoning Models but causes Overthinking - models continue redundant self-verification after already obtaining correct answers, wasting computational resources.

Method: APR identifies the Reasoning Anchor (where answer first stabilizes) and penalizes only the Answer-Stable Tail (redundant verification after anchor) using policy optimization with length penalties.

Result: APR achieves performance-efficiency Pareto frontier on 1.5B and 7B models across five mathematical reasoning datasets with significantly reduced RL training resources.

Conclusion: Targeted penalization of structural redundancy in reasoning processes enables more efficient Large Reasoning Models without compromising accuracy.

Abstract: Test-Time Scaling (TTS) has significantly enhanced the capabilities of Large Reasoning Models (LRMs) but introduces a critical side-effect known as Overthinking. We conduct a preliminary study to rethink this phenomenon from a fine-grained perspective. We observe that LRMs frequently conduct repetitive self-verification without revision even after obtaining the final answer during the reasoning process. We formally define this specific position where the answer first stabilizes as the Reasoning Anchor. By analyzing pre- and post-anchor reasoning behaviors, we uncover the structural redundancy fixed in LRMs: the meaningless repetitive verification after deriving the first complete answer, which we term the Answer-Stable Tail (AST). Motivated by this observation, we propose Anchor-based Process Reward (APR), a structure-aware reward shaping method that localizes the reasoning anchor and penalizes exclusively the post-anchor AST. Leveraging the policy optimization algorithm suitable for length penalties, our APR models achieved the performance-efficiency Pareto frontier at 1.5B and 7B scales averaged across five mathematical reasoning datasets while requiring significantly fewer computational resources for RL training.

[46] WordCraft: Scaffolding the Keyword Method for L2 Vocabulary Learning with Multimodal LLMs

Yuheng Shao, Junjie Xiong, Chaoran Wu, Xiyuan Wang, Ziyu Zhou, Yang Ouyang, Qinyi Tao, Quan Li

Main category: cs.CL

TL;DR: WordCraft is an interactive tool using Multimodal Large Language Models to help Chinese learners of English apply the keyword method for vocabulary memorization through guided keyword selection, association construction, and image formation.

Details

Motivation: Chinese learners of English struggle with the keyword method for vocabulary memorization due to difficulties generating phonologically appropriate keywords, constructing coherent associations, and creating vivid mental imagery. Existing automated approaches either reduce learner engagement or lack process-oriented guidance.

Method: Conducted formative study with 18 Chinese learners and educators to identify difficulties, then developed WordCraft - a learner-centered interactive tool powered by Multimodal Large Language Models (MLLMs) that scaffolds the keyword method through guided keyword selection, association construction, and image formation.

Result: Two user studies demonstrated that WordCraft preserves the generation effect while achieving high levels of effectiveness and usability for vocabulary memorization.

Conclusion: WordCraft successfully addresses limitations of existing approaches by providing process-oriented guidance while maintaining learner engagement, making the keyword method more accessible and effective for Chinese learners of English vocabulary.

Abstract: Applying the keyword method for vocabulary memorization remains a significant challenge for L1 Chinese-L2 English learners. They frequently struggle to generate phonologically appropriate keywords, construct coherent associations, and create vivid mental imagery to aid long-term retention. Existing approaches, including fully automated keyword generation and outcome-oriented mnemonic aids, either compromise learner engagement or lack adequate process-oriented guidance. To address these limitations, we conducted a formative study with L1 Chinese-L2 English learners and educators (N=18), which revealed key difficulties and requirements in applying the keyword method to vocabulary learning. Building on these insights, we introduce WordCraft, a learner-centered interactive tool powered by Multimodal Large Language Models (MLLMs). WordCraft scaffolds the keyword method by guiding learners through keyword selection, association construction, and image formation, thereby enhancing the effectiveness of vocabulary memorization. Two user studies demonstrate that WordCraft not only preserves the generation effect but also achieves high levels of effectiveness and usability.

[47] Eliciting Trustworthiness Priors of Large Language Models via Economic Games

Siyu Yan, Lusha Zhu, Jian-Qiao Zhu

Main category: cs.CL

TL;DR: LLMs’ trustworthiness priors are elicited using iterated in-context learning with Trust Game, showing GPT-4.1 aligns with human priors and responds to player personas based on warmth/competence stereotypes.

Details

Motivation: To characterize trust in AI systems and maintain calibrated trust (avoiding overtrust/undertrust), the paper aims to measure trustworthiness priors in LLMs using behavioral game theory rather than self-reported attitudes.

Method: Proposes novel elicitation method using iterated in-context learning applied to Trust Game from behavioral game theory. Elicits trustworthiness priors from leading LLMs, examines GPT-4.1’s responses to different player personas, and predicts variation using stereotype-based model grounded in perceived warmth and competence.

Result: GPT-4.1’s trustworthiness priors closely track human priors. GPT-4.1 differentiates trust across agent characteristics, and variation in elicited trustworthiness can be predicted by warmth/competence stereotype model.

Conclusion: LLMs can exhibit human-like trustworthiness priors, and trust differentiation follows social psychology stereotypes. This provides a foundation for understanding and calibrating trust in AI systems.

Abstract: One critical aspect of building human-centered, trustworthy artificial intelligence (AI) systems is maintaining calibrated trust: appropriate reliance on AI systems outperforms both overtrust (e.g., automation bias) and undertrust (e.g., disuse). A fundamental challenge, however, is how to characterize the level of trust exhibited by an AI system itself. Here, we propose a novel elicitation method based on iterated in-context learning (Zhu and Griffiths, 2024a) and apply it to elicit trustworthiness priors using the Trust Game from behavioral game theory. The Trust Game is particularly well suited for this purpose because it operationalizes trust as voluntary exposure to risk based on beliefs about another agent, rather than self-reported attitudes. Using our method, we elicit trustworthiness priors from several leading large language models (LLMs) and find that GPT-4.1’s trustworthiness priors closely track those observed in humans. Building on this result, we further examine how GPT-4.1 responds to different player personas in the Trust Game, providing an initial characterization of how such models differentiate trust across agent characteristics. Finally, we show that variation in elicited trustworthiness can be well predicted by a stereotype-based model grounded in perceived warmth and competence.

[48] Reasoning as State Transition: A Representational Analysis of Reasoning Evolution in Large Language Models

Siyuan Zhang, Jialian Li, Yichi Zhang, Xiao Yang, Yinpeng Dong, Hang Su

Main category: cs.CL

TL;DR: This paper investigates how large language models develop reasoning capabilities during training by analyzing internal representation dynamics rather than just output generation.

Details

Motivation: Prior work treats reasoning as a black box, focusing only on output generation. The authors want to understand the internal representational changes that occur during reasoning tasks and how training affects these internal dynamics.

Method: The authors introduce a representational perspective to analyze internal states across models at different training stages. They conduct comprehensive experiments including comparative analysis, statistical correlation studies, and counterfactual experiments to examine the relationship between internal representations and external outputs.

Result: Post-training yields limited improvement in static initial representation quality. Reasoning involves significant continuous distributional shifts in representations during generation. Post-training helps models drive these transitions toward better distributions for task solving. Generation correctness correlates highly with final representations, and token semantics (not computation or parameter differences) drive the representational transitions.

Conclusion: The paper provides novel insights into reasoning processes and training effects on reasoning enhancement, offering valuable perspectives for future model analysis and optimization through internal representational analysis.

Abstract: Large Language Models have achieved remarkable performance on reasoning tasks, motivating research into how this ability evolves during training. Prior work has primarily analyzed this evolution via explicit generation outcomes, treating the reasoning process as a black box and obscuring internal changes. To address this opacity, we introduce a representational perspective to investigate the dynamics of the model’s internal states. Through comprehensive experiments across models at various training stages, we discover that post-training yields only limited improvement in static initial representation quality. Furthermore, we reveal that, distinct from non-reasoning tasks, reasoning involves a significant continuous distributional shift in representations during generation. Comparative analysis indicates that post-training empowers models to drive this transition toward a better distribution for task solving. To clarify the relationship between internal states and external outputs, statistical analysis confirms a high correlation between generation correctness and the final representations; while counterfactual experiments identify the semantics of the generated tokens, rather than additional computation during inference or intrinsic parameter differences, as the dominant driver of the transition. Collectively, we offer a novel understanding of the reasoning process and the effect of training on reasoning enhancement, providing valuable insights for future model analysis and optimization.

[49] HyLRA: Hybrid Layer Reuse Attention for Efficient Long-Context Inference

Xuan Ai, Qingqing Yang, Peng Wang, Lei Deng, Lin Zhang, Renhai Chen, Gong Zhang

Main category: cs.CL

TL;DR: HyLRA is a hybrid layer reuse attention framework that optimizes long-context LLM inference by profiling layer-wise sparsity patterns to balance efficiency and accuracy.

Details

Motivation: Long-context inference in LLMs suffers from quadratic attention complexity and large KV cache memory footprint. Existing sparse attention methods use rigid patterns or aggressive pruning, failing to achieve optimal efficiency-accuracy balance.

Method: HyLRA uses layer-wise sparsity profiling to identify intra-layer sensitivity (some layers need full attention) and inter-layer similarity (consecutive layers share critical tokens). It employs offline dynamic programming to derive optimal layer-wise policies: sensitive layers use full attention, while tolerant layers reuse top-k indices from preceding layers to bypass quadratic calculations.

Result: HyLRA improves inference throughput by 6%-46% while maintaining comparable performance (<1% accuracy degradation), consistently outperforming state-of-the-art sparse attention methods.

Conclusion: HyLRA effectively overcomes the quadratic bottleneck of dense attention by restricting computation to the most critical tokens through layer-wise optimization, achieving better efficiency-accuracy trade-offs than existing sparse attention approaches.

Abstract: Long-context inference in Large Language Models (LLMs) is bottlenecked by the quadratic computation complexity of attention and the substantial memory footprint of Key-Value (KV) caches. While existing sparse attention mechanisms attempt to mitigate this by exploiting inherent sparsity, they often rely on rigid patterns or aggressive pruning, failing to achieve an optimal balance between efficiency and accuracy. In this paper, we introduce {\bf HyLRA} ({\bf Hy}brid {\bf L}ayer {\bf R}euse {\bf A}ttention), a novel framework driven by layer-wise sparsity profiling. Our empirical analysis uncovers a dual characteristic in attention mechanics: \textit{intra-layer sensitivity}, where specific layers necessitate full attention to prevent feature distortion, and \textit{inter-layer similarity}, where consecutive layers share substantial critical tokens. Based on these observations, HyLRA employs an offline dynamic programming approach to derive an optimal layer-wise policy. This hybrid strategy retains full attention for sensitive layers to ensure robustness, while enabling tolerant layers to bypass quadratic calculations by directly reusing top-$k$ indices from preceding layers. This approach allows LLMs to restrict computation to the most critical tokens, effectively overcoming the quadratic bottleneck of dense attention. Extensive evaluations demonstrate that HyLRA improves inference throughput by 6%–46% while maintaining comparable performance (with $<1%$ accuracy degradation), consistently outperforming state-of-the-art sparse attention methods. HyLRA is open source at \href{https://anonymous.4open.science/r/unified-cache-management-CF80/}{\texttt{/r/unified-cache-management-CF80/}}

[50] Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis

Zicheng Kong, Dehua Ma, Zhenbo Xu, Alven Yang, Yiwei Ru, Haoran Wang, Zixuan Zhou, Fuqing Bie, Liuyu Xiang, Huijia Wu, Jian Zhao, Zhaofeng He

Main category: cs.CL

TL;DR: Omni-RRM is the first open-source rubric-grounded reward model that produces structured, multi-dimension preference judgments with justifications across text, image, video, and audio modalities.

Details

Motivation: Existing reward models for multimodal LLMs are vision-centric, return opaque scalar scores, rely on costly human annotations, and lack structured reasoning capabilities across multiple modalities.

Method: Two-stage approach: 1) Create Omni-Preference dataset via automated pipeline synthesizing response pairs from models of different capabilities, using teacher models to reconcile preferences with rubric-grounded rationales; 2) Train Omni-RRM via supervised fine-tuning followed by reinforcement learning (GRPO) to sharpen discrimination.

Result: Achieves SOTA accuracy on video (80.2% on ShareGPT-V) and audio (66.8% on Audio-HH-RLHF) benchmarks, with 17.7% absolute gain over base model on image tasks, and improves downstream performance via Best-of-N selection.

Conclusion: Omni-RRM provides effective, automated reward modeling across modalities without human-labeled preferences, enabling better alignment for multimodal LLMs through structured, rubric-grounded judgments.

Abstract: Multimodal large language models (MLLMs) have shown remarkable capabilities, yet their performance is often capped by the coarse nature of existing alignment techniques. A critical bottleneck remains the lack of effective reward models (RMs): existing RMs are predominantly vision-centric, return opaque scalar scores, and rely on costly human annotations. We introduce \textbf{Omni-RRM}, the first open-source rubric-grounded reward model that produces structured, multi-dimension preference judgments with dimension-wise justifications across \textbf{text, image, video, and audio}. At the core of our approach is \textbf{Omni-Preference}, a large-scale dataset built via a fully automated pipeline: we synthesize candidate response pairs by contrasting models of different capabilities, and use strong teacher models to \emph{reconcile and filter} preferences while providing a modality-aware \emph{rubric-grounded rationale} for each pair. This eliminates the need for human-labeled training preferences. Omni-RRM is trained in two stages: supervised fine-tuning to learn the rubric-grounded outputs, followed by reinforcement learning (GRPO) to sharpen discrimination on difficult, low-contrast pairs. Comprehensive evaluations show that Omni-RRM achieves state-of-the-art accuracy on video (80.2% on ShareGPT-V) and audio (66.8% on Audio-HH-RLHF) benchmarks, and substantially outperforms existing open-source RMs on image tasks, with a 17.7% absolute gain over its base model on overall accuracy. Omni-RRM also improves downstream performance via Best-of-$N$ selection and transfers to text-only preference benchmarks. Our data, code, and models are available at https://anonymous.4open.science/r/Omni-RRM-CC08.

[51] Factuality on Demand: Controlling the Factuality-Informativeness Trade-off in Text Generation

Ziwei Gong, Yanda Chen, Julia Hirschberg, Chen Zhao, He He, Zhou Yu, Kathleen Mckeown

Main category: cs.CL

TL;DR: FCG framework allows users to specify factuality constraints for LLM responses, balancing informativeness vs. accuracy, with synthetic training improving both constraint adherence and informativeness.

Details

Motivation: LLMs encode knowledge with varying confidence and face trade-offs between generating informative but potentially inaccurate responses vs. less informative but more factual ones. Different applications require different balances between informativeness and factuality.

Method: Introduces Factuality-Controlled Generation (FCG) framework where users specify factuality constraints alongside queries. Uses synthetic data to train models on the FCG task, evaluating performance on adherence to factuality constraints and response informativeness.

Result: Synthetic training significantly improves models’ ability to respect factuality requirements while maintaining informativeness in their outputs.

Conclusion: FCG provides a practical framework for controlling the factuality-informativeness trade-off in LLM responses, with synthetic training being an effective approach for improving performance on this task.

Abstract: Large language models (LLMs) encode knowledge with varying degrees of confidence. When responding to queries, models face an inherent trade-off: they can generate responses that are less informative but highly factual, or more informative but potentially less accurate. Different applications demand different balances between informativeness and factuality. We introduce Factuality-Controlled Generation (FCG), a framework that enables users to specify factuality constraints alongside their queries. We propose to evaluate FCG performance on two dimensions: adherence to factuality constraints and response informativeness. We propose to train models on the FCG task using synthetic data, and show that our synthetic training significantly improves models’ ability to both respect factuality requirements and maintain informativeness in their outputs.

[52] Unifying Adversarial Robustness and Training Across Text Scoring Models

Manveer Singh Tamber, Hosna Oyarhoseini, Jimmy Lin

Main category: cs.CL

TL;DR: Unified framework for studying adversarial robustness in text scoring models (dense retrievers, rerankers, reward models) with new adversarial training methods that improve robustness and task effectiveness, including applications to RLHF alignment.

Details

Motivation: Current research on adversarial robustness in language models is fragmented across different applications and attacks, which obscures shared vulnerabilities. The authors aim to unify the study of adversarial robustness specifically for text scoring models to better understand and address these vulnerabilities.

Method: Propose a unified framework for studying adversarial robustness in text scoring models (dense retrievers, rerankers, reward models). Adapt attacks and adversarial training methods across different model roles. Introduce multiple adversarial training methods for text scoring models, and show that combining complementary training methods yields strong robustness.

Result: Demonstrate that current adversarial training formulations for language models are often short-sighted and fail to generalize across attacks. Show that the proposed adversarial training methods improve both robustness and task effectiveness. Highlight practical value for RLHF by showing that adversarially trained reward models mitigate reward hacking and support training of better-aligned LLMs.

Conclusion: A unified approach to studying adversarial robustness in text scoring models reveals shared vulnerabilities and enables more effective adversarial training methods that improve both robustness and task performance, with practical applications in RLHF alignment.

Abstract: Research on adversarial robustness in language models is currently fragmented across applications and attacks, obscuring shared vulnerabilities. In this work, we propose unifying the study of adversarial robustness in text scoring models spanning dense retrievers, rerankers, and reward models. This motivates adapting both attacks and adversarial training methods across model roles. Unlike open-ended generation, text scoring failures are directly testable: an attack succeeds when an irrelevant or rejected text outscores a relevant or chosen one. Using this principled lens of text scoring, we demonstrate that current adversarial training formulations for language models are often short-sighted, failing to effectively generalize across attacks. To address this, we introduce multiple adversarial training methods for text scoring models and show that combining complementary training methods can yield strong robustness while also improving task effectiveness. We also highlight the practical value of our approach for RLHF, showing that our adversarially trained reward models mitigate reward hacking and support the training of better-aligned LLMs. We provide our code and models for further study.

[53] ILSIC: Corpora for Identifying Indian Legal Statutes from Queries by Laypeople

Shounak Paul, Raghav Dogra, Pawan Goyal, Saptarshi Ghosh

Main category: cs.CL

TL;DR: ILSIC: A new corpus for Legal Statute Identification comparing laypeople queries vs court judgments in Indian law, showing models trained on court data perform poorly on laypeople queries.

Details

Motivation: Legal Statute Identification (LSI) traditionally uses court judgments as input, but real-world queries come from laypeople. There's limited research comparing court vs laypeople data for LSI, especially for Indian law.

Method: Created ILSIC corpus with 500+ Indian statutes covering both laypeople queries and court judgments. Conducted experiments with zero/few-shot inference, retrieval-augmented generation, and supervised fine-tuning to compare performance.

Result: Models trained only on court judgments perform poorly on laypeople queries. Transfer learning from court to laypeople data can help in some scenarios. Performance varies by query category and statute frequency.

Conclusion: Court-trained models don’t generalize well to laypeople queries. The ILSIC corpus enables better LSI research for real-world applications. Transfer learning shows promise but needs careful implementation.

Abstract: Legal Statute Identification (LSI) for a given situation is one of the most fundamental tasks in Legal NLP. This task has traditionally been modeled using facts from court judgments as input queries, due to their abundance. However, in practical settings, the input queries are likely to be informal and asked by laypersons, or non-professionals. While a few laypeople LSI datasets exist, there has been little research to explore the differences between court and laypeople data for LSI. In this work, we create ILSIC, a corpus of laypeople queries covering 500+ statutes from Indian law. Additionally, the corpus also contains court case judgements to enable researchers to effectively compare between court and laypeople data for LSI. We conducted extensive experiments on our corpus, including benchmarking over the laypeople dataset using zero and few-shot inference, retrieval-augmented generation and supervised fine-tuning. We observe that models trained purely on court judgements are ineffective during test on laypeople queries, while transfer learning from court to laypeople data can be beneficial in certain scenarios. We also conducted fine-grained analyses of our results in terms of categories of queries and frequency of statutes.

[54] EffGen: Enabling Small Language Models as Capable Autonomous Agents

Gaurav Srivastava, Aafiya Hussain, Chi Wang, Yingyan Celine Lin, Xuan Wang

Main category: cs.CL

TL;DR: effGen is an open-source agentic framework optimized for small language models (SLMs) that enables local deployment with enhanced tool-calling, intelligent task decomposition, complexity-based routing, and unified memory system.

Details

Motivation: Existing language model agentic systems are built for large language models via API calls, which face limitations including high token costs and privacy concerns for sensitive applications.

Method: effGen introduces four major contributions: (1) Enhanced tool-calling with prompt optimization that compresses contexts by 70-80%, (2) Intelligent task decomposition that breaks complex queries into subtasks, (3) Complexity-based routing using five factors for pre-execution decisions, and (4) Unified memory system combining short-term, long-term, and vector-based storage.

Result: Results on 13 benchmarks show effGen outperforms LangChain, AutoGen, and Smolagents with higher success rates, faster execution, and lower memory. Prompt optimization benefits SLMs more (11.2% gain at 1.5B vs 2.4% at 32B), while routing benefits large models more (3.6% at 1.5B vs 7.9% at 32B).

Conclusion: effGen provides an effective, efficient, and secure local deployment solution for SLMs, with complementary scaling behavior between prompt optimization and complexity routing, ensuring consistent gains across all model scales.

Abstract: Most existing language model agentic systems today are built and optimized for large language models (e.g., GPT, Claude, Gemini) via API calls. While powerful, this approach faces several limitations including high token costs and privacy concerns for sensitive applications. We introduce effGen, an open-source agentic framework optimized for small language models (SLMs) that enables effective, efficient, and secure local deployment (pip install effgen). effGen makes four major contributions: (1) Enhanced tool-calling with prompt optimization that compresses contexts by 70-80% while preserving task semantics, (2) Intelligent task decomposition that breaks complex queries into parallel or sequential subtasks based on dependencies, (3) Complexity-based routing using five factors to make smart pre-execution decisions, and (4) Unified memory system combining short-term, long-term, and vector-based storage. Additionally, effGen unifies multiple agent protocols (MCP, A2A, ACP) for cross-protocol communication. Results on 13 benchmarks show effGen outperforms LangChain, AutoGen, and Smolagents with higher success rates, faster execution, and lower memory. Our results reveal that prompt optimization and complexity routing have complementary scaling behavior: optimization benefits SLMs more (11.2% gain at 1.5B vs 2.4% at 32B), while routing benefits large models more (3.6% at 1.5B vs 7.9% at 32B), providing consistent gains across all scales when combined. effGen (https://effgen.org/) is released under the MIT License, ensuring broad accessibility for research and commercial use. Our framework code is publicly available at https://github.com/ctrl-gaurav/effGen.

[55] Do Schwartz Higher-Order Values Help Sentence-Level Human Value Detection? When Hard Gating Hurts

Víctor Yeste, Paolo Rosso

Main category: cs.CL

TL;DR: Study examines whether Schwartz higher-order categories improve sentence-level human value detection under compute constraints, finding hierarchical enforcement hurts performance while threshold tuning and ensembling provide gains.

Details

Motivation: To determine if Schwartz higher-order categories provide usable structure for sentence-level human value detection, particularly under strict compute-frugal budgets (single 8GB GPU).

Method: Compare three approaches: (1) direct supervised transformers, (2) HO→values pipelines with hard hierarchical masks, (3) Presence→HO→values cascades. Also test low-cost add-ons (lexica, short context, topics), label-wise threshold tuning, small LLM baselines (≤10B), QLoRA, and simple ensembles on ValueEval'24/ValuesML dataset (74K English sentences).

Result: HO categories are learnable from single sentences, but hard hierarchical gating reduces end-task Macro-F₁ via error compounding and recall suppression. Label-wise threshold tuning provides substantial gains (+0.05 Macro-F₁), and small transformer ensembles offer consistent additional improvements (+0.02 Macro-F₁). Small LLMs lag behind supervised encoders but contribute complementary errors in cross-family ensembles.

Conclusion: HO structure is useful descriptively but enforcing it with hard gates hurts sentence-level value detection; robust improvements come from calibration (threshold tuning) and lightweight ensembling rather than hierarchical constraints.

Abstract: Sentence-level human value detection is typically framed as multi-label classification over Schwartz values, but it remains unclear whether Schwartz higher-order (HO) categories provide usable structure. We study this under a strict compute-frugal budget (single 8 GB GPU) on ValueEval'24 / ValuesML (74K English sentences). We compare (i) direct supervised transformers, (ii) HO$\rightarrow$values pipelines that enforce the hierarchy with hard masks, and (iii) Presence$\rightarrow$HO$\rightarrow$values cascades, alongside low-cost add-ons (lexica, short context, topics), label-wise threshold tuning, small instruction-tuned LLM baselines ($\le$10B), QLoRA, and simple ensembles. HO categories are learnable from single sentences (e.g., the easiest bipolar pair reaches Macro-$F_1\approx0.58$), but hard hierarchical gating is not a reliable win: it often reduces end-task Macro-$F_1$ via error compounding and recall suppression. In contrast, label-wise threshold tuning is a high-leverage knob (up to $+0.05$ Macro-$F_1$), and small transformer ensembles provide the most consistent additional gains (up to $+0.02$ Macro-$F_1$). Small LLMs lag behind supervised encoders as stand-alone systems, yet can contribute complementary errors in cross-family ensembles. Overall, HO structure is useful descriptively, but enforcing it with hard gates hurts sentence-level value detection; robust improvements come from calibration and lightweight ensembling.

[56] Neural FOXP2 – Language Specific Neuron Steering for Targeted Language Improvement in LLMs

Anusa Saha, Tanmay Joshi, Vinija Jain, Aman Chadha, Amitava Das

Main category: cs.CL

TL;DR: Neural FOXP2: A method to steer language neurons in LLMs to make non-English languages primary by identifying and manipulating sparse, low-rank control circuits for language defaultness.

Details

Motivation: LLMs are multilingual but default to English due to English dominance in pretraining; other languages remain suppressed in parametric memory. The paper aims to address this language bias by identifying and controlling language-specific neural circuits.

Method: Three-stage approach: (1) Localize language neurons via sparse autoencoders to identify features selective for target languages; (2) Find steering directions via spectral low-rank analysis of activation differences; (3) Apply sparse activation shifts to language neurons to steer model toward target language defaultness.

Result: The method successfully makes Hindi or Spanish primary in LLMs by steering language-specific neurons, demonstrating controllable language defaultness without retraining.

Conclusion: Language defaultness in LLMs is governed by sparse, low-rank control circuits that can be mechanistically isolated and safely steered, enabling targeted language preference adjustments.

Abstract: LLMs are multilingual by training, yet their lingua franca is often English, reflecting English language dominance in pretraining. Other languages remain in parametric memory but are systematically suppressed. We argue that language defaultness is governed by a sparse, low-rank control circuit, language neurons, that can be mechanistically isolated and safely steered. We introduce Neural FOXP2, that makes a chosen language (Hindi or Spanish) primary in a model by steering language-specific neurons. Neural FOXP2 proceeds in three stages: (i) Localize: We train per-layer SAEs so each activation decomposes into a small set of active feature components. For every feature, we quantify English vs. Hindi/Spanish selectivity overall logit-mass lift toward the target-language token set. Tracing the top-ranked features back to their strongest contributing units yields a compact language-neuron set. (ii) Steering directions: We localize controllable language-shift geometry via a spectral low-rank analysis. For each layer, we build English to target activation-difference matrices and perform layerwise SVD to extract the dominant singular directions governing language change. The eigengap and effective-rank spectra identify a compact steering subspace and an empirically chosen intervention window (where these directions are strongest and most stable). (iii) Steer: We apply a signed, sparse activation shift targeted to the language neurons. Concretely, within low to mid layers we add a positive steering along the target-language dominant directions and a compensating negative shift toward the null space for the English neurons, yielding controllable target-language defaultness.

[57] Verification Required: The Impact of Information Credibility on AI Persuasion

Saaduddin Mahmud, Eugene Bagdasarian, Shlomo Zilberstein

Main category: cs.CL

TL;DR: LLM agents playing MixTalk game to study strategic communication with probabilistic credibility, showing limitations in reasoning about information credibility, with proposed TOPD method improving receiver robustness.

Details

Motivation: LLM agents are increasingly used in high-stakes communication settings, but prior work only studied either unverifiable cheap-talk or fully verifiable disclosure, missing realistic domains where information has probabilistic credibility.

Method: Introduced MixTalk game where sender combines verifiable/unverifiable claims, receiver allocates budget for costly verification. Evaluated state-of-the-art LLM agents in large-scale tournaments across three deployment settings. Proposed Tournament Oracle Policy Distillation (TOPD) method that distills tournament oracle policy from interaction logs for in-context deployment.

Result: Evaluation revealed LLM agents’ strengths and limitations in reasoning about information credibility and the explicit behavior shaping these interactions. TOPD significantly improved receiver robustness to persuasion.

Conclusion: MixTalk provides a framework for studying strategic communication with probabilistic credibility, revealing LLM agents’ capabilities and limitations, with TOPD offering an effective method to improve receiver robustness in such settings.

Abstract: Agents powered by large language models (LLMs) are increasingly deployed in settings where communication shapes high-stakes decisions, making a principled understanding of strategic communication essential. Prior work largely studies either unverifiable cheap-talk or fully verifiable disclosure, failing to capture realistic domains in which information has probabilistic credibility. We introduce MixTalk, a strategic communication game for LLM-to-LLM interaction that models information credibility. In MixTalk, a sender agent strategically combines verifiable and unverifiable claims to communicate private information, while a receiver agent allocates a limited budget to costly verification and infers the underlying state from prior beliefs, claims, and verification outcomes. We evaluate state-of-the-art LLM agents in large-scale tournaments across three realistic deployment settings, revealing their strengths and limitations in reasoning about information credibility and the explicit behavior that shapes these interactions. Finally, we propose Tournament Oracle Policy Distillation (TOPD), an offline method that distills tournament oracle policy from interaction logs and deploys it in-context at inference time. Our results show that TOPD significantly improves receiver robustness to persuasion.

[58] Trust in One Round: Confidence Estimation for Large Language Models via Structural Signals

Pengyue Yang, Jiawen Wen, Haolin Jin, Linghan Huang, Huaming Chen, Ling Chen

Main category: cs.CL

TL;DR: Structural Confidence: A single-pass, model-agnostic framework for LLM confidence estimation using multi-scale structural signals from hidden-state trajectories, outperforming traditional methods across diverse benchmarks.

Details

Motivation: Standard confidence estimators for LLMs (token likelihood, semantic similarity, consistency) are brittle under distribution shift, domain-specialized text, and compute limits, which is problematic for high-stakes applications where errors carry significant costs.

Method: Proposes Structural Confidence framework that extracts multi-scale structural signals from a model’s final-layer hidden-state trajectory using spectral, local-variation, and global shape descriptors to capture internal stability patterns missed by probabilities and embeddings.

Result: Demonstrates strong performance across four heterogeneous benchmarks (FEVER, SciFact, WikiBio-hallucination, TruthfulQA) in terms of AUROC and AUPR, outperforming established baselines while using only a single deterministic forward pass.

Conclusion: Structural Confidence offers a practical, efficient, and robust post-hoc confidence estimation method for socially impactful, resource-constrained LLM applications, avoiding the computational overhead of sampling-based consistency methods.

Abstract: Large language models (LLMs) are increasingly deployed in domains where errors carry high social, scientific, or safety costs. Yet standard confidence estimators, such as token likelihood, semantic similarity and multi-sample consistency, remain brittle under distribution shift, domain-specialised text, and compute limits. In this work, we present Structural Confidence, a single-pass, model-agnostic framework that enhances output correctness prediction based on multi-scale structural signals derived from a model’s final-layer hidden-state trajectory. By combining spectral, local-variation, and global shape descriptors, our method captures internal stability patterns that are missed by probabilities and sentence embeddings. We conduct extensive, cross-domain evaluation across four heterogeneous benchmarks-FEVER (fact verification), SciFact (scientific claims), WikiBio-hallucination (biographical consistency), and TruthfulQA (truthfulness-oriented QA). Our Structural Confidence framework demonstrates strong performance compared with established baselines in terms of AUROC and AUPR. More importantly, unlike sampling-based consistency methods which require multiple stochastic generations and an auxiliary model, our approach uses a single deterministic forward pass, offering a practical basis for efficient, robust post-hoc confidence estimation in socially impactful, resource-constrained LLM applications.

[59] MedSpeak: A Knowledge Graph-Aided ASR Error Correction Framework for Spoken Medical QA

Yutong Song, Shiva Shrestha, Chenhan Lyu, Elahe Khatibi, Pengfei Zhang, Honghui Xu, Nikil Dutt, Amir Rahmani

Main category: cs.CL

TL;DR: MedSpeak: A knowledge graph-aided ASR error correction framework that improves medical spoken question-answering by leveraging semantic relationships and phonetic information from medical knowledge graphs combined with LLM reasoning.

Details

Motivation: Spoken question-answering systems relying on ASR struggle with accurately recognizing medical terminology, which is crucial for medical applications where accuracy is paramount.

Method: Proposes MedSpeak framework that uses a medical knowledge graph to encode both semantic relationships and phonetic information, combined with LLM reasoning to correct ASR errors in noisy transcripts and improve downstream answer prediction.

Result: Comprehensive experiments show MedSpeak significantly improves medical term recognition accuracy and overall medical SQA performance, establishing it as state-of-the-art for medical SQA.

Conclusion: MedSpeak effectively addresses ASR limitations in medical contexts by leveraging structured medical knowledge and LLM capabilities, providing a robust solution for medical spoken question-answering systems.

Abstract: Spoken question-answering (SQA) systems relying on automatic speech recognition (ASR) often struggle with accurately recognizing medical terminology. To this end, we propose MedSpeak, a novel knowledge graph-aided ASR error correction framework that refines noisy transcripts and improves downstream answer prediction by leveraging both semantic relationships and phonetic information encoded in a medical knowledge graph, together with the reasoning power of LLMs. Comprehensive experimental results on benchmarks demonstrate that MedSpeak significantly improves the accuracy of medical term recognition and overall medical SQA performance, establishing MedSpeak as a state-of-the-art solution for medical SQA. The code is available at https://github.com/RainieLLM/MedSpeak.

[60] DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning

Batuhan K. Karaman, Aditya Rawal, Suhaila Shakiah, Mohammad Ghavamzadeh, Mingyi Hong, Arijit Biswas, Ruida Zhou

Main category: cs.CL

TL;DR: DISPO: A REINFORCE-style RL algorithm for LLMs that decouples importance sampling weight clipping for correct/incorrect responses, enabling controllable policy updates to balance exploration-distillation while preventing catastrophic failures in mathematical reasoning tasks.

Details

Motivation: Current RL approaches for enhancing LLM reasoning in mathematics face a trade-off: PPO-style methods (GRPO/DAPO) offer stability but slow learning, while REINFORCE-style methods (CISPO) are efficient but unstable due to improper importance sampling weight clipping that still allows non-zero gradients outside trust regions.

Method: DISPO introduces a REINFORCE-style algorithm that decouples up-clipping and down-clipping of importance sampling weights for correct and incorrect responses separately, creating four controllable policy update regimes. This allows independent tuning of how the algorithm handles exploration vs. distillation for correct responses and prevents catastrophic failures for incorrect responses.

Result: DISPO achieves 61.04% on AIME'24 benchmark (vs. 55.42% for CISPO and 50.21% for DAPO), with similar gains across various benchmarks and models. The method maintains exploration-distillation balance while preventing the performance collapse seen in other approaches.

Conclusion: DISPO successfully addresses the stability-efficiency trade-off in RL for LLM reasoning by providing fine-grained control over policy updates through decoupled clipping regimes, enabling both efficient learning and stable performance in mathematical reasoning tasks.

Abstract: Reinforcement learning with verifiable rewards has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models particularly in mathematics. Current approaches in this domain present a clear trade-off: PPO-style methods (e.g., GRPO/DAPO) offer training stability but exhibit slow learning trajectories due to their trust-region constraints on policy updates, while REINFORCE-style approaches (e.g., CISPO) demonstrate improved learning efficiency but suffer from performance instability as they clip importance sampling weights while still permitting non-zero gradients outside the trust-region. To address these limitations, we introduce DISPO, a simple yet effective REINFORCE-style algorithm that decouples the up-clipping and down-clipping of importance sampling weights for correct and incorrect responses, yielding four controllable policy update regimes. Through targeted ablations, we uncover how each regime impacts training: for correct responses, weights >1 increase the average token entropy (i.e., exploration) while weights <1 decrease it (i.e., distillation) – both beneficial but causing gradual performance degradation when excessive. For incorrect responses, overly restrictive clipping triggers sudden performance collapse through repetitive outputs (when weights >1) or vanishing response lengths (when weights <1). By separately tuning these four clipping parameters, DISPO maintains the exploration-distillation balance while preventing catastrophic failures, achieving 61.04% on AIME'24 (vs. 55.42% CISPO and 50.21% DAPO) with similar gains across various benchmarks and models.

[61] Sparse Reward Subsystem in Large Language Models

Guowei Xu, Mert Yuksekgonul, James Zou

Main category: cs.CL

TL;DR: LLMs contain a sparse reward subsystem with value neurons representing internal state expectations and dopamine neurons encoding reward prediction errors, analogous to biological reward systems.

Details

Motivation: To investigate whether LLMs develop internal reward mechanisms similar to biological brains, specifically looking for neural correlates of value estimation and reward prediction error encoding within model hidden states.

Method: Analyzed hidden states of LLMs to identify sparse reward subsystems, conducted intervention experiments on value neurons, examined robustness across datasets/model scales/architectures, and identified dopamine neurons encoding reward prediction errors through divergence analysis between predicted and actual rewards.

Result: Found robust value neurons representing internal state value expectations that are crucial for reasoning, with significant transferability across datasets and models. Identified dopamine neurons that encode reward prediction errors with high activation for positive surprises and low activation for negative surprises.

Conclusion: LLMs develop internal reward subsystems analogous to biological brains, with specialized neurons for value estimation and reward prediction error encoding, suggesting fundamental similarities in reward processing across artificial and biological systems.

Abstract: In this paper, we identify a sparse reward subsystem within the hidden states of Large Language Models (LLMs), drawing an analogy to the biological reward subsystem in the human brain. We demonstrate that this subsystem contains value neurons that represent the model’s internal expectation of state value, and through intervention experiments, we establish the importance of these neurons for reasoning. Our experiments reveal that these value neurons are robust across diverse datasets, model scales, and architectures; furthermore, they exhibit significant transferability across different datasets and models fine-tuned from the same base model. By examining cases where value predictions and actual rewards diverge, we identify dopamine neurons within the reward subsystem which encode reward prediction errors (RPE). These neurons exhibit high activation when the reward is higher than expected and low activation when the reward is lower than expected.

[62] DeALOG: Decentralized Multi-Agents Log-Mediated Reasoning Framework

Abhijit Chakraborty, Ashish Raj Shekhar, Shiven Agarwal, Vivek Gupta

Main category: cs.CL

TL;DR: DeALOG is a decentralized multi-agent framework for multimodal QA that uses specialized agents communicating through a shared natural-language log for collaborative error detection and verification.

Details

Motivation: Complex question answering requires integrating information from diverse sources (text, tables, images), necessitating a framework that supports specialized processing with coordination and interpretability.

Method: Uses specialized agents (Table, Context, Visual, Summarizing, Verification) that communicate through a shared natural-language log as persistent memory, enabling decentralized collaboration and verification.

Result: Competitive performance on multiple benchmarks: FinQA, TAT-QA, CRT-QA, WikiTableQuestions, FeTaQA, and MultiModalQA. Analysis confirms importance of shared log, agent specialization, and verification.

Conclusion: DeALOG provides a scalable approach through modular components using natural-language communication, enabling robust multimodal QA without central control.

Abstract: Complex question answering across text, tables and images requires integrating diverse information sources. A framework supporting specialized processing with coordination and interpretability is needed. We introduce DeALOG, a decentralized multi-agent framework for multimodal question answering. It uses specialized agents: Table, Context, Visual, Summarizing and Verification, that communicate through a shared natural-language log as persistent memory. This log-based approach enables collaborative error detection and verification without central control, improving robustness. Evaluations on FinQA, TAT-QA, CRT-QA, WikiTableQuestions, FeTaQA, and MultiModalQA show competitive performance. Analysis confirms the importance of the shared log, agent specialization, and verification for accuracy. DeALOG, provides a scalable approach through modular components using natural-language communication.

[63] Reliable Use of Lemmas via Eligibility Reasoning and Section$-$Aware Reinforcement Learning

Zhikun Xu, Xiaodong Yu, Ben Zhou, Jiang Liu, Jialian Wu, Ze Wang, Ximeng Sun, Hao Chen, Zicheng Liu

Main category: cs.CL

TL;DR: RULES: A reinforcement learning framework for training LLMs to properly judge lemma applicability by decomposing the task into precondition and conclusion-utility checks with section-aware loss masking.

Details

Motivation: Current LLMs perform well on mathematical benchmarks but often misapply lemmas by importing conclusions without validating assumptions, highlighting the need for better structured reasoning about lemma applicability.

Method: Formalizes lemma-judging as structured prediction with two-section outputs (precondition check and conclusion-utility check). Uses reinforcement learning with section-aware loss masking to assign penalties to the section responsible for errors. Trained on diverse natural language and formal proof corpora.

Result: Shows consistent in-domain gains over vanilla models and single-label RL baselines, larger improvements on applicability-breaking perturbations, and parity or modest gains on end-to-end tasks. Ablations confirm both two-section outputs and section-aware reinforcement are necessary for robustness.

Conclusion: The RULES framework effectively improves LLMs’ ability to properly judge lemma applicability through structured decomposition and targeted reinforcement learning, addressing a key weakness in mathematical reasoning.

Abstract: Recent large language models (LLMs) perform strongly on mathematical benchmarks yet often misapply lemmas, importing conclusions without validating assumptions. We formalize lemma$-$judging as a structured prediction task: given a statement and a candidate lemma, the model must output a precondition check and a conclusion$-$utility check, from which a usefulness decision is derived. We present RULES, which encodes this specification via a two$-$section output and trains with reinforcement learning plus section$-$aware loss masking to assign penalty to the section responsible for errors. Training and evaluation draw on diverse natural language and formal proof corpora; robustness is assessed with a held$-$out perturbation suite; and end$-$to$-$end evaluation spans competition$-$style, perturbation$-$aligned, and theorem$-$based problems across various LLMs. Results show consistent in$-$domain gains over both a vanilla model and a single$-$label RL baseline, larger improvements on applicability$-$breaking perturbations, and parity or modest gains on end$-$to$-$end tasks; ablations indicate that the two$-$section outputs and section$-$aware reinforcement are both necessary for robustness.

[64] Distilling Token-Trained Models into Byte-Level Models

Zishuo Bao, Jiaqi Leng, Junxiong Wang, Bowen Peng, Yucheng Lu

Main category: cs.CL

TL;DR: Efficient distillation method converts token-trained LLMs into Byte Language Models (BLMs) using progressive knowledge distillation and byte-level fine-tuning, achieving comparable performance with only 125B bytes of training data.

Details

Motivation: Byte Language Models (BLMs) offer advantages for scaling beyond tokenization but require expensive training from scratch on trillions of bytes. The authors aim to develop an efficient method to convert existing token-trained LLMs into BLMs without the prohibitive cost of full retraining.

Method: Two-stage curriculum: (1) Progressive Knowledge Distillation aligns byte-level representations with token-trained teacher model embeddings, (2) Byte-Level Supervised Fine-Tuning enables end-to-end generation in byte space. Applied to multiple model families including Llama, Qwen, and OLMo.

Result: Distilled BLMs retain most of teacher models’ performance using only approximately 125B bytes of training data, demonstrating efficient conversion from token-trained to byte-level models.

Conclusion: Proposed distillation recipe provides a cost-effective way to convert existing LLMs into BLMs, enabling byte-level language modeling without expensive retraining from scratch.

Abstract: Byte Language Models (BLMs) have emerged as a promising direction for scaling language models beyond tokenization. However, existing BLMs typically require training from scratch on trillions of bytes, making them prohibitively expensive. In this paper, we propose an efficient distillation recipe that converts existing token-trained LLMs into BLMs while retaining comparable capabilities. Our recipe follows a two-stage curriculum: (1) Progressive Knowledge Distillation, which aligns byte-level representations with the embeddings of the token-trained teacher model; and (2) Byte-Level Supervised Fine-Tuning, which enables end-to-end generation entirely in the byte space. We validate our approach across multiple model families, including Llama, Qwen, and OLMo, and demonstrate that the distilled BLMs retain most of the teacher models’ performance using only approximately 125B bytes.

[65] Large Language Models as Students Who Think Aloud: Overly Coherent, Verbose, and Confident

Conrad Borchers, Jill-Jênn Vie, Roger Azevedo

Main category: cs.CL

TL;DR: LLMs fail to accurately simulate novice reasoning in chemistry tutoring, producing over-coherent, verbose responses that overestimate learner performance compared to human think-aloud data.

Details

Motivation: To evaluate whether LLMs can faithfully model novice reasoning and metacognitive judgments in educational contexts, moving beyond traditional accuracy-focused evaluations to capture the fragmented, imperfect nature of human learning.

Method: Used 630 think-aloud utterances from chemistry tutoring problems with student logs; compared LLM-generated reasoning to human utterances under minimal and extended contextual prompting; assessed models’ ability to predict step-level learner success.

Result: GPT-4.1 generates fluent continuations but produces systematically over-coherent, verbose, and less variable reasoning than humans; these effects intensify with richer context; consistently overestimates learner performance.

Conclusion: LLMs have epistemic limitations in simulating learning due to training on expert-like solutions lacking expressions of affect and working memory constraints; evaluation framework can guide future adaptive systems design.

Abstract: Large language models (LLMs) are increasingly embedded in AI-based tutoring systems. Can they faithfully model novice reasoning and metacognitive judgments? Existing evaluations emphasize problem-solving accuracy, overlooking the fragmented and imperfect reasoning that characterizes human learning. We evaluate LLMs as novices using 630 think-aloud utterances from multi-step chemistry tutoring problems with problem-solving logs of student hint use, attempts, and problem context. We compare LLM-generated reasoning to human learner utterances under minimal and extended contextual prompting, and assess the models’ ability to predict step-level learner success. Although GPT-4.1 generates fluent and contextually appropriate continuations, its reasoning is systematically over-coherent, verbose, and less variable than human think-alouds. These effects intensify with a richer problem-solving context during prompting. Learner performance was consistently overestimated. These findings highlight epistemic limitations of simulating learning with LLMs. We attribute these limitations to LLM training data, including expert-like solutions devoid of expressions of affect and working memory constraints during problem solving. Our evaluation framework can guide future design of adaptive systems that more faithfully support novice learning and self-regulation using generative artificial intelligence.

[66] Personality Expression Across Contexts: Linguistic and Behavioral Variation in LLM Agents

Bin Han, Deuksin Kwon, Jonathan Gratch

Main category: cs.CL

TL;DR: LLMs with identical personality prompts show different behavioral outcomes across conversational contexts, suggesting context-sensitive rather than fixed personality expression.

Details

Motivation: To understand how identical personality prompts in LLMs lead to different linguistic, behavioral, and emotional outcomes across various conversational settings, and whether this variation represents inconsistency or context-sensitive adaptation.

Method: Examined LLM behavior with identical personality prompts across four conversational settings: ice-breaking, negotiation, group decision, and empathy tasks. Analyzed how contextual cues influence personality expression and emotional tone.

Result: Contextual cues systematically influence both personality expression and emotional tone. The same traits are expressed differently depending on social and affective demands, showing context-sensitive rather than fixed personality expression.

Conclusion: LLMs exhibit context-sensitive personality expression that adapts flexibly to social interaction goals and affective conditions, similar to human behavior according to Whole Trait Theory.

Abstract: Large Language Models (LLMs) can be conditioned with explicit personality prompts, yet their behavioral realization often varies depending on context. This study examines how identical personality prompts lead to distinct linguistic, behavioral, and emotional outcomes across four conversational settings: ice-breaking, negotiation, group decision, and empathy tasks. Results show that contextual cues systematically influence both personality expression and emotional tone, suggesting that the same traits are expressed differently depending on social and affective demands. This raises an important question for LLM-based dialogue agents: whether such variations reflect inconsistency or context-sensitive adaptation akin to human behavior. Viewed through the lens of Whole Trait Theory, these findings highlight that LLMs exhibit context-sensitive rather than fixed personality expression, adapting flexibly to social interaction goals and affective conditions.

[67] Exploring Knowledge Purification in Multi-Teacher Knowledge Distillation for LLMs

Ruihan Jin, Pengpeng Shao, Zhengqi Wen, Jinyang Wu, Mingkuan Feng, Shuo Yang, Chu Yuan Zhang, Jianhua Tao

Main category: cs.CL

TL;DR: Knowledge purification consolidates rationales from multiple teacher LLMs into single rationale to mitigate conflicts and improve distillation efficiency

Details

Motivation: Traditional knowledge distillation faces challenges with knowledge conflicts and high resource demands when using multiple teacher models, especially for transferring knowledge from stronger LLMs to smaller models

Method: Introduces knowledge purification concept and proposes five purification methods from various perspectives to consolidate rationales from multiple teacher LLMs into single rationale

Result: Purification methods improve distilled model performance, effectively alleviate knowledge conflicts, and router-based methods show robust generalization capabilities

Conclusion: Knowledge purification techniques optimize multi-teacher distillation and facilitate practical deployment of powerful yet lightweight models

Abstract: Knowledge distillation has emerged as a pivotal technique for transferring knowledge from stronger large language models (LLMs) to smaller, more efficient models. However, traditional distillation approaches face challenges related to knowledge conflicts and high resource demands, particularly when leveraging multiple teacher models. In this paper, we introduce the concept of \textbf{Knowledge Purification}, which consolidates the rationales from multiple teacher LLMs into a single rationale, thereby mitigating conflicts and enhancing efficiency. To investigate the effectiveness of knowledge purification, we further propose five purification methods from various perspectives. Our experiments demonstrate that these methods not only improve the performance of the distilled model but also effectively alleviate knowledge conflicts. Moreover, router-based methods exhibit robust generalization capabilities, underscoring the potential of innovative purification techniques in optimizing multi-teacher distillation and facilitating the practical deployment of powerful yet lightweight models.

[68] From Utterance to Vividity: Training Expressive Subtitle Translation LLM via Adaptive Local Preference Optimization

Chaoqun Cui, Shijing Wang, Liangbin Huang, Qingqing Gu, Zhaolong Huang, Xiao Zeng, Wenji Mao

Main category: cs.CL

TL;DR: The paper focuses on developing domain-specific translation LLMs for visual media subtitles, addressing limitations of general LLMs in vertical domains through a new Adaptive Local Preference Optimization method and a released subtitle parallel corpus.

Details

Motivation: While LLMs have improved machine translation generally, they struggle with vertical domain translations like visual media subtitles that require expressive, vivid translations. The paper aims to address this gap by creating domain-customized translation LLMs.

Method: 1) Investigated subtitle translation scenarios and literal/liberal translation domains, 2) Verified LLMs as reliable reward models and evaluators for translation, 3) Constructed and released a multidirectional subtitle parallel corpus dataset, 4) Proposed Adaptive Local Preference Optimization (ALPO) method for fine-grained preference alignment in translation.

Result: Experimental results show ALPO achieves outstanding performance in multidimensional evaluation of translation quality, demonstrating effectiveness for expressive subtitle translation.

Conclusion: The paper successfully addresses domain-specific translation needs for visual media subtitles through ALPO and a specialized dataset, showing LLMs can be effectively customized for vertical domains requiring expressive translation.

Abstract: The rapid development of Large Language Models (LLMs) has significantly enhanced the general capabilities of machine translation. However, as application scenarios become more complex, the limitations of LLMs in vertical domain translations are gradually becoming apparent. In this study, we focus on how to construct translation LLMs that meet the needs of domain customization. We take visual media subtitle translation as our topic and explore how to train expressive and vivid translation LLMs. We investigated the situations of subtitle translation and other domains of literal and liberal translation, verifying the reliability of LLM as reward model and evaluator for translation. Additionally, to train an expressive translation LLM, we constructed and released a multidirectional subtitle parallel corpus dataset and proposed the Adaptive Local Preference Optimization (ALPO) method to address fine-grained preference alignment. Experimental results demonstrate that ALPO achieves outstanding performance in multidimensional evaluation of translation quality.

[69] What If We Allocate Test-Time Compute Adaptively?

Ahsan Bilal, Ahmed Mohsin, Muhammad Umer, Ali Subhan, Hassan Rizwan, Ayesha Mohsin, Dean Hougen

Main category: cs.CL

TL;DR: A verifier-guided adaptive framework for reasoning that dynamically allocates computation across iterations using process reward models to guide trajectory generation and selection, outperforming uniform test-time compute scaling.

Details

Motivation: Current test-time compute scaling approaches allocate inference computation uniformly, use fixed sampling strategies, and apply verification only for reranking. This is inefficient because it doesn't adapt computation allocation based on problem difficulty or reasoning progress.

Method: Proposes an iterative trajectory generation and selection framework where for each problem, the agent runs multiple inference iterations. In each iteration, it optionally produces a high-level plan, selects reasoning tools and compute strategies with exploration parameters, then generates candidate reasoning trajectories. A process reward model (PRM) serves as unified control: step-level PRM scores guide pruning/expansion during generation within iterations, and aggregated trajectory rewards select final responses across iterations.

Result: The dynamic, PRM-guided approach consistently outperforms direct test-time scaling across datasets, yielding large gains on MATH-500 and several-fold improvements on harder benchmarks like AIME24 and AMO-Bench. Efficiency analysis shows verification-guided allocation concentrates computation on high-utility reasoning paths.

Conclusion: Adaptive, verifier-guided computation allocation during reasoning is more effective than uniform test-time scaling, with process reward models providing effective control signals for both intra-iteration trajectory generation and inter-iteration response selection.

Abstract: Test-time compute scaling allocates inference computation uniformly, uses fixed sampling strategies, and applies verification only for reranking. In contrast, we propose a verifier-guided adaptive framework treating reasoning as iterative trajectory generation and selection. For each problem, the agent runs multiple inference iterations. In each iteration, it optionally produces a high-level plan, selects a set of reasoning tools and a compute strategy together with an exploration parameter, and then generates a candidate reasoning trajectory. A process reward model (PRM) serves as a unified control signal: within each iteration, step-level PRM scores are aggregated to guide pruning and expansion during generation, and across iterations, aggregated trajectory rewards are used to select the final response. Across datasets, our dynamic, PRM-guided approach consistently outperforms direct test-time scaling, yielding large gains on MATH-500 and several-fold improvements on harder benchmarks such as AIME24 and AMO-Bench. We characterize efficiency using theoretical FLOPs and a compute intensity metric penalizing wasted generation and tool overhead, demonstrating that verification-guided allocation concentrates computation on high-utility reasoning paths.

[70] Logic-Oriented Retriever Enhancement via Contrastive Learning

Wenxuan Zhang, Yuan-Hao Jiang, Changyong Qi, Rui Jia, Yonghe Wu

Main category: cs.CL

TL;DR: LORE enhances retrieval for LLMs by activating latent logical reasoning capacity through fine-grained contrastive learning, improving performance on knowledge-intensive tasks without external supervision.

Details

Motivation: Current retrievers for LLMs overfit to surface similarity and fail on queries requiring complex logical relations, limiting performance on knowledge-intensive tasks. The latent logical analysis capacity in model representations remains underutilized.

Method: LORE uses fine-grained contrastive learning to activate latent logical reasoning capacity in embeddings, guiding them toward evidence aligned with logical structure rather than shallow similarity. It requires no external supervision, resources, or pre-retrieval analysis and remains index-compatible.

Result: LORE consistently improves retrieval utility and downstream generation while maintaining efficiency. The approach enhances performance on knowledge-intensive tasks involving complex logical relations.

Conclusion: Activating latent logical reasoning capacity through fine-grained contrastive learning effectively improves retrieval for LLMs on complex queries without requiring external resources or supervision.

Abstract: Large language models (LLMs) struggle in knowledge-intensive tasks, as retrievers often overfit to surface similarity and fail on queries involving complex logical relations. The capacity for logical analysis is inherent in model representations but remains underutilized in standard training. LORE (Logic ORiented Retriever Enhancement) introduces fine-grained contrastive learning to activate this latent capacity, guiding embeddings toward evidence aligned with logical structure rather than shallow similarity. LORE requires no external upervision, resources, or pre-retrieval analysis, remains index-compatible, and consistently improves retrieval utility and downstream generation while maintaining efficiency. The datasets and code are publicly available at https://github.com/mazehart/Lore-RAG.

[71] Tendem: A Hybrid AI+Human Platform

Konstantin Chernyshev, Ekaterina Artemova, Viacheslav Zhukov, Maksim Nerush, Mariia Fedorova, Iryna Repik, Olga Shapovalova, Aleksey Sukhorosov, Vladimir Dobrovolskii, Natalia Mikhailova, Sergei Tilga

Main category: cs.CL

TL;DR: Tendem is a hybrid AI-human system where AI handles structured work and human experts intervene when models fail, with comprehensive quality reviews. It outperforms AI-only agents and human-only workflows in quality and speed while maintaining comparable costs to human-only execution.

Details

Motivation: The motivation is to create a hybrid system that combines the strengths of AI and human expertise to overcome limitations of purely AI or human-only approaches, aiming for higher quality outputs with faster turnaround times while controlling operational costs.

Method: Tendem uses a hybrid approach where AI handles structured, repeatable work, and human experts step in when models fail or to verify results. All results undergo comprehensive quality review before delivery. Performance was evaluated through in-house tests on 94 real-world tasks comparing against AI-only agents and human-only workflows by Upwork freelancers.

Result: Tendem consistently delivers higher-quality outputs with faster turnaround times compared to both AI-only agents and human-only workflows, while maintaining operational costs comparable to human-only execution. On third-party benchmarks, Tendem’s AI Agent performs near state-of-the-art on web browsing and tool-use tasks with strong results in frontier domain knowledge and reasoning.

Conclusion: The hybrid AI-human approach of Tendem effectively combines the strengths of both AI and human expertise, resulting in superior performance in quality and speed while controlling costs, demonstrating the value of human-in-the-loop systems for complex real-world tasks.

Abstract: Tendem is a hybrid system where AI handles structured, repeatable work and Human Experts step in when the models fail or to verify results. Each result undergoes a comprehensive quality review before delivery to the Client. To assess Tendem’s performance, we conducted a series of in-house evaluations on 94 real-world tasks, comparing it with AI-only agents and human-only workflows carried out by Upwork freelancers. The results show that Tendem consistently delivers higher-quality outputs with faster turnaround times. At the same time, its operational costs remain comparable to human-only execution. On third-party agentic benchmarks, Tendem’s AI Agent (operating autonomously, without human involvement) performs near state-of-the-art on web browsing and tool-use tasks while demonstrating strong results in frontier domain knowledge and reasoning.

[72] Long-range Modeling and Processing of Multimodal Event Sequences

Jichu Li, Yilun Zhong, Zhiting Li, Feng Zhou, Quyu Kong

Main category: cs.CL

TL;DR: A novel framework extending LLM-based temporal point processes to visual modality with adaptive sequence compression for multimodal event modeling

Details

Motivation: Existing TPP approaches are limited in generating rich multimodal content and reasoning about event dynamics, especially when incorporating multimodal data increases sequence length and hinders attention-based models from generating coherent long-form textual descriptions requiring long-range understanding.

Method: Extends LLM-based TPPs to visual modality, positions text generation as core capability alongside time/type prediction, uses adaptive sequence compression based on temporal similarity to reduce sequence length while preserving patterns, employs two-stage paradigm of pre-training on compressed sequences followed by supervised fine-tuning.

Result: Method outperforms state-of-the-art baselines in both predictive accuracy and quality of generated textual analyses, demonstrated through extensive experiments including on challenging DanmakuTPP-QA benchmark.

Conclusion: The proposed framework successfully addresses the long-context problem in multimodal TPPs and enables effective generation of rich textual analyses alongside accurate event predictions.

Abstract: Temporal point processes (TPPs) have emerged as powerful tools for modeling asynchronous event sequences. While recent advances have extended TPPs to handle textual information, existing approaches are limited in their ability to generate rich, multimodal content and reason about event dynamics. A key challenge is that incorporating multimodal data dramatically increases sequence length, hindering the ability of attention-based models to generate coherent, long-form textual descriptions that require long-range understanding. In this paper, we propose a novel framework that extends LLM-based TPPs to the visual modality, positioning text generation as a core capability alongside time and type prediction. Our approach addresses the long-context problem through an adaptive sequence compression mechanism based on temporal similarity, which reduces sequence length while preserving essential patterns. We employ a two-stage paradigm of pre-training on compressed sequences followed by supervised fine-tuning for downstream tasks. Extensive experiments, including on the challenging DanmakuTPP-QA benchmark, demonstrate that our method outperforms state-of-the-art baselines in both predictive accuracy and the quality of its generated textual analyses.

[73] Don’t Judge a Book by its Cover: Testing LLMs’ Robustness Under Logical Obfuscation

Abhilekh Borah, Shubhra Ghosh, Kedar Joshi, Aditya Kumar Guru, Kripabandhu Ghosh

Main category: cs.CL

TL;DR: Logifus framework creates logically equivalent but obfuscated reasoning problems to test LLMs’ true understanding vs. surface pattern matching, showing severe performance drops across multiple reasoning tasks.

Details

Motivation: Current LLMs perform well on standard reasoning tasks but fail when problems are presented in logically equivalent but obfuscated formats, suggesting they rely on surface patterns rather than deep understanding.

Method: Introduces Logifus, a structure-preserving logical obfuscation framework, and LogiQAte benchmark with 1,108 questions across four reasoning tasks: first-order logic entailment, family-graph entailment, pattern induction, and navigation reasoning under altered formats.

Result: Obfuscation severely degrades zero-shot performance across all tasks: GPT-4o drops 47%, GPT-5 drops 27%, and reasoning model o4-mini drops 22% on average, revealing models parse questions without deep understanding.

Conclusion: LLMs lack genuine comprehension and rely on surface form patterns, highlighting the need for models that preserve meaning beyond surface structure and truly understand logical equivalence.

Abstract: Tasks such as solving arithmetic equations, evaluating truth tables, and completing syllogisms are handled well by large language models (LLMs) in their standard form, but they often fail when the same problems are posed in logically equivalent yet obfuscated formats. To study this vulnerability, we introduce Logifus, a structure-preserving logical obfuscation framework, and, utilizing this, we present LogiQAte, a first-of-its-kind diagnostic benchmark with 1,108 questions across four reasoning tasks: (i) Obfus FOL (first-order logic entailment under equivalence-preserving rewrites), (ii) Obfus Blood Relation (family-graph entailment under indirect relational chains), (iii) Obfus Number Series (pattern induction under symbolic substitutions), and (iv) Obfus Direction Sense (navigation reasoning under altered directions and reference frames). Across all the tasks, evaluating six state-of-the-art models, we find that obfuscation severely degrades zero-shot performance, with performance dropping on average by 47% for GPT-4o, 27% for GPT-5, and 22% for reasoning model, o4-mini. Our findings reveal that current LLMs parse questions without deep understanding, highlighting the urgency of building models that genuinely comprehend and preserve meaning beyond surface form.

[74] Beyond Training for Cultural Awareness: The Role of Dataset Linguistic Structure in Large Language Models

Reem I. Masoud, Chen Feng, Shunta Asano, Saied Alshahrani, Philip Colin Treleaven, Miguel R. D. Rodrigues

Main category: cs.CL

TL;DR: Analysis of linguistic properties in fine-tuning datasets for cultural alignment of LLMs, showing model-dependent effects of semantic coherence, lexical diversity, and structural richness on cultural performance.

Details

Motivation: Address concerns about cultural misalignment in LLMs by investigating which linguistic properties of fine-tuning datasets are associated with cultural performance, and how these effects vary across different models.

Method: Compute linguistic, semantic, and structural metrics for Arabic, Chinese, and Japanese datasets; apply PCA within each language; fine-tune three LLM families (LLaMA, Mistral, DeepSeek); evaluate on cultural benchmarks; conduct controlled subset interventions.

Result: PCA components correlate with downstream performance but associations are strongly model-dependent; lexical-oriented components are most robust across models and benchmarks; emphasizing semantic or diversity extremes is often neutral or harmful.

Conclusion: Linguistic properties of fine-tuning data significantly impact cultural alignment, but effects are model-specific; lexical properties show most consistent benefits; dataset design for cultural adaptation should consider model-specific interactions.

Abstract: The global deployment of large language models (LLMs) has raised concerns about cultural misalignment, yet the linguistic properties of fine-tuning datasets used for cultural adaptation remain poorly understood. We adopt a dataset-centric view of cultural alignment and ask which linguistic properties of fine-tuning data are associated with cultural performance, whether these properties are predictive prior to training, and how these effects vary across models. We compute lightweight linguistic, semantic, and structural metrics for Arabic, Chinese, and Japanese datasets and apply principal component analysis separately within each language. This design ensures that the resulting components capture variation among datasets written in the same language rather than differences between languages. The resulting components correspond to broadly interpretable axes related to semantic coherence, surface-level lexical and syntactic diversity, and lexical or structural richness, though their composition varies across languages. We fine-tune three major LLM families (LLaMA, Mistral, DeepSeek) and evaluate them on benchmarks of cultural knowledge, values, and norms. While PCA components correlate with downstream performance, these associations are strongly model-dependent. Through controlled subset interventions, we show that lexical-oriented components (PC3) are the most robust, yielding more consistent performance across models and benchmarks, whereas emphasizing semantic or diversity extremes (PC1-PC2) is often neutral or harmful.

[75] Typologically-Informed Candidate Reranking for LLM-based Translation into Low-Resource Languages

Nipuna Abeykoon, Ashen Weerathunga, Pubudu Wijesinghe, Parameswari Krishnamurthy

Main category: cs.CL

TL;DR: A framework using linguistic typology to improve LLM translation quality for low-resource languages without parallel data or model retraining.

Details

Motivation: LLMs trained on high-resource languages exhibit systematic biases toward dominant typological patterns, causing structural non-conformance when translating into typologically divergent low-resource languages.

Method: Two-component framework: 1) Universal Metalinguistic Framework (UMF) representing languages as structured profiles across 16 typological dimensions with divergence-weighted scoring, and 2) Computational Engine operating through linguistic disambiguation during generation and typological compliance scoring during selection.

Result: Evaluation across nine language pairs shows intervention rates strongly correlating with typological distance from English. On 341 English sentences with different morphological/syntactic phenomena: 48.16% intervention precision for conservatively treated languages, 28.15% for morphologically dense languages, and 86.26% for structurally profiled languages.

Conclusion: The framework requires no parallel training data and works with any LLM capable of producing multiple candidate outputs, enabling practical deployment for under-resourced languages.

Abstract: Large language models trained predominantly on high-resource languages exhibit systematic biases toward dominant typological patterns, leading to structural non-conformance when translating into typologically divergent low-resource languages. We present a framework that leverages linguistic typology to improve translation quality without parallel training data or model retraining. The framework consists of two components: the Universal Metalinguistic Framework (UMF), which represents languages as structured profiles across 16 typological dimensions with divergence-weighted scoring, and the Computational Engine, which operates through linguistic disambiguation during generation and typological compliance scoring during selection. Evaluation across nine language pairs demonstrates intervention rates strongly correlating with typological distance from English. In experiments on 341 English sentences each having different morphological and syntactic phenomena, the framework shows an intervention precision of 48.16% for conservatively treated languages, 28.15% for morphologically dense languages, and 86.26% for structurally profiled languages. The framework requires no parallel training data and operates with any LLM capable of producing multiple candidate outputs, enabling practical deployment for under-resourced languages.

[76] PedagoSense: A Pedology Grounded LLM System for Pedagogical Strategy Detection and Contextual Response Generation in Learning Dialogues

Shahem Sultan, Shahem Fadi, Yousef Melhim, Ibrahim Alsarraj, Besher Hassan

Main category: cs.CL

TL;DR: PedagoSense: A system that detects pedagogical strategies in tutor-student dialogues and generates LLM responses aligned with recommended strategies for adaptive educational technologies.

Details

Motivation: To improve interaction quality in dialogue-based learning by bridging pedagogical theory with practical LLM-based response generation, enabling more adaptive educational technologies.

Method: Two-stage approach: 1) Binary classifier detects presence of pedagogical strategies, 2) Fine-grained classifier identifies specific strategies. Parallel system recommends strategies from dialogue context and uses LLM to generate strategy-aligned responses.

Result: High performance for pedagogical strategy detection with consistent gains from data augmentation. Fine-grained classification remains challenging in some cases.

Conclusion: PedagoSense successfully bridges pedagogical theory and LLM-based response generation, enabling more adaptive educational dialogue systems.

Abstract: This paper addresses the challenge of improving interaction quality in dialogue based learning by detecting and recommending effective pedagogical strategies in tutor student conversations. We introduce PedagoSense, a pedology grounded system that combines a two stage strategy classifier with large language model generation. The system first detects whether a pedagogical strategy is present using a binary classifier, then performs fine grained classification to identify the specific strategy. In parallel, it recommends an appropriate strategy from the dialogue context and uses an LLM to generate a response aligned with that strategy. We evaluate on human annotated tutor student dialogues, augmented with additional non pedagogical conversations for the binary task. Results show high performance for pedagogical strategy detection and consistent gains when using data augmentation, while analysis highlights where fine grained classes remain challenging. Overall, PedagoSense bridges pedagogical theory and practical LLM based response generation for more adaptive educational technologies.

[77] EmoAra: Emotion-Preserving English Speech Transcription and Cross-Lingual Translation with Arabic Text-to-Speech

Besher Hassan, Ibrahim Alsarraj, Musaab Hasan, Yousef Melhim, Shahem Fadi, Shahem Sultan

Main category: cs.CL

TL;DR: EmoAra is an end-to-end pipeline for cross-lingual spoken communication that preserves emotional context, specifically designed for banking customer service applications.

Details

Motivation: The paper addresses the need to preserve emotional nuance in cross-lingual spoken communication, particularly in banking customer service where emotional context significantly impacts service quality and customer experience.

Method: The system integrates four components: CNN-based Speech Emotion Recognition, Whisper for English ASR, fine-tuned MarianMT for English-to-Arabic translation, and MMS-TTS-Ara for Arabic speech synthesis, forming an end-to-end pipeline.

Result: Achieved 94% F1-score for emotion classification, BLEU 56 and BERTScore F1 88.7% for translation, and 81% average human evaluation score for banking-domain translations.

Conclusion: EmoAra successfully demonstrates an effective pipeline for emotion-preserving cross-lingual spoken communication with strong performance metrics across all components.

Abstract: This work presents EmoAra, an end-to-end emotion-preserving pipeline for cross-lingual spoken communication, motivated by banking customer service where emotional context affects service quality. EmoAra integrates Speech Emotion Recognition, Automatic Speech Recognition, Machine Translation, and Text-to-Speech to process English speech and deliver an Arabic spoken output while retaining emotional nuance. The system uses a CNN-based emotion classifier, Whisper for English transcription, a fine-tuned MarianMT model for English-to-Arabic translation, and MMS-TTS-Ara for Arabic speech synthesis. Experiments report an F1-score of 94% for emotion classification, translation performance of BLEU 56 and BERTScore F1 88.7%, and an average human evaluation score of 81% on banking-domain translations. The implementation and resources are available at the accompanying GitHub repository.

[78] Bridging Lexical Ambiguity and Vision: A Mini Review on Visual Word Sense Disambiguation

Shashini Nilukshi, Deshan Sumanathilaka

Main category: cs.CL

TL;DR: A review of Visual Word Sense Disambiguation (VWSD), a multimodal approach that uses visual cues to resolve lexical ambiguity in vision-language tasks, covering developments from 2016-2025 including CLIP-based models, diffusion generation, and LLM-enhanced systems.

Details

Motivation: Traditional Word Sense Disambiguation (WSD) relies only on text and lexical resources, but VWSD addresses lexical ambiguity in vision-language tasks by incorporating visual cues to determine the correct meaning of ambiguous words with minimal text input.

Method: The review examines developments from early multimodal fusion methods to newer frameworks using contrastive models like CLIP, diffusion-based text-to-image generation, and LLM support. It covers feature-based, graph-based, and contrastive embedding techniques from 2016-2025, focusing on prompt engineering, fine-tuning, and multilingual adaptation.

Result: CLIP-based fine-tuned models and LLM-enhanced VWSD systems consistently outperform zero-shot baselines, achieving gains of up to 6-8% in Mean Reciprocal Rank (MRR). However, challenges remain including context limitations, model bias toward common meanings, lack of multilingual datasets, and need for better evaluation frameworks.

Conclusion: The future of VWSD lies in the growing overlap of CLIP alignment, diffusion generation, and LLM reasoning for developing robust, context-aware, and multilingual disambiguation systems.

Abstract: This paper offers a mini review of Visual Word Sense Disambiguation (VWSD), which is a multimodal extension of traditional Word Sense Disambiguation (WSD). VWSD helps tackle lexical ambiguity in vision-language tasks. While conventional WSD depends only on text and lexical resources, VWSD uses visual cues to find the right meaning of ambiguous words with minimal text input. The review looks at developments from early multimodal fusion methods to new frameworks that use contrastive models like CLIP, diffusion-based text-to-image generation, and large language model (LLM) support. Studies from 2016 to 2025 are examined to show the growth of VWSD through feature-based, graph-based, and contrastive embedding techniques. It focuses on prompt engineering, fine-tuning, and adapting to multiple languages. Quantitative results show that CLIP-based fine-tuned models and LLM-enhanced VWSD systems consistently perform better than zero-shot baselines, achieving gains of up to 6-8% in Mean Reciprocal Rank (MRR). However, challenges still exist, such as limitations in context, model bias toward common meanings, a lack of multilingual datasets, and the need for better evaluation frameworks. The analysis highlights the growing overlap of CLIP alignment, diffusion generation, and LLM reasoning as the future path for strong, context-aware, and multilingual disambiguation systems.

[79] Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

Zizhuo Fu, Wenxuan Zeng, Runsheng Wang, Meng Li

Main category: cs.CL

TL;DR: The paper analyzes attention sink phenomena in LLMs, revealing that Vanilla Attention and Sink Attention naturally form Mixture-of-Experts mechanisms within attention layers, explaining head collapse issues, and proposes a sink-aware training algorithm with load balancing loss to improve performance.

Details

Motivation: LLMs often show disproportionate attention to first tokens (attention sink), with recent approaches like Sink Attention and Gated Attention addressing this. However, there's a lack of comprehensive analysis of relationships among these attention mechanisms and understanding of why head collapse occurs where only fixed subsets of attention heads contribute to generation.

Method: Provides theoretical and empirical evidence showing Vanilla Attention and Sink Attention naturally construct Mixture-of-Experts mechanisms within attention layers. Proposes sink-aware training algorithm with auxiliary load balancing loss designed for attention layers to mitigate head collapse.

Result: Extensive experiments show the method achieves effective head load balancing and improves model performance across Vanilla Attention, Sink Attention, and Gated Attention architectures.

Conclusion: The study offers new perspective on attention mechanisms and encourages further exploration of inherent MoE structure within attention layers, providing insights into attention sink phenomena and practical solutions for head collapse issues.

Abstract: Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work, where only a fixed subset of attention heads contributes to generation. To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers. Extensive experiments show that our method achieves effective head load balancing and improves model performance across Vanilla Attention, Sink Attention, and Gated Attention. We hope this study offers a new perspective on attention mechanisms and encourages further exploration of the inherent MoE structure within attention layers.

[80] ASTER: Agentic Scaling with Tool-integrated Extended Reasoning

Xuqin Zhang, Quan He, Zhenrui Zheng, Zongzhang Zhang, Xu He, Dong Li

Main category: cs.CL

TL;DR: ASTER framework uses targeted cold-start strategy with interaction-dense trajectories to prevent RL collapse in tool-integrated reasoning for LLMs, achieving SOTA math performance with 90% on AIME 2025.

Details

Motivation: Addresses the challenge of scaling Tool-Integrated Reasoning (TIR) via RL in LLMs, which suffers from "interaction collapse" where models degenerate into heavy internal reasoning with only trivial post-hoc code verification instead of sustained multi-turn tool usage.

Method: Introduces ASTER (Agentic Scaling with Tool-integrated Extended Reasoning) framework with targeted cold-start strategy prioritizing interaction-dense trajectories. Studies how cold-start SFT induces tool-using behavioral prior, how interaction density shapes exploration, and how RL interaction budget affects learning dynamics.

Result: ASTER-4B achieves state-of-the-art results on competitive mathematical benchmarks, reaching 90.0% on AIME 2025, surpassing leading frontier open-source models including DeepSeek-V3.2-Exp. A small expert cold-start set of just 4K interaction-dense trajectories yields strongest downstream performance.

Conclusion: Targeted cold-start strategy with interaction-dense trajectories effectively prevents RL collapse in tool-integrated reasoning, enabling superior exploration during extended RL training and establishing robust behavioral priors for sustained multi-turn tool usage.

Abstract: Reinforcement learning (RL) has emerged as a dominant paradigm for eliciting long-horizon reasoning in Large Language Models (LLMs). However, scaling Tool-Integrated Reasoning (TIR) via RL remains challenging due to interaction collapse: a pathological state where models fail to sustain multi-turn tool usage, instead degenerating into heavy internal reasoning with only trivial, post-hoc code verification. We systematically study three questions: (i) how cold-start SFT induces an agentic, tool-using behavioral prior, (ii) how the interaction density of cold-start trajectories shapes exploration and downstream RL outcomes, and (iii) how the RL interaction budget affects learning dynamics and generalization under varying inference-time budgets. We then introduce ASTER (Agentic Scaling with Tool-integrated Extended Reasoning), a framework that circumvents this collapse through a targeted cold-start strategy prioritizing interaction-dense trajectories. We find that a small expert cold-start set of just 4K interaction-dense trajectories yields the strongest downstream performance, establishing a robust prior that enables superior exploration during extended RL training. Extensive evaluations demonstrate that ASTER-4B achieves state-of-the-art results on competitive mathematical benchmarks, reaching 90.0% on AIME 2025, surpassing leading frontier open-source models, including DeepSeek-V3.2-Exp.

[81] Chronos: Learning Temporal Dynamics of Reasoning Chains for Test-Time Scaling

Kai Zhang, Jiayi Liao, Chengpeng Li, Ziyuan Xie, Sihang Li, Xiang Wang

Main category: cs.CL

TL;DR: Chronos is a lightweight plug-and-play chronological reasoning scorer that models reasoning trajectories as time series to assign quality scores for weighted voting, improving LLM reasoning performance with minimal overhead.

Details

Motivation: Existing test-time scaling methods like majority voting treat all reasoning traces equally, making them vulnerable to variations in trajectory quality and localized logical failures. There's a need for a method that can intelligently assess reasoning quality.

Method: Chronos models each reasoning trajectory as a time series, learning to capture trajectory features from token probabilities. It assigns quality scores to trajectories and employs a weighted voting mechanism based on these scores.

Result: Extensive evaluations show Chronos consistently delivers substantial gains across various models with negligible computational overhead. Chronos@128 achieves 34.21% improvement over Pass@1 and 22.70% over Maj@128 on HMMT25 using Qwen3-4B-Thinking-2507.

Conclusion: Chronos provides an effective lightweight solution for improving LLM reasoning performance through chronological modeling of reasoning trajectories and weighted voting, outperforming existing test-time scaling methods.

Abstract: Test-Time Scaling (TTS) has emerged as an effective paradigm for improving the reasoning performance of large language models (LLMs). However, existing methods – most notably majority voting and heuristic token-level scoring – treat reasoning traces or tokens equally, thereby being susceptible to substantial variations in trajectory quality and localized logical failures. In this work, we introduce \textbf{Chronos}, a lightweight and plug-and-play chronological reasoning scorer that models each trajectory as a time series. Specifically, Chronos learns to capture trajectory features of token probabilities, assigns quality scores accordingly, and employs a weighted voting mechanism. Extensive evaluations on both in-domain and out-of-domain benchmarks demonstrate that Chronos consistently delivers substantial gains across a variety of models, with negligible computational overhead. Notably, Chronos@128 achieves relative improvements of 34.21% over Pass@1 and 22.70% over Maj@128 on HMMT25 using Qwen3-4B-Thinking-2507, highlighting its effectiveness.

[82] Supervised Fine-Tuning Needs to Unlock the Potential of Token Priority

Zhanming Shen, Zeyu Qin, Jiaqi Hu, Wentao Ye, Hao Chen, Xiaomeng Hu, Haokai Xu, Gang Chen, Yi R. Fung, Haobo Wang

Main category: cs.CL

TL;DR: Token Priority is proposed as a framework to bridge the granularity mismatch in supervised fine-tuning, treating it as distribution reshaping rather than simple optimization, with two regimes: Positive Priority for noise filtration and Signed Priority for toxic content unlearning.

Details

Motivation: The paper addresses the fundamental constraint in transitioning from fitting empirical data to achieving true human utility - the granularity mismatch where fine-grained autoregressive generation is supervised by coarse or uniform signals. Current approaches fail to properly align raw data with the ideal alignment manifold.

Method: The paper introduces Token Priority as a conceptual framework that formalizes Supervised Fine-Tuning (SFT) as a precise distribution reshaping process rather than simple optimization. It categorizes existing approaches into two regimes: Positive Priority (for noise filtration) and Signed Priority (for toxic modes unlearning). The method involves analyzing recent breakthroughs through this unified lens and identifying key challenges.

Result: The paper provides a unified theoretical framework for understanding SFT as distribution reshaping, categorizes existing approaches into coherent regimes, identifies current limitations, and suggests directions for future research in alignment and fine-tuning methodologies.

Conclusion: Token Priority serves as an essential bridge to address the granularity mismatch in language model fine-tuning, offering a more precise understanding of how to align raw data with human utility through distribution reshaping rather than simple optimization.

Abstract: The transition from fitting empirical data to achieving true human utility is fundamentally constrained by a granularity mismatch, where fine-grained autoregressive generation is often supervised by coarse or uniform signals. This position paper advocates Token Priority as the essential bridge, formalizing Supervised Fine-Tuning (SFT) not as simple optimization but as a precise distribution reshaping process that aligns raw data with the ideal alignment manifold. We analyze recent breakthroughs through this unified lens, categorizing them into two distinct regimes: Positive Priority for noise filtration and Signed Priority for toxic modes unlearning. We revisit existing progress and limitations, identify key challenges, and suggest directions for future research.

[83] Inferential Question Answering

Jamshid Mozafari, Hamed Zamani, Guido Zuccon, Adam Jatowt

Main category: cs.CL

TL;DR: Inferential QA is a new task requiring models to infer answers from passages that only provide clues, not direct answers. The QUIT dataset with 7,401 questions and 2.4M passages shows current QA systems struggle with inference-based reasoning.

Details

Motivation: Most QA research focuses on answer containment where answers can be directly extracted from documents, but some questions require inference - deriving answers not explicitly stated from available information. There's a need to move beyond direct extraction to reasoning from indirect evidence.

Method: Created QUIT dataset with 7,401 questions and 2.4M passages built from high-convergence human- and machine-authored hints, labeled across three relevance levels using LLM-based answerability and human verification. Evaluated retrievers, rerankers, and LLM-based readers on this new task.

Result: Methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, fine-tuning provides inconsistent improvements. Even reasoning-oriented LLMs fail to outperform smaller general-purpose models, showing current QA pipelines are not ready for inference-based reasoning.

Conclusion: Inferential QA establishes a new class of QA tasks that move towards understanding and reasoning from indirect textual evidence, revealing significant limitations in current QA systems’ ability to perform inference-based reasoning.

Abstract: Despite extensive research on a wide range of question answering (QA) systems, most existing work focuses on answer containment-i.e., assuming that answers can be directly extracted and/or generated from documents in the corpus. However, some questions require inference, i.e., deriving answers that are not explicitly stated but can be inferred from the available information. We introduce Inferential QA – a new task that challenges models to infer answers from answer-supporting passages which provide only clues. To study this problem, we construct QUIT (QUestions requiring Inference from Texts) dataset, comprising 7,401 questions and 2.4M passages built from high-convergence human- and machine-authored hints, labeled across three relevance levels using LLM-based answerability and human verification. Through comprehensive evaluation of retrievers, rerankers, and LLM-based readers, we show that methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, and fine-tuning provides inconsistent improvements. Even reasoning-oriented LLMs fail to outperform smaller general-purpose models. These findings reveal that current QA pipelines are not yet ready for inference-based reasoning. Inferential QA thus establishes a new class of QA tasks that move towards understanding and reasoning from indirect textual evidence.

[84] Minimizing Mismatch Risk: A Prototype-Based Routing Framework for Zero-shot LLM-generated Text Detection

Ke Sun, Guangsheng Bao, Han Cui, Yue Zhang

Main category: cs.CL

TL;DR: DetectRouter: A framework for LLM-generated text detection that routes inputs to the most appropriate surrogate model from a diverse pool, improving detection performance through prototype-based selection.

Details

Motivation: Existing zero-shot detection methods use fixed surrogate models for all inputs, but detection performance varies significantly based on surrogate-source alignment. The authors found that while no single surrogate works best universally, a well-matched surrogate typically exists within a diverse pool for any given input.

Method: Proposes DetectRouter, a prototype-based framework with two-stage training: 1) constructs discriminative prototypes from white-box models, 2) generalizes to black-box sources by aligning geometric distances with observed detection scores. Transforms detection into a routing problem of selecting the most appropriate surrogate for each input.

Result: Experiments on EvoBench and MAGE benchmarks show consistent improvements across multiple detection criteria and model families compared to fixed surrogate approaches.

Conclusion: Detection performance depends on surrogate-source alignment, and routing inputs to appropriate surrogates from a diverse pool significantly improves LLM-generated text detection compared to using fixed surrogates.

Abstract: Zero-shot methods detect LLM-generated text by computing statistical signatures using a surrogate model. Existing approaches typically employ a fixed surrogate for all inputs regardless of the unknown source. We systematically examine this design and find that detection performance varies substantially depending on surrogate-source alignment. We observe that while no single surrogate achieves optimal performance universally, a well-matched surrogate typically exists within a diverse pool for any given input. This finding transforms robust detection into a routing problem: selecting the most appropriate surrogate for each input. We propose DetectRouter, a prototype-based framework that learns text-detector affinity through two-stage training. The first stage constructs discriminative prototypes from white-box models; the second generalizes to black-box sources by aligning geometric distances with observed detection scores. Experiments on EvoBench and MAGE benchmarks demonstrate consistent improvements across multiple detection criteria and model families.

[85] Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments

Siwei Wu, Yizhi Li, Yuyang Song, Wei Zhang, Yang Wang, Riza Batista-Navarro, Xian Yang, Mingjie Tang, Bryan Dai, Jian Yang, Chenghua Lin

Main category: cs.CL

TL;DR: TerminalTraj is a scalable pipeline for generating high-quality terminal trajectories for training agentic models, addressing executability and verifiability challenges through Dockerized environments and validation code.

Details

Motivation: Training agentic models for terminal-based tasks requires high-quality terminal trajectories with realistic long-horizon interactions, but scaling such data collection is challenging due to executability (needing diverse Docker environments) and verifiability (heterogeneous task outputs).

Method: Proposes TerminalTraj pipeline that: (1) filters high-quality repositories to construct Dockerized execution environments, (2) generates Docker-aligned task instances, and (3) synthesizes agent trajectories with executable validation code for verification.

Result: Curated 32K Docker images and generated 50,733 verified terminal trajectories across eight domains. Models trained on this data achieved up to 20% improvement on TerminalBench 1.0 and 10% on TerminalBench 2.0. TerminalTraj-32B reached 35.30% on TB 1.0 and 22.00% on TB 2.0.

Conclusion: TerminalTraj provides a scalable solution for generating high-quality terminal trajectory data, enabling improved training of agentic models for terminal-based tasks with demonstrated performance gains on benchmark evaluations.

Abstract: Training agentic models for terminal-based tasks critically depends on high-quality terminal trajectories that capture realistic long-horizon interactions across diverse domains. However, constructing such data at scale remains challenging due to two key requirements: \textbf{\emph{Executability}}, since each instance requires a suitable and often distinct Docker environment; and \textbf{\emph{Verifiability}}, because heterogeneous task outputs preclude unified, standardized verification. To address these challenges, we propose \textbf{TerminalTraj}, a scalable pipeline that (i) filters high-quality repositories to construct Dockerized execution environments, (ii) generates Docker-aligned task instances, and (iii) synthesizes agent trajectories with executable validation code. Using TerminalTraj, we curate 32K Docker images and generate 50,733 verified terminal trajectories across eight domains. Models trained on this data with the Qwen2.5-Coder backbone achieve consistent performance improvements on TerminalBench (TB), with gains of up to 20% on TB~~1.0 and 10% on TB~~2.0 over their respective backbones. Notably, \textbf{TerminalTraj-32B} achieves strong performance among models with fewer than 100B parameters, reaching 35.30% on TB~~1.0 and 22.00% on TB~~2.0, and demonstrates improved test-time scaling behavior. All code and data are available at https://github.com/Wusiwei0410/TerminalTraj.

[86] PARSE: An Open-Domain Reasoning Question Answering Benchmark for Persian

Jamshid Mozafari, Seyed Parsa Mousavinasab, Adam Jatowt

Main category: cs.CL

TL;DR: PARSE is the first open-domain Persian reasoning QA benchmark with 10,800 questions across Boolean, multiple-choice, and factoid formats, created via LLM-based generation and validated through human evaluation.

Details

Motivation: There's a lack of high-quality reasoning QA benchmarks for low-resource languages like Persian, despite its 130 million speakers, limiting development and evaluation of reasoning-capable QA systems in these languages.

Method: Created PARSE benchmark using controlled LLM-based generation pipeline with multi-stage filtering, annotation, and consistency checks. Questions cover diverse reasoning types, difficulty levels, and answer structures. Validated through human evaluation.

Result: Benchmarking shows Persian prompts and structured prompting (CoT for Boolean/multiple-choice; few-shot for factoid) improve performance. Fine-tuning boosts results, especially for Persian-specialized models.

Conclusion: PARSE fills a critical gap in Persian QA research and provides a foundation for developing and evaluating reasoning-capable LLMs in low-resource settings, supporting both fair comparison and practical model adaptation.

Abstract: Reasoning-focused Question Answering (QA) has advanced rapidly with Large Language Models (LLMs), yet high-quality benchmarks for low-resource languages remain scarce. Persian, spoken by roughly 130 million people, lacks a comprehensive open-domain resource for evaluating reasoning-capable QA systems. We introduce PARSE, the first open-domain Persian reasoning QA benchmark, containing 10,800 questions across Boolean, multiple-choice, and factoid formats, with diverse reasoning types, difficulty levels, and answer structures. The benchmark is built via a controlled LLM-based generation pipeline and validated through human evaluation. We also ensure linguistic and factual quality through multi-stage filtering, annotation, and consistency checks. We benchmark multilingual and Persian LLMs under multiple prompting strategies and show that Persian prompts and structured prompting (CoT for Boolean/multiple-choice; few-shot for factoid) improve performance. Fine-tuning further boosts results, especially for Persian-specialized models. These findings highlight how PARSE supports both fair comparison and practical model adaptation. PARSE fills a critical gap in Persian QA research and provides a strong foundation for developing and evaluating reasoning-capable LLMs in low-resource settings.

[87] PACER: Blockwise Pre-verification for Speculative Decoding with Adaptive Length

Situo Zhang, Yifan Zhang, Zichen Zhu, Hankun Wang, Da Ma, Danyang Zhang, Lu Chen, Kai Yu

Main category: cs.CL

TL;DR: Pacer introduces dynamic draft length control for speculative decoding using a lightweight pre-verification layer to optimize inference speed.

Details

Motivation: Standard speculative decoding uses fixed draft lengths, but optimal draft length varies across decoding steps, limiting potential speed improvements.

Method: Pacer uses a trainable pre-verification layer to dynamically control draft length by pre-verifying draft tokens blockwise before sending to target model, stopping generation if verification fails.

Result: Pacer achieves up to 2.66x speedup over autoregressive decoding, outperforms standard speculative decoding, and reaches 3.09x speedup when integrated with Ouroboros.

Conclusion: Dynamic draft length control via pre-verification significantly improves speculative decoding efficiency without sacrificing accuracy.

Abstract: Speculative decoding (SD) is a powerful technique for accelerating the inference process of large language models (LLMs) without sacrificing accuracy. Typically, SD employs a small draft model to generate a fixed number of draft tokens, which are then verified in parallel by the target model. However, our experiments reveal that the optimal draft length varies significantly across different decoding steps. This variation suggests that using a fixed draft length limits the potential for further improvements in decoding speed. To address this challenge, we propose Pacer, a novel approach that dynamically controls draft length using a lightweight, trainable pre-verification layer. This layer pre-verifies draft tokens blockwise before they are sent to the target model, allowing the draft model to stop token generation if the blockwise pre-verification fails. We implement Pacer on multiple SD model pairs and evaluate its performance across various benchmarks. Our results demonstrate that Pacer achieves up to 2.66x Speedup over autoregressive decoding and consistently outperforms standard speculative decoding. Furthermore, when integrated with Ouroboros, Pacer attains up to 3.09x Speedup.

[88] EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language ModelsEverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models

Chuanrui Hu, Tong Li, Xingze Gao, Hongda Chen, Dannong Xu, Yi Bai, Tianwei Lin, Xinda Zhao, Xiaohong Li, Jiaqi An, Yunyun Han, Jian Pei, Yafeng Deng

Main category: cs.CL

TL;DR: EverMemBench: A challenging benchmark for long-term conversational memory in LLMs, featuring multi-party, multi-group conversations with temporal evolution and complex information structures.

Details

Motivation: Existing benchmarks for conversational memory focus on simple dyadic, single-topic dialogues that don't capture real-world complexity. There's a need for more realistic evaluation of long-term memory systems in LLM-based assistants.

Method: Created EverMemBench with multi-party, multi-group conversations spanning over 1 million tokens, featuring temporally evolving information, cross-topic interleaving, and role-specific personas. Includes 1,000+ QA pairs evaluating three dimensions: fine-grained recall, memory awareness, and user profile understanding.

Result: Revealed critical limitations: (1) multi-hop reasoning collapses in multi-party settings (oracle models achieve only 26%), (2) temporal reasoning remains unsolved requiring version semantics, (3) memory awareness bottlenecked by retrieval where similarity-based methods fail to bridge semantic gaps.

Conclusion: EverMemBench provides a challenging testbed for developing next-generation memory architectures, highlighting significant gaps in current LLM memory capabilities for complex real-world conversations.

Abstract: Long-term conversational memory is essential for LLM-based assistants, yet existing benchmarks focus on dyadic, single-topic dialogues that fail to capture real-world complexity. We introduce EverMemBench, a benchmark featuring multi-party, multi-group conversations spanning over 1 million tokens with temporally evolving information, cross-topic interleaving, and role-specific personas. EverMemBench evaluates memory systems across three dimensions through 1,000+ QA pairs: fine-grained recall, memory awareness, and user profile understanding. Our evaluation reveals critical limitations: (1) multi-hop reasoning collapses in multi-party settings, with even oracle models achieving only 26%; (2) temporal reasoning remains unsolved, requiring version semantics beyond timestamp matching; (3) memory awareness is bottlenecked by retrieval, where current similarity-based methods fail to bridge the semantic gap between queries and implicitly relevant memories. EverMemBench provides a challenging testbed for developing next-generation memory architectures.

[89] DreamOn: Diffusion Language Models For Code Infilling Beyond Fixed-size Canvas

Zirui Wu, Lin Zheng, Zhihui Xie, Jiacheng Ye, Jiahui Gao, Shansan Gong, Yansong Feng, Zhenguo Li, Wei Bi, Guorui Zhou, Lingpeng Kong

Main category: cs.CL

TL;DR: DreamOn enables variable-length code infilling for diffusion language models by adding length control states, solving the fixed-length mask limitation that previously degraded performance.

Details

Motivation: Diffusion language models (DLMs) offer flexible any-order infilling but are limited by requiring fixed-length masked sequences, which hurts performance when mask size doesn't match ideal completion length.

Method: DreamOn augments diffusion process with two length control states that allow the model to autonomously expand or contract output length based on its predictions, requiring minimal modifications to training objectives and no architectural changes.

Result: Built on Dream-Coder-7B and DiffuCoder-7B, DreamOn achieves infilling performance on par with state-of-the-art autoregressive models on HumanEval-Infilling and SantaCoder-FIM, matching oracle performance with ground-truth length.

Conclusion: DreamOn removes a fundamental barrier to practical deployment of DLMs, significantly advancing their flexibility and applicability for variable-length generation in code infilling tasks.

Abstract: Diffusion Language Models (DLMs) present a compelling alternative to autoregressive models, offering flexible, any-order infilling without specialized prompting design. However, their practical utility is blocked by a critical limitation: the requirement of a fixed-length masked sequence for generation. This constraint severely degrades code infilling performance when the predefined mask size mismatches the ideal completion length. To address this, we propose DreamOn, a novel diffusion framework that enables dynamic, variable-length generation. DreamOn augments the diffusion process with two length control states, allowing the model to autonomously expand or contract the output length based solely on its own predictions. We integrate this mechanism into existing DLMs with minimal modifications to the training objective and no architectural changes. Built upon Dream-Coder-7B and DiffuCoder-7B, DreamOn achieves infilling performance on par with state-of-the-art autoregressive models on HumanEval-Infilling and SantaCoder-FIM and matches oracle performance achieved with ground-truth length. Our work removes a fundamental barrier to the practical deployment of DLMs, significantly advancing their flexibility and applicability for variable-length generation. Our code is available at https://github.com/DreamLM/DreamOn.

[90] CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering

Yu Liu, Wenxiao Zhang, Cong Cao, Fangfang Yuan, Weizhuo Chen, Cheng Hu, Pin Xu, Yuling Yang, Kun Peng, Diandian Guo, Qiang Sun, Yanbing Liu, Jin B. Hong, Zhiyuan Ma

Main category: cs.CL

TL;DR: CRAFT is a reinforcement learning framework that improves faithful reasoning in retrieval-augmented generation for multi-hop QA by optimizing both structural correctness and semantic faithfulness through dual reward mechanisms.

Details

Motivation: Addresses three key challenges in reliable reasoning for multi-hop QA: 1) Reasoning collapse due to complex multi-hop composition and noisy retrieval, 2) Reasoning-answer inconsistency where correct answers aren't supported by evidence, and 3) Loss of format control in structured outputs.

Method: Proposes CRAFT (Calibrated Reasoning with Answer-Faithful Traces) using Group Relative Policy Optimization (GRPO) reinforcement learning framework. Employs dual reward mechanisms: deterministic rewards for structural correctness and judge-based rewards for semantic faithfulness. Supports controllable trace variants for systematic analysis.

Result: Experiments on three multi-hop QA benchmarks show CRAFT improves both answer accuracy and reasoning faithfulness across model scales. The 7B model achieves competitive performance with closed-source LLMs across multiple reasoning trace settings.

Conclusion: CRAFT effectively addresses reasoning faithfulness challenges in RAG-based multi-hop QA through a reinforcement learning approach that optimizes both structural and semantic aspects of reasoning, enabling more reliable and faithful response generation.

Abstract: Retrieval-augmented generation (RAG) is widely used to ground Large Language Models (LLMs) for multi-hop question answering. Recent work mainly focused on improving answer accuracy via fine-tuning and structured or reinforcement-based optimization. However, reliable reasoning in response generation faces three challenges: 1) Reasoning Collapse. Reasoning in multi-hop QA is inherently complex due to multi-hop composition and is further destabilized by noisy retrieval. 2) Reasoning-answer inconsistency. Due to the intrinsic uncertainty of LLM generation and exposure to evidence–distractor mixtures, models may produce correct answers that are not faithfully supported by their intermediate reasoning or evidence. 3) Loss of format control. Traditional chain-of-thought generation often deviates from required structured output formats, leading to incomplete or malformed structured content. To address these challenges, we propose CRAFT (Calibrated Reasoning with Answer-Faithful Traces), a Group Relative Policy Optimization (GRPO) based reinforcement learning framework that trains models to perform faithful reasoning during response generation. CRAFT employs dual reward mechanisms to optimize multi-hop reasoning: deterministic rewards ensure structural correctness while judge-based rewards verify semantic faithfulness. This optimization framework supports controllable trace variants that enable systematic analysis of how structure and scale affect reasoning performance and faithfulness. Experiments on three multi-hop QA benchmarks show that CRAFT improves both answer accuracy and reasoning faithfulness across model scales, with the CRAFT 7B model achieving competitive performance with closed-source LLMs across multiple reasoning trace settings.

[91] Balancing Understanding and Generation in Discrete Diffusion Models

Yue Liu, Yuzhong Zhao, Zheyong Xie, Qixiang Ye, Jianbin Jiao, Yao Hu, Shaosheng Cao, Yunfan Liu

Main category: cs.CL

TL;DR: XDLM bridges masked and uniform-noise diffusion language models to achieve balanced performance in both semantic understanding and few-step generation quality.

Details

Motivation: Current discrete generative models have divergent capabilities: masked diffusion models excel at semantic understanding and zero-shot generalization, while uniform-noise diffusion models achieve strong few-step generation quality. Neither achieves balanced performance across both dimensions.

Method: Proposes XDLM that bridges the two paradigms via a stationary noise kernel, providing theoretical unification of both approaches and alleviating memory bottlenecks through algebraic simplification of posterior probabilities.

Result: XDLM advances the Pareto frontier between understanding capability and generation quality, surpassing UDLM by 5.4 points on zero-shot text benchmarks and outperforming MDLM in few-step image generation (FID 54.1 vs. 80.8). When scaled to tune an 8B-parameter LLM, achieves 15.0 MBPP in just 32 steps, effectively doubling baseline performance.

Conclusion: XDLM successfully unifies masked and uniform-noise diffusion paradigms, achieving balanced performance in both understanding and generation, with superior scaling potential and practical efficiency improvements.

Abstract: In discrete generative modeling, two dominant paradigms demonstrate divergent capabilities: Masked Diffusion Language Models (MDLM) excel at semantic understanding and zero-shot generalization, whereas Uniform-noise Diffusion Language Models (UDLM) achieve strong few-step generation quality, yet neither attains balanced performance across both dimensions. To address this, we propose XDLM, which bridges the two paradigms via a stationary noise kernel. XDLM offers two key contributions: (1) it provides a principled theoretical unification of MDLM and UDLM, recovering each paradigm as a special case; and (2) an alleviated memory bottleneck enabled by an algebraic simplification of the posterior probabilities. Experiments demonstrate that XDLM advances the Pareto frontier between understanding capability and generation quality. Quantitatively, XDLM surpasses UDLM by 5.4 points on zero-shot text benchmarks and outperforms MDLM in few-step image generation (FID 54.1 vs. 80.8). When scaled to tune an 8B-parameter large language model, XDLM achieves 15.0 MBPP in just 32 steps, effectively doubling the baseline performance. Finally, analysis of training dynamics reveals XDLM’s superior potential for long-term scaling. Code is available at https://github.com/MzeroMiko/XDLM

[92] Context Dependence and Reliability in Autoregressive Language Models

Poushali Sengupta, Shashi Raj Pandey, Sabita Maharjan, Frank Eliassen

Main category: cs.CL

TL;DR: RISE is a method for quantifying unique influence of context elements in LLMs, addressing redundancy issues in attribution scores for better interpretability.

Details

Motivation: LLMs use extensive context with redundant information, making it difficult to identify which elements actually influence outputs. Standard explanation methods struggle with redundancy and overlapping context, leading to unstable attribution scores that undermine interpretability and raise concerns about risks like prompt injection.

Method: RISE (Redundancy-Insensitive Scoring of Explanation) quantifies the unique influence of each input relative to others, minimizing the impact of redundancies and providing clearer, stable attributions.

Result: Experiments demonstrate that RISE offers more robust explanations than traditional methods, emphasizing the importance of conditional information for trustworthy LLM explanations and monitoring.

Conclusion: RISE addresses the challenge of distinguishing essential context elements from correlated ones, providing more reliable attribution scores for LLM interpretability and risk monitoring.

Abstract: Large language models (LLMs) generate outputs by utilizing extensive context, which often includes redundant information from prompts, retrieved passages, and interaction history. In critical applications, it is vital to identify which context elements actually influence the output, as standard explanation methods struggle with redundancy and overlapping context. Minor changes in input can lead to unpredictable shifts in attribution scores, undermining interpretability and raising concerns about risks like prompt injection. This work addresses the challenge of distinguishing essential context elements from correlated ones. We introduce RISE (Redundancy-Insensitive Scoring of Explanation), a method that quantifies the unique influence of each input relative to others, minimizing the impact of redundancies and providing clearer, stable attributions. Experiments demonstrate that RISE offers more robust explanations than traditional methods, emphasizing the importance of conditional information for trustworthy LLM explanations and monitoring.

[93] On the Power of (Approximate) Reward Models for Inference-Time Scaling

Youheng Zhu, Yiping Lu

Main category: cs.CL

TL;DR: Theoretical analysis showing that approximate reward models with bounded Bellman error (O(1/T)) enable exponential efficiency gains in inference-time scaling via Sequential Monte Carlo, reducing reasoning complexity from exponential to polynomial in sequence length.

Details

Motivation: Inference-time scaling using Sequential Monte Carlo (SMC) relies on reward models to evaluate reasoning trajectories, but deployed systems use approximate reward models rather than true ones. The paper aims to theoretically understand when and why approximate reward models suffice for effective inference-time scaling.

Method: Theoretical analysis identifying the Bellman error of approximate reward models as the key quantity. For reasoning processes of length T, the paper proves that if the Bellman error is bounded by O(1/T), then combining this reward model with SMC reduces computational complexity from exponential to polynomial in T.

Result: Theoretical proof that approximate reward models with bounded Bellman error enable exponential improvement in inference efficiency. Specifically, the computational complexity of reasoning is reduced from exponential in T to polynomial in T when using SMC with such reward models.

Conclusion: Approximate reward models can be sufficient for effective inference-time scaling via SMC when their Bellman error is properly bounded. This provides theoretical justification for using practical approximate reward models in deployed systems and identifies the key condition for their effectiveness.

Abstract: Inference-time scaling has recently emerged as a powerful paradigm for improving the reasoning capability of large language models. Among various approaches, Sequential Monte Carlo (SMC) has become a particularly important framework, enabling iterative generation, evaluation, rejection, and resampling of intermediate reasoning trajectories. A central component in this process is the reward model, which evaluates partial solutions and guides the allocation of computation during inference. However, in practice, true reward models are never available. All deployed systems rely on approximate reward models, raising a fundamental question: Why and when do approximate reward models suffice for effective inference-time scaling? In this work, we provide a theoretical answer. We identify the Bellman error of the approximate reward model as the key quantity governing the effectiveness of SMC-based inference-time scaling. For a reasoning process of length $T$, we show that if the Bellman error of the approximate reward model is bounded by $O(1/T)$, then combining this reward model with SMC reduces the computational complexity of reasoning from exponential in $T$ to polynomial in $T$. This yields an exponential improvement in inference efficiency despite using only approximate rewards.

[94] Rethinking Selective Knowledge Distillation

Almog Tavor, Itay Ebenspanger, Neil Cnaan, Mor Geva

Main category: cs.CL

TL;DR: Selective knowledge distillation for LLMs using student-entropy-guided position selection (SE-KD) improves efficiency and performance over dense distillation

Details

Motivation: Current selective knowledge distillation methods for LLMs lack clarity on which importance signals and selection policies are most effective, creating a need for systematic analysis and better approaches

Method: Systematically analyze selective KD along position, class, and sample axes; then introduce student-entropy-guided position selection (SE-KD) and extend it across all three axes (SE-KD 3X)

Result: SE-KD improves accuracy, downstream task adherence, and memory efficiency; SE-KD 3X reduces wall time by 70%, peak memory by 18%, and storage usage by 80% without performance loss

Conclusion: Selective distillation guided by student entropy provides efficient and effective knowledge transfer in LLMs, making offline teacher caching feasible with significant resource savings

Abstract: Growing efforts to improve knowledge distillation (KD) in large language models (LLMs) replace dense teacher supervision with selective distillation, which uses a subset of token positions, vocabulary classes, or training samples for supervision. However, it remains unclear which importance signals, selection policies, and their interplay are most effective. In this work, we revisit where and how to distill in autoregressive LLMs. We disentangle selective KD along the position, class, and sample axes and systematically compare importance signals and selection policies. Then, guided by this analysis, we identify underexplored opportunities and introduce student-entropy-guided position selection (SE-KD). Across a suite of benchmarks, SE-KD often improves accuracy, downstream task adherence, and memory efficiency over dense distillation. Extending this approach across the class and sample axes (SE-KD 3X) yields complementary efficiency gains that make offline teacher caching feasible. In practice, this reduces wall time by 70% and peak memory by 18%, while cutting storage usage by 80% over prior methods without sacrificing performance.

[95] From Pragmas to Partners: A Symbiotic Evolution of Agentic High-Level Synthesis

Niansong Zhang, Sunwoo Kim, Shreesha Srinath, Zhiru Zhang

Main category: cs.CL

TL;DR: HLS remains crucial in AI-driven hardware design era, serving as practical abstraction layer for agentic optimization with faster iteration, portability, and design permutability advantages.

Details

Motivation: Address the question of whether high-level synthesis (HLS) still matters in the agentic era of AI-driven hardware design, arguing that HLS remains essential despite the rise of large language models and AI agents.

Method: Position paper approach: 1) Explain HLS as practical abstraction layer and golden reference for agentic hardware design, 2) Identify key limitations of current HLS tools that agents can address, 3) Propose taxonomy for symbiotic evolution of agentic HLS showing responsibility shift from humans to AI agents.

Result: Establishes HLS as critical layer for agentic optimization in hardware design, identifies specific tool limitations (inadequate performance feedback, rigid interfaces, limited debuggability), and provides framework for evolution from copilots to autonomous design partners.

Conclusion: HLS remains essential in the agentic era, serving as a natural optimization layer that enables faster iteration and design exploration, with AI agents uniquely positioned to address current HLS tool limitations.

Abstract: The rise of large language models has sparked interest in AI-driven hardware design, raising the question: does high-level synthesis (HLS) still matter in the agentic era? We argue that HLS remains essential. While we expect mature agentic hardware systems to leverage both HLS and RTL, this paper focuses on HLS and its role in enabling agentic optimization. HLS offers faster iteration cycles, portability, and design permutability that make it a natural layer for agentic optimization.This position paper makes three contributions. First, we explain why HLS serves as a practical abstraction layer and a golden reference for agentic hardware design. Second, we identify key limitations of current HLS tools, namely inadequate performance feedback, rigid interfaces, and limited debuggability that agents are uniquely positioned to address. Third, we propose a taxonomy for the symbiotic evolution of agentic HLS, clarifying how responsibility shifts from human designers to AI agents as systems advance from copilots to autonomous design partners.

[96] SentiFuse: Deep Multi-model Fusion Framework for Robust Sentiment Extraction

Hieu Minh Duong, Rupa Ghosh, Cong Hoan Nguyen, Eugene Levin, Todd Gary, Long Nguyen

Main category: cs.CL

TL;DR: SentiFuse is a model-agnostic framework that integrates heterogeneous sentiment analysis models through standardization and multiple fusion strategies (decision-level, feature-level, adaptive) to leverage complementary strengths, achieving improved performance over individual models and naive ensembles.

Details

Motivation: Existing sentiment analysis models have complementary strengths but lack a unified framework for effective integration. Current approaches don't systematically combine diverse models to leverage their different capabilities.

Method: SentiFuse uses a standardization layer to normalize outputs from heterogeneous models, then applies multiple fusion strategies: decision-level fusion (combining final predictions), feature-level fusion (combining intermediate representations), and adaptive fusion (dynamically weighting models based on input characteristics).

Result: Experiments on three large-scale social-media datasets (Crowdflower, GoEmotions, Sentiment140) show SentiFuse consistently outperforms individual models and naive ensembles. Feature-level fusion achieves up to 4% absolute F1 improvement over best individual model, while adaptive fusion enhances robustness on challenging cases like negation and mixed emotions.

Conclusion: Systematically leveraging model complementarity through frameworks like SentiFuse yields more accurate and reliable sentiment analysis across diverse datasets and text types, with different fusion strategies offering different advantages.

Abstract: Sentiment analysis models exhibit complementary strengths, yet existing approaches lack a unified framework for effective integration. We present SentiFuse, a flexible and model-agnostic framework that integrates heterogeneous sentiment models through a standardization layer and multiple fusion strategies. Our approach supports decision-level fusion, feature-level fusion, and adaptive fusion, enabling systematic combination of diverse models. We conduct experiments on three large-scale social-media datasets: Crowdflower, GoEmotions, and Sentiment140. These experiments show that SentiFuse consistently outperforms individual models and naive ensembles. Feature-level fusion achieves the strongest overall effectiveness, yielding up to 4% absolute improvement in F1 score over the best individual model and simple averaging, while adaptive fusion enhances robustness on challenging cases such as negation, mixed emotions, and complex sentiment expressions. These results demonstrate that systematically leveraging model complementarity yields more accurate and reliable sentiment analysis across diverse datasets and text types.

[97] Understanding QA generation: Extracting Parametric and Contextual Knowledge with CQA for Low Resource Bangla Language

Umme Abira Azmary, MD Ikramul Kayes, Swakkhar Shatabda, Farig Yousuf Sadeque

Main category: cs.CL

TL;DR: BanglaCQA: First counterfactual QA dataset for Bangla with analysis of parametric vs. contextual knowledge in low-resource language models, showing CoT prompting effectively extracts parametric knowledge in counterfactual scenarios.

Details

Motivation: Bangla QA models face challenges due to limited annotated data and linguistic complexity, with existing datasets lacking structure to analyze whether models rely more on pre-encoded parametric knowledge or contextual input during answer generation.

Method: Introduce BanglaCQA dataset by extending existing Bangla dataset with counterfactual passages and answerability annotations. Propose fine-tuned pipelines for encoder-decoder language-specific/multilingual models and prompting-based pipelines for decoder-only LLMs. Apply LLM-based and human evaluation with semantic similarity metrics.

Result: Chain-of-Thought prompting reveals uniquely effective mechanism for extracting parametric knowledge in counterfactual scenarios, particularly in decoder-only LLMs. Detailed analysis shows how models perform across different QA settings in low-resource languages.

Conclusion: Introduces novel framework for analyzing knowledge sources in Bangla QA and uncovers critical findings that open broader directions for counterfactual reasoning in low-resource language settings.

Abstract: Question-Answering (QA) models for low-resource languages like Bangla face challenges due to limited annotated data and linguistic complexity. A key issue is determining whether models rely more on pre-encoded (parametric) knowledge or contextual input during answer generation, as existing Bangla QA datasets lack the structure required for such analysis. We introduce BanglaCQA, the first Counterfactual QA dataset in Bangla, by extending a Bangla dataset while integrating counterfactual passages and answerability annotations. In addition, we propose fine-tuned pipelines for encoder-decoder language-specific and multilingual baseline models, and prompting-based pipelines for decoder-only LLMs to disentangle parametric and contextual knowledge in both factual and counterfactual scenarios. Furthermore, we apply LLM-based and human evaluation techniques that measure answer quality based on semantic similarity. We also present a detailed analysis of how models perform across different QA settings in low-resource languages, and show that Chain-of-Thought (CoT) prompting reveals a uniquely effective mechanism for extracting parametric knowledge in counterfactual scenarios, particularly in decoder-only LLMs. Our work not only introduces a novel framework for analyzing knowledge sources in Bangla QA but also uncovers critical findings that open up broader directions for counterfactual reasoning in low-resource language settings.

[98] ConPress: Learning Efficient Reasoning from Multi-Question Contextual Pressure

Jie Deng, Shining Liang, Jun Li, Hongzhi Li, Yutao Xie

Main category: cs.CL

TL;DR: ConPress is a self-supervised fine-tuning method that leverages multi-question prompts to induce models to produce shorter reasoning traces, then uses these compressed traces to fine-tune models for more efficient single-question reasoning.

Details

Motivation: Large reasoning models generate long chain-of-thought traces that cause substantial inference overhead. The authors discovered that when multiple independent questions are presented together, models spontaneously produce shorter reasoning traces for each question, a phenomenon they call "Self-Compression."

Method: ConPress constructs multi-question prompts to induce self-compression, samples the resulting model outputs, parses and filters per-question traces to obtain concise yet correct reasoning trajectories, then uses these for supervised fine-tuning to internalize compressed reasoning behavior.

Result: With only 8k fine-tuning examples, ConPress reduces reasoning token usage by 59% on MATH500 and 33% on AIME25 while maintaining competitive accuracy, demonstrating efficient compression of reasoning traces without external teachers or reinforcement learning.

Conclusion: The self-compression phenomenon can be leveraged through ConPress to create more efficient reasoning models that maintain accuracy while significantly reducing inference overhead, offering a lightweight self-supervised approach to reasoning optimization.

Abstract: Large reasoning models (LRMs) typically solve reasoning-intensive tasks by generating long chain-of-thought (CoT) traces, leading to substantial inference overhead. We identify a reproducible inference-time phenomenon, termed Self-Compression: when multiple independent and answerable questions are presented within a single prompt, the model spontaneously produces shorter reasoning traces for each question. This phenomenon arises from multi-question contextual pressure during generation and consistently manifests across models and benchmarks. Building on this observation, we propose ConPress (Learning from Contextual Pressure), a lightweight self-supervised fine-tuning approach. ConPress constructs multi-question prompts to induce self-compression, samples the resulting model outputs, and parses and filters per-question traces to obtain concise yet correct reasoning trajectories. These trajectories are directly used for supervised fine-tuning, internalizing compressed reasoning behavior in single-question settings without external teachers, manual pruning, or reinforcement learning. With only 8k fine-tuning examples, ConPress reduces reasoning token usage by 59% on MATH500 and 33% on AIME25, while maintaining competitive accuracy.

[99] Ebisu: Benchmarking Large Language Models in Japanese Finance

Xueqing Peng, Ruoyu Xiang, Fan Zhang, Mingzi Song, Mingyang Jiang, Yan Wang, Lingfei Qian, Taiki Hara, Yuqing Guo, Jimin Huang, Junichi Tsujii, Sophia Ananiadou

Main category: cs.CL

TL;DR: Ebisu is a benchmark for Japanese financial language understanding with two expert-annotated tasks: implicit commitment recognition in investor Q&A and hierarchical financial terminology extraction from disclosures, showing current LLMs struggle despite scale and adaptation.

Details

Motivation: Japanese financial language presents unique challenges due to agglutinative head-final structure, mixed writing systems, and high-context communication norms with indirect expression and implicit commitments, creating substantial difficulties for LLMs that need specialized evaluation.

Method: Created Ebisu benchmark with two tasks: JF-ICR (implicit commitment and refusal recognition in investor Q&A) and JF-TE (hierarchical extraction and ranking of nested financial terminology from professional disclosures). Evaluated diverse LLMs including general-purpose, Japanese-adapted, and financial models.

Result: Even state-of-the-art LLMs struggle on both tasks. Increased model scale yields limited improvements, and language- and domain-specific adaptation does not reliably improve performance, leaving substantial gaps in Japanese financial language understanding.

Conclusion: Ebisu provides a focused benchmark for advancing linguistically and culturally grounded financial NLP, highlighting the need for specialized approaches to handle Japanese financial language’s unique characteristics. All datasets and evaluation scripts are publicly released.

Abstract: Japanese finance combines agglutinative, head-final linguistic structure, mixed writing systems, and high-context communication norms that rely on indirect expression and implicit commitment, posing a substantial challenge for LLMs. We introduce Ebisu, a benchmark for native Japanese financial language understanding, comprising two linguistically and culturally grounded, expert-annotated tasks: JF-ICR, which evaluates implicit commitment and refusal recognition in investor-facing Q&A, and JF-TE, which assesses hierarchical extraction and ranking of nested financial terminology from professional disclosures. We evaluate a diverse set of open-source and proprietary LLMs spanning general-purpose, Japanese-adapted, and financial models. Results show that even state-of-the-art systems struggle on both tasks. While increased model scale yields limited improvements, language- and domain-specific adaptation does not reliably improve performance, leaving substantial gaps unresolved. Ebisu provides a focused benchmark for advancing linguistically and culturally grounded financial NLP. All datasets and evaluation scripts are publicly released.

[100] Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training

Ran Xu, Tianci Liu, Zihan Dong, Tony You, Ilgee Hong, Carl Yang, Linjun Zhang, Tao Zhao, Haoyu Wang

Main category: cs.CL

TL;DR: Rubric-ARM: A framework that jointly optimizes rubric generation and judgment using RL from preference feedback, treating rubric generation as a latent action to improve response quality assessment in non-verifiable domains.

Details

Motivation: Standard reward models use scalar scores that fail to capture multifaceted response quality in non-verifiable domains like creative writing or open-ended instruction following, creating a need for more nuanced evaluation frameworks.

Method: Proposes Rubric-ARM framework with joint optimization of rubric generator and judge using reinforcement learning from preference feedback. Uses alternating optimization strategy to mitigate non-stationarity of simultaneous updates, treating rubric generation as a latent action learned to maximize judgment accuracy.

Result: Achieves state-of-the-art performance on multiple benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.

Conclusion: Rubric-ARM provides an effective framework for nuanced response quality assessment in non-verifiable domains through joint optimization of rubric generation and judgment, with theoretical guarantees and practical benefits for policy alignment.

Abstract: Standard reward models typically predict scalar scores that fail to capture the multifaceted nature of response quality in non-verifiable domains, such as creative writing or open-ended instruction following. To address this limitation, we propose Rubric-ARM, a framework that jointly optimizes a rubric generator and a judge using reinforcement learning from preference feedback. Unlike existing methods that rely on static rubrics or disjoint training pipelines, our approach treats rubric generation as a latent action learned to maximize judgment accuracy. We introduce an alternating optimization strategy to mitigate the non-stationarity of simultaneous updates, providing theoretical analysis that demonstrates how this schedule reduces gradient variance during training. Extensive experiments show that Rubric-ARM achieves state-of-the-art performance among baselines on multiple benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.

[101] Argument Rarity-based Originality Assessment for AI-Assisted Writing

Keito Inoshita, Michiaki Omura, Tsukasa Yamanaka, Go Maeda, Kentaro Tsuji

Main category: cs.CL

TL;DR: AROA framework for evaluating argumentative originality in essays using rarity metrics, revealing quality-originality trade-off and limitations of LLMs in generating original content despite structural competence.

Details

Motivation: With LLMs capable of generating high-quality text, traditional quality-focused writing assessment is losing significance. Education should foster critical thinking and original perspectives, requiring assessment to shift from quality to originality evaluation.

Method: Proposes Argument Rarity-based Originality Assessment (AROA) framework that defines originality as rarity within a reference corpus. Evaluates through four components: structural rarity, claim rarity, evidence rarity, and cognitive depth. Uses density estimation to quantify rarity and integrates with quality adjustment mechanism, treating quality and originality as independent evaluation axes.

Result: Experiments show strong negative correlation between quality and claim rarity, demonstrating quality-originality trade-off. AI essays achieved comparable structural complexity to human essays but had substantially lower claim rarity, indicating LLMs can reproduce argumentation form but have limitations in content originality.

Conclusion: AROA provides effective framework for originality assessment, revealing fundamental differences between human and AI-generated content. LLMs excel at structural reproduction but struggle with original claims, highlighting need for originality-focused assessment in education.

Abstract: As Large Language Models (LLMs) have become capable of effortlessly generating high-quality text, traditional quality-focused writing assessment is losing its significance. If the essential goal of education is to foster critical thinking and original perspectives, assessment must also shift its paradigm from quality to originality. This study proposes Argument Rarity-based Originality Assessment (AROA), a framework for automatically evaluating argumentative originality in student essays. AROA defines originality as rarity within a reference corpus and evaluates it through four complementary components: structural rarity, claim rarity, evidence rarity, and cognitive depth. The framework quantifies the rarity of each component using density estimation and integrates them with a quality adjustment mechanism, thereby treating quality and originality as independent evaluation axes. Experiments using human essays and AI-generated essays revealed a strong negative correlation between quality and claim rarity, demonstrating a quality-originality trade-off where higher-quality texts tend to rely on typical claim patterns. Furthermore, while AI essays achieved comparable levels of structural complexity to human essays, their claim rarity was substantially lower than that of humans, indicating that LLMs can reproduce the form of argumentation but have limitations in the originality of content.

[102] FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents

Chiwei Zhu, Benfeng Xu, Mingxuan Du, Shaohan Wang, Xiaorui Wang, Zhendong Mao, Yongdong Zhang

Main category: cs.CL

TL;DR: FS-Researcher is a file-system-based dual-agent framework for long-horizon deep research tasks that scales beyond LLM context limits using persistent external memory.

Details

Motivation: Long trajectories in deep research tasks exceed LLM context limits, compressing token budgets for evidence collection and report writing, and preventing effective test-time scaling.

Method: Uses a dual-agent framework: Context Builder agent browses internet, writes structured notes, archives raw sources into hierarchical knowledge base; Report Writer agent composes final report section by section using knowledge base as source. File system serves as durable external memory and shared coordination medium.

Result: Achieves state-of-the-art report quality on DeepResearch Bench and DeepConsult benchmarks across different backbone models. Shows positive correlation between report quality and computation allocated to Context Builder, validating test-time scaling.

Conclusion: File-system paradigm enables scaling deep research beyond context window via persistent workspace, allowing iterative refinement and effective test-time scaling for long-horizon tasks.

Abstract: Deep research is emerging as a representative long-horizon task for large language model (LLM) agents. However, long trajectories in deep research often exceed model context limits, compressing token budgets for both evidence collection and report writing, and preventing effective test-time scaling. We introduce FS-Researcher, a file-system-based, dual-agent framework that scales deep research beyond the context window via a persistent workspace. Specifically, a Context Builder agent acts as a librarian which browses the internet, writes structured notes, and archives raw sources into a hierarchical knowledge base that can grow far beyond context length. A Report Writer agent then composes the final report section by section, treating the knowledge base as the source of facts. In this framework, the file system serves as a durable external memory and a shared coordination medium across agents and sessions, enabling iterative refinement beyond the context window. Experiments on two open-ended benchmarks (DeepResearch Bench and DeepConsult) show that FS-Researcher achieves state-of-the-art report quality across different backbone models. Further analyses demonstrate a positive correlation between final report quality and the computation allocated to the Context Builder, validating effective test-time scaling under the file-system paradigm. The code and data are anonymously open-sourced at https://github.com/Ignoramus0817/FS-Researcher.

[103] LLM-based Embeddings: Attention Values Encode Sentence Semantics Better Than Hidden States

Yeqin Zhang, Yunfei Wang, Jiaxuan Chen, Ke Qin, Yizheng Zhao, Cam-Tu Nguyen

Main category: cs.CL

TL;DR: Value Aggregation (VA) method uses attention value vectors instead of hidden states for better sentence representations, achieving state-of-the-art training-free LLM embeddings.

Details

Motivation: Current LLM-based sentence representations rely on final-layer hidden states optimized for next-token prediction, which fail to capture global sentence-level semantics effectively.

Method: Proposes Value Aggregation (VA) that pools token values across multiple layers and token indices. Further refines to Aligned Weighted VA (AlignedWVA) using attention scores as weights and output projection matrix for alignment.

Result: VA outperforms other LLM-based embeddings in training-free setting, matches/surpasses ensemble-based MetaEOL. AlignedWVA achieves SOTA among training-free LLM embeddings, substantially outperforming high-cost MetaEOL.

Conclusion: Attention value vectors capture sentence semantics better than hidden states. Value Aggregation methods provide strong LLM embeddings, with potential for further improvement through fine-tuning.

Abstract: Sentence representations are foundational to many Natural Language Processing (NLP) applications. While recent methods leverage Large Language Models (LLMs) to derive sentence representations, most rely on final-layer hidden states, which are optimized for next-token prediction and thus often fail to capture global, sentence-level semantics. This paper introduces a novel perspective, demonstrating that attention value vectors capture sentence semantics more effectively than hidden states. We propose Value Aggregation (VA), a simple method that pools token values across multiple layers and token indices. In a training-free setting, VA outperforms other LLM-based embeddings, even matches or surpasses the ensemble-based MetaEOL. Furthermore, we demonstrate that when paired with suitable prompts, the layer attention outputs can be interpreted as aligned weighted value vectors. Specifically, the attention scores of the last token function as the weights, while the output projection matrix ($W_O$) aligns these weighted value vectors with the common space of the LLM residual stream. This refined method, termed Aligned Weighted VA (AlignedWVA), achieves state-of-the-art performance among training-free LLM-based embeddings, outperforming the high-cost MetaEOL by a substantial margin. Finally, we highlight the potential of obtaining strong LLM embedding models through fine-tuning Value Aggregation.

[104] Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment

Zehua Cheng, Jianwei Yang, Wei Dai, Jiahao Sun

Main category: cs.CL

TL;DR: CSS framework provides certifiable robustness against jailbreak attacks via semantic smoothing with stratified randomized ablation and noise-augmented alignment tuning, achieving strong safety guarantees while maintaining utility.

Details

Motivation: LLMs remain vulnerable to adaptive jailbreaks that bypass empirical defenses like GCG, requiring provable safety guarantees rather than heuristic approaches.

Method: Certified Semantic Smoothing (CSS) via Stratified Randomized Ablation partitions inputs into immutable structural prompts and mutable payloads, using Hypergeometric distribution for rigorous guarantees. Noise-Augmented Alignment Tuning (NAAT) transforms base models into semantic denoisers to handle sparse contexts.

Result: Reduces Attack Success Rate of gradient-based attacks from 84.2% to 1.2% while maintaining 94.1% benign utility, significantly outperforming character-level baselines (74.3% utility).

Conclusion: The framework provides deterministic certificates of safety, ensuring models remain robust against all adversarial variants within a provable radius, offering a principled approach to LLM safety.

Abstract: Large Language Models (LLMs) remain vulnerable to adaptive jailbreaks that easily bypass empirical defenses like GCG. We propose a framework for certifiable robustness that shifts safety guarantees from single-pass inference to the statistical stability of an ensemble. We introduce Certified Semantic Smoothing (CSS) via Stratified Randomized Ablation, a technique that partitions inputs into immutable structural prompts and mutable payloads to derive rigorous lo norm guarantees using the Hypergeometric distribution. To resolve performance degradation on sparse contexts, we employ Noise-Augmented Alignment Tuning (NAAT), which transforms the base model into a semantic denoiser. Extensive experiments on Llama-3 show that our method reduces the Attack Success Rate of gradient-based attacks from 84.2% to 1.2% while maintaining 94.1% benign utility, significantly outperforming character-level baselines which degrade utility to 74.3%. This framework provides a deterministic certificate of safety, ensuring that a model remains robust against all adversarial variants within a provable radius.

[105] Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles

Shaohan Wang, Benfeng Xu, Licheng Zhang, Mingxuan Du, Chiwei Zhu, Xiaorui Wang, Zhendong Mao, Yongdong Zhang

Main category: cs.CL

TL;DR: A live benchmark (Wiki Live Challenge) using Wikipedia Good Articles as expert references to evaluate Deep Research Agents, with comprehensive evaluation framework including 39 criteria for writing quality and factual verifiability metrics.

Details

Motivation: Current evaluation frameworks for Deep Research Agents rely on LLM-generated references which lack reliability of expert-verified content and struggle to provide objective, fine-grained assessments. There's a need for more rigorous evaluation using expert-level references.

Method: Introduces Wiki Live Challenge (WLC) benchmark using 100 recent Wikipedia Good Articles as expert-level references. Develops Wiki Eval framework with fine-grained evaluation method (39 criteria for writing quality) and rigorous metrics for factual verifiability.

Result: Experiments show significant gap between current Deep Research Agents and human expert-level Wikipedia articles, validating WLC’s effectiveness in advancing agent research.

Conclusion: WLC provides a reliable, expert-verified benchmark for evaluating Deep Research Agents, addressing limitations of current LLM-based evaluation approaches and enabling more rigorous assessment of agent capabilities.

Abstract: Deep Research Agents (DRAs) have demonstrated remarkable capabilities in autonomous information retrieval and report generation, showing great potential to assist humans in complex research tasks. Current evaluation frameworks primarily rely on LLM-generated references or LLM-derived evaluation dimensions. While these approaches offer scalability, they often lack the reliability of expert-verified content and struggle to provide objective, fine-grained assessments of critical dimensions. To bridge this gap, we introduce Wiki Live Challenge (WLC), a live benchmark that leverages the newest Wikipedia Good Articles (GAs) as expert-level references. Wikipedia’s strict standards for neutrality, comprehensiveness, and verifiability serve as a great challenge for DRAs, with GAs representing the pinnacle of which. We curate a dataset of 100 recent Good Articles and propose Wiki Eval, a comprehensive evaluation framework comprising a fine-grained evaluation method with 39 criteria for writing quality and rigorous metrics for factual verifiability. Extensive experiments on various DRA systems demonstrate a significant gap between current DRAs and human expert-level Wikipedia articles, validating the effectiveness of WLC in advancing agent research. We release our benchmark at https://github.com/WangShao2000/Wiki_Live_Challenge

[106] The Art of Socratic Inquiry: A Framework for Proactive Template-Guided Therapeutic Conversation Generation

Mingwen Zhang, Minqiang Yang, Changsheng Ma, Yang Yu, Hui Bai, Chen Xu, Xiangzhen Kong, Bin Hu

Main category: cs.CL

TL;DR: Socratic Inquiry Framework (SIF) transforms LLMs from reactive to proactive therapeutic agents by enabling structured, cognition-guiding questioning in CBT contexts.

Details

Motivation: Current psychological LLMs are overly reactive, providing empathetic but superficial responses that fail to surface latent beliefs or guide behavioral change in therapy, lacking the proactive questioning essential for effective CBT.

Method: Proposes Socratic Inquiry Framework (SIF) with two components: Strategy Anchoring (when to ask) and Template Retrieval (what to ask), plus Socratic-QA dataset for supervision. SIF is a lightweight, plug-and-play intent planner that enables context-aware questioning without end-to-end retraining.

Result: SIF significantly enhances proactive questioning frequency, conversational depth, and therapeutic alignment, shifting LLMs from reactive comfort to proactive exploration in therapeutic contexts.

Conclusion: Establishes a new paradigm for psychologically informed LLMs: not just to respond, but to guide, with SIF enabling theory-grounded proactive questioning in therapy.

Abstract: Proactive questioning, where therapists deliberately initiate structured, cognition-guiding inquiries, is a cornerstone of cognitive behavioral therapy (CBT). Yet, current psychological large language models (LLMs) remain overwhelmingly reactive, defaulting to empathetic but superficial responses that fail to surface latent beliefs or guide behavioral change. To bridge this gap, we propose the \textbf{Socratic Inquiry Framework (SIF)}, a lightweight, plug-and-play therapeutic intent planner that transforms LLMs from passive listeners into active cognitive guides. SIF decouples \textbf{when to ask} (via Strategy Anchoring) from \textbf{what to ask} (via Template Retrieval), enabling context-aware, theory-grounded questioning without end-to-end retraining. Complementing SIF, we introduce \textbf{Socratic-QA}, a high-quality dataset of strategy-aligned Socratic sequences that provides explicit supervision for proactive reasoning. Experiments show that SIF significantly enhances proactive questioning frequency, conversational depth, and therapeutic alignment, marking a clear shift from reactive comfort to proactive exploration. Our work establishes a new paradigm for psychologically informed LLMs: not just to respond, but to guide.

[107] SEA-Guard: Culturally Grounded Multilingual Safeguard for Southeast Asia

Panuthep Tasawong, Jian Gang Ngui, Alham Fikri Aji, Trevor Cohn, Peerat Limkonchotiwat

Main category: cs.CL

TL;DR: SEA-Guard: A culturally-aware AI safety framework for Southeast Asia using agentic data generation to create region-specific safety datasets and multilingual safeguard models.

Details

Motivation: Current AI safety models rely on English datasets translated to other languages, missing crucial cultural nuances and region-specific values, norms, and regulations. This is particularly problematic for Southeast Asia where cultural diversity requires authentic, locally-grounded safety approaches.

Method: Developed a novel agentic data-generation framework to scalably create authentic, region-specific safety datasets for Southeast Asia. Built the SEA-Guard family of multilingual safeguard models grounded in SEA cultural contexts.

Result: SEA-Guard consistently outperforms existing safeguards at detecting regionally sensitive or harmful content across multiple benchmarks and cultural variants while maintaining strong general safety performance.

Conclusion: Culturally aware safeguards are essential for AI alignment in real-world settings, and the SEA-Guard framework demonstrates that region-specific, culturally-grounded safety models can be effectively developed through scalable data generation approaches.

Abstract: Culturally aware safeguards are crucial for AI alignment in real-world settings, where safety extends beyond common sense and encompasses diverse local values, norms, and region-specific regulations. However, building large-scale, culturally grounded datasets is challenging due to limited resources and a scarcity of native annotators. Consequently, many safeguard models rely on machine translation of English datasets, often missing regional and cultural nuances. We present a novel agentic data-generation framework to scalably create authentic, region-specific safety datasets for Southeast Asia (SEA). On this foundation, we introduce the SEA-Guard family, the first multilingual safeguard models grounded in SEA cultural contexts. Evaluated across multiple benchmarks and cultural variants, SEA-Guard consistently outperforms existing safeguards at detecting regionally sensitive or harmful content while maintaining strong general safety performance.

[108] A2Eval: Agentic and Automated Evaluation for Embodied Brain

Shuai Zhang, Jiayu Hu, Zijie Chen, Zeyuan Ding, Yi Zhang, Yingji Zhang, Ziyi Zhou, Junwei Liao, Shengjie Zhou, Yong Dai, Zhenzhong Lan, Xiaozhu Ju

Main category: cs.CL

TL;DR: A2Eval is an agentic framework that automates benchmark curation and evaluation for embodied VLMs using two collaborative agents to address redundancy and bias in current evaluation methods.

Details

Motivation: Current embodied VLM evaluation relies on static, manually annotated benchmarks that are labor-intensive, computationally expensive, and suffer from redundancy and coverage imbalance, distorting model rankings and hindering iterative development.

Method: Proposes Agentic Automatic Evaluation (A2Eval) with two collaborative agents: 1) Data Agent autonomously induces capability dimensions and assembles balanced, compact evaluation suites, and 2) Eval Agent synthesizes and validates executable evaluation pipelines for fully autonomous assessment.

Result: A2Eval compresses evaluation suites by 85%, reduces computational costs by 77%, delivers 4.6x speedup while preserving evaluation quality, corrects systematic ranking biases, improves human alignment (Spearman’s rho=0.85), and maintains high ranking fidelity (Kendall’s tau=0.81).

Conclusion: A2Eval establishes a new standard for high-fidelity, low-cost embodied assessment, enabling more efficient and accurate evaluation of embodied VLMs through autonomous benchmark curation and evaluation.

Abstract: Current embodied VLM evaluation relies on static, expert-defined, manually annotated benchmarks that exhibit severe redundancy and coverage imbalance. This labor intensive paradigm drains computational and annotation resources, inflates costs, and distorts model rankings, ultimately stifling iterative development. To address this, we propose Agentic Automatic Evaluation (A2Eval), the first agentic framework that automates benchmark curation and evaluation through two collaborative agents. The Data Agent autonomously induces capability dimensions and assembles a balanced, compact evaluation suite, while the Eval Agent synthesizes and validates executable evaluation pipelines, enabling fully autonomous, high-fidelity assessment. Evaluated across 10 benchmarks and 13 models, A2Eval compresses evaluation suites by 85%, reduces overall computational costs by 77%, and delivers a 4.6x speedup while preserving evaluation quality. Crucially, A2Eval corrects systematic ranking biases, improves human alignment to Spearman’s rho=0.85, and maintains high ranking fidelity (Kendall’s tau=0.81), establishing a new standard for high-fidelity, low-cost embodied assessment. Our code and data will be public soon.

[109] Steering Vector Fields for Context-Aware Inference-Time Control in Large Language Models

Jiaqian Li, Yanshu Li, Kuan-Hao Huang

Main category: cs.CL

TL;DR: Steering Vector Fields (SVF) improves inference-time control of LLMs by making steering directions context-dependent rather than using static vectors, addressing reliability issues in long-form and multi-attribute steering.

Details

Motivation: Current steering vectors (SVs) for controlling LLMs at inference time are unreliable - some concepts are unsteerable, steering can backfire for many inputs, and reliability degrades in long-form generation and multi-attribute steering. The static nature of SVs assumes constant concept-improving directions across contexts, which doesn't hold in practice.

Method: Proposes Steering Vector Fields (SVF) which learns a differentiable concept scoring function whose local gradient defines the steering direction at each activation, making interventions explicitly context-dependent. This supports coordinated multi-layer interventions in a shared concept space and enables efficient long-form and multi-attribute control.

Result: SVF delivers stronger and more reliable control across multiple LLMs and steering tasks, improving the practicality of inference-time steering compared to static steering vectors.

Conclusion: By making steering directions context-dependent through gradient fields, SVF addresses fundamental limitations of static steering vectors and provides a more effective framework for inference-time control of language models.

Abstract: Steering vectors (SVs) offer a lightweight way to control large language models (LLMs) at inference time by shifting hidden activations, providing a practical middle ground between prompting and fine-tuning. Yet SVs can be unreliable in practice. Some concepts are unsteerable, and even when steering helps on average it can backfire for a non-trivial fraction of inputs. Reliability also degrades in long-form generation and multi-attribute steering. We take a geometric view of these failures. A static SV applies the same update vector everywhere in representation space, implicitly assuming that the concept-improving direction is constant across contexts. When the locally effective direction varies with the current activation, a single global vector can become misaligned, which yields weak or reversed effects. Guided by this perspective, we propose Steering Vector Fields (SVF), which learns a differentiable concept scoring function whose local gradient defines the steering direction at each activation, making interventions explicitly context-dependent. This formulation supports coordinated multi-layer interventions in a shared, aligned concept space, and enables efficient long-form and multi-attribute control within a unified framework. Across multiple LLMs and steering tasks, SVF delivers stronger and more reliable control, improving the practicality of inference-time steering.

[110] CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation

Zhongyuan Peng, Caijun Xu, Changyi Xiao, Shibo Hong, Eli Zhang, Stephen Huang, Yixin Cao

Main category: cs.CL

TL;DR: CoDiQ framework enables fine-grained difficulty control for generating competition-level reasoning questions with high solvability, creating a 44K corpus that improves Large Reasoning Models’ performance.

Details

Motivation: Existing automated question synthesis methods lack precise difficulty control, incur high computational costs, and struggle to generate competition-level questions at scale for training Large Reasoning Models.

Method: CoDiQ framework uses test-time scaling for fine-grained difficulty control while ensuring solvability. Identifies scaling tendencies, develops CoDiQ-Generator from Qwen3-8B to improve upper bound of difficult question generation, and builds CoDiQ-Corpus of 44K competition-grade question sequences.

Result: Human evaluations show CoDiQ questions are significantly more challenging than LiveCodeBench/AIME with over 82% solvability. Training LRMs on CoDiQ-Corpus substantially improves reasoning performance.

Conclusion: Scaling controlled-difficulty training questions enhances reasoning capabilities. The framework and corpus are open-sourced to support related research.

Abstract: Large Reasoning Models (LRMs) benefit substantially from training on challenging competition-level questions. However, existing automated question synthesis methods lack precise difficulty control, incur high computational costs, and struggle to generate competition-level questions at scale. In this paper, we propose CoDiQ (Controllable Difficult Question Generation), a novel framework enabling fine-grained difficulty control via test-time scaling while ensuring question solvability. Specifically, first, we identify a test-time scaling tendency (extended reasoning token budget boosts difficulty but reduces solvability) and the intrinsic properties defining the upper bound of a model’s ability to generate valid, high-difficulty questions. Then, we develop CoDiQ-Generator from Qwen3-8B, which improves the upper bound of difficult question generation, making it particularly well-suited for challenging question construction. Building on the CoDiQ framework, we build CoDiQ-Corpus (44K competition-grade question sequences). Human evaluations show these questions are significantly more challenging than LiveCodeBench/AIME with over 82% solvability. Training LRMs on CoDiQ-Corpus substantially improves reasoning performance, verifying that scaling controlled-difficulty training questions enhances reasoning capabilities. We open-source CoDiQ-Corpus, CoDiQ-Generator, and implementations to support related research.

[111] Scaling Search-Augmented LLM Reasoning via Adaptive Information Control

Siheng Xiong, Oguzhan Gungordu, Blair Johnson, James C. Kerce, Faramarz Fekri

Main category: cs.CL

TL;DR: DeepControl: Adaptive information control framework for search-augmented reasoning agents using formal information utility to regulate retrieval continuation and granularity.

Details

Motivation: Existing search-augmented reasoning agents suffer from uncontrolled retrieval leading to redundant evidence, context saturation, and unstable learning. Current approaches using outcome-based RL provide limited guidance for regulating information acquisition.

Method: Proposes DeepControl framework based on formal information utility that measures marginal value of retrieved evidence. Introduces retrieval continuation and granularity control mechanisms to regulate when to continue/stop retrieval and how much information to expand. Uses annealed control strategy to internalize effective information acquisition behaviors.

Result: Extensive experiments across seven benchmarks show consistent outperformance of strong baselines. Achieves average improvements of 9.4% and 8.6% on Qwen2.5-7B and Qwen2.5-3B over outcome-based RL baselines. Consistently outperforms both retrieval-free and retrieval-based reasoning methods without explicit information control.

Conclusion: Adaptive information control is crucial for scaling search-augmented reasoning agents to complex, real-world information environments. DeepControl demonstrates effectiveness of formal information utility for regulating retrieval in reasoning systems.

Abstract: Search-augmented reasoning agents interleave multi-step reasoning with external information retrieval, but uncontrolled retrieval often leads to redundant evidence, context saturation, and unstable learning. Existing approaches rely on outcome-based reinforcement learning (RL), which provides limited guidance for regulating information acquisition. We propose DeepControl, a framework for adaptive information control based on a formal notion of information utility, which measures the marginal value of retrieved evidence under a given reasoning state. Building on this utility, we introduce retrieval continuation and granularity control mechanisms that selectively regulate when to continue and stop retrieval, and how much information to expand. An annealed control strategy enables the agent to internalize effective information acquisition behaviors during training. Extensive experiments across seven benchmarks demonstrate that our method consistently outperforms strong baselines. In particular, our approach achieves average performance improvements of 9.4% and 8.6% on Qwen2.5-7B and Qwen2.5-3B, respectively, over strong outcome-based RL baselines, and consistently outperforms both retrieval-free and retrieval-based reasoning methods without explicit information control. These results highlight the importance of adaptive information control for scaling search-augmented reasoning agents to complex, real-world information environments.

[112] Counting Hypothesis: Potential Mechanism of In-Context Learning

Jung H. Lee, Sujith Vijayan

Main category: cs.CL

TL;DR: The paper proposes a “counting hypothesis” for In-Context Learning (ICL) in LLMs, suggesting that LLMs’ encoding strategies may underlie ICL mechanisms, supported by evidence from analyzing ICL properties and LLMs’ functional modules.

Details

Motivation: ICL enables LLMs to learn tasks from input examples without modifying internal structure, but its underlying mechanisms remain poorly understood, making error correction and diagnosis challenging. There's a need to better understand ICL limitations and how LLMs support ICL.

Method: Inspired by ICL properties and LLMs’ functional modules, the authors propose the “counting hypothesis” of ICL, which suggests that LLMs’ encoding strategy may underlie ICL, and provide supporting evidence for this hypothesis.

Result: The paper presents evidence supporting the counting hypothesis, suggesting that LLMs’ encoding strategies play a fundamental role in enabling ICL capabilities.

Conclusion: Understanding the counting hypothesis and LLMs’ encoding strategies provides insights into ICL mechanisms, which could help improve error correction, diagnosis, and broader utilization of LLMs across domains.

Abstract: In-Context Learning (ICL) indicates that large language models (LLMs) pretrained on a massive amount of data can learn specific tasks from input prompts’ examples. ICL is notable for two reasons. First, it does not need modification of LLMs’ internal structure. Second, it enables LLMs to perform a wide range of tasks/functions with a few examples demonstrating a desirable task. ICL opens up new ways to utilize LLMs in more domains, but its underlying mechanisms still remain poorly understood, making error correction and diagnosis extremely challenging. Thus, it is imperative that we better understand the limitations of ICL and how exactly LLMs support ICL. Inspired by ICL properties and LLMs’ functional modules, we propose 1the counting hypothesis’ of ICL, which suggests that LLMs’ encoding strategy may underlie ICL, and provide supporting evidence.

[113] Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models

Wenhui Tan, Fiorenzo Parascandolo, Enver Sangineto, Jianzhong Ju, Zhenbo Luo, Qian Cao, Rita Cucchiara, Ruihua Song, Jian Luan

Main category: cs.CL

TL;DR: LED (Latent Exploration Decoding) improves reasoning performance in Large Reasoning Models by addressing exploration collapse through depth-conditioned decoding that aggregates intermediate layer posteriors.

Details

Motivation: Modern reasoning post-training in Large Reasoning Models causes exploration collapse where temperature-based sampling no longer improves pass@n accuracy, due to sharply reduced entropy in final-layer posteriors while intermediate layers maintain higher entropy.

Method: Proposes Latent Exploration Decoding (LED) - a depth-conditioned decoding strategy that aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates, requiring no additional training or parameters.

Result: LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models without additional training.

Conclusion: LED effectively addresses exploration collapse in post-trained reasoning models by leveraging entropy asymmetry between intermediate and final layers, providing a simple yet effective decoding strategy for improved reasoning performance.

Abstract: Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post-training. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@$n$ accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Project page: https://GitHub.com/Xiaomi-Research/LED.

[114] Game of Thought: Robust Information Seeking with Large Language Models Using Game Theory

Langyuan Cui, Chun Kai Ling, Hwee Tou Ng

Main category: cs.CL

TL;DR: GoT framework uses game theory to improve LLMs’ worst-case information-seeking performance in adversarial settings like Strategic Language Search

Details

Motivation: LLMs often lack sufficient information in real-world scenarios but need to actively seek missing information; existing approaches degrade worst-case performance, which is problematic for high-stakes applications

Method: Formalizes Strategic Language Search as a two-player zero-sum extensive form game, proposes Game of Thought (GoT) framework using game-theoretic techniques to approximate Nash equilibrium strategies

Result: GoT consistently improves worst-case performance compared to direct prompting and heuristic-guided search methods across all tested settings

Conclusion: Game-theoretic approaches like GoT can effectively enhance LLMs’ information-seeking capabilities in adversarial scenarios while maintaining worst-case performance guarantees

Abstract: Large Language Models (LLMs) are increasingly deployed in real-world scenarios where they may lack sufficient information to complete a given task. In such settings, the ability to actively seek out missing information becomes a critical capability. Existing approaches to enhancing this ability often rely on simplifying assumptions that degrade \textit{worst-case} performance. This is an issue with serious implications in high-stakes applications. In this work, we use the game of Twenty Questions to evaluate the information-seeking ability of LLMs. We introduce and formalize its adversarial counterpart, the Strategic Language Search (SLS) problem along with its variants as a two-player zero-sum extensive form game. We propose Game of Thought (GoT), a framework that applies game-theoretic techniques to approximate a Nash equilibrium (NE) strategy for the restricted variant of the game. Empirical results demonstrate that our approach consistently improves worst-case performance compared to (1) direct prompting-based methods and (2) heuristic-guided search methods across all tested settings.

[115] ARTIS: Agentic Risk-Aware Test-Time Scaling via Iterative Simulation

Xingshan Zeng, Lingzhi Wang, Weiwen Liu, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu

Main category: cs.CL

TL;DR: ARTIS: A framework for agentic risk-aware test-time scaling via iterative simulation that decouples exploration from commitment by enabling simulated interactions before real-world execution to improve action-level reliability without environmental risk.

Details

Motivation: Current test-time scaling techniques for LLMs are insufficient for agentic settings where actions interact with external environments and can have irreversible, costly effects. There's a need for methods that improve action-level reliability and robustness without incurring environmental risk.

Method: Proposes ARTIS framework that enables test-time exploration through simulated interactions prior to real-world execution. Introduces risk-aware tool simulator that emphasizes fidelity on failure-inducing actions via targeted data generation and rebalanced training to capture rare but high-impact failure modes.

Result: Experiments on multi-turn and multi-step agentic benchmarks show that iterative simulation substantially improves agent reliability, and risk-aware simulation is essential for consistently realizing these gains across models and tasks.

Conclusion: ARTIS provides an effective framework for improving agentic decision-making reliability by decoupling exploration from commitment through simulated interactions, with risk-aware simulation being crucial for capturing rare failure modes and achieving consistent performance gains.

Abstract: Current test-time scaling (TTS) techniques enhance large language model (LLM) performance by allocating additional computation at inference time, yet they remain insufficient for agentic settings, where actions directly interact with external environments and their effects can be irreversible and costly. We propose \emph{\name}, \emph{\underline{A}gentic \underline{R}isk-Aware \underline{T}est-Time Scaling via \underline{I}terative \underline{S}imulation}, a framework that decouples exploration from commitment by enabling test-time exploration through simulated interactions prior to real-world execution. This design allows extending inference-time computation to improve action-level reliability and robustness without incurring environmental risk. We further show that naive LLM-based simulators struggle to capture rare but high-impact failure modes, substantially limiting their effectiveness for agentic decision making. To address this limitation, we introduce a \emph{risk-aware tool simulator} that emphasizes fidelity on failure-inducing actions via targeted data generation and rebalanced training. Experiments on multi-turn and multi-step agentic benchmarks demonstrate that iterative simulation substantially improves agent reliability, and that risk-aware simulation is essential for consistently realizing these gains across models and tasks.

[116] MedAraBench: Large-Scale Arabic Medical Question Answering Dataset and Benchmark

Mouath Abu-Daoud, Leen Kharouf, Omar El Hajj, Dana El Samad, Mariam Al-Omari, Jihad Mallat, Khaled Saleh, Nizar Habash, Farah E. Shamout

Main category: cs.CL

TL;DR: MedAraBench: A large-scale Arabic medical QA dataset created by digitizing academic materials, spanning 19 specialties and 5 difficulty levels, with benchmarking of 8 LLMs showing need for domain-specific improvements.

Details

Motivation: Arabic is underrepresented in NLP research, especially in medical applications, due to limited open-source data and benchmarks, hindering evaluation of multilingual LLM capabilities.

Method: Manually digitized academic materials from Arabic medical professionals, extensive preprocessing, split into training/test sets, used expert human evaluation and LLM-as-a-judge for quality assessment.

Result: Created diverse, high-quality dataset spanning 19 specialties and 5 difficulty levels; benchmarked 8 SOTA models (GPT-5, Gemini 2.0 Flash, Claude 4-Sonnet) showing need for domain-specific enhancements.

Conclusion: Dataset and evaluation scripts released to broaden medical data benchmarks, expand LLM evaluation suites, and enhance multilingual capabilities for clinical deployment.

Abstract: Arabic remains one of the most underrepresented languages in natural language processing research, particularly in medical applications, due to the limited availability of open-source data and benchmarks. The lack of resources hinders efforts to evaluate and advance the multilingual capabilities of Large Language Models (LLMs). In this paper, we introduce MedAraBench, a large-scale dataset consisting of Arabic multiple-choice question-answer pairs across various medical specialties. We constructed the dataset by manually digitizing a large repository of academic materials created by medical professionals in the Arabic-speaking region. We then conducted extensive preprocessing and split the dataset into training and test sets to support future research efforts in the area. To assess the quality of the data, we adopted two frameworks, namely expert human evaluation and LLM-as-a-judge. Our dataset is diverse and of high quality, spanning 19 specialties and five difficulty levels. For benchmarking purposes, we assessed the performance of eight state-of-the-art open-source and proprietary models, such as GPT-5, Gemini 2.0 Flash, and Claude 4-Sonnet. Our findings highlight the need for further domain-specific enhancements. We release the dataset and evaluation scripts to broaden the diversity of medical data benchmarks, expand the scope of evaluation suites for LLMs, and enhance the multilingual capabilities of models for deployment in clinical settings.

[117] Mechanistic Indicators of Steering Effectiveness in Large Language Models

Mehdi Jafari, Hao Xue, Flora Salim

Main category: cs.CL

TL;DR: Paper investigates internal model signals (entropy and KL divergence) to diagnose reliability of activation steering in LLMs, showing these mechanistic signals predict steering success/failure.

Details

Motivation: Despite widespread use of activation-based steering for targeted behaviors in LLMs, the mechanistic factors governing steering success/failure remain poorly understood, as prior work relied on black-box outputs or LLM-based judges rather than internal model signals.

Method: Focuses on two information-theoretic measures: Normalized Branching Factor (NBF) derived from entropy, and KL divergence between steered activations and targeted concepts in vocabulary space. Uses LLM-generated annotations as ground truth based on reliability study showing high inter-judge agreement between architecturally distinct LLMs.

Result: Mechanistic signals (entropy preservation and KL alignment) provide meaningful predictive power for identifying successful steering and estimating failure probability. Also introduces stronger evaluation baseline for Contrastive Activation Addition (CAA) and Sparse Autoencoder-based steering methods.

Conclusion: Internal model signals can effectively diagnose reliability of activation steering in LLMs, with structured entropy preservation and coherent KL alignment across decoding steps correlating with successful steering outcomes.

Abstract: Activation-based steering enables Large Language Models (LLMs) to exhibit targeted behaviors by intervening on intermediate activations without retraining. Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood, as prior work has relied primarily on black-box outputs or LLM-based judges. In this study, we investigate whether the reliability of steering can be diagnosed using internal model signals. We focus on two information-theoretic measures: the entropy-derived Normalized Branching Factor (NBF), and the Kullback-Leibler (KL) divergence between steered activations and targeted concepts in the vocabulary space. We hypothesize that effective steering corresponds to structured entropy preservation and coherent KL alignment across decoding steps. Building on a reliability study demonstrating high inter-judge agreement between two architecturally distinct LLMs, we use LLM-generated annotations as ground truth and show that these mechanistic signals provide meaningful predictive power for identifying successful steering and estimating failure probability. We further introduce a stronger evaluation baseline for Contrastive Activation Addition (CAA) and Sparse Autoencoder-based steering, the two most widely adopted activation-steering methods.

[118] BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition

Hyunsik Kim, Haeri Kim, Munhak Lee, Kyungmin Lee

Main category: cs.CL

TL;DR: BBPE16: UTF-16-based byte-level BPE tokenizer for multilingual ASR that reduces token sequence length for non-Latin scripts while maintaining language-agnostic properties.

Details

Motivation: Current UTF-8 BBPE tokenizers inflate token sequences for non-Latin scripts (CJK languages), increasing computational load and memory usage in multilingual ASR systems.

Method: Proposes BBPE16, a UTF-16-based byte-level BPE tokenizer that represents most modern scripts with uniform 2-byte code units, improving cross-lingual token sharing.

Result: BBPE16 achieves comparable or better accuracy across monolingual, bilingual, and trilingual ASR; reduces Chinese token counts by up to 10.4% and decoding iterations by up to 10.3%.

Conclusion: BBPE16 offers practical advantages for multilingual ASR by reducing computational load, memory usage, and speeding up fine-tuning and inference while maintaining language-agnostic properties.

Abstract: Multilingual automatic speech recognition (ASR) requires tokenization that efficiently covers many writing systems. Byte-level BPE (BBPE) using UTF-8 is widely adopted for its language-agnostic design and full Unicode coverage, but its variable-length encoding inflates token sequences for non-Latin scripts, such as Chinese, Japanese, and Korean (CJK). Longer sequences increase computational load and memory use. We propose BBPE16, a UTF-16-based BBPE tokenizer that represents most modern scripts with a uniform 2-byte code unit. BBPE16 preserves BBPE’s language-agnostic properties while substantially improving cross-lingual token sharing. Across monolingual, bilingual, and trilingual ASR, and in a multilingual continual-learning setup, BBPE16 attains comparable or better accuracy; for Chinese, it reduces token counts by up to 10.4% and lowers decoding iterations by up to 10.3%. These reductions speed up fine-tuning and inference and decrease memory usage, making BBPE16 a practical tokenization choice for multilingual ASR.

[119] COMI: Coarse-to-fine Context Compression via Marginal Information Gain

Jiwei Tang, Shilei Liu, Zhicheng Zhang, Yujin Yuan, Libin Zheng, Wenbo Su, Bo Zheng

Main category: cs.CL

TL;DR: COMI: A coarse-to-fine adaptive context compression framework for LLMs that optimizes semantic relevance and diversity under high compression rates using Marginal Information Gain metric.

Details

Motivation: LLMs face computational inefficiency and information redundancy in long context scenarios, limiting deployment. Context compression methods help but need better optimization for both relevance and diversity under high compression rates.

Method: Two-stage framework: 1) Coarse-grained group reallocation partitions context into groups and dynamically assigns compression rates based on inter-group Marginal Information Gain (MIG). 2) Fine-grained token merging fuses tokens within groups using intra-group MIG-based weighting to preserve key semantics while reducing redundancy.

Result: Extensive experiments across QA and summarization tasks show COMI outperforms existing baselines by large margins, achieving ~25-point EM improvement under 32x compression with Qwen2-7B on NaturalQuestions.

Conclusion: COMI effectively addresses LLM context compression challenges by jointly optimizing semantic relevance and diversity through MIG-guided coarse-to-fine compression, enabling efficient long-context processing.

Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse tasks. However, their deployment in long context scenarios remains hindered by computational inefficiency and information redundancy. Context compression methods address these challenges by significantly reducing input length and eliminating redundancy. We propose COMI, a coarse-to-fine adaptive context compression framework that jointly optimizes for semantic relevance and diversity under high compression rates. We introduce Marginal Information Gain (MIG), a metric defined as the relevance of a unit to the input query minus its semantic redundancy with other units, guiding the compression process to prioritize information that is both relevant and low redundant. The framework operates in two stages: (1) Coarse-Grained Group Reallocation, where the context is partitioned into groups and dynamically assigned compression rates based on inter-group MIG, ensuring compression budgets align with information value distribution; and (2) Fine-Grained Token Merging, where tokens within each group are fused via an intra-group MIG-based weighting mechanism, thereby preserving key semantics while avoiding the accumulation of redundancy. Extensive experiments across question-answering (e.g., NaturalQuestions, 2WikiMQA, HotpotQA and NarrativeQA), summarization (e.g., MultiNews) with various backbones (e.g., LLaMA-2-7B, Qwen2-7B) show that COMI outperforms existing baselines by a large margin, e.g., approximately 25-point Exact Match (EM) improvement under 32x compression constraint with Qwen2-7B on NaturalQuestions.

[120] SafePred: A Predictive Guardrail for Computer-Using Agents via World Models

Yurun Chen, Zeyi Liao, Ping Yin, Taotao Xie, Keting Yin, Shengyu Zhang

Main category: cs.CL

TL;DR: SafePred is a predictive guardrail framework for computer-using agents that proactively prevents long-term risks by aligning predicted future risks with current decisions, unlike reactive approaches that only address immediate risks.

Details

Motivation: Existing reactive guardrails for computer-using agents can only prevent immediate short-term risks but fail to address long-term risks where seemingly reasonable actions lead to delayed high-risk consequences. This gap necessitates a predictive approach to proactively avoid such emergent risks.

Method: SafePred establishes a risk-to-decision loop with two key abilities: 1) Short- and long-term risk prediction using safety policies and world model prediction to generate semantic risk representations and prune high-risk actions; 2) Decision optimization through step-level interventions and task-level re-planning to translate predicted risks into safe decision guidance.

Result: Extensive experiments show SafePred significantly reduces high-risk behaviors, achieving over 97.6% safety performance and improving task utility by up to 21.4% compared with reactive baselines.

Conclusion: The predictive guardrail approach effectively addresses limitations of reactive methods by proactively preventing both short- and long-term risks in computer-using agents, establishing a robust safety framework for real-world deployment.

Abstract: With the widespread deployment of Computer-using Agents (CUAs) in complex real-world environments, prevalent long-term risks often lead to severe and irreversible consequences. Most existing guardrails for CUAs adopt a reactive approach, constraining agent behavior only within the current observation space. While these guardrails can prevent immediate short-term risks (e.g., clicking on a phishing link), they cannot proactively avoid long-term risks: seemingly reasonable actions can lead to high-risk consequences that emerge with a delay (e.g., cleaning logs leads to future audits being untraceable), which reactive guardrails cannot identify within the current observation space. To address these limitations, we propose a predictive guardrail approach, with the core idea of aligning predicted future risks with current decisions. Based on this approach, we present SafePred, a predictive guardrail framework for CUAs that establishes a risk-to-decision loop to ensure safe agent behavior. SafePred supports two key abilities: (1) Short- and long-term risk prediction: by using safety policies as the basis for risk prediction, SafePred leverages the prediction capability of the world model to generate semantic representations of both short-term and long-term risks, thereby identifying and pruning actions that lead to high-risk states; (2) Decision optimization: translating predicted risks into actionable safe decision guidances through step-level interventions and task-level re-planning. Extensive experiments show that SafePred significantly reduces high-risk behaviors, achieving over 97.6% safety performance and improving task utility by up to 21.4% compared with reactive baselines.

[121] Enhancing Automated Essay Scoring with Three Techniques: Two-Stage Fine-Tuning, Score Alignment, and Self-Training

Hongseok Choi, Serynn Kim, Wencke Liermann, Jin Seong, Jin-Xia Huang

Main category: cs.CL

TL;DR: Novel Automated Essay Scoring approach using two-stage fine-tuning, score alignment, and uncertainty-aware self-training to improve performance in limited-data settings.

Details

Motivation: Real-world AES systems face extreme scarcity of labeled data, limiting development and adoption. Need robust methods that work well with limited labeled samples.

Method: Three key techniques: 1) Two-stage fine-tuning with low-rank adaptations for prompt adaptation, 2) Score alignment for distribution consistency, 3) Uncertainty-aware self-training using unlabeled data with pseudo-labels.

Result: In 32-data setting, all techniques improve performance, achieving 91.2% of full-data performance with only ~1,000 labeled samples. Score alignment achieves SOTA results in full-data setting when integrated into DualBERT.

Conclusion: Proposed techniques effectively address data scarcity in AES, with score alignment providing consistent improvements across both limited and full-data settings.

Abstract: Automated Essay Scoring (AES) plays a crucial role in education by providing scalable and efficient assessment tools. However, in real-world settings, the extreme scarcity of labeled data severely limits the development and practical adoption of robust AES systems. This study proposes a novel approach to enhance AES performance in both limited-data and full-data settings by introducing three key techniques. First, we introduce a Two-Stage fine-tuning strategy that leverages low-rank adaptations to better adapt an AES model to target prompt essays. Second, we introduce a Score Alignment technique to improve consistency between predicted and true score distributions. Third, we employ uncertainty-aware self-training using unlabeled data, effectively expanding the training set with pseudo-labeled samples while mitigating label noise propagation. We implement above three key techniques on DualBERT. We conduct extensive experiments on the ASAP++ dataset. As a result, in the 32-data setting, all three key techniques improve performance, and their integration achieves 91.2% of the full-data performance trained on approximately 1,000 labeled samples. In addition, the proposed Score Alignment technique consistently improves performance in both limited-data and full-data settings: e.g., it achieves state-of-the-art results in the full-data setting when integrated into DualBERT.

[122] WorldCup Sampling for Multi-bit LLM Watermarking

Yidan Wang, Yubing Ren, Yanan Cao, Li Guo

Main category: cs.CL

TL;DR: WorldCup is a multi-bit watermarking framework for LLMs that embeds messages directly into token selection via hierarchical competition, achieving better balance across capacity, detectability, robustness, and text quality than prior methods.

Details

Motivation: Existing multi-bit watermarking methods for LLMs largely extend zero-bit schemes through seed-driven steering, leading to indirect information flow, limited effective capacity, and suboptimal decoding. There's a need for more direct and efficient watermarking that can encode richer provenance information while maintaining text quality.

Method: WorldCup treats sampling as a natural communication channel and embeds message bits directly into token selection via a hierarchical competition mechanism guided by complementary signals. It uses entropy-aware modulation to preserve generation quality and supports robust message recovery through confidence-aware decoding.

Result: Comprehensive experiments show WorldCup achieves a strong balance across capacity, detectability, robustness, text quality, and decoding efficiency, consistently outperforming prior baselines.

Conclusion: WorldCup lays a solid foundation for future LLM watermarking studies by providing a more direct and efficient approach to multi-bit watermarking that maintains text quality while enabling richer provenance encoding.

Abstract: As large language models (LLMs) generate increasingly human-like text, watermarking offers a promising solution for reliable attribution beyond mere detection. While multi-bit watermarking enables richer provenance encoding, existing methods largely extend zero-bit schemes through seed-driven steering, leading to indirect information flow, limited effective capacity, and suboptimal decoding. In this paper, we propose WorldCup, a multi-bit watermarking framework for LLMs that treats sampling as a natural communication channel and embeds message bits directly into token selection via a hierarchical competition mechanism guided by complementary signals. Moreover, WorldCup further adopts entropy-aware modulation to preserve generation quality and supports robust message recovery through confidence-aware decoding. Comprehensive experiments show that WorldCup achieves a strong balance across capacity, detectability, robustness, text quality, and decoding efficiency, consistently outperforming prior baselines and laying a solid foundation for future LLM watermarking studies.

[123] Zero2Text: Zero-Training Cross-Domain Inversion Attacks on Textual Embeddings

Doohyun Kim, Donghwa Kang, Kyungjae Lee, Hyeongboo Baek, Brent Byunghoon Kang

Main category: cs.CL

TL;DR: Zero2Text is a training-free framework for embedding inversion attacks on vector databases that uses recursive online alignment to recover text from embeddings without requiring training data or excessive queries.

Details

Motivation: Vector databases in RAG systems pose privacy risks through embedding inversion attacks. Existing methods face trade-offs: optimization-based approaches need too many queries, while alignment-based methods require accessible in-domain training data, making them ineffective in black-box and cross-domain settings.

Method: Zero2Text uses recursive online alignment that synergizes LLM priors with dynamic ridge regression to iteratively align text generation to target embeddings on-the-fly, without requiring training data or static datasets.

Result: Extensive experiments show Zero2Text outperforms baselines significantly. On MS MARCO against OpenAI victim model, it achieves 1.8x higher ROUGE-L and 6.4x higher BLEU-2 scores, recovering sentences from unknown domains without leaked data pairs. Standard defenses like differential privacy fail against this adaptive threat.

Conclusion: Zero2Text dismantles barriers in embedding inversion attacks by providing an effective training-free framework that works in strict black-box and cross-domain settings, highlighting vulnerabilities in current vector database privacy protections.

Abstract: The proliferation of retrieval-augmented generation (RAG) has established vector databases as critical infrastructure, yet they introduce severe privacy risks via embedding inversion attacks. Existing paradigms face a fundamental trade-off: optimization-based methods require computationally prohibitive queries, while alignment-based approaches hinge on the unrealistic assumption of accessible in-domain training data. These constraints render them ineffective in strict black-box and cross-domain settings. To dismantle these barriers, we introduce Zero2Text, a novel training-free framework based on recursive online alignment. Unlike methods relying on static datasets, Zero2Text synergizes LLM priors with a dynamic ridge regression mechanism to iteratively align generation to the target embedding on-the-fly. We further demonstrate that standard defenses, such as differential privacy, fail to effectively mitigate this adaptive threat. Extensive experiments across diverse benchmarks validate Zero2Text; notably, on MS MARCO against the OpenAI victim model, it achieves 1.8x higher ROUGE-L and 6.4x higher BLEU-2 scores compared to baselines, recovering sentences from unknown domains without a single leaked data pair.

[124] <SOG_k>: One LLM Token for Explicit Graph Structural Understanding

Jingyao Wu, Bin Lu, Zijun Di, Xiaoying Gan, Meng Jin, Luoyi Fu, Xinbing Wang, Chenghu Zhou

Main category: cs.CL

TL;DR: SOG introduces special structural tokens to represent graph topologies in LLMs, enabling efficient graph understanding without excessive tokenization or embedding misalignment.

Details

Motivation: LLMs struggle with graph data due to structural hallucination - existing methods either use verbose verbalization (excessive tokens) or continuous embeddings (misaligned with text tokens), creating inefficiencies and inaccuracies.

Method: Proposes a topology-aware structural tokenizer that maps each graph topology into a single special token <SOG_k>, creating a unified token space. Constructs hybrid structure QA corpora to align structural tokens with existing text tokens.

Result: Achieves 9.9% to 41.4% performance improvement on five graph-level benchmarks compared to baselines. Shows interpretability, consistency, and flexible extension to node-level tasks for both global and local structural understanding.

Conclusion: SOG enables LLMs to understand, generate, and reason about graphs concisely and accurately through structural tokens, bridging the gap between text and graph modalities.

Abstract: Large language models show great potential in unstructured data understanding, but still face significant challenges with graphs due to their structural hallucination. Existing approaches mainly either verbalize graphs into natural language, which leads to excessive token consumption and scattered attention, or transform graphs into trainable continuous embeddings (i.e., soft prompt), but exhibit severe misalignment with original text tokens. To solve this problem, we propose to incorporate one special token <SOG_k> to fully represent the Structure Of Graph within a unified token space, facilitating explicit topology input and structural information sharing. Specifically, we propose a topology-aware structural tokenizer that maps each graph topology into a highly selective single token. Afterwards, we construct a set of hybrid structure Question-Answering corpora to align new structural tokens with existing text tokens. With this approach, <SOG_k> empowers LLMs to understand, generate, and reason in a concise and accurate manner. Extensive experiments on five graph-level benchmarks demonstrate the superiority of our method, achieving a performance improvement of 9.9% to 41.4% compared to the baselines while exhibiting interpretability and consistency. Furthermore, our method provides a flexible extension to node-level tasks, enabling both global and local structural understanding. The codebase is publicly available at https://github.com/Jingyao-Wu/SOG.

[125] Data Distribution Matters: A Data-Centric Perspective on Context Compression for Large Language Model

Kangtao Lv, Jiwei Tang, Langming Liu, Haibin Chen, Weidong Zhang, Shilei Liu, Yongwei Wang, Yujin Yuan, Wenbo Su, Bo Zheng

Main category: cs.CL

TL;DR: This paper investigates how data distribution affects context compression quality in LLMs, examining both input data and the model’s internal pretrained knowledge, revealing that encoder-measured input entropy negatively correlates with compression quality and that encoder-decoder intrinsic data gaps diminish compression gains.

Details

Motivation: LLMs face computational inefficiency and information redundancy in long-context scenarios. While context compression has been adopted, existing research focuses only on model-side improvements, leaving the impact of data distribution on compression quality unexplored. The paper aims to bridge this gap by taking a data-centric perspective.

Method: The authors adopt a data-centric perspective to systematically investigate how data distribution impacts compression quality across two dimensions: input data and intrinsic data (model’s internal pretrained knowledge). They evaluate semantic integrity of compressed representations using an autoencoder-based framework and analyze relationships between entropy measures and compression quality.

Result: Experimental results show: (1) encoder-measured input entropy negatively correlates with compression quality, while decoder-measured entropy shows no significant relationship under frozen-decoder settings; (2) the gap between intrinsic data of encoder and decoder significantly diminishes compression gains, which is hard to mitigate.

Conclusion: The study provides the first systematic investigation of data distribution’s impact on context compression quality. Based on findings, the authors present practical guidelines to optimize compression gains, highlighting the importance of considering both input data characteristics and encoder-decoder alignment in pretrained knowledge.

Abstract: The deployment of Large Language Models (LLMs) in long-context scenarios is hindered by computational inefficiency and significant information redundancy. Although recent advancements have widely adopted context compression to address these challenges, existing research only focus on model-side improvements, the impact of the data distribution itself on context compression remains largely unexplored. To bridge this gap, we are the first to adopt a data-centric perspective to systematically investigate how data distribution impacts compression quality, including two dimensions: input data and intrinsic data (i.e., the model’s internal pretrained knowledge). We evaluate the semantic integrity of compressed representations using an autoencoder-based framework to systematically investigate it. Our experimental results reveal that: (1) encoder-measured input entropy negatively correlates with compression quality, while decoder-measured entropy shows no significant relationship under a frozen-decoder setting; and (2) the gap between intrinsic data of the encoder and decoder significantly diminishes compression gains, which is hard to mitigate. Based on these findings, we further present practical guidelines to optimize compression gains.

[126] CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Yuling Shi, Chaoxiang Xie, Zhensu Sun, Yeheng Chen, Chenxu Zhang, Longfei Yun, Chengcheng Wan, Hongyu Zhang, David Lo, Xiaodong Gu

Main category: cs.CL

TL;DR: MLLMs can understand source code from rendered images with up to 8x token compression, leveraging visual cues like syntax highlighting while maintaining performance on code understanding tasks.

Details

Motivation: Current LLMs treat source code as linear text tokens, causing computational inefficiency as software scales. MLLMs offer opportunity to represent code as compressed images for more efficient inference.

Method: Systematic study exploring MLLMs for code understanding by representing source code as rendered images, adjusting resolution for compression, and evaluating performance across various compression ratios.

Result: MLLMs achieve up to 8x token compression while effectively understanding code, leverage visual cues like syntax highlighting for improved code completion under 4x compression, and show exceptional resilience in tasks like clone detection.

Conclusion: Image-modality code representation offers promising pathway for more efficient inference in code understanding, though current MLLMs have limitations that need addressing.

Abstract: Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image-modality code representation as a pathway to more efficient inference.

[127] Sentence Curve Language Models

DongNyeong Heo, Heelyoul Choi

Main category: cs.CL

TL;DR: SCLM introduces sentence curves as continuous representations for language modeling, extending diffusion-based LMs to predict global sentence structure rather than static word embeddings.

Details

Motivation: Current language models use static word embeddings that are insensitive to neighboring words, focusing on local word prediction while neglecting global sentence structure. This limits their ability to capture holistic linguistic patterns.

Method: Proposes sentence curves - spline curves whose control points affect multiple words in a sentence. Extends diffusion-based language models to predict these sentence curves instead of static word embeddings, promoting global structure modeling through regularization effects.

Result: SCLM achieves state-of-the-art performance among diffusion-based language models on IWSLT14 and WMT14 datasets, shows stable training without knowledge distillation, and demonstrates promising potential on LM1B compared to discrete diffusion-based LMs.

Conclusion: Sentence curve representation effectively addresses the limitation of static word embeddings by enabling better global structure modeling in language models, offering a promising direction for improving diffusion-based language modeling.

Abstract: Language models (LMs) are a central component of modern AI systems, and diffusion-based language models (DLMs) have recently emerged as a competitive alternative. Both paradigms rely on word embeddings not only to represent the input sentence, but also to represent the target sentence that backbone models are trained to predict. We argue that such static embedding of the target word is insensitive to neighboring words, encouraging locally accurate word prediction while neglecting global structure across the target sentence. To address this limitation, we propose a continuous sentence representation, termed sentence curve, defined as a spline curve whose control points affect multiple words in the sentence. Based on this representation, we introduce sentence curve language model (SCLM), which extends DLMs to predict sentence curves instead of the static word embeddings. We theoretically show that sentence curve prediction induces a regularization effect that promotes global structure modeling, and characterize how different sentence curve types affect this behavior. Empirically, SCLM achieves SOTA performance among DLMs on IWSLT14 and WMT14, shows stable training without burdensome knowledge distillation, and demonstrates promising potential compared to discrete DLMs on LM1B.

[128] AXE: Low-Cost Cross-Domain Web Structured Information Extraction

Abdelrahman Mansour, Khaled W. Alshaer, Moataz Elsaban

Main category: cs.CL

TL;DR: AXE is a web data extraction pipeline that treats HTML DOM as a tree to prune, using a specialized mechanism to remove boilerplate and enable a small 0.6B LLM to generate precise structured outputs with traceable source nodes.

Details

Motivation: Current web data extraction methods face a trade-off between brittle manual heuristics and expensive large language models. There's a need for cost-effective, reliable extraction that maintains traceability to source data.

Method: AXE treats HTML DOM as a tree requiring pruning rather than text. It uses a specialized pruning mechanism to remove boilerplate, creating distilled context for a small 0.6B LLM. Grounded XPath Resolution (GXR) ensures traceability by linking extractions to source nodes.

Result: AXE achieves state-of-the-art zero-shot performance with 88.1% F1 score on SWDE dataset, outperforming larger fully-trained alternatives despite its small model size and low computational footprint.

Conclusion: AXE provides a practical, cost-effective solution for large-scale web information extraction by combining efficient DOM pruning with small LLMs and maintaining traceability through GXR.

Abstract: Extracting structured data from the web is often a trade-off between the brittle nature of manual heuristics and the prohibitive cost of Large Language Models. We introduce AXE (Adaptive X-Path Extractor), a pipeline that rethinks this process by treating the HTML DOM as a tree that needs pruning rather than just a wall of text to be read. AXE uses a specialized “pruning” mechanism to strip away boilerplate and irrelevant nodes, leaving behind a distilled, high-density context that allows a tiny 0.6B LLM to generate precise, structured outputs. To keep the model honest, we implement Grounded XPath Resolution (GXR), ensuring every extraction is physically traceable to a source node. Despite its low footprint, AXE achieves state-of-the-art zero-shot performance, outperforming several much larger, fully-trained alternatives with an F1 score of 88.1% on the SWDE dataset. By releasing our specialized adaptors, we aim to provide a practical, cost-effective path for large-scale web information extraction.

[129] Read As Human: Compressing Context via Parallelizable Close Reading and Skimming

Jiwei Tang, Shilei Liu, Zhicheng Zhang, Qingsong Lv, Runsong Zhao, Tingwei Lu, Langming Liu, Haibin Chen, Yujin Yuan, Hai-Tao Zheng, Wenbo Su, Bo Zheng

Main category: cs.CL

TL;DR: RAM is a context compression framework for LLMs that uses adaptive hybrid reading (close reading important segments, skimming less relevant ones) to improve efficiency on long-context tasks.

Details

Motivation: LLMs struggle with long-context scenarios due to computational inefficiency and redundant information. Current methods either compress everything (losing details) or process everything (inefficient).

Method: Partitions context into segments, encodes them with query in parallel. High-relevance segments are fully retained (close reading), low-relevance ones are query-guided compressed into summary vectors (skimming). Uses contrastive learning to refine decision boundaries.

Result: Outperforms existing baselines on multiple QA and summarization benchmarks across two backbones, with up to 12x end-to-end speedup on long inputs (average 16K, max 32K length).

Conclusion: RAM effectively addresses long-context challenges through human-inspired adaptive reading, achieving both performance gains and computational efficiency while maintaining interpretability.

Abstract: Large Language Models (LLMs) demonstrate exceptional capability across diverse tasks. However, their deployment in long-context scenarios is hindered by two challenges: computational inefficiency and redundant information. We propose RAM (Read As HuMan), a context compression framework that adopts an adaptive hybrid reading strategy, to address these challenges. Inspired by human reading behavior (i.e., close reading important content while skimming less relevant content), RAM partitions the context into segments and encodes them with the input query in parallel. High-relevance segments are fully retained (close reading), while low-relevance ones are query-guided compressed into compact summary vectors (skimming). Both explicit textual segments and implicit summary vectors are concatenated and fed into decoder to achieve both superior performance and natural language format interpretability. To refine the decision boundary between close reading and skimming, we further introduce a contrastive learning objective based on positive and negative query-segment pairs. Experiments demonstrate that RAM outperforms existing baselines on multiple question answering and summarization benchmarks across two backbones, while delivering up to a 12x end-to-end speedup on long inputs (average length 16K; maximum length 32K).

[130] PretrainRL: Alleviating Factuality Hallucination of Large Language Models at the Beginning

Langming Liu, Kangtao Lv, Haibin Chen, Weidong Zhang, Yejing Wang, Shilei Liu, Xin Tong, Yujin Yuan, Yongwei Wang, Wenbo Su, Bo Zheng

Main category: cs.CL

TL;DR: PretrainRL integrates reinforcement learning into pretraining to address factual hallucinations in LLMs by debiasing imbalanced data distributions and enabling effective learning of low-probability truths.

Details

Motivation: LLMs suffer from factual hallucinations due to imbalanced data distributions in pretraining corpora, creating "low-probability truth" and "high-probability falsehood" states. Existing approaches either evade the problem or cause catastrophic forgetting.

Method: Proposes PretrainRL framework with “debiasing then learning” principle: uses reinforcement learning during pretraining to reshape probability distributions by down-weighting high-probability falsehoods. Includes efficient negative sampling strategy to discover falsehoods and novel metrics to evaluate probabilistic state.

Result: Extensive experiments on three public benchmarks show PretrainRL significantly alleviates factual hallucinations and outperforms state-of-the-art methods.

Conclusion: PretrainRL addresses factual hallucinations at their root by integrating reinforcement learning into pretraining, effectively consolidating factual knowledge through debiasing and learning.

Abstract: Large language models (LLMs), despite their powerful capabilities, suffer from factual hallucinations where they generate verifiable falsehoods. We identify a root of this issue: the imbalanced data distribution in the pretraining corpus, which leads to a state of “low-probability truth” and “high-probability falsehood”. Recent approaches, such as teaching models to say “I don’t know” or post-hoc knowledge editing, either evade the problem or face catastrophic forgetting. To address this issue from its root, we propose \textbf{PretrainRL}, a novel framework that integrates reinforcement learning into the pretraining phase to consolidate factual knowledge. The core principle of PretrainRL is “\textbf{debiasing then learning}.” It actively reshapes the model’s probability distribution by down-weighting high-probability falsehoods, thereby making “room” for low-probability truths to be learned effectively. To enable this, we design an efficient negative sampling strategy to discover these high-probability falsehoods and introduce novel metrics to evaluate the model’s probabilistic state concerning factual knowledge. Extensive experiments on three public benchmarks demonstrate that PretrainRL significantly alleviates factual hallucinations and outperforms state-of-the-art methods.

[131] ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support

Tiantian Chen, Jiaqi Lu, Ying Shen, Lin Zhang

Main category: cs.CL

TL;DR: ES-MemEval benchmark evaluates LLM memory capabilities for long-term emotional support, revealing current limitations in handling evolving user states despite RAG improvements.

Details

Motivation: Current LLMs lack robust long-term memory for complex web services like emotional support, and existing benchmarks focus on static fact retrieval rather than dispersed, implicit, and evolving user information.

Method: Introduces ES-MemEval benchmark evaluating five memory capabilities (information extraction, temporal reasoning, conflict detection, abstention, user modeling) and EvoEmo dataset capturing fragmented, implicit user disclosures across multiple sessions.

Result: Experiments show explicit long-term memory reduces hallucinations and enables personalization, while RAG improves factual consistency but struggles with temporal dynamics and evolving user states.

Conclusion: Current memory paradigms have limitations, motivating more robust integration of memory and retrieval for long-term personalized dialogue systems.

Abstract: Large Language Models (LLMs) have shown strong potential as conversational agents. Yet, their effectiveness remains limited by deficiencies in robust long-term memory, particularly in complex, long-term web-based services such as online emotional support. However, existing long-term dialogue benchmarks primarily focus on static and explicit fact retrieval, failing to evaluate agents in critical scenarios where user information is dispersed, implicit, and continuously evolving. To address this gap, we introduce ES-MemEval, a comprehensive benchmark that systematically evaluates five core memory capabilities: information extraction, temporal reasoning, conflict detection, abstention, and user modeling, in long-term emotional support settings, covering question answering, summarization, and dialogue generation tasks. To support the benchmark, we also propose EvoEmo, a multi-session dataset for personalized long-term emotional support that captures fragmented, implicit user disclosures and evolving user states. Extensive experiments on open-source long-context, commercial, and retrieval-augmented (RAG) LLMs show that explicit long-term memory is essential for reducing hallucinations and enabling effective personalization. At the same time, RAG improves factual consistency but struggles with temporal dynamics and evolving user states. These findings highlight both the potential and limitations of current paradigms and motivate more robust integration of memory and retrieval for long-term personalized dialogue systems.

[132] GuideWeb: A Benchmark for Automatic In-App Guide Generation on Real-World Web UIs

Chengguang Gan, Yoshihiro Tsujii, Yunhao Liang, Tatsunori Mori, Shiwen Ni, Hiroki Itoh

Main category: cs.CL

TL;DR: GuideWeb introduces a benchmark for automatic in-app guide generation on web UIs, formulating the task as selecting guide target elements and generating guide text aligned with user intent.

Details

Motivation: Maintaining Digital Adoption Platform (DAP) guides is labor-intensive because website layouts and functionalities evolve continuously, requiring repeated manual updates and re-annotation.

Method: Proposes GuideWeb benchmark for automatic guide generation on real-world web UIs, with comprehensive evaluation suite measuring guide target element selection accuracy and guide text quality.

Result: GuideWeb Agent achieves 30.79% accuracy in guide target element prediction, BLEU scores of 44.94 for intent generation and 21.34 for guide-text generation, outperforming existing baselines.

Conclusion: Automatic guide generation remains challenging and requires further advances before reliable deployment in real-world settings, despite promising results from the proposed approach.

Abstract: Digital Adoption Platform (DAP) provide web-based overlays that deliver operation guidance and contextual hints to help users navigate complex websites. Although modern DAP tools enable non-experts to author such guidance, maintaining these guides remains labor-intensive because website layouts and functionalities evolve continuously, which requires repeated manual updates and re-annotation. In this work, we introduce \textbf{GuideWeb}, a new benchmark for automatic in-app guide generation on real-world web UIs. GuideWeb formulates the task as producing page-level guidance by selecting \textbf{guide target elements} grounded in the webpage and generating concise guide text aligned with user intent. We also propose a comprehensive evaluation suite that jointly measures the accuracy of guide target element selection and the quality of generated intents and guide texts. Experiments show that our proposed \textbf{GuideWeb Agent} achieves \textbf{30.79%} accuracy in guide target element prediction, while obtaining BLEU scores of \textbf{44.94} for intent generation and \textbf{21.34} for guide-text generation. Existing baselines perform substantially worse, which highlights that automatic guide generation remains challenging and that further advances are necessary before such systems can be reliably deployed in real-world settings.

[133] From Code-Centric to Concept-Centric: Teaching NLP with LLM-Assisted “Vibe Coding”

Hend Al-Khalifa

Main category: cs.CL

TL;DR: Vibe Coding: Using LLMs as coding assistants in NLP education to shift focus from syntax to conceptual understanding, with positive student feedback but challenges in verification and time management.

Details

Motivation: Address the educational challenges and opportunities presented by LLMs in NLP courses, aiming to leverage LLMs as coding assistants while maintaining focus on conceptual understanding and critical thinking rather than just coding syntax.

Method: Implemented “Vibe Coding” approach in senior-level undergraduate NLP course with 7 labs where students used LLMs for code generation, assessed primarily through critical reflection questions. Mandatory prompt logging and reflection-based assessment were used to structure the approach.

Result: High student satisfaction (mean scores 4.4-4.6/5.0) across engagement, conceptual learning, and assessment fairness. Students valued reduced debugging cognitive load enabling deeper focus on NLP concepts. Challenges included time constraints, LLM output verification, and need for clearer task specifications.

Conclusion: When properly structured with mandatory prompt logging and reflection-based assessment, LLM-assisted learning can shift focus from syntactic fluency to conceptual mastery, preparing students for AI-augmented professional landscape.

Abstract: The rapid advancement of Large Language Models (LLMs) presents both challenges and opportunities for Natural Language Processing (NLP) education. This paper introduces ``Vibe Coding,’’ a pedagogical approach that leverages LLMs as coding assistants while maintaining focus on conceptual understanding and critical thinking. We describe the implementation of this approach in a senior-level undergraduate NLP course, where students completed seven labs using LLMs for code generation while being assessed primarily on conceptual understanding through critical reflection questions. Analysis of end-of-course feedback from 19 students reveals high satisfaction (mean scores 4.4-4.6/5.0) across engagement, conceptual learning, and assessment fairness. Students particularly valued the reduced cognitive load from debugging, enabling deeper focus on NLP concepts. However, challenges emerged around time constraints, LLM output verification, and the need for clearer task specifications. Our findings suggest that when properly structured with mandatory prompt logging and reflection-based assessment, LLM-assisted learning can shift focus from syntactic fluency to conceptual mastery, preparing students for an AI-augmented professional landscape.

[134] Breaking the Static Graph: Context-Aware Traversal for Robust Retrieval-Augmented Generation

Kwun Hang Lau, Fangyuan Zhang, Boyu Ruan, Yingli Zhou, Qintian Guo, Ruiyuan Zhang, Xiaofang Zhou

Main category: cs.CL

TL;DR: CatRAG improves multi-hop RAG by making knowledge graph traversal query-adaptive, addressing the “Static Graph Fallacy” in previous approaches like HippoRAG.

Details

Motivation: Existing structure-aware RAG methods like HippoRAG suffer from "Static Graph Fallacy" - they use fixed transition probabilities that ignore query-dependent edge relevance, causing semantic drift and incomplete evidence retrieval for multi-hop queries.

Method: Builds on HippoRAG 2 architecture with three key innovations: (1) Symbolic Anchoring for weak entity constraints, (2) Query-Aware Dynamic Edge Weighting to modulate graph structure based on query intent, and (3) Key-Fact Passage Weight Enhancement to anchor random walks to likely evidence.

Result: Outperforms state-of-the-art baselines across four multi-hop benchmarks, showing substantial improvements in reasoning completeness (recovering entire evidence paths) beyond modest recall gains.

Conclusion: CatRAG effectively bridges the gap between retrieving partial context and enabling fully grounded reasoning by making knowledge graph traversal query-adaptive.

Abstract: Recent advances in Retrieval-Augmented Generation (RAG) have shifted from simple vector similarity to structure-aware approaches like HippoRAG, which leverage Knowledge Graphs (KGs) and Personalized PageRank (PPR) to capture multi-hop dependencies. However, these methods suffer from a “Static Graph Fallacy”: they rely on fixed transition probabilities determined during indexing. This rigidity ignores the query-dependent nature of edge relevance, causing semantic drift where random walks are diverted into high-degree “hub” nodes before reaching critical downstream evidence. Consequently, models often achieve high partial recall but fail to retrieve the complete evidence chain required for multi-hop queries. To address this, we propose CatRAG, Context-Aware Traversal for robust RAG, a framework that builds on the HippoRAG 2 architecture and transforms the static KG into a query-adaptive navigation structure. We introduce a multi-faceted framework to steer the random walk: (1) Symbolic Anchoring, which injects weak entity constraints to regularize the random walk; (2) Query-Aware Dynamic Edge Weighting, which dynamically modulates graph structure, to prune irrelevant paths while amplifying those aligned with the query’s intent; and (3) Key-Fact Passage Weight Enhancement, a cost-efficient bias that structurally anchors the random walk to likely evidence. Experiments across four multi-hop benchmarks demonstrate that CatRAG consistently outperforms state of the art baselines. Our analysis reveals that while standard Recall metrics show modest gains, CatRAG achieves substantial improvements in reasoning completeness, the capacity to recover the entire evidence path without gaps. These results reveal that our approach effectively bridges the gap between retrieving partial context and enabling fully grounded reasoning. Resources are available at https://github.com/kwunhang/CatRAG.

[135] Mixture-of-Experts with Intermediate CTC Supervision for Accented Speech Recognition

Wonjun Lee, Hyounghun Kim, Gary Geunbae Lee

Main category: cs.CL

TL;DR: Moe-Ctc: A Mixture-of-Experts architecture with intermediate CTC supervision that improves ASR performance on accented speech by promoting both expert specialization and generalization.

Details

Motivation: Accented speech remains challenging for ASR systems because most models are trained on data dominated by a few high-resource English varieties, causing performance degradation for other accents. Existing accent-agnostic approaches struggle with heavily accented or unseen varieties, while accent-specific methods rely on limited and often noisy labels.

Method: Introduces Moe-Ctc, a Mixture-of-Experts architecture with intermediate CTC supervision. During training, accent-aware routing encourages experts to capture accent-specific patterns, which gradually transitions to label-free routing for inference. Each expert has its own CTC head to align routing with transcription quality, and a routing-augmented loss stabilizes optimization.

Result: Experiments on the Mcv-Accent benchmark show consistent gains across both seen and unseen accents in low- and high-resource conditions, achieving up to 29.3% relative WER reduction over strong FastConformer baselines.

Conclusion: Moe-Ctc effectively addresses accent robustness in ASR by jointly promoting expert specialization and generalization through a novel Mixture-of-Experts architecture with intermediate CTC supervision, demonstrating significant improvements over existing approaches.

Abstract: Accented speech remains a persistent challenge for automatic speech recognition (ASR), as most models are trained on data dominated by a few high-resource English varieties, leading to substantial performance degradation for other accents. Accent-agnostic approaches improve robustness yet struggle with heavily accented or unseen varieties, while accent-specific methods rely on limited and often noisy labels. We introduce Moe-Ctc, a Mixture-of-Experts architecture with intermediate CTC supervision that jointly promotes expert specialization and generalization. During training, accent-aware routing encourages experts to capture accent-specific patterns, which gradually transitions to label-free routing for inference. Each expert is equipped with its own CTC head to align routing with transcription quality, and a routing-augmented loss further stabilizes optimization. Experiments on the Mcv-Accent benchmark demonstrate consistent gains across both seen and unseen accents in low- and high-resource conditions, achieving up to 29.3% relative WER reduction over strong FastConformer baselines.

[136] Orthogonal Hierarchical Decomposition for Structure-Aware Table Understanding with Large Language Models

Bin Cao, Huixian Lu, Chenwen Ma, Ting Wang, Ruizhe Li, Jing Fan

Main category: cs.CL

TL;DR: OHD framework uses orthogonal tree decomposition to represent complex tables with hierarchical structures for better LLM understanding and reasoning.

Details

Motivation: Complex tables with multi-level headers, merged cells, and heterogeneous layouts pose challenges for LLMs in understanding and reasoning. Existing linearization or grid modeling approaches struggle to capture hierarchical structures and cross-dimensional dependencies, leading to misalignment between structural semantics and textual representations.

Method: Proposes Orthogonal Hierarchical Decomposition (OHD) framework with Orthogonal Tree Induction (OTI) that decomposes irregular tables into column and row trees to capture vertical/horizontal hierarchical dependencies. Uses dual-pathway association to reconstruct semantic lineage of each cell, and incorporates LLM as semantic arbitrator to align multi-level semantic information.

Result: OHD consistently outperforms existing representation paradigms across multiple evaluation metrics on two complex table question answering benchmarks: AITQA and HiTab.

Conclusion: The OHD framework effectively addresses challenges in representing complex tables for LLMs by preserving hierarchical structures and capturing cross-dimensional dependencies through orthogonal tree decomposition.

Abstract: Complex tables with multi-level headers, merged cells and heterogeneous layouts pose persistent challenges for LLMs in both understanding and reasoning. Existing approaches typically rely on table linearization or normalized grid modeling. However, these representations struggle to explicitly capture hierarchical structures and cross-dimensional dependencies, which can lead to misalignment between structural semantics and textual representations for non-standard tables. To address this issue, we propose an Orthogonal Hierarchical Decomposition (OHD) framework that constructs structure-preserving input representations of complex tables for LLMs. OHD introduces an Orthogonal Tree Induction (OTI) method based on spatial–semantic co-constraints, which decomposes irregular tables into a column tree and a row tree to capture vertical and horizontal hierarchical dependencies, respectively. Building on this representation, we design a dual-pathway association protocol to symmetrically reconstruct semantic lineage of each cell, and incorporate an LLM as a semantic arbitrator to align multi-level semantic information. We evaluate OHD framework on two complex table question answering benchmarks, AITQA and HiTab. Experimental results show that OHD consistently outperforms existing representation paradigms across multiple evaluation metrics.

[137] Beyond Local Edits: Embedding-Virtualized Knowledge for Broader Evaluation and Preservation of Model Editing

Shuainan Liu, Xuanang Chen, Ben He, Le Sun

Main category: cs.CL

TL;DR: EVK introduces embedding-space perturbations to evaluate knowledge editing in LLMs beyond dataset-bounded samples, with EVK-Bench for comprehensive evaluation and EVK-Align for drift reduction.

Details

Motivation: Current knowledge editing evaluations are limited to predefined benchmarks with finite samples, failing to capture the broader impact on the model's entire knowledge system. There's a need to understand knowledge drift beyond explicit data annotations.

Method: Introduces Embedding-Virtualized Knowledge (EVK) that characterizes model knowledge through controlled perturbations in embedding space. Constructs EVK-Bench benchmark to quantify knowledge drift. Proposes EVK-Align module that constrains embedding-level drift during editing and integrates with existing methods.

Result: EVK enables more comprehensive evaluation of knowledge editing effects, revealing impacts not captured by conventional metrics. EVK-Align significantly improves knowledge preservation without sacrificing editing accuracy when integrated with existing editing methods.

Conclusion: EVK provides a novel embedding-space approach to evaluate knowledge editing more comprehensively, addressing limitations of sample-based metrics. The framework offers both evaluation tools and practical solutions for better knowledge preservation during editing.

Abstract: Knowledge editing methods for large language models are commonly evaluated using predefined benchmarks that assess edited facts together with a limited set of related or neighboring knowledge. While effective, such evaluations remain confined to finite, dataset-bounded samples, leaving the broader impact of editing on the model’s knowledge system insufficiently understood. To address this gap, we introduce Embedding-Virtualized Knowledge (EVK) that characterizes model knowledge through controlled perturbations in embedding space, enabling the exploration of a substantially broader and virtualized knowledge region beyond explicit data annotations. Based on EVK, we construct an embedding-level evaluation benchmark EVK-Bench that quantifies potential knowledge drift induced by editing, revealing effects that are not captured by conventional sample-based metrics. Furthermore, we propose a plug-and-play EVK-Align module that constrains embedding-level knowledge drift during editing and can be seamlessly integrated into existing editing methods. Experiments demonstrate that our approach enables more comprehensive evaluation while significantly improving knowledge preservation without sacrificing editing accuracy.

[138] S3-CoT: Self-Sampled Succinct Reasoning Enables Efficient Chain-of-Thought LLMs

Yanrui Du, Sendong Zhao, Yibo Gao, Danyang Zhao, Qika Lin, Ming Ma, Jiayun Li, Yi Jiang, Kai He, Qianyi Xu, Bing Qin, Mengling Feng

Main category: cs.CL

TL;DR: A self-sampling framework for efficient chain-of-thought learning that induces style-aligned reasoning traces from LLMs themselves without teacher guidance, enabling fast-thinking mode similar to human System 1 reasoning.

Details

Motivation: Current chain-of-thought methods in LLMs often involve redundant reasoning processes, and there's a need for LLMs to acquire a fast-thinking mode analogous to human System 1 reasoning. The scarcity of high-quality supervision data is a bottleneck for SFT-based methods.

Method: Self-sampling framework using activation steering to induce style-aligned, variable-length reasoning traces from target LLMs without teacher guidance. Uses filtered data by gold answers for SFT with dual-cognitive system and progressive compression curriculum. Also explores self-evolution regime using prediction-consistent data without gold answers.

Result: Extensive experiments on math benchmarks show stable improvements for both general and R1-style LLMs. Cross-domain generalization tests in medicine demonstrate effectiveness. The method yields efficient CoT learning with reduced reasoning redundancy.

Conclusion: The proposed self-sampling framework successfully enables LLMs to acquire fast-thinking capabilities, addressing the redundancy problem in CoT reasoning while eliminating the need for scarce high-quality supervision data.

Abstract: Large language models (LLMs) equipped with chain-of-thought (CoT) achieve strong performance and offer a window into LLM behavior. However, recent evidence suggests that improvements in CoT capabilities often come with redundant reasoning processes, motivating a key question: Can LLMs acquire a fast-thinking mode analogous to human System 1 reasoning? To explore this, our study presents a self-sampling framework based on activation steering for efficient CoT learning. Our method can induce style-aligned and variable-length reasoning traces from target LLMs themselves without any teacher guidance, thereby alleviating a central bottleneck of SFT-based methods-the scarcity of high-quality supervision data. Using filtered data by gold answers, we perform SFT for efficient CoT learning with (i) a human-like dual-cognitive system, and (ii) a progressive compression curriculum. Furthermore, we explore a self-evolution regime in which SFT is driven solely by prediction-consistent data of variable-length variants, eliminating the need for gold answers. Extensive experiments on math benchmarks, together with cross-domain generalization tests in medicine, show that our method yields stable improvements for both general and R1-style LLMs. Our data and model checkpoints can be found at https://github.com/DYR1/S3-CoT.

[139] From Latent Signals to Reflection Behavior: Tracing Meta-Cognitive Activation Trajectory in R1-Style LLMs

Yanrui Du, Yibo Gao, Sendong Zhao, Jiayun Li, Haochun Wang, Qika Lin, Kai He, Bing Qin, Mengling Feng

Main category: cs.CL

TL;DR: Analysis of internal mechanisms in R1-style LLMs for self-reflection, revealing a structured progression from latent monitoring to overt reflection behavior.

Details

Motivation: To understand the internal mechanisms underlying self-reflection behavior in R1-style LLMs, as the capacity for self-reflection has attracted attention but the underlying processes remain unclear.

Method: Anchored on the onset of reflection behavior and traced its layer-wise activation trajectory using logit lens to read token-level semantics, with targeted interventions to uncover causal relationships.

Result: Uncovered a structured progression: latent-control layers encode thinking budget semantics, semantic-pivot layers show discourse-level cues, and behavior-overt layers show rising likelihood of reflection tokens. Interventions revealed a causal chain across these stages.

Conclusion: The findings suggest a human-like meta-cognitive process progressing from latent monitoring to discourse-level regulation to overt self-reflection in LLMs.

Abstract: R1-style LLMs have attracted growing attention for their capacity for self-reflection, yet the internal mechanisms underlying such behavior remain unclear. To bridge this gap, we anchor on the onset of reflection behavior and trace its layer-wise activation trajectory. Using the logit lens to read out token-level semantics, we uncover a structured progression: (i) Latent-control layers, where an approximate linear direction encodes the semantics of thinking budget; (ii) Semantic-pivot layers, where discourse-level cues, including turning-point and summarization cues, surface and dominate the probability mass; and (iii) Behavior-overt layers, where the likelihood of reflection-behavior tokens begins to rise until they become highly likely to be sampled. Moreover, our targeted interventions uncover a causal chain across these stages: prompt-level semantics modulate the projection of activations along latent-control directions, thereby inducing competition between turning-point and summarization cues in semantic-pivot layers, which in turn regulates the sampling likelihood of reflection-behavior tokens in behavior-overt layers. Collectively, our findings suggest a human-like meta-cognitive process-progressing from latent monitoring, to discourse-level regulation, and to finally overt self-reflection. Our analysis code can be found at https://github.com/DYR1/S3-CoT.

[140] Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation

Zhanghao Hu, Qinglin Zhu, Hanqi Yan, Yulan He, Lin Gui

Main category: cs.CL

TL;DR: xMemory proposes a hierarchical memory system for AI agents that organizes dialogue memories into semantic components rather than using standard RAG similarity search, reducing redundancy and improving reasoning with temporally linked information.

Details

Motivation: Standard RAG pipelines are designed for large, heterogeneous corpora but don't work well for agent memory systems where dialogue streams are bounded, coherent, and contain highly correlated spans with duplicates. Fixed top-k similarity retrieval returns redundant context, and post-hoc pruning can delete temporally linked prerequisites needed for correct reasoning.

Method: xMemory builds a hierarchy of intact memory units using a sparsity-semantics objective that guides memory split and merge operations. It disentangles memories into semantic components, organizes them into a searchable hierarchy, and retrieves top-down by selecting compact, diverse sets of themes and semantics for multi-fact queries, expanding to episodes and raw messages only when it reduces uncertainty.

Result: Experiments on LoCoMo and PerLTQA benchmarks across three latest LLMs show consistent gains in answer quality and token efficiency compared to standard approaches.

Conclusion: Retrieval for agent memory should move beyond similarity matching to operate over latent semantic components, using hierarchical organization to provide compact, diverse, and temporally coherent context for reasoning.

Abstract: Agent memory systems often adopt the standard Retrieval-Augmented Generation (RAG) pipeline, yet its underlying assumptions differ in this setting. RAG targets large, heterogeneous corpora where retrieved passages are diverse, whereas agent memory is a bounded, coherent dialogue stream with highly correlated spans that are often duplicates. Under this shift, fixed top-$k$ similarity retrieval tends to return redundant context, and post-hoc pruning can delete temporally linked prerequisites needed for correct reasoning. We argue retrieval should move beyond similarity matching and instead operate over latent components, following decoupling to aggregation: disentangle memories into semantic components, organise them into a hierarchy, and use this structure to drive retrieval. We propose xMemory, which builds a hierarchy of intact units and maintains a searchable yet faithful high-level node organisation via a sparsity–semantics objective that guides memory split and merge. At inference, xMemory retrieves top-down, selecting a compact, diverse set of themes and semantics for multi-fact queries, and expanding to episodes and raw messages only when it reduces the reader’s uncertainty. Experiments on LoCoMo and PerLTQA across the three latest LLMs show consistent gains in answer quality and token efficiency.

[141] NEAT: Neuron-Based Early Exit for Large Reasoning Models

Kang Liu, Yongkang Liu, Xiaocui Yang, Peidong Wang, Wen Zhang, Shi Feng, Yifei Zhang, Daling Wang

Main category: cs.CL

TL;DR: NEAT is a neuron-based early reasoning exit framework that monitors neuron activation patterns to enable training-free early exits, reducing redundant reasoning steps in Large Reasoning Models without additional test-time computation.

Details

Motivation: Large Reasoning Models suffer from "overthinking" - generating redundant reasoning steps after reaching correct solutions. Existing early exit methods require additional rollout computation or labeled datasets, creating overhead.

Method: NEAT identifies exit-associated neurons and tracks their activation patterns during reasoning to dynamically trigger early exit or suppress reflection. It operates without training or additional test-time computation by monitoring neuron-level activation dynamics.

Result: Experiments on four reasoning benchmarks across six models show NEAT achieves 22-28% average token reduction while maintaining accuracy, demonstrating effective reduction of unnecessary reasoning.

Conclusion: NEAT provides an efficient, training-free approach to mitigate overthinking in reasoning models by leveraging neuron activation patterns for early exit decisions, reducing computational overhead while preserving solution quality.

Abstract: Large Reasoning Models (LRMs) often suffer from \emph{overthinking}, a phenomenon in which redundant reasoning steps are generated after a correct solution has already been reached. Existing early reasoning exit methods primarily rely on output-level heuristics or trained probing models to skip redundant reasoning steps, thereby mitigating overthinking. However, these approaches typically require additional rollout computation or externally labeled datasets. In this paper, we propose \textbf{NEAT}, a \textbf{N}euron-based \textbf{E}arly re\textbf{A}soning exi\textbf{T} framework that monitors neuron-level activation dynamics to enable training-free early exits, without introducing additional test-time computation. NEAT identifies exit-associated neurons and tracks their activation patterns during reasoning to dynamically trigger early exit or suppress reflection, thereby reducing unnecessary reasoning while preserving solution quality. Experiments on four reasoning benchmarks across six models with different scales and architectures show that, for each model, NEAT achieves an average token reduction of 22% to 28% when averaged over the four benchmarks, while maintaining accuracy.

[142] WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora

Pengyu Wang, Benfeng Xu, Licheng Zhang, Shaohan Wang, Mingxuan Du, Chiwei Zhu, Zhendong Mao

Main category: cs.CL

TL;DR: WildGraphBench is a new benchmark for evaluating GraphRAG systems using Wikipedia’s long, heterogeneous reference documents as realistic external knowledge sources.

Details

Motivation: Existing GraphRAG benchmarks use short, curated passages that don't reflect real-world scenarios with long contexts and large-scale heterogeneous documents, creating a gap in evaluating systems' practical performance.

Method: Leverage Wikipedia’s structure where articles reference external documents, using 12 top-level topics, external references as retrieval corpus, and citation-linked statements as ground truth to create 1,100 questions across three complexity levels.

Result: Experiments show current GraphRAG pipelines help with multi-fact aggregation from moderate sources but may overemphasize high-level statements at the expense of fine-grained details, leading to weaker summarization performance.

Conclusion: WildGraphBench provides a realistic benchmark for GraphRAG evaluation, revealing limitations in current aggregation paradigms and the need for better handling of fine-grained details in summarization tasks.

Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) organizes external knowledge as a hierarchical graph, enabling efficient retrieval and aggregation of scattered evidence across multiple documents. However, many existing benchmarks for GraphRAG rely on short, curated passages as external knowledge, failing to adequately evaluate systems in realistic settings involving long contexts and large-scale heterogeneous documents. To bridge this gap, we introduce WildGraphBench, a benchmark designed to assess GraphRAG performance in the wild. We leverage Wikipedia’s unique structure, where cohesive narratives are grounded in long and heterogeneous external reference documents, to construct a benchmark reflecting real-word scenarios. Specifically, we sample articles across 12 top-level topics, using their external references as the retrieval corpus and citation-linked statements as ground truth, resulting in 1,100 questions spanning three levels of complexity: single-fact QA, multi-fact QA, and section-level summarization. Experiments across multiple baselines reveal that current GraphRAG pipelines help on multi-fact aggregation when evidence comes from a moderate number of sources, but this aggregation paradigm may overemphasize high-level statements at the expense of fine-grained details, leading to weaker performance on summarization tasks. Project page:https://github.com/BstWPY/WildGraphBench.

[143] Closing the Loop: Universal Repository Representation with RPG-Encoder

Jane Luo, Chengyu Yin, Xin Zhang, Qingtao Li, Steven Liu, Yiming Huang, Jie Wu, Hao Liu, Yangyu Huang, Yu Kang, Fangkai Yang, Ying Xin, Scarlett Li

Main category: cs.CL

TL;DR: RPG-Encoder is a framework that creates unified repository representations by encoding code into Repository Planning Graphs, combining semantic features with dependencies for improved code understanding and generation.

Details

Motivation: Current repository agents suffer from fragmented representations using isolated API documentation or dependency graphs lacking semantic depth, creating a reasoning disconnect between comprehension and generation.

Method: Proposes RPG-Encoder that: 1) Encodes raw code into Repository Planning Graphs combining semantic features with dependencies, 2) Evolves topology incrementally to reduce maintenance costs, 3) Serves as unified interface for structure-aware navigation.

Result: Achieves 93.7% Acc@5 on SWE-bench Verified, exceeds best baseline by over 10% on SWE-bench Live Lite, reduces overhead by 95.7%, and achieves 98.5% reconstruction coverage on RepoCraft.

Conclusion: RPG-Encoder establishes state-of-the-art repository understanding with superior fine-grained localization, high-fidelity codebase mirroring, and closes the loop between intent and implementation.

Abstract: Current repository agents encounter a reasoning disconnect due to fragmented representations, as existing methods rely on isolated API documentation or dependency graphs that lack semantic depth. We consider repository comprehension and generation to be inverse processes within a unified cycle: generation expands intent into implementation, while comprehension compresses implementation back into intent. To address this, we propose RPG-Encoder, a framework that generalizes the Repository Planning Graph (RPG) from a static generative blueprint into a unified, high-fidelity representation. RPG-Encoder closes the reasoning loop through three mechanisms: (1) Encoding raw code into the RPG that combines lifted semantic features with code dependencies; (2) Evolving the topology incrementally to decouple maintenance costs from repository scale, reducing overhead by 95.7%; and (3) Operating as a unified interface for structure-aware navigation. In evaluations, RPG-Encoder establishes state-of-the-art repository understanding on SWE-bench Verified with 93.7% Acc@5 and exceeds the best baseline by over 10% on SWE-bench Live Lite. These results highlight our superior fine-grained localization accuracy in complex codebases. Furthermore, it achieves 98.5% reconstruction coverage on RepoCraft, confirming RPG’s high-fidelity capacity to mirror the original codebase and closing the loop between intent and implementation.

[144] LEC-KG: An LLM-Embedding Collaborative Framework for Domain-Specific Knowledge Graph Construction – A Case Study on SDGs

Yikai Zeng, Yingchao Piao, Jianhui Li

Main category: cs.CL

TL;DR: LEC-KG: A bidirectional collaborative framework integrating LLMs’ semantic understanding with KGE’s structural reasoning for constructing domain-specific knowledge graphs from unstructured text.

Details

Motivation: Domain-specific knowledge graph construction faces challenges including heterogeneous entity mentions, long-tail relation distributions, and lack of standardized schemas, requiring better integration of semantic understanding and structural reasoning.

Method: Three key components: 1) hierarchical coarse-to-fine relation extraction to mitigate long-tail bias, 2) evidence-guided Chain-of-Thought feedback grounding structural suggestions in source text, and 3) semantic initialization enabling structural validation for unseen entities. LLMs and KGE modules enhance each other iteratively.

Result: Evaluated on Chinese Sustainable Development Goal reports, showing substantial improvements over LLM baselines, particularly on low-frequency relations. Framework reliably transforms unstructured policy text into validated knowledge graph triples through iterative refinement.

Conclusion: LEC-KG effectively integrates semantic and structural reasoning for domain-specific knowledge graph construction, demonstrating robust performance on challenging real-world policy documents with long-tail relation distributions.

Abstract: Constructing domain-specific knowledge graphs from unstructured text remains challenging due to heterogeneous entity mentions, long-tail relation distributions, and the absence of standardized schemas. We present LEC-KG, a bidirectional collaborative framework that integrates the semantic understanding of Large Language Models (LLMs) with the structural reasoning of Knowledge Graph Embeddings (KGE). Our approach features three key components: (1) hierarchical coarse-to-fine relation extraction that mitigates long-tail bias, (2) evidence-guided Chain-of-Thought feedback that grounds structural suggestions in source text, and (3) semantic initialization that enables structural validation for unseen entities. The two modules enhance each other iteratively-KGE provides structure-aware feedback to refine LLM extractions, while validated triples progressively improve KGE representations. We evaluate LEC-KG on Chinese Sustainable Development Goal (SDG) reports, demonstrating substantial improvements over LLM baselines, particularly on low-frequency relations. Through iterative refinement, our framework reliably transforms unstructured policy text into validated knowledge graph triples.

[145] Think Dense, Not Long: Dynamic Decoupled Conditional Advantage for Efficient Reasoning

Keqin Peng, Yuanxin Ouyang, Xuebo Liu, Zhiliang Tian, Ruijian Han, Yancheng Yuan, Liang Ding

Main category: cs.CL

TL;DR: DDCA addresses verbose reasoning in RLVR by dynamically decoupling efficiency optimization from correctness using conditional length advantages and difficulty-adaptive penalties.

Details

Motivation: RLVR encourages multi-step reasoning but leads to overly verbose traces, and naive length penalties in group-relative optimization hurt accuracy due to baseline dilution and difficulty-penalty mismatch.

Method: Proposes Dynamic Decoupled Conditional Advantage (DDCA) which: 1) computes length advantages conditionally within correct-response clusters to eliminate baseline dilution, and 2) dynamically scales penalty strength using group pass rate as difficulty proxy.

Result: DDCA improves efficiency-accuracy trade-off across GSM8K, MATH500, AMC23, and AIME25, reducing generated tokens by ~60% on simpler tasks (GSM8K) and over 20% on harder benchmarks (AIME25) while maintaining or improving accuracy.

Conclusion: DDCA effectively addresses structural issues in RLVR optimization, enabling more efficient reasoning without sacrificing accuracy through conditional advantage computation and difficulty-adaptive penalties.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) can elicit strong multi-step reasoning, yet it often encourages overly verbose traces. Moreover, naive length penalties in group-relative optimization can severely hurt accuracy. We attribute this failure to two structural issues: (i) Dilution of Length Baseline, where incorrect responses (with zero length reward) depress the group baseline and over-penalize correct solutions; and (ii) Difficulty-Penalty Mismatch, where a static penalty cannot adapt to problem difficulty, suppressing necessary reasoning on hard instances while leaving redundancy on easy ones. We propose Dynamic Decoupled Conditional Advantage (DDCA) to decouple efficiency optimization from correctness. DDCA computes length advantages conditionally within the correct-response cluster to eliminate baseline dilution, and dynamically scales the penalty strength using the group pass rate as a proxy for difficulty. Experiments on GSM8K, MATH500, AMC23, and AIME25 show that DDCA consistently improves the efficiency–accuracy trade-off relative to adaptive baselines, reducing generated tokens by approximately 60% on simpler tasks (e.g., GSM8K) versus over 20% on harder benchmarks (e.g., AIME25), thereby maintaining or improving accuracy. Code is available at https://github.com/alphadl/DDCA.

[146] Dicta-LM 3.0: Advancing The Frontier of Hebrew Sovereign LLMs

Shaltiel Shmidman, Avi Shmidman, Amir DN Cohen, Moshe Koppel

Main category: cs.CL

TL;DR: Dicta-LM 3.0: A collection of open-weight Hebrew-English LLMs in three sizes (24B, 12B, 1.7B) adapted from existing models, with 65k context length and tool-calling support, plus a new Hebrew chat-LLM benchmark suite.

Details

Motivation: Addressing the scarcity of sovereign LLMs for non-English languages, particularly low-resource languages like Hebrew, despite high demand for such models in non-English speaking regions.

Method: Adaptation approach using existing base models (Mistral-Small-3.1, NVIDIA Nemotron Nano V2, Qwen3-1.7B) trained on substantial Hebrew-English corpora, released in multiple variants with 65k context length and tool-calling capabilities.

Result: Released three model sizes (24B, 12B, 1.7B) with base and chat variants, plus a comprehensive Hebrew benchmark suite covering Translation, Summarization, Winograd, Israeli Trivia, and Diacritization tasks.

Conclusion: The work addresses low-resource language LLM training challenges and provides a framework for adapting LLMs to other non-English languages, contributing to multilingual NLP advancement.

Abstract: Open-weight LLMs have been released by frontier labs; however, sovereign Large Language Models (for languages other than English) remain low in supply yet high in demand. Training large language models (LLMs) for low-resource languages such as Hebrew poses unique challenges. In this paper, we introduce Dicta-LM 3.0: an open-weight collection of LLMs trained on substantially-sized corpora of Hebrew and English texts. The model is released in three sizes: 24B - adapted from the Mistral-Small-3.1 base model, 12B - adapted from the NVIDIA Nemotron Nano V2 model, and 1.7B - adapted from the Qwen3-1.7B base model. We are releasing multiple variants of each model, each with a native context length of 65k tokens; base model and chat model with tool-calling support. To rigorously evaluate our models, we introduce a new benchmark suite for evaluation of Hebrew chat-LLMs, covering a diverse set of tasks including Translation, Summarization, Winograd, Israeli Trivia, and Diacritization (nikud). Our work not only addresses the intricacies of training LLMs in low-resource languages but also proposes a framework that can be leveraged for adapting other LLMs to various non-English languages, contributing to the broader field of multilingual NLP.

[147] Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts

Wenhao Li, Daohai Yu, Gen Luo, Yuxin Zhang, Fei Chao, Rongrong Ji, Yifan Wu, Jiaxin Liu, Ziyang Gong, Zimu Liao

Main category: cs.CL

TL;DR: OOMB is a memory-efficient training system for LLMs that enables training with extremely long contexts (up to 4M tokens) on a single GPU by maintaining constant activation memory and optimizing KV cache management.

Details

Motivation: Training LLMs on long contexts is severely limited by GPU memory overhead from activations that scale linearly with sequence length, requiring large clusters with context parallelism.

Method: Uses chunk-recurrent training with on-the-fly activation recomputation for O(1) activation memory, plus paged memory manager for KV cache, asynchronous CPU offloading, and page-level sparse attention to manage KV cache growth.

Result: Achieves only 10MB memory increase per 10K additional tokens for Qwen2.5-7B, enabling 4M-token context training on a single H200 GPU instead of requiring large clusters.

Conclusion: OOMB represents a substantial advance in resource efficiency for long-context LLM training, making extremely long context training feasible on single GPUs.

Abstract: Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time. The primary culprits are the activations, whose memory footprints scale linearly with sequence length. We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier. Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint (O(1)) and shifts the primary bottleneck to the growing KV cache. To manage the KV cache, OOMB integrates a suite of synergistic optimizations: a paged memory manager for both the KV cache and its gradients to eliminate fragmentation, asynchronous CPU offloading to hide data transfer latency, and page-level sparse attention to reduce both computational complexity and communication overhead. The synergy of these techniques yields exceptional efficiency. Our empirical results show that for every additional 10K tokens of context, the end-to-end training memory overhead increases by a mere 10MB for Qwen2.5-7B. This allows training Qwen2.5-7B with a 4M-token context on a single H200 GPU, a feat that would otherwise require a large cluster using context parallelism. This work represents a substantial advance in resource efficiency for long-context LLM training. The source code is available at https://github.com/wenhaoli-xmu/OOMB.

[148] There Is More to Refusal in Large Language Models than a Single Direction

Faaiz Joad, Majd Hawasly, Sabri Boughorbel, Nadir Durrani, Husrev Taha Sencar

Main category: cs.CL

TL;DR: Refusal behaviors in LLMs correspond to multiple distinct activation directions rather than a single one, but linear steering along any refusal direction produces similar refusal-to-over-refusal trade-offs, acting as a shared control mechanism.

Details

Motivation: To challenge the prior understanding that refusal in large language models is mediated by a single activation-space direction, and to investigate the geometric structure of refusal behaviors across diverse categories including safety, incomplete requests, anthropomorphization, and over-refusal.

Method: Analyzed eleven categories of refusal and non-compliance behaviors across LLMs, examining their corresponding activation-space directions. Used linear steering techniques to manipulate these directions and observe resulting refusal behaviors and trade-offs.

Result: Found that refusal behaviors correspond to geometrically distinct directions in activation space, contrary to the single-direction hypothesis. However, linear steering along any refusal-related direction produces nearly identical refusal-to-over-refusal trade-offs, suggesting a shared one-dimensional control mechanism. The primary effect of different directions is not whether the model refuses, but how it refuses.

Conclusion: The account of refusal being mediated by a single activation-space direction is incomplete. While refusal behaviors are geometrically diverse, they share a common control mechanism for refusal intensity, with different directions primarily affecting the style or manner of refusal rather than the decision to refuse itself.

Abstract: Prior work argues that refusal in large language models is mediated by a single activation-space direction, enabling effective steering and ablation. We show that this account is incomplete. Across eleven categories of refusal and non-compliance, including safety, incomplete or unsupported requests, anthropomorphization, and over-refusal, we find that these refusal behaviors correspond to geometrically distinct directions in activation space. Yet despite this diversity, linear steering along any refusal-related direction produces nearly identical refusal to over-refusal trade-offs, acting as a shared one-dimensional control knob. The primary effect of different directions is not whether the model refuses, but how it refuses.

[149] Quantifying the Gap between Understanding and Generation within Unified Multimodal Models

Chenlong Wang, Yuhang Chen, Zhihan Hu, Dongping Chen, Wenhu Chen, Sarah Wiegreffe, Tianyi Zhou

Main category: cs.CL

TL;DR: GapEval benchmark reveals persistent gap between understanding and generation capabilities in unified multimodal models, showing only surface-level unification rather than deep cognitive convergence.

Details

Motivation: To investigate whether understanding and generation capabilities are genuinely aligned and integrated within unified multimodal models, as current models claim unification but may only achieve surface-level integration.

Method: Introduces GapEval, a bidirectional benchmark that quantifies the gap between understanding and generation capabilities through symmetric evaluation where each question can be answered in both image and text modalities, measuring cross-modal consistency.

Result: Experiments reveal a persistent gap between understanding and generation directions across various UMM architectures, indicating knowledge within models remains disjoint and capability emergence across modalities is unsynchronized.

Conclusion: Current unified multimodal models achieve only surface-level unification rather than deep cognitive convergence, with knowledge manipulation limitations preventing genuine integration of understanding and generation capabilities.

Abstract: Recent advances in unified multimodal models (UMM) have demonstrated remarkable progress in both understanding and generation tasks. However, whether these two capabilities are genuinely aligned and integrated within a single model remains unclear. To investigate this question, we introduce GapEval, a bidirectional benchmark designed to quantify the gap between understanding and generation capabilities, and quantitatively measure the cognitive coherence of the two “unified” directions. Each question can be answered in both modalities (image and text), enabling a symmetric evaluation of a model’s bidirectional inference capability and cross-modal consistency. Experiments reveal a persistent gap between the two directions across a wide range of UMMs with different architectures, suggesting that current models achieve only surface-level unification rather than deep cognitive convergence of the two. To further explore the underlying mechanism, we conduct an empirical study from the perspective of knowledge manipulation to illustrate the underlying limitations. Our findings indicate that knowledge within UMMs often remains disjoint. The capability emergence and knowledge across modalities are unsynchronized, paving the way for further exploration.

[150] Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing

Lingkun Long, Yushi Huang, Shihao Bai, Ruihao Gong, Jun Zhang, Ao Zhou, Jianlei Yang

Main category: cs.CL

TL;DR: Focus-dLLM is a training-free attention sparsification framework for diffusion LLMs that achieves 29x speedup for long-context inference by predicting unmasked regions and pruning redundant attention while preserving attention sinks.

Details

Motivation: Diffusion LLMs have strong long-context capabilities but suffer from high computational costs due to bidirectional full attention. Existing sparse attention methods are ineffective because they need to estimate attention importance for tokens yet to be decoded, while unmasked positions are unknown during diffusion.

Method: 1) Design a past confidence-guided indicator to predict unmasked regions based on token confidence correlation across adjacent steps. 2) Propose sink-aware pruning strategy to estimate and remove redundant attention while preserving attention sinks. 3) Reuse identified sink locations across layers leveraging cross-layer consistency.

Result: Achieves more than 29x lossless speedup under 32K context length, demonstrating efficient long-context inference for diffusion LLMs without accuracy degradation.

Conclusion: Focus-dLLM provides an effective training-free attention sparsification framework that significantly improves inference efficiency for diffusion LLMs while maintaining accuracy, addressing the computational bottleneck of bidirectional attention in long-context processing.

Abstract: Diffusion Large Language Models (dLLMs) deliver strong long-context processing capability in a non-autoregressive decoding paradigm. However, the considerable computational cost of bidirectional full attention limits the inference efficiency. Although sparse attention is promising, existing methods remain ineffective. This stems from the need to estimate attention importance for tokens yet to be decoded, while the unmasked token positions are unknown during diffusion. In this paper, we present Focus-dLLM, a novel training-free attention sparsification framework tailored for accurate and efficient long-context dLLM inference. Based on the finding that token confidence strongly correlates across adjacent steps, we first design a past confidence-guided indicator to predict unmasked regions. Built upon this, we propose a sink-aware pruning strategy to accurately estimate and remove redundant attention computation, while preserving highly influential attention sinks. To further reduce overhead, this strategy reuses identified sink locations across layers, leveraging the observed cross-layer consistency. Experimental results show that our method offers more than $29\times$ lossless speedup under $32K$ context length. The code is publicly available at: https://github.com/Longxmas/Focus-dLLM

[151] D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use

Bowen Xu, Shaoyu Wu, Hao Jiang, Kai Liu, Xin Chen, Lulu Hu, Bin Yang

Main category: cs.CL

TL;DR: D-CORE is a two-stage training framework that improves large reasoning models’ tool use by enhancing sub-task decomposition and reflective reasoning capabilities through self-distillation and diversity-aware reinforcement learning.

Details

Motivation: Current large reasoning models lack sub-task decomposition capability in complex tool use scenarios, leading to "Lazy Reasoning" where they fail to properly break down complex problems into manageable steps.

Method: Two-stage framework: 1) Self-distillation to incentivize task decomposition reasoning capability, 2) Diversity-aware reinforcement learning to restore reflective reasoning capability.

Result: D-CORE achieves robust tool-use improvements across diverse benchmarks and model scales. D-CORE-8B reaches 77.7% accuracy on BFCLv3 (5.7% improvement over best 8B model), and D-CORE-14B establishes new SOTA at 79.3%, outperforming 70B models despite being 5× smaller.

Conclusion: D-CORE effectively addresses lazy reasoning in large reasoning models by enhancing decomposition and reflective reasoning capabilities, achieving state-of-the-art performance in tool use tasks with significantly smaller models.

Abstract: Effective tool use and reasoning are essential capabilities for large reasoning models~(LRMs) to address complex real-world problems. Through empirical analysis, we identify that current LRMs lack the capability of sub-task decomposition in complex tool use scenarios, leading to Lazy Reasoning. To address this, we propose a two-stage training framework D-CORE~(\underline{\textbf{D}}ecomposing tasks and \underline{\textbf{Co}}mposing \underline{\textbf{Re}}asoning processes) that first incentivize the LRMs’ task decomposition reasoning capability via self-distillation, followed by diversity-aware reinforcement learning~(RL) to restore LRMs’ reflective reasoning capability. D-CORE achieves robust tool-use improvements across diverse benchmarks and model scales. Experiments on BFCLv3 demonstrate superiority of our method: D-CORE-8B reaches 77.7% accuracy, surpassing the best-performing 8B model by 5.7%. Meanwhile, D-CORE-14B establishes a new state-of-the-art at 79.3%, outperforming 70B models despite being 5$\times$ smaller. The source code is available at https://github.com/alibaba/EfficientAI.

[152] AR-MAP: Are Autoregressive Large Language Models Implicit Teachers for Diffusion Large Language Models?

Liang Lin, Feng Xiong, Zengbin Wang, Kun Wang, Junhao Dong, Xuecai Hu, Yong Wang, Xiangxiang Chu

Main category: cs.CL

TL;DR: AR-MAP transfers preference alignment knowledge from autoregressive LLMs to diffusion LLMs through weight scaling, avoiding high-variance direct alignment methods.

Details

Motivation: Diffusion LLMs face challenges in preference alignment due to high variance in ELBO-based likelihood estimation, requiring a more efficient transfer learning approach from aligned autoregressive models.

Method: Proposes AR-MAP framework that leverages preference-aligned autoregressive LLMs as implicit teachers for DLLM alignment through simple weight scaling, exploiting shared architectural structure between different generation paradigms.

Result: Achieves 69.08% average score across diverse preference alignment tasks, demonstrating competitive or superior performance compared to existing DLLM-specific alignment methods.

Conclusion: AR-MAP provides an effective transfer learning approach for DLLM alignment that circumvents computational overhead and variance issues of direct alignment methods.

Abstract: Diffusion Large Language Models (DLLMs) have emerged as a powerful alternative to autoregressive models, enabling parallel token generation across multiple positions. However, preference alignment of DLLMs remains challenging due to high variance introduced by Evidence Lower Bound (ELBO)-based likelihood estimation. In this work, we propose AR-MAP, a novel transfer learning framework that leverages preference-aligned autoregressive LLMs (AR-LLMs) as implicit teachers for DLLM alignment. We reveal that DLLMs can effectively absorb alignment knowledge from AR-LLMs through simple weight scaling, exploiting the shared architectural structure between these divergent generation paradigms. Crucially, our approach circumvents the high variance and computational overhead of direct DLLM alignment and comprehensive experiments across diverse preference alignment tasks demonstrate that AR-MAP achieves competitive or superior performance compared to existing DLLM-specific alignment methods, achieving 69.08% average score across all tasks and models. Our Code is available at https://github.com/AMAP-ML/AR-MAP.

[153] Evaluating Metalinguistic Knowledge in Large Language Models across the World’s Languages

Tjaša Arčon, Matej Klemen, Marko Robnik-Šikonja, Kaja Dobrovoljc

Main category: cs.CL

TL;DR: LLMs show limited metalinguistic knowledge across languages, performing moderately at best and heavily influenced by digital resource availability rather than true grammatical competence.

Details

Motivation: To assess LLMs' explicit knowledge of linguistic structure (metalinguistic knowledge) across diverse languages, moving beyond narrow phenomena and high-resource language focus in existing benchmarks.

Method: Created a benchmark evaluating metalinguistic reasoning across linguistic domains (lexical, phonological, morphological, syntactic) and languages, analyzing performance using accuracy, macro F1, and comparing to majority-class and chance baselines.

Result: GPT-4o performed best but only achieved moderate accuracy (0.367); all models performed above chance but failed to beat majority-class baseline. Performance varied by linguistic domain (lexical highest, phonological lowest) and strongly correlated with digital language status and resource availability.

Conclusion: LLMs’ metalinguistic knowledge is fragmented and shaped by data availability rather than generalizable grammatical competence, highlighting the need for more diverse linguistic evaluation and development.

Abstract: Large language models (LLMs) are routinely evaluated on language use tasks, yet their knowledge of linguistic structure remains poorly understood. Existing linguistic benchmarks typically focus on narrow phenomena, emphasize high-resource languages, and rarely evaluate metalinguistic knowledge-explicit reasoning about language structure rather than language use. Using accuracy and macro F1, together with majority-class and chance baselines, we analyse overall performance and examine variation by linguistic domains and language-related factors. Our results show that metalinguistic knowledge in current LLMs is limited: GPT-4o performs best but achieves only moderate accuracy (0.367), while open-source models lag behind. All models perform above chance but fail to outperform the majority-class baseline, suggesting they capture cross-linguistic patterns but lack fine-grained grammatical distinctions. Performance varies across linguistic domains, with lexical features showing the highest accuracy and phonological features among the lowest, partially reflecting differences in online visibility. At the language level, accuracy shows a strong association with digital language status: languages with higher digital presence and resource availability are evaluated more accurately, while low-resource languages show substantially lower performance. Analyses of predictive factors confirm that resource-related indicators (Wikipedia size, corpus availability) are more informative predictors of accuracy than geographical, genealogical, or sociolinguistic factors. Together, these results suggest that LLMs’ metalinguistic knowledge is fragmented and shaped by data availability rather than generalizable grammatical competence across the world’s languages. We release our benchmark as an open-source dataset to support systematic evaluation and encourage greater global linguistic diversity in future LLMs.

[154] Sinhala Physical Common Sense Reasoning Dataset for Global PIQA

Nisansa de Silva, Surangika Ranathunga

Main category: cs.CL

TL;DR: First Sinhala physical commonsense reasoning dataset for Global PIQA with 110 human-created samples in Sri Lankan context

Details

Motivation: To create the first physical commonsense reasoning dataset for Sinhala language, addressing the lack of such resources for low-resource languages and supporting the Global PIQA initiative for multilingual AI development

Method: Human creation and verification of 110 data samples, each containing a prompt, correct answer, and wrong answer, with most questions contextualized to Sri Lankan culture and environment

Result: Successfully created the first Sinhala physical commonsense reasoning dataset with 110 verified samples, contributing to Global PIQA’s multilingual coverage

Conclusion: The dataset fills a gap for Sinhala language resources and supports development of multilingual AI systems with physical commonsense reasoning capabilities

Abstract: This paper presents the first-ever Sinhala physical common sense reasoning dataset created as part of Global PIQA. It contains 110 human-created and verified data samples, where each sample consists of a prompt, the corresponding correct answer, and a wrong answer. Most of the questions refer to the Sri Lankan context, where Sinhala is an official language.

[155] Towards AI Evaluation in Domain-Specific RAG Systems: The AgriHubi Case Study

Md. Toufique Hasan, Ayman Asad Khan, Mika Saari, Vaishnavi Bankhele, Pekka Abrahamsson

Main category: cs.CL

TL;DR: AgriHubi is a Finnish-language agricultural decision support system using retrieval-augmented generation with domain adaptation for low-resource language settings.

Details

Motivation: LLMs have potential for knowledge-intensive domains like agriculture but face limitations: weak grounding, English-centric training, limited real-world evaluation, and amplified issues for low-resource languages where domain documentation exists but is hard to access through general models.

Method: Developed AgriHubi, a domain-adapted RAG system for Finnish agriculture that integrates Finnish agricultural documents with open PORO family models, combines explicit source grounding with user feedback for iterative refinement, and was developed over eight iterations.

Result: System shows clear gains in answer completeness, linguistic accuracy, and perceived reliability; reveals practical trade-offs between response quality and latency when deploying larger models; evaluated through two user studies.

Conclusion: Provides empirical guidance for designing and evaluating domain-specific RAG systems in low-resource language settings.

Abstract: Large language models show promise for knowledge-intensive domains, yet their use in agriculture is constrained by weak grounding, English-centric training data, and limited real-world evaluation. These issues are amplified for low-resource languages, where high-quality domain documentation exists but remains difficult to access through general-purpose models. This paper presents AgriHubi, a domain-adapted retrieval-augmented generation (RAG) system for Finnish-language agricultural decision support. AgriHubi integrates Finnish agricultural documents with open PORO family models and combines explicit source grounding with user feedback to support iterative refinement. Developed over eight iterations and evaluated through two user studies, the system shows clear gains in answer completeness, linguistic accuracy, and perceived reliability. The results also reveal practical trade-offs between response quality and latency when deploying larger models. This study provides empirical guidance for designing and evaluating domain-specific RAG systems in low-resource language settings.

[156] Am I More Pointwise or Pairwise? Revealing Position Bias in Rubric-Based LLM-as-a-Judge

Yuzheng Xu, Tosho Hirasawa, Tadashi Kozuno, Yoshitaka Ushiku

Main category: cs.CL

TL;DR: LLM-as-a-judge rubric evaluation suffers from position bias where models prefer scores at specific positions, and balanced permutation aggregation improves reliability and human correlation.

Details

Motivation: While LLM-as-a-judge evaluation has been studied for point-wise and pair-wise paradigms, rubric-based evaluation (where LLMs select scores from multiple rubrics) has received less analysis. The authors hypothesize that rubric-based evaluation resembles a multi-choice setting and may suffer from position bias.

Method: Conducted controlled experiments across multiple models and datasets to demonstrate consistent position bias. Proposed a balanced permutation strategy that evenly distributes each score option across different positions. Aggregated scores across these balanced permutations to mitigate bias.

Result: Found consistent position bias in rubric-based LLM evaluation. Showed that aggregating scores across balanced permutations not only reveals latent position bias but also improves correlation between LLM-as-a-Judge and human judgments.

Conclusion: Rubric-based LLM-as-a-Judge is not inherently point-wise and suffers from position bias. Simple permutation-based calibration can substantially improve its reliability and alignment with human evaluation.

Abstract: Large language models (LLMs) are now widely used to evaluate the quality of text, a field commonly referred to as LLM-as-a-judge. While prior works mainly focus on point-wise and pair-wise evaluation paradigms. Rubric-based evaluation, where LLMs select a score from multiple rubrics, has received less analysis. In this work, we show that rubric-based evaluation implicitly resembles a multi-choice setting and therefore has position bias: LLMs prefer score options appearing at specific positions in the rubric list. Through controlled experiments across multiple models and datasets, we demonstrate consistent position bias. To mitigate this bias, we propose a balanced permutation strategy that evenly distributes each score option across positions. We show that aggregating scores across balanced permutations not only reveals latent position bias, but also improves correlation between the LLM-as-a-Judge and human. Our results suggest that rubric-based LLM-as-a-Judge is not inherently point-wise and that simple permutation-based calibration can substantially improve its reliability.

[157] Using Correspondence Patterns to Identify Irregular Words in Cognate sets Through Leave-One-Out Validation

Frederic Blum, Johann-Mattis List

Main category: cs.CL

TL;DR: A computational method for quantifying regularity in historical language comparison using balanced average recurrence of correspondence patterns to identify irregular cognate sets.

Details

Motivation: Current historical linguistics relies on intuitive judgments of regularity in sound correspondences, but irregularity is more common than expected. There's a need for quantitative evaluation methods to improve workflows in computer-assisted language comparison.

Method: Proposes balanced average recurrence of correspondence patterns as a new regularity measure, with a computational method to identify cognate sets lacking regularity. Uses leave-one-out validation with simulated and real data to test identification of irregular forms.

Result: Method achieves 85% accuracy with real datasets. Shows benefits of working with subsamples of large datasets and demonstrates how increasing irregularity affects results.

Conclusion: The new regularity measure and irregular cognate identification method could improve quality of existing and future datasets in computer-assisted language comparison.

Abstract: Regular sound correspondences constitute the principal evidence in historical language comparison. Despite the heuristic focus on regularity, it is often more an intuitive judgement than a quantified evaluation, and irregularity is more common than expected from the Neogrammarian model. Given the recent progress of computational methods in historical linguistics and the increased availability of standardized lexical data, we are now able to improve our workflows and provide such a quantitative evaluation. Here, we present the balanced average recurrence of correspondence patterns as a new measure of regularity. We also present a new computational method that uses this measure to identify cognate sets that lack regularity with respect to their correspondence patterns. We validate the method through two experiments, using simulated and real data. In the experiments, we employ leave-one-out validation to measure the regularity of cognate sets in which one word form has been replaced by an irregular one, checking how well our method identifies the forms causing the irregularity. Our method achieves an overall accuracy of 85% with the datasets based on real data. We also show the benefits of working with subsamples of large datasets and how increasing irregularity in the data influences our results. Reflecting on the broader potential of our new regularity measure and the irregular cognate identification method based on it, we conclude that they could play an important role in improving the quality of existing and future datasets in computer-assisted language comparison.

[158] OpenSeal: Good, Fast, and Cheap Construction of an Open-Source Southeast Asian LLM via Parallel Data

Tan Sang Nguyen, Muhammad Reza Qorib, Hwee Tou Ng

Main category: cs.CL

TL;DR: OpenSeal is the first truly open-source Southeast Asian LLM built using parallel data for multilingual extension, achieving performance comparable to similar-sized models with minimal compute.

Details

Motivation: Most LLMs are English-centric and perform poorly on low-resource languages. Existing Southeast Asia-focused LLMs aren't truly open source as they don't disclose training data. Truly open-source models are needed for transparency and understanding of LLM internals, biases, generalization, and multilinguality.

Method: Conducted controlled experiments on parallel data effectiveness for continual pretraining of LLMs. Used only 34.7B tokens of parallel data and 180 hours on 8x NVIDIA H200 GPUs to build OpenSeal, focusing on parallel data as the most effective way to extend LLMs to new languages.

Result: Built OpenSeal, the first truly open Southeast Asian LLM that rivals performance of existing models of similar size. Found that using only parallel data is most effective for extending LLMs to new languages.

Conclusion: Parallel data is highly effective for multilingual LLM extension. OpenSeal demonstrates that truly open-source, high-performance multilingual models can be built with modest compute resources, advancing transparency in LLM development.

Abstract: Large language models (LLMs) have proven to be effective tools for a wide range of natural language processing (NLP) applications. Although many LLMs are multilingual, most remain English-centric and perform poorly on low-resource languages. Recently, several Southeast Asia-focused LLMs have been developed, but none are truly open source, as they do not publicly disclose their training data. Truly open-source models are important for transparency and for enabling a deeper and more precise understanding of LLM internals and development, including biases, generalization, and multilinguality. Motivated by recent advances demonstrating the effectiveness of parallel data in improving multilingual performance, we conduct controlled and comprehensive experiments to study the effectiveness of parallel data in continual pretraining of LLMs. Our findings show that using only parallel data is the most effective way to extend an LLM to new languages. Using just 34.7B tokens of parallel data and 180 hours on 8x NVIDIA H200 GPUs, we built OpenSeal, the first truly open Southeast Asian LLM that rivals the performance of existing models of similar size.

[159] dziribot: rag based intelligent conversational agent for algerian arabic dialect

El Batoul Bechiri, Dihia Lanasri

Main category: cs.CL

TL;DR: DziriBOT is a hybrid conversational agent for Algerian Darja dialect, combining NLU with RAG to handle linguistic complexities like code-switching and orthographic variations, achieving SOTA with fine-tuned DziriBERT.

Details

Motivation: Address the challenge of building conversational agents for the Algerian Darja dialect, which has non-standardized orthography, extensive French code-switching, and uses both Arabic and Latin scripts, in a low-resource setting.

Method: Multi-layered architecture integrating specialized NLU with Retrieval-Augmented Generation (RAG). Evaluated three approaches: sparse-feature Rasa pipeline, classical ML baselines, and transformer-based fine-tuning (DziriBERT).

Result: Fine-tuned DziriBERT model achieves state-of-the-art performance, significantly outperforming traditional baselines, especially in handling orthographic noise and rare intents.

Conclusion: DziriBOT provides a robust, scalable solution bridging formal language models with Algerian linguistic realities, offering a blueprint for dialect-aware automation in regional markets.

Abstract: The rapid digitalization of customer service has intensified the demand for conversational agents capable of providing accurate and natural interactions. In the Algerian context, this is complicated by the linguistic complexity of Darja, a dialect characterized by non-standardized orthography, extensive code-switching with French, and the simultaneous use of Arabic and Latin (Arabizi) scripts. This paper introduces DziriBOT, a hybrid intelligent conversational agent specifically engineered to overcome these challenges. We propose a multi-layered architecture that integrates specialized Natural Language Understanding (NLU) with Retrieval-Augmented Generation (RAG), allowing for both structured service flows and dynamic, knowledge-intensive responses grounded in curated enterprise documentation. To address the low-resource nature of Darja, we systematically evaluate three distinct approaches: a sparse-feature Rasa pipeline, classical machine learning baselines, and transformer-based fine-tuning. Our experimental results demonstrate that the fine-tuned DziriBERT model achieves state-of-the-art performance. These results significantly outperform traditional baselines, particularly in handling orthographic noise and rare intents. Ultimately, DziriBOT provides a robust, scalable solution that bridges the gap between formal language models and the linguistic realities of Algerian users, offering a blueprint for dialect-aware automation in the regional market.

[160] Kimi K2.5: Visual Agentic Intelligence

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen, Dazhi Cheng, Minghan Chu, Jialei Cui, Jiaqi Deng, Muxi Diao, Hao Ding, Mengfan Dong, Mengnan Dong, Yuxin Dong, Yuhao Dong, Angang Du, Chenzhuang Du, Dikang Du, Lingxiao Du, Yulun Du, Yu Fan, Shengjun Fang, Qiulin Feng, Yichen Feng, Garimugai Fu, Kelin Fu, Hongcheng Gao, Tong Gao, Yuyao Ge, Shangyi Geng, Chengyang Gong, Xiaochen Gong, Zhuoma Gongque, Qizheng Gu, Xinran Gu, Yicheng Gu, Longyu Guan, Yuanying Guo, Xiaoru Hao, Weiran He, Wenyang He, Yunjia He, Chao Hong, Hao Hu, Jiaxi Hu, Yangyang Hu, Zhenxing Hu, Ke Huang, Ruiyuan Huang, Weixiao Huang, Zhiqi Huang, Tao Jiang, Zhejun Jiang, Xinyi Jin, Yu Jing, Guokun Lai, Aidi Li, C. Li, Cheng Li, Fang Li, Guanghe Li, Guanyu Li, Haitao Li, Haoyang Li, Jia Li, Jingwei Li, Junxiong Li, Lincan Li, Mo Li, Weihong Li, Wentao Li, Xinhang Li, Xinhao Li, Yang Li, Yanhao Li, Yiwei Li, Yuxiao Li, Zhaowei Li, Zheming Li, Weilong Liao, Jiawei Lin, Xiaohan Lin, Zhishan Lin, Zichao Lin, Cheng Liu, Chenyu Liu, Hongzhang Liu, Liang Liu, Shaowei Liu, Shudong Liu, Shuran Liu, Tianwei Liu, Tianyu Liu, Weizhou Liu, Xiangyan Liu, Yangyang Liu, Yanming Liu, Yibo Liu, Yuanxin Liu, Yue Liu, Zhengying Liu, Zhongnuo Liu, Enzhe Lu, Haoyu Lu, Zhiyuan Lu, Junyu Luo, Tongxu Luo, Yashuo Luo, Long Ma, Yingwei Ma, Shaoguang Mao, Yuan Mei, Xin Men, Fanqing Meng, Zhiyong Meng, Yibo Miao, Minqing Ni, Kun Ouyang, Siyuan Pan, Bo Pang, Yuchao Qian, Ruoyu Qin, Zeyu Qin, Jiezhong Qiu, Bowen Qu, Zeyu Shang, Youbo Shao, Tianxiao Shen, Zhennan Shen, Juanfeng Shi, Lidong Shi, Shengyuan Shi, Feifan Song, Pengwei Song, Tianhui Song, Xiaoxi Song, Hongjin Su, Jianlin Su, Zhaochen Su, Lin Sui, Jinsong Sun, Junyao Sun, Tongyu Sun, Flood Sung, Yunpeng Tai, Chuning Tang, Heyi Tang, Xiaojuan Tang, Zhengyang Tang, Jiawen Tao, Shiyuan Teng, Chaoran Tian, Pengfei Tian, Ao Wang, Bowen Wang, Chensi Wang, Chuang Wang, Congcong Wang, Dingkun Wang, Dinglu Wang, Dongliang Wang, Feng Wang, Hailong Wang, Haiming Wang, Hengzhi Wang, Huaqing Wang, Hui Wang, Jiahao Wang, Jinhong Wang, Jiuzheng Wang, Kaixin Wang, Linian Wang, Qibin Wang, Shengjie Wang, Shuyi Wang, Si Wang, Wei Wang, Xiaochen Wang, Xinyuan Wang, Yao Wang, Yejie Wang, Yipu Wang, Yiqin Wang, Yucheng Wang, Yuzhi Wang, Zhaoji Wang, Zhaowei Wang, Zhengtao Wang, Zhexu Wang, Zihan Wang, Zizhe Wang, Chu Wei, Ming Wei, Chuan Wen, Zichen Wen, Chengjie Wu, Haoning Wu, Junyan Wu, Rucong Wu, Wenhao Wu, Yuefeng Wu, Yuhao Wu, Yuxin Wu, Zijian Wu, Chenjun Xiao, Jin Xie, Xiaotong Xie, Yuchong Xie, Yifei Xin, Bowei Xing, Boyu Xu, Jianfan Xu, Jing Xu, Jinjing Xu, L. H. Xu, Lin Xu, Suting Xu, Weixin Xu, Xinbo Xu, Xinran Xu, Yangchuan Xu, Yichang Xu, Yuemeng Xu, Zelai Xu, Ziyao Xu, Junjie Yan, Yuzi Yan, Guangyao Yang, Hao Yang, Junwei Yang, Kai Yang, Ningyuan Yang, Ruihan Yang, Xiaofei Yang, Xinlong Yang, Ying Yang, Yi Yang, Yi Yang, Zhen Yang, Zhilin Yang, Zonghan Yang, Haotian Yao, Dan Ye, Wenjie Ye, Zhuorui Ye, Bohong Yin, Chengzhen Yu, Longhui Yu, Tao Yu, Tianxiang Yu, Enming Yuan, Mengjie Yuan, Xiaokun Yuan, Yang Yue, Weihao Zeng, Dunyuan Zha, Haobing Zhan, Dehao Zhang, Hao Zhang, Jin Zhang, Puqi Zhang, Qiao Zhang, Rui Zhang, Xiaobin Zhang, Y. Zhang, Yadong Zhang, Yangkun Zhang, Yichi Zhang, Yizhi Zhang, Yongting Zhang, Yu Zhang, Yushun Zhang, Yutao Zhang, Yutong Zhang, Zheng Zhang, Chenguang Zhao, Feifan Zhao, Jinxiang Zhao, Shuai Zhao, Xiangyu Zhao, Yikai Zhao, Zijia Zhao, Huabin Zheng, Ruihan Zheng, Shaojie Zheng, Tengyang Zheng, Junfeng Zhong, Longguang Zhong, Weiming Zhong, M. Zhou, Runjie Zhou, Xinyu Zhou, Zaida Zhou, Jinguo Zhu, Liya Zhu, Xinhao Zhu, Yuxuan Zhu, Zhen Zhu, Jingze Zhuang, Weiyu Zhuang, Ying Zou, Xinxing Zu

Main category: cs.CL

TL;DR: Kimi K2.5 is an open-source multimodal agentic model with joint text-vision optimization and Agent Swarm framework for parallel task decomposition and execution.

Details

Motivation: To advance general agentic intelligence by creating a multimodal model that jointly optimizes text and vision modalities, enabling them to enhance each other for better performance across various domains.

Method: Uses joint text-vision pre-training, zero-vision SFT (supervised fine-tuning), and joint text-vision reinforcement learning. Introduces Agent Swarm - a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently.

Result: Achieves state-of-the-art results across coding, vision, reasoning, and agentic tasks. Agent Swarm reduces latency by up to 4.5× over single-agent baselines.

Conclusion: Kimi K2.5 represents an advancement in multimodal agentic intelligence with its joint text-vision optimization and efficient parallel execution framework, with the model checkpoint released to facilitate future research.

Abstract: We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to $4.5\times$ over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.

[161] Culinary Crossroads: A RAG Framework for Enhancing Diversity in Cross-Cultural Recipe Adaptation

Tianyi Hu, Andrea Morales-Garzón, Jingyi Zheng, Maria Maistro, Daniel Hershcovich

Main category: cs.CL

TL;DR: CARRIAGE is a plug-and-play RAG framework for cross-cultural recipe adaptation that enhances diversity in both retrieval and context organization to generate varied recipe adaptations for different dietary needs and preferences.

Details

Motivation: Standard RAG approaches for cross-cultural recipe adaptation tend to overly rely on limited context, failing to produce diverse outputs even with varied contextual inputs, which is problematic for creative tasks with multiple valid answers like recipe adaptation for different dietary preferences.

Method: CARRIAGE enhances diversity in both retrieval and context organization stages of RAG. It’s a plug-and-play framework that explicitly aims to generate highly diverse outputs by better leveraging contextual diversity.

Result: CARRIAGE achieves Pareto efficiency in terms of diversity and quality of recipe adaptation compared to closed-book LLMs, demonstrating improved ability to generate varied recipe adaptations.

Conclusion: CARRIAGE addresses a key limitation of standard RAG in creative tasks by enhancing diversity generation, making it the first RAG framework explicitly designed for generating highly diverse outputs to accommodate multiple user preferences.

Abstract: In cross-cultural recipe adaptation, the goal is not only to ensure cultural appropriateness and retain the original dish’s essence, but also to provide diverse options for various dietary needs and preferences. Retrieval Augmented Generation (RAG) is a promising approach, combining the retrieval of real recipes from the target cuisine for cultural adaptability with large language models (LLMs) for relevance. However, it remains unclear whether RAG can generate diverse adaptation results. Our analysis shows that RAG tends to overly rely on a limited portion of the context across generations, failing to produce diverse outputs even when provided with varied contextual inputs. This reveals a key limitation of RAG in creative tasks with multiple valid answers: it fails to leverage contextual diversity for generating varied responses. To address this issue, we propose CARRIAGE, a plug-and-play RAG framework for cross-cultural recipe adaptation that enhances diversity in both retrieval and context organization. To our knowledge, this is the first RAG framework that explicitly aims to generate highly diverse outputs to accommodate multiple user preferences. Our experiments show that CARRIAGE achieves Pareto efficiency in terms of diversity and quality of recipe adaptation compared to closed-book LLMs.

[162] Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages

Isaac Chung, Linda Freienthal

Main category: cs.CL

TL;DR: Cross-lingual evaluation of LLMs shows that while surface-level metrics are stable across languages, pragmatic judgments exhibit ranking instabilities, revealing that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages.

Details

Motivation: To investigate the reliability of cross-lingual evaluation of LLMs by distinguishing between genuine model performance differences and measurement instability, particularly for morphologically rich languages.

Method: Used synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, then tested automatic metrics and LLM-as-a-judge scoring for stability across languages, with Estonian native speaker annotations as reference.

Result: Surface-level metrics (lexical diversity, similarity) maintained cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibited rank inversions and near-zero correlations, indicating evaluation method instability rather than true model differences.

Conclusion: Zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages, requiring language-specific calibration against human baselines; controlled generation provides diagnostic probe for evaluation method stability.

Abstract: Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations. Because generation is controlled, these inconsistencies reflect how judge scoring behaves differently across languages rather than true model differences. This controlled design provides a diagnostic probe: evaluation methods that fail to maintain stability under identical generation conditions signal transfer failure before deployment. Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages, motivating language-specific calibration against targeted human baselines. We release our controlled generation protocol, synthetic data, and evaluation framework to enable replication across language families at https://github.com/isaac-chung/cross-lingual-stability-judges.

[163] Hallucination or Creativity: How to Evaluate AI-Generated Scientific Stories?

Alex Argese, Pasquale Lisena, Raphaël Troncy

Main category: cs.CL

TL;DR: StoryScore: A composite metric for evaluating AI-generated scientific stories that integrates semantic alignment, lexical grounding, narrative control, structural fidelity, redundancy avoidance, and entity-level hallucination detection.

Details

Motivation: Current evaluation metrics for AI-generated scientific narratives fail to capture storytelling qualities like abstraction, simplification, and pedagogical creativity, while hallucination detectors often misclassify legitimate narrative reformulations or prove unstable with creative content.

Method: Proposes StoryScore, a unified framework that combines multiple dimensions: semantic alignment with original content, lexical grounding, narrative control, structural fidelity, redundancy avoidance, and entity-level hallucination detection.

Result: Analysis reveals that existing hallucination detection methods struggle to distinguish pedagogical creativity from factual errors, showing that automatic metrics can assess semantic similarity but fail to evaluate narrative control and storytelling quality.

Conclusion: StoryScore provides a comprehensive evaluation framework for AI-generated scientific stories, addressing the limitations of current metrics in capturing storytelling quality while maintaining factual accuracy.

Abstract: Generative AI can turn scientific articles into narratives for diverse audiences, but evaluating these stories remains challenging. Storytelling demands abstraction, simplification, and pedagogical creativity-qualities that are not often well-captured by standard summarization metrics. Meanwhile, factual hallucinations are critical in scientific contexts, yet, detectors often misclassify legitimate narrative reformulations or prove unstable when creativity is involved. In this work, we propose StoryScore, a composite metric for evaluating AI-generated scientific stories. StoryScore integrates semantic alignment, lexical grounding, narrative control, structural fidelity, redundancy avoidance, and entity-level hallucination detection into a unified framework. Our analysis also reveals why many hallucination detection methods fail to distinguish pedagogical creativity from factual errors, highlighting a key limitation: while automatic metrics can effectively assess semantic similarity with original content, they struggle to evaluate how it is narrated and controlled.

[164] Advancing General-Purpose Reasoning Models with Modular Gradient Surgery

Min Cai, Yu Liang, Longzheng Wang, Yan Wang, Yueyang Zhang, Long Xia, Zhiyuan Sun, Xi Ye, Daiting Shi

Main category: cs.CL

TL;DR: MGS (Modular Gradient Surgery) addresses cross-domain interference in multi-task RL for large reasoning models by resolving gradient conflicts at the module level within transformers.

Details

Motivation: Training single general-purpose large reasoning models across diverse domains is challenging due to domain heterogeneity causing cross-domain interference in both sequential and mixed RL approaches, limiting overall gains.

Method: Introduces Modular Gradient Surgery (MGS) which resolves gradient conflicts at the module level within transformer architectures, specifically applied to Llama and Qwen models across math, general chat, and instruction following domains.

Result: MGS achieves average improvements of 4.3 points (16.6%) for Llama and 4.5 points (11.1%) for Qwen over standard multi-task RL across three domains, remaining effective under prolonged training.

Conclusion: The study clarifies sources of interference in multi-domain RL and presents an effective solution (MGS) for training general-purpose large reasoning models, addressing gradient conflicts at the module level.

Abstract: Reinforcement learning (RL) has played a central role in recent advances in large reasoning models (LRMs), yielding strong gains in verifiable and open-ended reasoning. However, training a single general-purpose LRM across diverse domains remains challenging due to pronounced domain heterogeneity. Through a systematic study of two widely used strategies, Sequential RL and Mixed RL, we find that both incur substantial cross-domain interference at the behavioral and gradient levels, resulting in limited overall gains. To address these challenges, we introduce Modular Gradient Surgery (MGS), which resolves gradient conflicts at the module level within the transformer. When applied to Llama and Qwen models, MGS achieves average improvements of 4.3 (16.6%) and 4.5 (11.1%) points, respectively, over standard multi-task RL across three representative domains (math, general chat, and instruction following). Further analysis demonstrates that MGS remains effective under prolonged training. Overall, our study clarifies the sources of interference in multi-domain RL and presents an effective solution for training general-purpose LRMs.

[165] The Shape of Beliefs: Geometry, Dynamics, and Interventions along Representation Manifolds of Language Models’ Posteriors

Raphaël Sarfati, Eric Bigelow, Daniel Wurgaft, Jack Merullo, Atticus Geiger, Owen Lewis, Tom McGrath, Ekdeep Singh Lubana

Main category: cs.CL

TL;DR: LLMs encode curved belief manifolds for distribution parameters, and geometry-aware steering preserves belief structure better than linear interventions

Details

Motivation: To understand how LLMs encode and update probabilistic beliefs in representation space, and how interventions reshape these beliefs, using a controlled distribution inference task

Method: Study Llama-3.2 generating samples from normal distributions by inferring parameters from in-context samples, analyze belief manifolds, and test different steering methods (standard linear vs geometry-aware)

Result: Found curved belief manifolds form with sufficient in-context learning; geometry-aware steering better preserves belief structure while linear steering pushes models off-manifold

Conclusion: Rich structure emerges naturally in LLMs, linear concept representations are often inadequate, and geometry-aware interventions respect underlying manifold structure

Abstract: Large language models (LLMs) represent prompt-conditioned beliefs (posteriors over answers and claims), but we lack a mechanistic account of how these beliefs are encoded in representation space, how they update with new evidence, and how interventions reshape them. We study a controlled setting in which Llama-3.2 generates samples from a normal distribution by implicitly inferring its parameters (mean and standard deviation) given only samples from the distribution in context. We find representations of curved “belief manifolds” for these parameters form with sufficient in-context learning and study how the model adapts when the distribution suddenly changes. While standard linear steering often pushes the model off-manifold and induces coupled, out-of-distribution shifts, geometry and field-aware steering better preserves the intended belief family. Our work demonstrates an example of linear field probing (LFP) as a simple approach to tile the data manifold and make interventions that respect the underlying geometry. We conclude that rich structure emerges naturally in LLMs and that purely linear concept representations are often an inadequate abstraction.

[166] A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method

Feiyang Cai, Guijuan He, Yi Hu, Jingjing Wang, Joshua Luo, Tianyu Zhu, Srikanth Pilla, Gang Li, Ling Liu, Feng Luo

Main category: cs.CL

TL;DR: Automated framework generates large-scale molecular structure descriptions using IUPAC name parsing and LLM-guided text generation, achieving 98.6% precision on 163k molecule-description pairs.

Details

Motivation: Molecular function depends on structure, but human annotation for structure-language alignment is too costly. Need automated method to create large-scale, high-quality datasets for training LLMs on chemical reasoning tasks.

Method: Extends rule-based chemical nomenclature parser to interpret IUPAC names into structured XML metadata, then uses this metadata to guide LLMs in generating accurate natural language descriptions automatically.

Result: Created dataset of ~163k molecule-description pairs with 98.6% precision validated by LLM-based and expert human evaluation on 2,000 molecules.

Conclusion: Provides scalable automated annotation method for molecular structure descriptions, enabling future molecule-language alignment for chemical reasoning tasks.

Abstract: Molecular function is largely determined by structure. Accurately aligning molecular structure with natural language is therefore essential for enabling large language models (LLMs) to reason about downstream chemical tasks. However, the substantial cost of human annotation makes it infeasible to construct large-scale, high-quality datasets of structure-grounded descriptions. In this work, we propose a fully automated annotation framework for generating precise molecular structure descriptions at scale. Our approach builds upon and extends a rule-based chemical nomenclature parser to interpret IUPAC names and construct enriched, structured XML metadata that explicitly encodes molecular structure. This metadata is then used to guide LLMs in producing accurate natural-language descriptions. Using this framework, we curate a large-scale dataset of approximately $163$k molecule-description pairs. A rigorous validation protocol combining LLM-based and expert human evaluation on a subset of $2,000$ molecules demonstrates a high description precision of $98.6%$. The resulting dataset provides a reliable foundation for future molecule-language alignment, and the proposed annotation method is readily extensible to larger datasets and broader chemical tasks that rely on structural descriptions.

[167] Language Steering for Multilingual In-Context Learning

Neeraja Kirtane, Kuan-Hao Huang

Main category: cs.CL

TL;DR: Training-free language steering method using activation differences between languages to improve multilingual in-context learning performance without parameter updates.

Details

Motivation: Multilingual LLMs perform substantially worse on non-English languages compared to English, especially in in-context learning scenarios where English demonstrations lead to poor performance on non-English inputs.

Method: Proposes language vectors - a training-free approach that leverages activation differences between source and target languages. Adds these vectors to intermediate model activations during inference to shift representations toward target language space without parameter updates.

Result: Consistent improvements on multilingual in-context learning across 19 languages, 3 datasets, and 3 models. Hierarchical clustering reveals meaningful linguistic structure aligned with language families. Vectors successfully transfer across tasks, showing task-agnostic representations.

Conclusion: Language vectors effectively steer LLM behavior toward target languages, improving multilingual performance while revealing universal semantic space structure across languages.

Abstract: While multilingual large language models have gained widespread adoption, their performance on non-English languages remains substantially inferior to English. This disparity is particularly evident in in-context learning scenarios, where providing demonstrations in English but testing on non-English inputs leads to significant performance degradation. In this paper, we hypothesize that LLMs develop a universal semantic space for understanding languages, where different languages are encoded as distinct directions within this space. Based on this hypothesis, we propose language vectors – a training-free language steering approach that leverages activation differences between source and target languages to guide model behavior. We steer the model generations by adding the vector to the intermediate model activations during inference. This is done to make the model’s internal representations shift towards the target language space without any parameter updates. We evaluate our method across three datasets and test on a total of 19 languages on three different models. Our results show consistent improvements on multilingual in-context learning over baselines across all tasks and languages tested. Beyond performance gains, hierarchical clustering of steering vectors reveals meaningful linguistic structure aligned with language families. These vectors also successfully transfer across tasks, demonstrating that these representations are task-agnostic.

[168] Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

Ziwen Xu, Chenyan Wu, Hengyu Sun, Haiwen Hong, Mengru Wang, Yunzhi Yao, Longtao Huang, Hui Xue, Shumin Deng, Zhixuan Chu, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: A unified framework for analyzing LLM control methods as dynamic weight updates, with preference-utility trade-off analysis and a new steering approach SPLIT

Details

Motivation: Existing methods for controlling LLMs (fine-tuning, LoRA, activation interventions) are studied in isolation, making comparisons difficult. The paper aims to provide a unified view and analysis framework.

Method: Proposes a unified framework treating interventions as dynamic weight updates, introduces preference-utility analysis using polarity-paired contrastive examples, develops activation manifold perspective, and creates SPLIT steering approach.

Result: Found consistent trade-off between preference (target concept tendency) and utility (coherent generation). Stronger control increases preference but reduces utility. SPLIT improves preference while better preserving utility.

Conclusion: Provides unified framework for understanding LLM control methods, reveals fundamental preference-utility trade-off, and offers practical steering approach SPLIT based on these insights.

Abstract: Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model’s valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md.

[169] Automated Multiple Mini Interview (MMI) Scoring

Ryan Huynh, Frank Guerin, Alison Callwood

Main category: cs.CL

TL;DR: Multi-agent prompting framework for automated assessment of soft skills in interviews, outperforming fine-tuned LLMs and achieving human-level reliability.

Details

Motivation: Human scoring of soft skills like empathy and communication in interviews is inconsistent and biased. While LLMs have improved automated essay scoring, they struggle with the abstract, context-dependent nature of interview transcripts where implicit signals are crucial.

Method: Multi-agent prompting framework that breaks evaluation into transcript refinement and criterion-specific scoring using 3-shot in-context learning with a large instruct-tuned model, without additional training.

Result: Outperforms specialized fine-tuned baselines (Avg QWK 0.62 vs 0.32), achieves reliability comparable to human experts, and generalizes to ASAP benchmark where it rivals domain-specific state-of-the-art models.

Conclusion: For complex, subjective reasoning tasks like interview assessment, structured prompt engineering offers a scalable alternative to data-intensive fine-tuning, changing how LLMs can be applied to automated evaluation.

Abstract: Assessing soft skills such as empathy, ethical judgment, and communication is essential in competitive selection processes, yet human scoring is often inconsistent and biased. While Large Language Models (LLMs) have improved Automated Essay Scoring (AES), we show that state-of-the-art rationale-based fine-tuning methods struggle with the abstract, context-dependent nature of Multiple Mini-Interviews (MMIs), missing the implicit signals embedded in candidate narratives. We introduce a multi-agent prompting framework that breaks down the evaluation process into transcript refinement and criterion-specific scoring. Using 3-shot in-context learning with a large instruct-tuned model, our approach outperforms specialised fine-tuned baselines (Avg QWK 0.62 vs 0.32) and achieves reliability comparable to human experts. We further demonstrate the generalisability of our framework on the ASAP benchmark, where it rivals domain-specific state-of-the-art models without additional training. These findings suggest that for complex, subjective reasoning tasks, structured prompt engineering may offer a scalable alternative to data-intensive fine-tuning, altering how LLMs can be applied to automated assessment.

[170] Proof-RM: A Scalable and Generalizable Reward Model for Math Proof

Haotong Yang, Zitong Wang, Shijia Kang, Siqi Yang, Wenkai Yu, Xu Niu, Yike Sun, Yi Hu, Zhouchen Lin, Muhan Zhang

Main category: cs.CL

TL;DR: A scalable pipeline for generating high-quality “question-proof-check” triplet data using LLMs, enabling training of proof-checking reward models to enhance mathematical reasoning in LLMs through RL with verifiable rewards.

Details

Motivation: While LLMs show strong math reasoning via RL with verifiable rewards, many advanced mathematical problems are proof-based with no simple way to verify proof authenticity through answer matching alone. There's a need for reward models that can reliably evaluate full proof processes.

Method: Designed a scalable data-construction pipeline using LLMs to generate diverse “question-proof-check” triplet data with minimal human effort. Systematically varied problem sources, generation methods, and model configurations to create diverse problem-proof pairs across difficulty levels, linguistic styles, and error types. Used hierarchical human review for label alignment, then trained a proof-checking RM with additional process reward and token weight balance to stabilize RL.

Result: Experiments validated the model’s scalability and strong performance from multiple perspectives: reward accuracy, generalization ability, and test-time guidance. The approach provides practical recipes and tools for strengthening LLM mathematical capabilities.

Conclusion: The work enables automatic verification of mathematical proofs through scalable data generation and proof-checking reward models, addressing a key limitation in current LLM mathematical reasoning approaches that rely on simple answer matching.

Abstract: While Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with Verifiable Rewards (RLVR), many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching. To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required. In this work, we design a scalable data-construction pipeline that, with minimal human effort, leverages LLMs to generate a large quantity of high-quality “question-proof-check” triplet data. By systematically varying problem sources, generation methods, and model configurations, we create diverse problem-proof pairs spanning multiple difficulty levels, linguistic styles, and error types, subsequently filtered through hierarchical human review for label alignment. Utilizing these data, we train a proof-checking RM, incorporating additional process reward and token weight balance to stabilize the RL process. Our experiments validate the model’s scalability and strong performance from multiple perspectives, including reward accuracy, generalization ability and test-time guidance, providing important practical recipes and tools for strengthening LLM mathematical capabilities.

[171] From Sycophancy to Sensemaking: Premise Governance for Human-AI Decision Making

Raunak Jain, Mudita Khurana, John Stephens, Srinivas Dharmasanam, Shankar Venkataraman

Main category: cs.CL

TL;DR: Paper proposes shifting from LLM answer generation to collaborative premise governance for reliable human-AI partnership in high-stakes decisions, using discrepancy-driven control loops and commitment gating.

Details

Motivation: Current LLMs exhibit dangerous sycophantic behavior in decision support - they provide fluent agreement without calibrated judgment, baking in implicit assumptions and pushing verification costs onto experts. This is particularly problematic in deep-uncertainty decisions where objectives are contested and reversals are costly.

Method: Proposes collaborative premise governance over a knowledge substrate, with discrepancy-driven control loops that detect conflicts and localize misalignment via typed discrepancies (teleological, epistemic, procedural). Uses bounded negotiation through decision slices, commitment gating to block action on uncommitted load-bearing premises, and value-gated challenge allocation under interaction cost constraints.

Result: Theoretical framework illustrated with tutoring examples, proposing trust should attach to auditable premises and evidence standards rather than conversational fluency. Provides falsifiable evaluation criteria for the approach.

Conclusion: Reliable human-AI partnership requires fundamental shift from answer generation to collaborative premise governance, with systematic discrepancy detection and commitment management to prevent sycophantic agreement in high-stakes decision support.

Abstract: As LLMs expand from assistance to decision support, a dangerous pattern emerges: fluent agreement without calibrated judgment. Low-friction assistants can become sycophantic, baking in implicit assumptions and pushing verification costs onto experts, while outcomes arrive too late to serve as reward signals. In deep-uncertainty decisions (where objectives are contested and reversals are costly), scaling fluent agreement amplifies poor commitments faster than it builds expertise. We argue reliable human-AI partnership requires a shift from answer generation to collaborative premise governance over a knowledge substrate, negotiating only what is decision-critical. A discrepancy-driven control loop operates over this substrate: detecting conflicts, localizing misalignment via typed discrepancies (teleological, epistemic, procedural), and triggering bounded negotiation through decision slices. Commitment gating blocks action on uncommitted load-bearing premises unless overridden under logged risk; value-gated challenge allocates probing under interaction cost. Trust then attaches to auditable premises and evidence standards, not conversational fluency. We illustrate with tutoring and propose falsifiable evaluation criteria.

[172] ROG: Retrieval-Augmented LLM Reasoning for Complex First-Order Queries over Knowledge Graphs

Ziyan Zhang, Chao Wang, Zhuo Chen, Chiyi Li, Kai Song

Main category: cs.CL

TL;DR: ROG: A retrieval-augmented framework combining query-aware neighborhood retrieval with LLM chain-of-thought reasoning for answering complex FOL queries over incomplete knowledge graphs.

Details

Motivation: Answering complex first-order logic queries (with projection, intersection, union, negation) over incomplete knowledge graphs is challenging, especially for deep reasoning chains with compounding errors in existing approaches.

Method: ROG decomposes multi-operator queries into single-operator sub-queries, grounds each step in query-relevant neighborhood evidence via retrieval, uses LLM chain-of-thought reasoning, and caches intermediate answer sets for reuse across steps.

Result: Experiments on standard KG reasoning benchmarks show consistent gains over strong embedding-based baselines, with largest improvements on high-complexity and negation-heavy query types.

Conclusion: ROG provides a practical alternative to embedding-based logical reasoning by replacing learned operators with retrieval-grounded, step-wise inference, reducing compounding errors and improving robustness on complex queries.

Abstract: Answering first-order logic (FOL) queries over incomplete knowledge graphs (KGs) is difficult, especially for complex query structures that compose projection, intersection, union, and negation. We propose ROG, a retrieval-augmented framework that combines query-aware neighborhood retrieval with large language model (LLM) chain-of-thought reasoning. ROG decomposes a multi-operator query into a sequence of single-operator sub-queries and grounds each step in compact, query-relevant neighborhood evidence. Intermediate answer sets are cached and reused across steps, improving consistency on deep reasoning chains. This design reduces compounding errors and yields more robust inference on complex and negation-heavy queries. Overall, ROG provides a practical alternative to embedding-based logical reasoning by replacing learned operators with retrieval-grounded, step-wise inference. Experiments on standard KG reasoning benchmarks show consistent gains over strong embedding-based baselines, with the largest improvements on high-complexity and negation-heavy query types.

[173] Misconception Diagnosis From Student-Tutor Dialogue: Generate, Retrieve, Rerank

Joshua Mitton, Prarthana Bhattacharyya, Digory Smith, Thomas Christie, Ralph Abboud, Simon Woodhead

Main category: cs.CL

TL;DR: A two-stage LLM approach for detecting student misconceptions from tutoring dialogues using generation and reranking with embedding similarity.

Details

Motivation: Timely identification of student misconceptions is crucial for improving learning outcomes but heavily relies on teacher effort and intuition, creating a need for automated detection systems.

Method: Two-stage approach: 1) Fine-tuned LLM generates plausible misconceptions, 2) Embedding similarity retrieves promising candidates, 3) Another fine-tuned LLM assesses and reranks candidates for improved relevance.

Result: The approach improves predictive performance over baselines; fine-tuning enhances misconception quality and can outperform larger closed-source models; ablation studies validate the importance of generation and reranking steps.

Conclusion: The proposed LLM-based system effectively detects student misconceptions from dialogues, with fine-tuned models showing superior performance over larger closed-source alternatives.

Abstract: Timely and accurate identification of student misconceptions is key to improving learning outcomes and pre-empting the compounding of student errors. However, this task is highly dependent on the effort and intuition of the teacher. In this work, we present a novel approach for detecting misconceptions from student-tutor dialogues using large language models (LLMs). First, we use a fine-tuned LLM to generate plausible misconceptions, and then retrieve the most promising candidates among these using embedding similarity with the input dialogue. These candidates are then assessed and re-ranked by another fine-tuned LLM to improve misconception relevance. Empirically, we evaluate our system on real dialogues from an educational tutoring platform. We consider multiple base LLM models including LLaMA, Qwen and Claude on zero-shot and fine-tuned settings. We find that our approach improves predictive performance over baseline models and that fine-tuning improves both generated misconception quality and can outperform larger closed-source models. Finally, we conduct ablation studies to both validate the importance of our generation and reranking steps on misconception generation quality.

[174] Large Language Models for Mental Health: A Multilingual Evaluation

Nishat Raihan, Sadiya Sayara Chowdhury Puspo, Ana-Maria Bucur, Stevie Chancellor, Marcos Zampieri

Main category: cs.CL

TL;DR: Evaluation of LLMs on multilingual mental health datasets shows competitive performance on original data but degradation on machine-translated versions, with variation by language typology.

Details

Motivation: To explore LLM capabilities in multilingual mental health contexts, which haven't been thoroughly studied despite LLMs' strong NLP performance generally.

Method: Evaluated proprietary and open-source LLMs on 8 mental health datasets in various languages and their machine-translated counterparts, comparing zero-shot, few-shot, and fine-tuned settings against conventional NLP baselines, and assessing translation quality across language families and typologies.

Result: Proprietary LLMs and fine-tuned open-source LLMs achieved competitive F1 scores, often surpassing state-of-the-art results on original data, but performance on MT data was generally lower with decline varying by language and typology.

Conclusion: LLMs show strengths in handling mental health tasks in non-English languages but have limitations when translation quality introduces structural or lexical mismatches, highlighting the importance of language typology in multilingual performance.

Abstract: Large Language Models (LLMs) have remarkable capabilities across NLP tasks. However, their performance in multilingual contexts, especially within the mental health domain, has not been thoroughly explored. In this paper, we evaluate proprietary and open-source LLMs on eight mental health datasets in various languages, as well as their machine-translated (MT) counterparts. We compare LLM performance in zero-shot, few-shot, and fine-tuned settings against conventional NLP baselines that do not employ LLMs. In addition, we assess translation quality across language families and typologies to understand its influence on LLM performance. Proprietary LLMs and fine-tuned open-source LLMs achieve competitive F1 scores on several datasets, often surpassing state-of-the-art results. However, performance on MT data is generally lower, and the extent of this decline varies by language and typology. This variation highlights both the strengths of LLMs in handling mental health tasks in languages other than English and their limitations when translation quality introduces structural or lexical mismatches.

[175] Abstract Activation Spaces for Content-Invariant Reasoning in Large Language Models

Gabriele Maraia, Marco Valentino, Fabio Massimo Zanzotto, Leonardo Ranaldi

Main category: cs.CL

TL;DR: A framework for abstraction-guided reasoning that separates structural inference from lexical semantics to reduce content effects in LLMs’ syllogistic reasoning.

Details

Motivation: LLMs struggle with deductive judgment in syllogistic reasoning, systematically conflating semantic plausibility with formal validity (content effect). This bias persists even with step-wise explanations, and reliably suppressing semantic interference remains challenging.

Method: Construct paired content-laden and abstract syllogisms, use model activations on abstract inputs to define abstract reasoning space, learn lightweight Abstractors that predict representations aligned with this space, and integrate predictions via multi-layer interventions during forward pass.

Result: Abstraction-aligned steering reduces content-driven errors and improves validity-sensitive performance, demonstrated through cross-lingual transfer experiments.

Conclusion: Activation-level abstraction serves as a scalable mechanism for enhancing robustness of formal reasoning in LLMs against semantic interference.

Abstract: Large Language Models (LLMs) often struggle with deductive judgment in syllogistic reasoning, systematically conflating semantic plausibility with formal validity a phenomenon known as content effect. This bias persists even when models generate step-wise explanations, indicating that intermediate rationales may inherit the same semantic shortcuts that affect answers. Recent approaches propose mitigating this issue by increasing inference-time structural constraints, either by encouraging abstract intermediate representations or by intervening directly in the model’s internal computations; however, reliably suppressing semantic interference remains an open challenge. To make formal deduction less sensitive to semantic content, we introduce a framework for abstraction-guided reasoning that explicitly separates structural inference from lexical semantics. We construct paired content-laden and abstract syllogisms and use the model’s activations on abstract inputs to define an abstract reasoning space. We then learn lightweight Abstractors that, from content-conditioned residual-stream states, predict representations aligned with this space and integrate these predictions via multi-layer interventions during the forward pass. Using cross-lingual transfer as a test bed, we show that abstraction-aligned steering reduces content-driven errors and improves validity-sensitive performance. Our results position activation-level abstraction as a scalable mechanism for enhancing the robustness of formal reasoning in LLMs against semantic interference.

[176] From Directions to Regions: Decomposing Activations in Language Models via Local Geometry

Or Shafran, Shaked Ronen, Omri Fahn, Shauli Ravfogel, Atticus Geiger, Mor Geva

Main category: cs.CL

TL;DR: MFA (Mixture of Factor Analyzers) decomposes LLM activations into Gaussian regions with local covariance structure, capturing nonlinear concepts better than linear direction methods.

Details

Motivation: Existing activation decomposition methods assume linear separability and search for individual global directions, which fails to capture concepts with nonlinear or multi-dimensional structure in language models.

Method: Use Mixture of Factor Analyzers (MFA) as a scalable unsupervised approach that models activation space as collection of Gaussian regions with local covariance structure, decomposing activations into region centroids and local variations.

Result: MFA captures complex nonlinear structures in activation space, outperforms unsupervised baselines on localization benchmarks, is competitive with supervised methods, and achieves stronger steering performance than sparse autoencoders for Llama-3.1-8B and Gemma-2-2B.

Conclusion: Local geometry expressed through subspaces is a promising unit of analysis for scalable concept discovery and model control, accounting for complex structures that isolated directions fail to capture.

Abstract: Activation decomposition methods in language models are tightly coupled to geometric assumptions on how concepts are realized in activation space. Existing approaches search for individual global directions, implicitly assuming linear separability, which overlooks concepts with nonlinear or multi-dimensional structure. In this work, we leverage Mixture of Factor Analyzers (MFA) as a scalable, unsupervised alternative that models the activation space as a collection of Gaussian regions with their local covariance structure. MFA decomposes activations into two compositional geometric objects: the region’s centroid in activation space, and the local variation from the centroid. We train large-scale MFAs for Llama-3.1-8B and Gemma-2-2B, and show they capture complex, nonlinear structures in activation space. Moreover, evaluations on localization and steering benchmarks show that MFA outperforms unsupervised baselines, is competitive with supervised localization methods, and often achieves stronger steering performance than sparse autoencoders. Together, our findings position local geometry, expressed through subspaces, as a promising unit of analysis for scalable concept discovery and model control, accounting for complex structures that isolated directions fail to capture.

[177] Indications of Belief-Guided Agency and Meta-Cognitive Monitoring in Large Language Models

Noam Steinmetz Yalon, Ariel Goldstein, Liad Mudrik, Mor Geva

Main category: cs.CL

TL;DR: The paper investigates whether LLMs exhibit consciousness indicators like belief-guided agency and meta-cognitive monitoring by analyzing belief dynamics in latent space.

Details

Motivation: To empirically evaluate whether large language models possess consciousness indicators, specifically testing the HOT-3 indicator from Butlin et al. (2023) which examines agency guided by belief-formation and meta-cognitive monitoring.

Method: Views beliefs as representations in the model’s latent space emerging from inputs, introduces a metric to quantify belief dominance during generation, and analyzes dynamics between competing beliefs across models and tasks.

Result: Three key findings: (1) external manipulations systematically modulate internal belief formation, (2) belief formation causally drives action selection, and (3) models can monitor and report their own belief states.

Conclusion: Provides empirical support for belief-guided agency and meta-cognitive monitoring in LLMs, laying methodological groundwork for investigating agency, beliefs, and meta-cognition emergence in language models.

Abstract: Rapid advancements in large language models (LLMs) have sparked the question whether these models possess some form of consciousness. To tackle this challenge, Butlin et al. (2023) introduced a list of indicators for consciousness in artificial systems based on neuroscientific theories. In this work, we evaluate a key indicator from this list, called HOT-3, which tests for agency guided by a general belief-formation and action selection system that updates beliefs based on meta-cognitive monitoring. We view beliefs as representations in the model’s latent space that emerge in response to a given input, and introduce a metric to quantify their dominance during generation. Analyzing the dynamics between competing beliefs across models and tasks reveals three key findings: (1) external manipulations systematically modulate internal belief formation, (2) belief formation causally drives the model’s action selection, and (3) models can monitor and report their own belief states. Together, these results provide empirical support for the existence of belief-guided agency and meta-cognitive monitoring in LLMs. More broadly, our work lays methodological groundwork for investigating the emergence of agency, beliefs, and meta-cognition in LLMs.

[178] MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, Wenya Wang

Main category: cs.CL

TL;DR: MemSkill: A learnable memory skill framework for LLM agents that evolves memory operations through a controller-executor-designer loop instead of using static hand-designed operations.

Details

Motivation: Current LLM agent memory systems rely on static, hand-designed operations for memory extraction, which are rigid under diverse interaction patterns and inefficient for long histories. This limits adaptability and performance.

Method: MemSkill reframes memory operations as learnable skills with a controller that selects relevant skills, an LLM-based executor that produces skill-guided memories, and a designer that evolves the skill set by reviewing hard cases and proposing refinements/new skills.

Result: Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld show MemSkill improves task performance over strong baselines and generalizes well across settings. The system demonstrates effective skill evolution and adaptation.

Conclusion: MemSkill provides a closed-loop framework for adaptive, self-evolving memory management in LLM agents, moving beyond static operations to learnable and evolvable memory skills that improve performance and generalization.

Abstract: Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present \textbf{MemSkill}, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emph{controller} that learns to select a small set of relevant skills, paired with an LLM-based \emph{executor} that produces skill-guided memories. Beyond learning skill selection, MemSkill introduces a \emph{designer} that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed-loop procedure that improves both the skill-selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self-evolving memory management for LLM agents.

[179] Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability

Xiao Liang, Zhong-Zhi Li, Zhenghao Lin, Eric Hancheng Jiang, Hengyuan Zhang, Yelong Shen, Kai-Wei Chang, Ying Nian Wu, Yeyun Gong, Weizhu Chen

Main category: cs.CL

TL;DR: A reinforcement learning framework to enhance LLMs’ divide-and-conquer reasoning capabilities, surpassing chain-of-thought on challenging tasks.

Details

Motivation: Chain-of-thought reasoning has limitations at model capability boundaries and lacks test-time scalability. Divide-and-conquer reasoning shows promise but suffers from misalignment between general-purpose post-training and DAC-style inference.

Method: End-to-end RL framework where the policy decomposes problems into subproblems, solves them sequentially, and addresses the original problem using subproblem solutions, with both decomposition and solution integrated into RL training.

Result: DAC-style framework achieves higher performance ceiling and stronger test-time scalability, surpassing CoT by 8.6% in Pass@1 and 6.3% in Pass@32 on competition-level benchmarks.

Conclusion: The proposed RL framework successfully bridges the gap between general-purpose training and DAC-style inference, unlocking LLMs’ reasoning capabilities on challenging tasks through enhanced divide-and-conquer reasoning.

Abstract: Large language models (LLMs) have demonstrated strong reasoning capabilities through step-by-step chain-of-thought (CoT) reasoning. Nevertheless, at the limits of model capability, CoT often proves insufficient, and its strictly sequential nature constrains test-time scalability. A potential alternative is divide-and-conquer (DAC) reasoning, which decomposes a complex problem into subproblems to facilitate more effective exploration of the solution. Although promising, our analysis reveals a fundamental misalignment between general-purpose post-training and DAC-style inference, which limits the model’s capacity to fully leverage this potential. To bridge this gap and fully unlock LLMs’ reasoning capabilities on the most challenging tasks, we propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity. At each step, the policy decomposes a problem into a group of subproblems, solves them sequentially, and addresses the original one conditioned on the subproblem solutions, with both decomposition and solution integrated into RL training. Under comparable training, our DAC-style framework endows the model with a higher performance ceiling and stronger test-time scalability, surpassing CoT by 8.6% in Pass@1 and 6.3% in Pass@32 on competition-level benchmarks.

[180] RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents

Jialiang Zhu, Gongrui Zhang, Xiaolong Ma, Lin Xu, Miaosen Zhang, Ruiqi Yang, Song Wang, Kai Qiu, Zhirong Wu, Qi Dai, Ruichun Ma, Bei Liu, Yifan Yang, Chong Luo, Zhengyuan Yang, Linjie Li, Lijuan Wang, Weizhu Chen, Xin Geng, Baining Guo

Main category: cs.CL

TL;DR: Re-TRAC: A cross-trajectory exploration framework for LLM-based research agents that uses structured state representations to enable iterative reflection and globally informed planning, outperforming ReAct by 15-20%.

Details

Motivation: The ReAct framework's linear design limits research agents' ability to revisit earlier states, branch into alternative search directions, or maintain global awareness under long contexts, leading to local optima, redundant exploration, and inefficient search.

Method: Re-TRAC generates structured state representations after each trajectory to summarize evidence, uncertainties, failures, and future plans, then conditions subsequent trajectories on this state representation to enable cross-trajectory exploration and iterative reflection.

Result: Re-TRAC consistently outperforms ReAct by 15-20% on BrowseComp with frontier LLMs. For smaller models, Re-TRAC-aware supervised fine-tuning achieves state-of-the-art performance at comparable scales, with monotonic reduction in tool calls and token usage across rounds.

Conclusion: Re-TRAC reframes research as a progressive process through cross-trajectory exploration, enabling more efficient and targeted search by leveraging structured state representations for iterative reflection and globally informed planning.

Abstract: LLM-based deep research agents are largely built on the ReAct framework. This linear design makes it difficult to revisit earlier states, branch into alternative search directions, or maintain global awareness under long contexts, often leading to local optima, redundant exploration, and inefficient search. We propose Re-TRAC, an agentic framework that performs cross-trajectory exploration by generating a structured state representation after each trajectory to summarize evidence, uncertainties, failures, and future plans, and conditioning subsequent trajectories on this state representation. This enables iterative reflection and globally informed planning, reframing research as a progressive process. Empirical results show that Re-TRAC consistently outperforms ReAct by 15-20% on BrowseComp with frontier LLMs. For smaller models, we introduce Re-TRAC-aware supervised fine-tuning, achieving state-of-the-art performance at comparable scales. Notably, Re-TRAC shows a monotonic reduction in tool calls and token usage across rounds, indicating progressively targeted exploration driven by cross-trajectory reflection rather than redundant search.

[181] Reward-free Alignment for Conflicting Objectives

Peter Chen, Xiaopeng Li, Xi Chen, Tianyi Lin

Main category: cs.CL

TL;DR: RACO: A reward-free alignment framework for multi-objective LLM alignment that uses clipped conflict-averse gradient descent to handle conflicting objectives without explicit reward models.

Details

Motivation: Real-world alignment problems often involve multiple conflicting objectives, but existing methods either use naive aggregation that leads to unstable training or rely on explicit reward models that introduce complexity and distort user preferences.

Method: Proposes RACO framework that directly uses pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent, with convergence guarantees to Pareto-critical points respecting user-specified weights.

Result: Experiments on multi-objective summarization and safety alignment tasks across Qwen 3, Llama 3, and Gemma 3 show RACO consistently achieves better Pareto trade-offs than existing multi-objective alignment baselines.

Conclusion: RACO provides an effective reward-free approach for multi-objective LLM alignment that handles conflicting objectives better than existing methods while respecting user preferences.

Abstract: Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.

[182] From Lengthy to Lucid: A Systematic Literature Review on NLP Techniques for Taming Long Sentences

Tatiana Passali, Efstathios Chatzikyriakidis, Stelios Andreadis, Thanos G. Stavropoulos, Anastasia Matonaki, Anestis Fachantidis, Grigorios Tsoumakas

Main category: cs.CL

TL;DR: A systematic survey of methods for handling long sentences, focusing on sentence compression and splitting techniques, with analysis of current trends, gaps in weakly/self-supervised approaches, and limited exploration of LLMs.

Details

Motivation: Long sentences create communication barriers by making it difficult for readers to grasp main points or follow writer intentions. There's a need to systematically review existing approaches to address this persistent issue in written communication.

Method: Conducted a systematic literature review using PRISMA guidelines, categorizing methods into a comprehensive taxonomy of sentence compression and splitting techniques. Performed comparative evaluation analysis on common datasets.

Result: Identified increased research interest since 2005 with significant growth after 2017. Found current dominance of supervised approaches, considerable gaps in weakly/self-supervised techniques, and limited exploration of Large Language Models despite their potential.

Conclusion: The survey provides a comprehensive resource for researchers, highlighting opportunities for future work in weakly/self-supervised methods and LLM applications, aiming to eliminate long sentences as communication barriers.

Abstract: Long sentences have been a persistent issue in written communication for many years since they make it challenging for readers to grasp the main points or follow the initial intention of the writer. This survey, conducted using the PRISMA guidelines, systematically reviews two main strategies for addressing the issue of long sentences: a) sentence compression and b) sentence splitting. An increased trend of interest in this area has been observed since 2005, with significant growth after 2017. Current research is dominated by supervised approaches for both sentence compression and splitting. Yet, there is a considerable gap in weakly and self-supervised techniques, suggesting an opportunity for further research, especially in domains with limited data. We also observe that despite their potential, Large Language Models (LLMs) have not yet been widely explored in this area. In this survey, we categorize and group the most representative methods into a comprehensive taxonomy. We also conduct a comparative evaluation analysis of these methods on common sentence compression and splitting datasets. Finally, we discuss the challenges and limitations of current methods, providing valuable insights for future research directions. This survey is meant to serve as a comprehensive resource for addressing the complexities of long sentences. We aim to enable researchers to make further advancements in the field until long sentences are no longer a barrier to effective communication.

[183] ALiiCE: Evaluating Positional Fine-grained Citation Generation

Yilong Xu, Jinhua Gao, Xiaoming Yu, Baolong Bi, Huawei Shen, Xueqi Cheng

Main category: cs.CL

TL;DR: ALiiCE is an automatic evaluation framework for positional fine-grained citation generation in LLMs, moving beyond sentence-level citations to assess citations that can appear anywhere within sentences.

Details

Motivation: Existing citation generation research focuses on sentence-level statements, missing the importance of positional fine-grained citations that can appear anywhere within sentences. There's a need for better evaluation methods for this more granular citation task.

Method: ALiiCE uses a dependency tree approach to parse sentence-level claims into atomic claims, then evaluates citation quality using three metrics: positional fine-grained citation recall, precision, and coefficient of variation of citation positions.

Result: The framework was used to evaluate positional fine-grained citation generation performance of several LLMs on long-form QA datasets, demonstrating ALiiCE’s effectiveness and reasonableness.

Conclusion: ALiiCE provides the first automatic evaluation framework for positional fine-grained citation generation, offering insights into current advancements and future directions for this task.

Abstract: Large Language Model (LLM) can enhance its credibility and verifiability by generating text with citations. However, existing research on citation generation is predominantly limited to sentence-level statements, neglecting the significance of positional fine-grained citations that can appear anywhere within sentences. To facilitate further exploration of the positional fine-grained citation generation, we propose ALiiCE, the first automatic evaluation framework for this task. Our method employs a dependency tree based approach to parse the sentence-level claim into atomic claims. Then ALiiCE evaluates citation quality using three metrics, including positional fine-grained citation recall, precision, and coefficient of variation of citation positions. We evaluate the positional fine-grained citation generation performance of several LLMs on long-form QA datasets. Our experiments and analyses demonstrate the effectiveness and reasonableness of ALiiCE. We offer our insights into the current advancements and future directions for the positional fine-grained citation generation task.

[184] Paraphrase Types Elicit Prompt Engineering Capabilities

Jan Philip Wahle, Terry Ruas, Yang Xu, Bela Gipp

Main category: cs.CL

TL;DR: Systematic evaluation of how linguistic variations in prompts affect language model performance across 5 models and 120 tasks, showing specific paraphrase types (morphology and lexicon) can improve performance by 5-7%.

Details

Motivation: To understand how variations in linguistic expression of prompts affect language models, as current understanding is limited despite the importance of prompt engineering for model performance.

Method: Systematic empirical evaluation across 5 models and 120 tasks using 6 families of paraphrases (morphology, syntax, lexicon, lexico-syntax, discourse, others) while controlling for other prompt engineering factors like length, lexical diversity, and proximity to training data.

Result: Language models show potential for task improvement when prompts are adapted with specific paraphrase types (6.7% median gain in Mixtral 8x7B; 5.5% in LLaMA 3 8B). Morphology and lexicon changes showed particular promise for improving prompts.

Conclusion: The findings contribute to developing more robust language models capable of handling variability in linguistic expression, with specific linguistic features (especially morphology and lexicon) showing significant impact on model performance.

Abstract: Much of the success of modern language models depends on finding a suitable prompt to instruct the model. Until now, it has been largely unknown how variations in the linguistic expression of prompts affect these models. This study systematically and empirically evaluates which linguistic features influence models through paraphrase types, i.e., different linguistic changes at particular positions. We measure behavioral changes for five models across 120 tasks and six families of paraphrases (i.e., morphology, syntax, lexicon, lexico-syntax, discourse, and others). We also control for other prompt engineering factors (e.g., prompt length, lexical diversity, and proximity to training data). Our results show a potential for language models to improve tasks when their prompts are adapted in specific paraphrase types (e.g., 6.7% median gain in Mixtral 8x7B; 5.5% in LLaMA 3 8B). In particular, changes in morphology and lexicon, i.e., the vocabulary used, showed promise in improving prompts. These findings contribute to developing more robust language models capable of handling variability in linguistic expression.

[185] LFQA-E: Carefully Benchmarking Long-form QA Evaluation

Yuchen Fan, Chen Lin, Xin Zhong, Shuo Zhang, Heng Zhou, Yuchen Zhang, Mingyu Liang, Chengxing Xie, Ermo Hua, Gang Chen, Zhizhou He, Cheng Huang, Ning Ding, Bowen Zhou

Main category: cs.CL

TL;DR: LFQA-E is a multilingual reference-based benchmark for evaluating automatic metrics in Long-Form Question Answering, showing current metrics fail to match human judgment.

Details

Motivation: Existing LFQA evaluation benchmarks lack reference answers, have limited size/topic coverage, and reduce reliability, creating a need for better evaluation methods.

Method: Created LFQA-E benchmark with 1618 questions and 7323 pairwise comparisons across 15 topics from diverse sources, then evaluated 17 automatic metrics across 5 categories against human judgments.

Result: No existing automatic metrics perform comparably to human judgments, failing to capture dense information in long-form responses. Detailed failure case analysis provided.

Conclusion: Current LFQA evaluation metrics are inadequate, highlighting need for better methods. LFQA-E benchmark enables comprehensive assessment and guides future development.

Abstract: Long-Form Question Answering (LFQA) involves generating comprehensive, paragraph-level responses to open-ended questions, which poses a significant challenge for evaluation due to the richness of information and flexible response format. Existing LFQA-evaluation benchmarks often lack reference answers and are limited in size and topic coverage, reducing their reliability. To address this gap, we introduce LFQA-E, a well-constructed, multilingual, and reference-based benchmark designed to rigorously evaluate automatic metrics for LFQA. LFQA-E comprises 1618 questions and 7323 pairwise comparisons across 15 topics, drawn from diverse sources such as online queries and examination questions, thereby enabling a comprehensive assessment of evaluation metrics. We examine five categories of metrics, encompassing 17 specific methods, using LFQA-E. The results demonstrate that none of the existing automatic metrics perform comparably to human judgments, highlighting their inability to capture the dense information in long-form responses. Furthermore, we present a detailed analysis of the failure cases and the generalization capacity of these metrics, offering insights to guide the future development of LFQA evaluation methods. The benchmark and code are available at https://github.com/YuchenFan48/LFQA-E.

[186] Leveraging LLMs for Translating and Classifying Mental Health Data

Konstantinos Skianis, A. Seza Doğruöz, John Pavlopoulos

Main category: cs.CL

TL;DR: GPT-3.5-turbo shows limited success in detecting depression severity from user-generated posts in English and Greek, highlighting the need for more research in low-resource languages and careful implementation with human supervision in mental health applications.

Details

Motivation: While LLMs show promise in medical applications, there's limited research on their use for mental health support in non-English languages. The study aims to address this gap by focusing on depression severity detection in Greek using translated user-generated posts.

Method: The study uses GPT-3.5-turbo to detect depression severity from user-generated posts. The approach involves analyzing posts in English and Greek (automatically translated from English) to assess the model’s performance across languages.

Result: GPT-3.5-turbo shows limited success in identifying depression severity in English, with varying performance in Greek. The results highlight challenges in cross-linguistic mental health applications.

Conclusion: Further research is needed for low-resource languages, and careful implementation with human supervision is crucial for effective LLM use in mental health platforms to avoid misdiagnosis.

Abstract: Large language models (LLMs) are increasingly used in medical fields. In mental health support, the early identification of linguistic markers associated with mental health conditions can provide valuable support to mental health professionals, and reduce long waiting times for patients. Despite the benefits of LLMs for mental health support, there is limited research on their application in mental health systems for languages other than English. Our study addresses this gap by focusing on the detection of depression severity in Greek through user-generated posts which are automatically translated from English. Our results show that GPT3.5-turbo is not very successful in identifying the severity of depression in English, and it has a varying performance in Greek as well. Our study underscores the necessity for further research, especially in languages with less resources. Also, careful implementation is necessary to ensure that LLMs are used effectively in mental health platforms, and human supervision remains crucial to avoid misdiagnosis.

[187] U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in LLMs

Konstantin Chernyshev, Vitaliy Polshkov, Ekaterina Artemova, Alex Myasnikov, Vlad Stepanov, Alexei Miasnikov, Sergei Tilga

Main category: cs.CL

TL;DR: U-MATH is a benchmark of 1,100 university-level math problems with 20% multimodal content, addressing limitations in current math evaluation for LLMs.

Details

Motivation: Current math evaluation for LLMs is limited by small benchmarks focusing on elementary/high-school problems, lacking diversity and multimodal content. Need for university-level, open-ended problems with visual elements.

Method: Created U-MATH benchmark with 1,100 unpublished open-ended university-level problems balanced across six subjects, 20% multimodal. Used LLM to judge solution correctness, releasing μ-MATH dataset for evaluating LLMs’ judgment capabilities.

Result: Leading LLMs show marked limitations in multimodal reasoning: 93.1% accuracy on textual tasks vs 58.5% on visual ones. Solution judgment is challenging, with best models achieving imperfect F1-score of 90.1%.

Conclusion: U-MATH reveals significant gaps in LLMs’ multimodal reasoning and solution judgment capabilities for university-level mathematics, highlighting need for improved multimodal understanding.

Abstract: The current evaluation of mathematical skills in LLMs is limited, as existing benchmarks are either relatively small, primarily focus on elementary and high-school problems, or lack diversity in topics. Additionally, the inclusion of visual elements in tasks remains largely under-explored. To address these gaps, we introduce U-MATH, a novel benchmark of 1,100 unpublished open-ended university-level problems sourced from teaching materials. It is balanced across six core subjects, with 20% of multimodal problems. Given the open-ended nature of U-MATH problems, we employ an LLM to judge the correctness of generated solutions. To this end, we release $μ$-MATH, a dataset to evaluate the LLMs’ capabilities in judging solutions. Benchmarking leading LLMs reveals marked limitations in multi-modal reasoning, with maximum accuracy reaching 93.1% on textual tasks but only 58.5% on visual ones. Furthermore, solution judgment proves challenging, requiring the most advanced models to achieve meaningfully high performance, even still peaking at an imperfect F1-score of 90.1%.

[188] Evolutionary Pre-Prompt Optimization for Mathematical Reasoning

Mathurin Videau, Alessandro Leite, Marc Schoenauer, Olivier Teytaud

Main category: cs.CL

TL;DR: Evolutionary Pre-Prompt Optimization (EPPO) improves few-shot chain-of-thought reasoning by using evolutionary computation to optimize example selection, achieving over 10-point gains on math reasoning benchmarks.

Details

Motivation: While few-shot learning with chain-of-thought (CoT) has shown promise for complex reasoning tasks, the selection of examples for pre-prompts significantly impacts performance. Current methods lack systematic optimization for example selection.

Method: Proposes Evolutionary Pre-Prompt Optimization (EPPO), which uses evolutionary computation algorithms to optimize the selection of examples for CoT pre-prompts. This comparison-based method avoids overfitting through limited exploitation.

Result: EPPO achieves over 10 absolute point improvements in exact match scores on GSM8k and MathQA benchmarks compared to naive few-shot approaches. Gains are consistent across contexts and further amplified when combined with self-consistency.

Conclusion: Evolutionary optimization of example selection significantly enhances few-shot CoT reasoning performance, with comparison-based methods like evolutionary computation being particularly effective for designing optimal pre-prompts.

Abstract: Recent advancements have highlighted that large language models (LLMs), when given a small set of task-specific examples, demonstrate remarkable proficiency, a capability that extends to complex reasoning tasks. In particular, the combination of few-shot learning with the chain-of-thought (CoT) approach has been pivotal in steering models towards more logically consistent conclusions [Wei et al. 2022b]. This paper explores the optimization of example selection for designing effective CoT pre-prompts and shows that the choice of the optimization algorithm, typically in favor of comparison-based methods such as evolutionary computation, significantly enhances efficacy and feasibility. Specifically, thanks to a limited exploitative and overfitted optimization, Evolutionary Pre-Prompt Optimization (EPPO) brings an improvement over the naive few-shot approach, exceeding 10 absolute points in exact match scores on benchmark datasets such as GSM8k and MathQA. These gains are consistent across various contexts and are further amplified when integrated with self-consistency (SC).

[189] Enhancing Human-Like Responses in Large Language Models

Ethem Yağız Çalık, Talha Rüzgar Akkuş

Main category: cs.CL

TL;DR: This paper explores techniques to make LLMs more human-like by enhancing natural language understanding, conversational coherence, and emotional intelligence through fine-tuning, psychological principles, and human reasoning patterns.

Details

Motivation: The motivation is to advance large language models to become more human-like in their interactions, improving natural language understanding, conversational coherence, and emotional intelligence for better user experiences and broader AI applications.

Method: The study evaluates various approaches including: 1) fine-tuning with diverse datasets, 2) incorporating psychological principles, and 3) designing models that better mimic human reasoning patterns.

Result: The findings demonstrate that these human-like enhancements improve user interactions and open new possibilities for AI applications across different domains.

Conclusion: Future work will address the ethical implications and potential biases introduced by these human-like attributes in AI systems.

Abstract: This paper explores the advancements in making large language models (LLMs) more human-like. We focus on techniques that enhance natural language understanding, conversational coherence, and emotional intelligence in AI systems. The study evaluates various approaches, including fine-tuning with diverse datasets, incorporating psychological principles, and designing models that better mimic human reasoning patterns. Our findings demonstrate that these enhancements not only improve user interactions but also open new possibilities for AI applications across different domains. Future work will address the ethical implications and potential biases introduced by these human-like attributes.

[190] TableMaster: A Recipe to Advance Table Understanding with Language Models

Lang Cao, Hanbing Liu

Main category: cs.CL

TL;DR: TableMaster is a framework that enhances language models for table understanding by addressing four key challenges: data location, table semantics, numerical accuracy, and reasoning flexibility.

Details

Motivation: Current language models struggle with table understanding due to the structured nature of tabular data, facing challenges in locating target data, understanding table semantics, handling numerical inaccuracies, and maintaining semantic flexibility in reasoning.

Method: TableMaster integrates multiple solutions: extracts relevant table content, verbalizes it with enriched semantic context, and introduces adaptive reasoning that dynamically adjusts between textual and symbolic reasoning based on each query.

Result: On the WikiTQ dataset, TableMaster achieves 78.13% accuracy using GPT-4o-mini, surpassing existing baselines and demonstrating effectiveness in table understanding.

Conclusion: TableMaster provides a practical framework for more robust and reliable table understanding by addressing key challenges in tabular data processing with language models.

Abstract: Tables serve as a fundamental format for representing structured relational data. While current language models (LMs) excel at many text-based tasks, they still face challenges in table understanding due to the complex characteristics of tabular data, such as their structured nature. In this paper, we aim to enhance LMs for improved table understanding. We identify four key challenges: 1) difficulty in locating target data, 2) deficiency in table semantics, 3) numerical inaccuracies in textual reasoning, and 4) semantic inflexibility in symbolic reasoning. To address these issues, we propose TableMaster, a recipe and comprehensive framework that integrates multiple solutions to overcome these obstacles. TableMaster first extracts relevant table content and verbalizes it with enriched semantic context. Additionally, we introduce adaptive reasoning, a flexible approach that dynamically adjusts between textual and symbolic reasoning, tailoring the reasoning process to each query. Extensive analyses and experiments demonstrate our findings and the effectiveness of TableMaster. On the WikiTQ dataset, TableMaster achieves an accuracy of 78.13% using GPT-4o-mini, surpassing existing baselines. We hope this work will serve as a practical step toward more robust and reliable table understanding.

[191] How does a Multilingual LM Handle Multiple Languages?

Santhosh Kakarla, Gautama Shastry Bulusu Venkata, Aishwarya Gaddam, Maheedhar Sai Omtri Mohan

Main category: cs.CL

TL;DR: This paper critically examines multilingual language models (MLMs) like BLOOM 1.7B and Qwen2, assessing their capabilities in multilingual understanding, semantic representation, and cross-lingual knowledge transfer, particularly for low-resource languages.

Details

Motivation: While multilingual language models have advanced rapidly, their effectiveness in capturing linguistic knowledge for low-resource languages remains unclear. Traditional evaluation methods often overlook internal syntactic and semantic encoding, creating gaps in understanding MLM capabilities across diverse languages.

Method: The study uses three approaches: 1) Analyzing multilingual word embeddings for semantic consistency using cosine similarity, 2) Examining BLOOM-1.7B and Qwen2 through Named Entity Recognition and sentence similarity tasks, and 3) Evaluating cross-lingual knowledge transfer from high-resource to low-resource languages in sentiment analysis and text classification using linguistic probing, performance metrics, and visualizations.

Result: MLMs perform well for high-resource languages but struggle with less-represented ones. The analysis reveals limitations in capturing linguistic knowledge for low-resource languages, highlighting gaps in semantic representation and cross-lingual transfer capabilities.

Conclusion: The findings provide insights into MLM strengths and limitations, aiming to enhance multilingual NLP models for better support of both high- and low-resource languages, promoting inclusivity in language technologies through improved model evaluation and development approaches.

Abstract: Multilingual language models have significantly advanced due to rapid progress in natural language processing. Models like BLOOM 1.7B, trained on diverse multilingual datasets, aim to bridge linguistic gaps. However, their effectiveness in capturing linguistic knowledge, particularly for low-resource languages, remains an open question. This study critically examines MLMs capabilities in multilingual understanding, semantic representation, and cross-lingual knowledge transfer. While these models perform well for high-resource languages, they struggle with less-represented ones. Additionally, traditional evaluation methods often overlook their internal syntactic and semantic encoding. This research addresses key limitations through three objectives. First, it assesses semantic similarity by analyzing multilingual word embeddings for consistency using cosine similarity. Second, it examines BLOOM-1.7B and Qwen2 through Named Entity Recognition and sentence similarity tasks to understand their linguistic structures. Third, it explores cross-lingual knowledge transfer by evaluating generalization from high-resource to low-resource languages in sentiment analysis and text classification. By leveraging linguistic probing, performance metrics, and visualizations, this study provides insights into the strengths and limitations of MLMs. The findings aim to enhance multilingual NLP models, ensuring better support for both high- and low-resource languages, thereby promoting inclusivity in language technologies.

[192] SCALM: Detecting Bad Practices in Smart Contracts Through LLMs

Zongwei Li, Xiaoqi Li, Wenkai Li, Xin Wang

Main category: cs.CL

TL;DR: SCALM: An LLM-based framework using Step-Back Prompting and RAG to identify and address bad practices in Ethereum smart contracts, outperforming existing tools in detection.

Details

Motivation: As Ethereum gains widespread usage, maintaining high standards of smart contract writing practices is crucial. While bad practices may not directly cause security issues, they elevate risk. There's a need for systematic understanding and avoidance of these practices.

Method: Proposes SCALM framework combining Step-Back Prompting and Retrieval-Augmented Generation (RAG) with large language models to identify and address over 35 specific bad practices in smart contracts.

Result: Extensive experiments with multiple LLMs and datasets show SCALM outperforms existing tools in detecting bad practices in smart contracts.

Conclusion: SCALM provides an effective LLM-based approach for identifying and addressing bad practices in smart contracts, contributing to better development practices on the Ethereum platform.

Abstract: As the Ethereum platform continues to mature and gain widespread usage, it is crucial to maintain high standards of smart contract writing practices. While bad practices in smart contracts may not directly lead to security issues, they do elevate the risk of encountering problems. Therefore, to understand and avoid these bad practices, this paper introduces the first systematic study of bad practices in smart contracts, delving into over 35 specific issues. Specifically, we propose a large language models (LLMs)-based framework, SCALM. It combines Step-Back Prompting and Retrieval-Augmented Generation (RAG) to identify and address various bad practices effectively. Our extensive experiments using multiple LLMs and datasets have shown that SCALM outperforms existing tools in detecting bad practices in smart contracts.

[193] Large Multimodal Models for Low-Resource Languages: A Survey

Marian Lupascu, Ana-Cristina Rogoz, Mihai Sorin Stupariu, Radu Tudor Ionescu

Main category: cs.CL

TL;DR: Survey paper analyzing techniques for adapting large multimodal models to low-resource languages, covering visual enhancement, data creation, cross-modal transfer, and fusion strategies across 117 studies in 96 languages.

Details

Motivation: To systematically analyze and categorize approaches for adapting large multimodal models (LMMs) to low-resource languages, addressing challenges of limited data and computational resources in multimodal AI research.

Method: Comprehensive analysis of 117 studies across 96 low-resource languages, categorizing works into resource-oriented and method-oriented contributions with relevant sub-categories, comparing performance and efficiency of different approaches.

Result: Identified key patterns in adaptation techniques, found visual information serves as crucial bridge for improving model performance in low-resource settings, but challenges remain in hallucination mitigation and computational efficiency.

Conclusion: Provides researchers with clear understanding of current approaches and remaining challenges in making LMMs accessible to speakers of low-resource languages, with open-source repository for continued research.

Abstract: In this survey, we systematically analyze techniques used to adapt large multimodal models (LMMs) for low-resource (LR) languages, examining approaches ranging from visual enhancement and data creation to cross-modal transfer and fusion strategies. Through a comprehensive analysis of 117 studies across 96 LR languages, we identify key patterns in how researchers tackle the challenges of limited data and computational resources. We categorize works into resource-oriented and method-oriented contributions, further dividing contributions into relevant sub-categories. We compare method-oriented contributions in terms of performance and efficiency, discussing benefits and limitations of representative studies. We find that visual information often serves as a crucial bridge for improving model performance in LR settings, though significant challenges remain in areas such as hallucination mitigation and computational efficiency. In summary, we provide researchers with a clear understanding of current approaches and remaining challenges in making LMMs more accessible to speakers of LR (understudied) languages. We complement our survey with an open-source repository available at: https://github.com/marianlupascu/LMM4LRL-Survey.

[194] Mix Data or Merge Models? Balancing the Helpfulness, Honesty, and Harmlessness of Large Language Model via Model Merging

Jinluan Yang, Dingnan Jin, Anke Tang, Li Shen, Didi Zhu, Zhengyu Chen, Ziyu Zhao, Daixin Wang, Qing Cui, Zhiqiang Zhang, Jun Zhou, Fei Wu, Kun Kuang

Main category: cs.CL

TL;DR: RESM: A novel model merging method using reweighting and sparsity-aware strategies to achieve balanced 3H (Helpfulness, Honesty, Harmlessness) alignment in LLMs, outperforming both data mixture and existing merging approaches.

Details

Motivation: Existing methods for 3H alignment (Helpfulness, Honesty, Harmlessness) face limitations: data mixture strategies rely heavily on expert knowledge and suffer from conflicting optimization signals, while model merging's potential for 3H optimization remains underexplored despite offering parameter-level conflict resolution.

Method: Proposes RESM (Reweighting Enhanced task Singular Merging) with two key strategies: 1) outlier weighting to address preference noise accumulation, and 2) sparsity-aware rank selection for layer sparsity adaptation in 3H-aligned LLM merging. Systematically compares model merging vs data mixture methods for 3H alignment.

Result: RESM achieves 2%-5% gain over data mixture methods and 1%-3% gain over previous model merging methods in balanced 3H alignment. The study reveals previously overlooked collaborative and conflict relationships among the 3H dimensions.

Conclusion: Model merging offers effective parameter-level conflict resolution for 3H alignment, with RESM demonstrating superior performance through innovative reweighting and sparsity adaptation techniques. The work provides insights into trade-offs between data-level and parameter-level approaches.

Abstract: Achieving balanced alignment of large language models (LLMs) in terms of Helpfulness, Honesty, and Harmlessness (3H optimization) constitutes a cornerstone of responsible AI. Existing methods like data mixture strategies face limitations, including heavy reliance on expert knowledge and conflicting optimization signals. While model merging offers parameter-level conflict-resolution strategies through integrating specialized models’ parameters, its potential for 3H optimization remains underexplored. This paper systematically compares the effectiveness of model merging and data mixture methods in constructing 3H-aligned LLMs for the first time, revealing previously overlooked collaborative and conflict relationships among the 3H dimensions and discussing the advantages and drawbacks of data mixture (\textit{data-level}) and model merging (\textit{parameter-level}) methods in mitigating the conflict for balanced 3H optimization. Specially, we propose a novel \textbf{R}eweighting \textbf{E}nhanced task \textbf{S}ingular \textbf{M}erging method, \textbf{RESM}, through outlier weighting and sparsity-aware rank selection strategies to address the challenges of preference noise accumulation and layer sparsity adaptation inherent in 3H-aligned LLM merging. Extensive evaluations can verify the effectiveness and robustness of RESM compared to previous data mixture (2%-5% gain) and model merging (1%-3% gain) methods in achieving balanced LLM alignment. We release our models through \href{https://huggingface.co/Jinluan}{3H_Merging} for further investigations.

[195] GENERator: A Long-Context Generative Genomic Foundation Model

Wei Wu, Qiuyi Li, Yuanyuan Zhang, Zhihao Zhan, Ruipu Chen, Mingyang Li, Kun Fu, Junyan Qi, Yongzhou Bao, Chao Wang, Yiheng Zhu, Zhiyun Zhang, Jian Tang, Fuli Feng, Jieping Ye, Yuwen Liu, Hui Xiong, Zheng Wang

Main category: cs.CL

TL;DR: GENErator is a generative genomic foundation model for long-context DNA modeling with 98k nucleotide context length, pre-trained on 386B nucleotides, showing strong intrinsic capabilities for genomic analysis and sequence design.

Details

Motivation: DNA sequencing has produced vast genomic datasets, but interpreting and engineering genomic function remains challenging. Existing language models for genomics have limitations in training scope, generative capability, or computational cost.

Method: Developed GENErator, a generative genomic foundation model with 98k nucleotide context length, pre-trained on 386 billion nucleotides of eukaryotic DNA. Uses zero-shot and fine-tuning approaches for various genomic tasks.

Result: Shows strong intrinsic capabilities without fine-tuning: phylogenetically coherent embeddings, generative accuracy comparable to SOTA with better efficiency. Zero-shot variant effect prediction competitive with alignment-based methods. Fine-tuned model achieves leading benchmark performance. Can generate protein-coding DNA sequences and design cis-regulatory elements with targeted activity profiles.

Conclusion: GENErator establishes an efficient and biologically grounded framework for genomic interpretation and programmable sequence design, bridging genomic analysis and generative applications.

Abstract: The rapid advancement of DNA sequencing has produced vast genomic datasets, yet interpreting and engineering genomic function remain fundamental challenges. Recent large language models have opened new avenues for genomic analysis, but existing approaches are often limited by restricted training scope, constrained generative capability, or prohibitive computational cost. We introduce GENErator, a generative genomic foundation model for long-context DNA modeling, with a context length of 98k nucleotides, pre-trained on 386 billion nucleotides of eukaryotic DNA. Without task-specific fine-tuning, GENERator exhibits strong intrinsic capabilities: unsupervised embedding analyses reveal phylogenetically coherent structure, and sequence recovery benchmarks demonstrate generative accuracy comparable to or exceeding state-of-the-art models with substantially improved computational efficiency. In a zero-shot setting, GENERator achieves competitive variant effect prediction performance relative to alignment-based methods, while remaining fully alignment-free and broadly applicable across species. With task-specific fine-tuning, the model attains leading performance on established genomic benchmarks. We further demonstrate practical generative applications. GENERator can generate protein-coding DNA sequences that translate into structurally plausible proteins and, through a prompt-guided design framework, design cis-regulatory elements with targeted activity profiles, including synthetic super-enhancers validated by high-throughput UMI-STARR-seq assays. Together, these results establish GENERator as an efficient and biologically grounded framework for genomic interpretation and programmable sequence design. Code and supplementary resources are available at https://github.com/GenerTeam/GENERator.

[196] Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving

Xin Xu, Yan Xu, Tianhao Chen, Yuchen Yan, Chengwu Liu, Zaoyu Chen, Yufei Wang, Yichun Yin, Yasheng Wang, Lifeng Shang, Qun Liu, Lu Yin

Main category: cs.CL

TL;DR: TATA is an adaptive framework that enables LLMs to autonomously choose between Chain-of-Thought and Tool-Integrated Reasoning based on their intrinsic capabilities for mathematical reasoning tasks.

Details

Motivation: Current approaches to mathematical reasoning with LLMs use either Chain-of-Thought (CoT) for generalizability or Tool-Integrated Reasoning (TIR) for precise computation, but they rely on predefined strategies rather than allowing LLMs to adapt based on their own capabilities.

Method: TATA uses base-LLM-aware data selection during supervised fine-tuning to tailor training data to each model’s unique abilities, enabling LLMs to autonomously determine and apply appropriate reasoning strategies (CoT or TIR) at test time.

Result: TATA achieves superior or comparable performance on six mathematical reasoning benchmarks while improving inference efficiency compared to TIR alone, effectively combining complementary strengths of CoT and TIR.

Conclusion: The framework demonstrates that LLMs can autonomously adapt reasoning strategies when trained with aptitude-aware data selection, aligning strategies with model capabilities for more effective mathematical reasoning.

Abstract: Existing approaches to mathematical reasoning with large language models (LLMs) rely on Chain-of-Thought (CoT) for generalizability or Tool-Integrated Reasoning (TIR) for precise computation. While efforts have been made to combine these methods, they primarily rely on post-selection or predefined strategies, leaving an open question: whether LLMs can autonomously adapt their reasoning strategy based on their inherent capabilities. In this work, we propose TATA (Teaching LLMs According to Their Aptitude), an adaptive framework that enables LLMs to personalize their reasoning strategy spontaneously, aligning it with their intrinsic aptitude. TATA incorporates base-LLM-aware data selection during supervised fine-tuning (SFT) to tailor training data to the model’s unique abilities. This approach equips LLMs to autonomously determine and apply the appropriate reasoning strategy at test time. We evaluate TATA through extensive experiments on six mathematical reasoning benchmarks, using both general-purpose and math-specialized LLMs. Empirical results demonstrate that TATA effectively combines the complementary strengths of CoT and TIR, achieving superior or comparable performance with improved inference efficiency compared to TIR alone. Further analysis underscores the critical role of aptitude-aware data selection in enabling LLMs to make effective and adaptive reasoning decisions and align reasoning strategies with model capabilities.

[197] How Much Do LLMs Hallucinate across Languages? On Realistic Multilingual Estimation of LLM Hallucination

Saad Obaid ul Islam, Anne Lauscher, Goran Glavaš

Main category: cs.CL

TL;DR: Multilingual study of LLM hallucination in knowledge-intensive long-form QA across 30 languages, showing hallucination rates are uncorrelated with language resource size but higher in models with broader language support.

Details

Motivation: Most hallucination research is English-centric and focuses on machine translation/summarization, but realistic settings involve open information seeking. Need to quantify LLM hallucination across languages in knowledge-intensive long-form QA.

Method: Trained multilingual hallucination detection model using MT-translated English dataset, manually annotated gold data for 5 high-resource languages, built open-domain QA dataset for 30 languages with LLM-generated prompts and Wikipedia references.

Result: LLMs hallucinate more tokens in high-resource languages due to longer responses, but hallucination rates (normalized for length) are uncorrelated with language digital footprint size. Smaller LLMs hallucinate more, and models with broader language support have higher hallucination rates.

Conclusion: Multilingual hallucination patterns differ from English-centric assumptions; broader language support correlates with higher hallucination rates, highlighting need for multilingual evaluation in realistic QA settings.

Abstract: In the age of misinformation, hallucination - the tendency of Large Language Models (LLMs) to generate non-factual or unfaithful responses - represents the main risk for their global utility. Despite LLMs becoming increasingly multilingual, the vast majority of research on detecting and quantifying LLM hallucination are (a) English-centric and (b) focus on machine translation (MT) and summarization, tasks that are less common in realistic settings than open information seeking. In contrast, we aim to quantify the extent of LLM hallucination across languages in knowledge-intensive long-form question answering (LFQA). To this end, we train a multilingual hallucination detection model and conduct a large-scale study across 30 languages and 6 open-source LLM families. We start from an English hallucination detection dataset and rely on MT to translate-train a detection model. We also manually annotate gold data for five high-resource languages; we then demonstrate, for these languages, that the estimates of hallucination rates are similar between silver (LLM-generated) and gold test sets, validating the use of silver data for estimating hallucination rates for other languages. For the final rates estimation, we build open-domain QA dataset for 30 languages with LLM-generated prompts and Wikipedia articles as references. Our analysis shows that LLMs, in absolute terms, hallucinate more tokens in high-resource languages due to longer responses, but that the actual hallucination rates (i.e., normalized for length) seems uncorrelated with the sizes of languages’ digital footprints. We also find that smaller LLMs hallucinate more, and significantly, LLMs with broader language support display higher hallucination rates.

[198] LIFT: A Novel Framework for Enhancing Long-Context Understanding of LLMs via Long Input Fine-Tuning

Yansheng Mao, Yufei Xu, Jiaqi Li, Fanxu Meng, Haotong Yang, Zilong Zheng, Xiyuan Wang, Muhan Zhang

Main category: cs.CL

TL;DR: LIFT is a framework that fine-tunes short-context LLMs to handle long inputs by adapting model parameters to specific long contexts, avoiding quadratic complexity of traditional long-context models.

Details

Motivation: Long context understanding is challenging for LLMs due to limited context windows. Current approaches either extend context windows (quadratic complexity) or rely on retrieval, but LIFT aims to store long input information directly in model parameters through fine-tuning.

Method: LIFT dynamically adapts short-context LLM parameters to given long inputs through fine-tuning, using LLM-generated synthetic tasks to enhance comprehension beyond memorization. Includes optimized pipeline reducing Time to First Token to <10 seconds for 8k context.

Result: Enables short-context LLMs to answer questions without requiring the original long context during inference, avoiding quadratic complexity. Provides comprehensive analysis of strengths/limitations and discusses feasibility for real-world deployment.

Conclusion: LIFT offers a novel approach to long-context modeling by storing long input information in model parameters rather than extending context windows, with potential for practical deployment and future research directions.

Abstract: Long context understanding remains challenging for large language models due to their limited context windows. This paper introduces Long Input Fine-Tuning (LIFT), a novel framework for long-context modeling that can enhance the long-context performance of arbitrary short-context LLMs by dynamically adapting their parameters to the given long input. Importantly, rather than endlessly extending the context window size to accommodate increasingly longer inputs in context, LIFT stores and absorbs the long input in parameters. By fine-tuning the long input into model parameters, LIFT allows short-context LLMs to answer questions even when the required information is not provided in the context during inference, avoiding the quadratic complexity w.r.t. input length of a normal long context model. Furthermore, LIFT does not simply perform continued pretraining on new, long contexts, but leverages carefully designed LLM-generated synthetic tasks to enhance the comprehension of long contexts, moving beyond mere memorization. To accommodate the additional cost of fine-tuning, we design a highly optimized pipeline that reduces the Time to First Token (TTFT) to less than 10 seconds for 8k context. We further provide a comprehensive analysis of LIFT’s strengths and limitations in long-context understanding, discuss its feasibility for large-scale real-world deployment, and highlight valuable directions for future research.

[199] Reassessing Active Learning Adoption in Contemporary NLP: A Community Survey

Julia Romberg, Christopher Schröder, Julius Gonsior, Katrin Tomanek, Fredrik Olsson

Main category: cs.CL

TL;DR: Survey of NLP community on active learning practices, challenges, and future prospects in the era of large language models

Details

Motivation: Active learning reduces annotation costs but its real-world adoption and impact of LLMs on its practice remain unclear; need to understand current implementation practices, obstacles, and future prospects

Method: Conducted online survey in NLP community to collect insights on active learning practices, challenges, and future prospects; reassessed relevance of data annotation and active learning

Result: Data annotation expected to remain important; active learning to stay relevant with LLM benefits; three persistent challenges: setup complexity, uncertain cost reduction, and tooling issues

Conclusion: Active learning remains relevant in LLM era but faces persistent adoption barriers; proposed strategies to alleviate key challenges; published anonymized dataset

Abstract: Supervised learning relies on data annotation which usually is time-consuming and therefore expensive. A longstanding strategy to reduce annotation costs is active learning, an iterative process, in which a human annotates only data instances deemed informative by a model. Research in active learning has made considerable progress, especially with the rise of large language models (LLMs). However, we still know little about how these remarkable advances have translated into real-world applications, or contributed to removing key barriers to active learning adoption. To fill in this gap, we conduct an online survey in the NLP community to collect previously intangible insights on current implementation practices, common obstacles in application, and future prospects in active learning. We also reassess the perceived relevance of data annotation and active learning as fundamental assumptions. Our findings show that data annotation is expected to remain important and active learning to stay relevant while benefiting from LLMs. Consistent with a community survey from over 15 years ago, three key challenges yet persist – setup complexity, uncertain cost reduction, and tooling – for which we propose alleviation strategies. We publish an anonymized version of the dataset.

[200] A Foundational individual Mobility Prediction Model based on Open-Source Large Language Models

Zhenlin Qin, Leizhen Wang, Francisco Camara Pereira, Zhenliang Ma

Main category: cs.CL

TL;DR: A unified fine-tuning framework for training foundational open-source LLM-based mobility prediction models that adapts to different cities and user contexts

Details

Motivation: Current LLM-based mobility prediction models are limited by being trained on specific datasets or using single prompts, making them difficult to adapt to different cities and diverse user contexts

Method: Proposes a unified fine-tuning framework to train a foundational open-source LLM-based mobility prediction model, validated through extensive experiments on six real-world mobility datasets

Result: The proposed model achieved the best performance in prediction accuracy and transferability over state-of-the-art models based on deep learning and LLMs

Conclusion: The unified framework successfully addresses adaptation challenges in LLM-based mobility prediction, demonstrating superior performance and transferability across different contexts

Abstract: Large Language Models (LLMs) are widely applied to domain-specific tasks due to their massive general knowledge and remarkable inference capacities. Current studies on LLMs have shown immense potential in applying LLMs to model individual mobility prediction problems. However, most LLM-based mobility prediction models only train on specific datasets or use single well-designed prompts, leading to difficulty in adapting to different cities and users with diverse contexts. To fill these gaps, this paper proposes a unified fine-tuning framework to train a foundational open source LLM-based mobility prediction model. We conducted extensive experiments on six real-world mobility datasets to validate the proposed model. The results showed that the proposed model achieved the best performance in prediction accuracy and transferability over state-of-the-art models based on deep learning and LLMs.

[201] CASE – Condition-Aware Sentence Embeddings for Conditional Semantic Textual Similarity Measurement

Gaifan Zhang, Yi Zhou, Danushka Bollegala

Main category: cs.CL

TL;DR: CASE (Condition-Aware Sentence Embeddings) is a method that creates context-aware sentence embeddings by using LLMs to generate condition embeddings and aligning them with supervised projection for improved conditional semantic textual similarity.

Details

Motivation: Sentence meaning depends on context, but current sentence embedding methods lack effective ways to modify embeddings based on contextual conditions. The paper aims to address how to best create sentence embeddings that are conditioned on specific contexts.

Method: 1) Uses LLM encoder to create condition embeddings where sentence influences attention scores during pooling; 2) Learns supervised alignment method to map LLM-based text embeddings to Conditional Semantic Textual Similarity (C-STS) task; 3) Employs condition embedding subtraction to improve isotropy of embedding space; 4) Uses supervised projection method requiring few embedding dimensions.

Result: Subtracting condition embeddings consistently improves C-STS performance of LLM-based embeddings by enhancing isotropy. Supervised projection significantly boosts performance despite requiring minimal embedding dimensions.

Conclusion: CASE provides an efficient and accurate method for creating context-aware sentence embeddings that outperform existing approaches on conditional semantic similarity tasks through careful conditioning and supervised alignment techniques.

Abstract: The meaning conveyed by a sentence often depends on the context in which it appears. Despite the progress of sentence embedding methods, it remains unclear as how to best modify a sentence embedding conditioned on its context. To address this problem, we propose Condition-Aware Sentence Embeddings (CASE), an efficient and accurate method to create an embedding for a sentence under a given condition. First, CASE creates an embedding for the condition using a Large Language Model (LLM) encoder, where the sentence influences the attention scores computed for the tokens in the condition during pooling. Next, a supervised method is learnt to align the LLM-based text embeddings with the Conditional Semantic Textual Similarity (C-STS) task. We find that subtracting the condition embedding consistently improves the C-STS performance of LLM-based text embeddings by improving the isotropy of the embedding space. Moreover, our supervised projection method significantly improves the performance of LLM-based embeddings despite requiring a small number of embedding dimensions.

[202] Training a Utility-based Retriever Through Shared Context Attribution for Retrieval-Augmented Language Models

Yilong Xu, Jinhua Gao, Xiaoming Yu, Yuanhai Xue, Baolong Bi, Huawei Shen, Xueqi Cheng

Main category: cs.CL

TL;DR: SCARLet trains utility-based retrievers for RALMs using multi-task generalization and inter-passage interaction to better capture passage utility beyond semantic relevance.

Details

Motivation: Current retrievers in Retrieval-Augmented Language Models focus mainly on semantic relevance, which may not effectively support downstream generation tasks. Utility-based retrieval that prioritizes passages providing valid benefits for tasks is promising but under-explored due to insufficient understanding of how to accurately capture passage utility.

Method: SCARLet framework incorporates two key factors: 1) Multi-task generalization by constructing shared context with synthesized training data across various tasks to mitigate semantic bias and focus on task-specific utility, and 2) Inter-passage interaction using perturbation-based attribution method to estimate passage-level utility that reflects interactions between passages for more accurate feedback.

Result: Evaluation on ten datasets across various tasks (both in-domain and out-of-domain) shows that retrievers trained by SCARLet consistently improve the overall performance of Retrieval-Augmented Language Models.

Conclusion: SCARLet effectively trains utility-based retrievers that go beyond semantic relevance to better support downstream tasks in RALMs through multi-task generalization and inter-passage interaction modeling.

Abstract: Retrieval-Augmented Language Models boost task performance, owing to the retriever that provides external knowledge. Although crucial, the retriever primarily focuses on semantics relevance, which may not always be effective for generation. Thus, utility-based retrieval has emerged as a promising topic, prioritizing passages that provide valid benefits for downstream tasks. However, due to insufficient understanding, capturing passage utility accurately remains unexplored. This work proposes SCARLet, a framework for training utility-based retrievers in RALMs, which incorporates two key factors, multi-task generalization and inter-passage interaction. First, SCARLet constructs shared context on which training data for various tasks is synthesized. This mitigates semantic bias from context differences, allowing retrievers to focus on learning task-specific utility and generalize across tasks. Next, SCARLet uses a perturbation-based attribution method to estimate passage-level utility for shared context, which reflects interactions between passages and provides more accurate feedback. We evaluate our approach on ten datasets across various tasks, both in-domain and out-of-domain, showing that retrievers trained by SCARLet consistently improve the overall performance of RALMs.

[203] LLMs as Span Annotators: A Comparative Study of LLMs and Humans

Zdeněk Kasner, Vilém Zouhar, Patrícia Schmidtová, Ivan Kartáč, Kristýna Onderková, Ondřej Plátek, Dimitra Gkatzia, Saad Mahamood, Ondřej Dušek, Simone Balloccu

Main category: cs.CL

TL;DR: LLMs can perform span annotation tasks at similar error rates to skilled human annotators but with lower cost, though they have only moderate agreement with humans.

Details

Motivation: Span annotation is valuable for text evaluation where single-score metrics fail, but traditionally requires human annotators or fine-tuned models. The paper investigates whether LLMs can serve as a cost-effective alternative to human annotators for span annotation tasks.

Method: The study compares LLMs to skilled human annotators on three span annotation tasks: evaluating data-to-text generation, identifying translation errors, and detecting propaganda techniques. They analyze inter-annotator agreement (IAA) between LLMs and humans, error rates, and cost efficiency.

Result: LLMs show only moderate inter-annotator agreement with human annotators overall. However, LLMs make errors at a similar rate as skilled crowdworkers. LLMs produce annotations at a fraction of the cost per output annotation compared to human annotation.

Conclusion: LLMs can serve as a cost-effective alternative to human annotators for span annotation tasks, though with moderate agreement levels. The released dataset of over 40k model and human span annotations enables further research in this area.

Abstract: Span annotation - annotating specific text features at the span level - can be used to evaluate texts where single-score metrics fail to provide actionable feedback. Until recently, span annotation was done by human annotators or fine-tuned models. In this paper, we study whether large language models (LLMs) can serve as an alternative to human annotators. We compare the abilities of LLMs to skilled human annotators on three span annotation tasks: evaluating data-to-text generation, identifying translation errors, and detecting propaganda techniques. We show that overall, LLMs have only moderate inter-annotator agreement (IAA) with human annotators. However, we demonstrate that LLMs make errors at a similar rate as skilled crowdworkers. LLMs also produce annotations at a fraction of the cost per output annotation. We release the dataset of over 40k model and human span annotations for further research.

[204] Free Access to World News: Reconstructing Full-Text Articles from GDELT

A. Fronzetti Colladon, R. Vestrelli

Main category: cs.CL

TL;DR: A Python package (gdeltnews) reconstructs full-text newspaper articles from GDELT n-grams at near-zero cost, achieving up to 95% similarity to original articles.

Details

Motivation: Access to full-text news corpora is challenging due to high costs and limited free alternatives, hindering research in economics, social science, information science, and NLP.

Method: Developed a Python package that merges overlapping n-grams from the GDELT Web News NGrams 3.0 dataset to reconstruct complete newspaper articles.

Result: Validated on 2211 articles from major U.S. news outlets, achieving up to 95% text similarity using Levenshtein and SequenceMatcher metrics.

Conclusion: The tool enables free, large-scale access to full-text news data for various research applications including economic forecasting, computational social science, and NLP.

Abstract: News data have become essential resources across various disciplines. Still, access to full-text news corpora remains challenging due to high costs and the limited availability of free alternatives. This paper presents a novel Python package (gdeltnews) that reconstructs full-text newspaper articles at near-zero cost by leveraging the Global Database of Events, Language, and Tone (GDELT) Web News NGrams 3.0 dataset. Our method merges overlapping n-grams extracted from global online news to rebuild complete articles. We validate the approach on a benchmark set of 2211 articles from major U.S. news outlets, achieving up to 95% text similarity against original articles based on Levenshtein and SequenceMatcher metrics. Our tool facilitates economic forecasting, computational social science, information science, and natural language processing applications by enabling free and large-scale access to full-text news data.

[205] APE-Bench: Evaluating Automated Proof Engineering for Formal Math Libraries

Huajian Xin, Luming Li, Xiaoran Jin, Jacques Fleuriot, Wenda Li

Main category: cs.CL

TL;DR: APE introduces the first systematic framework for evaluating repository-scale proof engineering in formal mathematics, moving beyond isolated theorem proving to assess multi-file coordination and semantic correctness in real library environments.

Details

Motivation: Existing evaluation benchmarks for formal mathematics systems focus only on isolated theorem proving, but modern systems now develop repository-scale proof engineering artifacts requiring multi-file coordination and semantic correctness beyond compilation. There's a need for systematic evaluation of these larger-scale proof engineering capabilities.

Method: APE introduces a dual verification framework that validates both syntactic compilation and semantic requirement satisfaction in pinned library environments. It includes APE-Bench (automatically extracting proof engineering tasks from real library commit histories) and APE-Harness (a unified execution framework based on task contract abstraction for standardized evaluation across diverse formal mathematics tasks).

Result: The framework enables fair systematic comparison of different agent implementations (including their APE-Agent reference scaffold alongside Claude Code and Codex CLI) on identical task specifications. All code and benchmark dataset are released as open-source.

Conclusion: APE provides the first systematic framework for evaluating repository-scale proof engineering, addressing the gap between current isolated theorem proving benchmarks and the complex, multi-file coordination required in modern formal mathematics systems.

Abstract: While frontier formal mathematics systems now routinely develop repository-scale proof engineering artifacts requiring multi-file coordination and semantic correctness beyond compilation, existing evaluation benchmarks remain focused on isolated theorem proving. We introduce Automated Proof Engineering (APE), the first systematic framework for evaluating repository-scale proof engineering through dual verification that validates both syntactic compilation and semantic requirement satisfaction in pinned library environments. We present a complete infrastructure comprising APE-Bench, which automatically extracts proof engineering tasks from real library commit histories, and APE-Harness, a unified execution framework based on task contract abstraction. This contract-based design enables standardized evaluation across diverse formal mathematics tasks and fair systematic comparison of different agent implementations (including our APE-Agent reference scaffold alongside Claude Code and Codex CLI) on identical task specifications. We demonstrate the framework’s effectiveness through comprehensive evaluation. All code and benchmark dataset are released as open-source at https://github.com/xinhjBrant/APE-Bench.

[206] Conflicts in Texts: Data, Implications and Challenges

Siyi Liu, Dan Roth

Main category: cs.CL

TL;DR: Survey paper analyzing conflicts in NLP systems across three areas: natural web texts, human-annotated data, and model interactions, with mitigation strategies for conflict-aware systems.

Details

Motivation: As NLP models are integrated into real-world applications, conflicts in information become critical issues that can undermine model reliability and trustworthiness. Conflicts arise from factual inconsistencies, subjective biases, annotation disagreements, and model hallucinations.

Method: Survey methodology categorizing conflicts into three key areas: (1) natural texts on the web with factual inconsistencies and multiple perspectives, (2) human-annotated data with annotator disagreements and biases, and (3) model interactions with hallucinations and knowledge conflicts.

Result: Unified framework for understanding conflicting information in NLP, analysis of implications across different conflict types, and discussion of mitigation strategies for developing conflict-aware systems.

Conclusion: Need for conflict-aware NLP systems that can reason over and reconcile conflicting information more effectively, with identified challenges and future research directions for improving model reliability.

Abstract: As NLP models become increasingly integrated into real-world applications, it becomes clear that there is a need to address the fact that models often rely on and generate conflicting information. Conflicts could reflect the complexity of situations, changes that need to be explained and dealt with, difficulties in data annotation, and mistakes in generated outputs. In all cases, disregarding the conflicts in data could result in undesired behaviors of models and undermine NLP models’ reliability and trustworthiness. This survey categorizes these conflicts into three key areas: (1) natural texts on the web, where factual inconsistencies, subjective biases, and multiple perspectives introduce contradictions; (2) human-annotated data, where annotator disagreements, mistakes, and societal biases impact model training; and (3) model interactions, where hallucinations and knowledge conflicts emerge during deployment. While prior work has addressed some of these conflicts in isolation, we unify them under the broader concept of conflicting information, analyze their implications, and discuss mitigation strategies. We highlight key challenges and future directions for developing conflict-aware NLP systems that can reason over and reconcile conflicting information more effectively.

[207] Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents

Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yuanchun Li, Yunxin Liu, Bin Wang, Bo An

Main category: cs.CL

TL;DR: Mobile-Bench-v2: A comprehensive benchmark for evaluating mobile GUI agents with multi-path evaluation, noisy environments, and proactive interaction capabilities

Details

Motivation: Existing mobile agent benchmarks have limitations: online benchmarks struggle with dynamic environments, offline benchmarks use single-path evaluation despite multi-solution GUI tasks, and both fail to assess noise handling or proactive interactions due to lack of noisy apps or overly detailed instructions.

Method: Uses slot-based instruction generation to create Mobile-Bench-v2 with four splits: 1) common task split with offline multi-path evaluation, 2) noisy split with pop-ups and ads, 3) contaminated split (AITZ-Noise) for real noisy environments, and 4) ambiguous instruction split with Q&A interactions for proactive interaction assessment.

Result: Benchmark includes comprehensive evaluation of mobile agents like AppAgent-v1, Mobile-Agent-v2, UI-Tars, and OS-Atlas across different splits to assess step rewards, noise handling, and proactive interaction capabilities.

Conclusion: Mobile-Bench-v2 provides a more realistic and comprehensive benchmark for evaluating mobile GUI agents, addressing limitations of existing benchmarks and enabling better assessment of agents’ capabilities in handling complex, noisy, and interactive mobile environments.

Abstract: VLM-based mobile agents are increasingly popular due to their capabilities to interact with smartphone GUIs and XML-structured texts and to complete daily tasks. However, existing online benchmarks struggle with obtaining stable reward signals due to dynamic environmental changes. Offline benchmarks evaluate the agents through single-path trajectories, which stands in contrast to the inherently multi-solution characteristics of GUI tasks. Additionally, both types of benchmarks fail to assess whether mobile agents can handle noise or engage in proactive interactions due to a lack of noisy apps or overly full instructions during the evaluation process. To address these limitations, we use a slot-based instruction generation method to construct a more realistic and comprehensive benchmark named Mobile-Bench-v2. Mobile-Bench-v2 includes a common task split, with offline multi-path evaluation to assess the agent’s ability to obtain step rewards during task execution. It contains a noisy split based on pop-ups and ads apps, and a contaminated split named AITZ-Noise to formulate a real noisy environment. Furthermore, an ambiguous instruction split with preset Q&A interactions is released to evaluate the agent’s proactive interaction capabilities. We conduct evaluations on these splits using the single-agent framework AppAgent-v1, the multi-agent framework Mobile-Agent-v2, as well as other mobile agents such as UI-Tars and OS-Atlas. Code and data are available at https://huggingface.co/datasets/xwk123/MobileBench-v2.

[208] Code-Mixed Phonetic Perturbations for Red-Teaming LLMs

Darpan Aswal, Siddharth D Jaiswal

Main category: cs.CL

TL;DR: CMP-RT is a novel red-teaming probe that combines code-mixing with phonetic perturbations to expose tokenizer-level safety vulnerabilities in transformers, allowing harmful prompts to bypass alignment mechanisms while maintaining interpretability.

Details

Motivation: Despite sophisticated safety alignment techniques, LLMs remain unsafe. Recent red-teaming focuses on incremental attack success rather than identifying underlying architectural vulnerabilities, particularly at the tokenizer level.

Method: CMP-RT combines code-mixing with phonetic perturbations, preserving phonetics while perturbing safety-critical tokens. It uses realistic elements from digital communication like code-mixing and textese to bypass alignment mechanisms.

Result: CMP-RT demonstrates robustness against standard defenses, attack scalability, and generalization across modalities and SOTA models like Gemini-3-Pro, establishing it as a major threat model.

Conclusion: Tokenization is an under-examined vulnerability in current safety pipelines, and CMP-RT exposes a gap between pre-training and safety alignment in multimodal LLMs.

Abstract: Large language models (LLMs) continue to be demonstrably unsafe despite sophisticated safety alignment techniques and multilingual red-teaming. However, recent red-teaming work has focused on incremental gains in attack success over identifying underlying architectural vulnerabilities in models. In this work, we present \textbf{CMP-RT}, a novel red-teaming probe that combines code-mixing with phonetic perturbations (CMP), exposing a tokenizer-level safety vulnerability in transformers. Combining realistic elements from digital communication such as code-mixing and textese, CMP-RT preserves phonetics while perturbing safety-critical tokens, allowing harmful prompts to bypass alignment mechanisms while maintaining high prompt interpretability, exposing a gap between pre-training and safety alignment. Our results demonstrate robustness against standard defenses, attack scalability, and generalization of the vulnerability across modalities and to SOTA models like Gemini-3-Pro, establishing CMP-RT as a major threat model and highlighting tokenization as an under-examined vulnerability in current safety pipelines.

[209] RePPL: Recalibrating Perplexity by Uncertainty in Semantic Propagation and Language Generation for Explainable QA Hallucination Detection

Yiming Huang, Junyan Zhang, Zihao Wang, Biquan Bie, Yunzhong Qiu, Xuming Hu, Yi R. Fung, Xinlei He

Main category: cs.CL

TL;DR: RePPL is a method that recalibrates uncertainty measurement for hallucination detection in LLMs by analyzing semantic propagation and language generation, providing token-level explanations.

Details

Motivation: Hallucinations remain a major obstacle to trustworthy LLM use. While previous uncertainty-based detection methods exist, they cannot explain why hallucinations occur or identify which input parts trigger them. The paper aims to provide explainable hallucination detection.

Method: RePPL recalibrates uncertainty measurement by analyzing two aspects: 1) uncertainty in semantic propagation (attention mechanisms fusing token information across layers), and 2) uncertainty in language generation (probability-based selection of semantics). It dispatches explainable uncertainty scores to each token and aggregates them in Perplexity-style Log-Average form.

Result: Achieves best comprehensive detection performance across various QA datasets on advanced models (average AUC of 0.833). Capable of producing token-level uncertainty scores as explanations for hallucinations.

Conclusion: RePPL provides an effective, explainable approach to hallucination detection by recalibrating uncertainty measurement through semantic propagation and generation analysis, offering both detection performance and interpretability.

Abstract: Large Language Models (LLMs) have become powerful, but hallucinations remain a vital obstacle to their trustworthy use. Previous works improved the capability of hallucination detection by measuring uncertainty. But they can not explain the provenance behind why hallucinations occur, particularly in identifying which part of the inputs tends to trigger hallucinations. Recent works on the prompt attack indicate that uncertainty exists in semantic propagation, where attention mechanisms gradually fuse local token information into high-level semantics across layers. Meanwhile, uncertainty also emerges in language generation, due to its probability-based selection of high-level semantics for sampled generations. Based on that, we propose RePPL to recalibrate uncertainty measurement by these two aspects, which dispatches explainable uncertainty scores to each token and aggregates in Perplexity-style Log-Average form as a total score. Experiments show that it achieves the best comprehensive detection performance across various QA datasets on advanced models (average AUC of 0.833), and it is capable of producing token-level uncertainty scores as explanations of hallucination.

[210] Reverse Engineering Human Preferences with Reinforcement Learning

Lisa Alazraki, Tan Yi-Chern, Jon Ander Campos, Maximilian Mozes, Marek Rei, Max Bartolo

Main category: cs.CL

TL;DR: Adversarial tuning of preamble generators to boost LLM evaluation scores via reinforcement learning, making attacks undetectable compared to direct response editing.

Details

Motivation: LLM-as-a-judge evaluation frameworks are scalable but vulnerable to adversarial exploitation where responses can be tuned to overfit judge preferences. Current methods edit responses post-hoc, but this study explores a more subtle approach using preamble optimization.

Method: Use judge-LLM signals as rewards to adversarially tune models that generate text preambles designed to boost downstream performance. Frozen LLMs are pipelined with these preamble generators via reinforcement learning to optimize upstream preambles.

Result: Frozen LLMs pipelined with tuned preamble generators attain higher LLM-evaluation scores than existing frameworks. The method is virtually undetectable and transfers effectively when candidate-LLM and judge-LLM are replaced with models not used during training.

Conclusion: The findings raise important questions about reliable LLM-as-a-judge evaluation design and demonstrate that human preferences can be reverse engineered effectively through preamble optimization, with potential applications beyond adversarial attacks.

Abstract: The capabilities of Large Language Models (LLMs) are routinely evaluated by other LLMs trained to predict human preferences. This framework–known as LLM-as-a-judge–is highly scalable and relatively low cost. However, it is also vulnerable to malicious exploitation, as LLM responses can be tuned to overfit the preferences of the judge. Previous work shows that the answers generated by a candidate-LLM can be edited post hoc to maximise the score assigned to them by a judge-LLM. In this study, we adopt a different approach and use the signal provided by judge-LLMs as a reward to adversarially tune models that generate text preambles designed to boost downstream performance. We find that frozen LLMs pipelined with these models attain higher LLM-evaluation scores than existing frameworks. Crucially, unlike other frameworks which intervene directly on the model’s response, our method is virtually undetectable. We also demonstrate that the effectiveness of the tuned preamble generator transfers when the candidate-LLM and the judge-LLM are replaced with models that are not used during training. These findings raise important questions about the design of more reliable LLM-as-a-judge evaluation settings. They also demonstrate that human preferences can be reverse engineered effectively, by pipelining LLMs to optimise upstream preambles via reinforcement learning–an approach that could find future applications in diverse tasks and domains beyond adversarial attacks.

[211] RLKD: Distilling LLMs’ Reasoning via Reinforcement Learning

Shicheng Xu, Liang Pang, Yunchang Zhu, Jia Gu, Zihao Wei, Jingcheng Deng, Feiyang Pan, Huawei Shen, Xueqi Cheng

Main category: cs.CL

TL;DR: RLKD: Reinforcement learning-based knowledge distillation framework using Generative Structure Reward Model to transfer teacher LLMs’ implicit multi-branch reasoning structure to student models, outperforming standard SFT-RL pipelines.

Details

Motivation: Standard supervised fine-tuning for knowledge distillation collapses the teacher's authentic multi-branch reasoning structure into flat token sequences, preventing effective transfer of reasoning capabilities to smaller student models.

Method: Proposes RLKD framework with Generative Structure Reward Model (GSRM) that converts reasoning paths into meta-reasoning-solving steps and computes structural alignment rewards, combined with reinforcement learning to distill reasoning structure.

Result: RLKD surpasses standard SFT-RL pipelines even when trained on only 0.1% of data under RL-only regime, unlocking greater student reasoning potential than SFT-based distillation.

Conclusion: The proposed RLKD framework effectively transfers teacher LLMs’ implicit multi-branch reasoning structure to student models through structural alignment rewards and reinforcement learning, significantly improving reasoning distillation.

Abstract: Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of smaller Large Language Models (LLMs). However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning (which selects appropriate sub-problems from multiple candidates) and solving (which addresses the sub-problem). This implies authentic reasoning has an implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in the teacher’s reasoning path, preventing effective distillation of this structure to students. To address this limitation, we propose RLKD, a reinforcement learning (RL)-based distillation framework guided by a novel Generative Structure Reward Model (GSRM). Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning. RLKD combines this reward with RL, enabling student LLMs to internalize the teacher’s implicit multi-branch reasoning structure rather than merely mimicking fixed output paths. Experiments show RLKD surpasses standard SFT-RL pipelines even when trained on 0.1% of data under an RL-only regime, unlocking greater student reasoning potential than SFT-based distillation. Code is available at https://github.com/xsc1234/RLKD.

[212] Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

Zinuo Li, Xian Zhang, Yongxin Guo, Mohammed Bennamoun, Farid Boussaid, Girish Dwivedi, Luqi Gong, Qiuhong Ke

Main category: cs.CL

TL;DR: TriSense is a triple-modality large language model for holistic video temporal understanding that integrates visual, audio, and speech modalities using a Query-Based Connector for adaptive modality fusion.

Details

Motivation: Existing models struggle to effectively fuse and interpret audio information in videos, limiting comprehensive video temporal understanding. Humans naturally integrate visual and auditory cues, but current approaches lack robust multimodal fusion capabilities.

Method: TriSense uses a Query-Based Connector that adaptively reweights modality contributions based on input queries, enabling robust performance under modality dropout. The model is trained on TriSense-2M, a dataset of over 2 million curated samples generated via automated pipeline with fine-tuned LLMs.

Result: Extensive experiments across multiple benchmarks demonstrate TriSense’s effectiveness in multimodal video analysis, showing robust performance even with missing modalities and superior video temporal understanding.

Conclusion: TriSense advances multimodal video analysis through effective triple-modality integration and adaptive fusion, with potential applications in comprehensive video understanding tasks.

Abstract: Humans naturally understand moments in a video by integrating visual and auditory cues. For example, localizing a scene in the video like “A scientist passionately speaks on wildlife conservation as dramatic orchestral music plays, with the audience nodding and applauding” requires simultaneous processing of visual, audio, and speech signals. However, existing models often struggle to effectively fuse and interpret audio information, limiting their capacity for comprehensive video temporal understanding. To address this, we present TriSense, a triple-modality large language model designed for holistic video temporal understanding through the integration of visual, audio, and speech modalities. Central to TriSense is a Query-Based Connector that adaptively reweights modality contributions based on the input query, enabling robust performance under modality dropout and allowing flexible combinations of available inputs. To support TriSense’s multimodal capabilities, we introduce TriSense-2M, a high-quality dataset of over 2 million curated samples generated via an automated pipeline powered by fine-tuned LLMs. TriSense-2M includes long-form videos and diverse modality combinations, facilitating broad generalization. Extensive experiments across multiple benchmarks demonstrate the effectiveness of TriSense and its potential to advance multimodal video analysis. Code and dataset will be publicly released.

[213] Model Editing with Graph-Based External Memory

Yash Kumar Atri, Ahmed Alaa, Thomas Hartvigsen

Main category: cs.CL

TL;DR: HYPE is a novel framework for precise and stable editing of large language models using hyperbolic geometry and graph neural networks to address hallucinations and outdated knowledge.

Details

Motivation: LLMs suffer from hallucinations and outdated parametric knowledge, while existing model editing methods often cause overfitting and catastrophic forgetting. There's a need for more precise and stable editing techniques.

Method: HYPE uses hyperbolic geometry and GNNs with three components: 1) Hyperbolic Graph Construction with Poincaré embeddings to preserve hierarchical relationships, 2) Möbius-Transformed Updates using hyperbolic addition to maintain structural consistency, and 3) Dual Stabilization combining gradient masking and periodic GNN parameter resetting.

Result: Experiments on CounterFact, CounterFact+, and MQuAKE datasets with GPT-J and GPT2-XL show HYPE significantly improves edit stability, factual accuracy, and multi-hop reasoning compared to existing methods.

Conclusion: HYPE provides an effective framework for precise and stable model editing that addresses key limitations of current approaches while preserving model integrity and knowledge structure.

Abstract: Large language models (LLMs) have revolutionized natural language processing, yet their practical utility is often limited by persistent issues of hallucinations and outdated parametric knowledge. Although post-training model editing offers a pathway for dynamic updates, existing methods frequently suffer from overfitting and catastrophic forgetting. To tackle these challenges, we propose a novel framework that leverages hyperbolic geometry and graph neural networks for precise and stable model edits. We introduce HYPE (HYperbolic Parameter Editing), which comprises three key components: (i) Hyperbolic Graph Construction, which uses Poincaré embeddings to represent knowledge triples in hyperbolic space, preserving hierarchical relationships and preventing unintended side effects by ensuring that edits to parent concepts do not inadvertently affect child concepts; (ii) Möbius-Transformed Updates, which apply hyperbolic addition to propagate edits while maintaining structural consistency within the hyperbolic manifold, unlike conventional Euclidean updates that distort relational distances; and (iii) Dual Stabilization, which combines gradient masking and periodic GNN parameter resetting to prevent catastrophic forgetting by focusing updates on critical parameters and preserving long-term knowledge. Experiments on CounterFact, CounterFact+, and MQuAKE with GPT-J and GPT2-XL demonstrate that HYPE significantly enhances edit stability, factual accuracy, and multi-hop reasoning.

[214] Business as Rulesual: A Benchmark and Framework for Business Rule Flow Modeling with LLMs

Chen Yang, Ruping Xu, Ruizhe Li, Bin Cao, Jing Fan

Main category: cs.CL

TL;DR: BREX benchmark for extracting structured procedural knowledge from business documents with complex logic, and ExIde framework using executable grounding for superior rule extraction.

Details

Motivation: Existing approaches fail to handle complex logical structures (conditional branching, parallel execution) in real-world business documents, and benchmarks are limited by simplistic schemas and shallow dependencies, creating a "Logic Gap" for process automation.

Method: Introduces BREX benchmark with 409 business documents and 2,855 expert-annotated rules across 30+ domains, and proposes ExIde framework with five prompting strategies including implicit semantic alignment and executable grounding via pseudo-code generation.

Result: Executable grounding serves as superior inductive bias, significantly outperforming standard prompts in rule extraction; reasoning-optimized models show distinct advantage in tracing long-range, non-linear rule dependencies compared to standard instruction-tuned models.

Conclusion: BREX addresses the Logic Gap in procedural knowledge extraction, and executable grounding provides effective framework for logic-aware LLMs without requiring fine-tuning, enabling better process automation from complex business documents.

Abstract: Extracting structured procedural knowledge from unstructured business documents is a critical yet unresolved bottleneck in process automation. While prior work has focused on extracting linear action flows from instructional texts, such as recipes, it has insufficiently addressed the complex logical structures, including conditional branching and parallel execution, that are pervasive in real-world regulatory and administrative documents. Furthermore, existing benchmarks are limited by simplistic schemas and shallow logical dependencies, restricting progress toward logic-aware large language models.To bridge this Logic Gap, we introduce BREX, a carefully curated benchmark comprising 409 real-world business documents and 2,855 expert-annotated rules. Unlike prior datasets centered on narrow service scenarios, BREX spans over 30 vertical domains, covering scientific, industrial, administrative, and financial regulations. We further propose ExIde, a structure-aware reasoning framework that investigates five distinct prompting strategies, ranging from implicit semantic alignment to executable grounding via pseudo-code generation. This enables explicit modeling of rule dependencies and provides an out-of-the-box framework for different business customers without finetuning their own large language models. We benchmark ExIde using 13 state-of-the-art large language models. Our extensive evaluation reveals that executable grounding serves as a superior inductive bias, significantly outperforming standard prompts in rule extraction. In addition, reasoning-optimized models demonstrate a distinct advantage in tracing long-range and non-linear rule dependencies compared to standard instruction-tuned models.

[215] ARC: Argument Representation and Coverage Analysis for Zero-Shot Long Document Summarization with Instruction Following LLMs

Mohamed Elaraby, Diane Litman

Main category: cs.CL

TL;DR: ARC is a bottom-up evaluation framework for assessing how well summaries preserve salient arguments in high-stakes domains like law and science, revealing systematic patterns in LLM summarization performance.

Details

Motivation: Current summarization evaluation methods don't adequately assess how well summaries preserve crucial arguments in high-stakes domains where argument roles are central, such as legal opinions and scientific articles.

Method: Developed Argument Representation Coverage (ARC) framework that distinguishes between different information types and separates omissions from factual errors, then evaluated eight open-weight LLMs on long legal opinions and scientific articles.

Result: Models capture some salient argument roles but frequently omit critical information, especially when arguments are sparsely distributed. ARC uncovered systematic patterns including context window positional bias and role-specific preferences.

Conclusion: ARC provides interpretable evaluation and actionable guidance for developing more complete and reliable summarization strategies in argument-critical domains.

Abstract: We introduce Argument Representation Coverage (ARC), a bottom-up evaluation framework that assesses how well summaries preserve salient arguments, a crucial issue in summarizing high-stakes domains such as law. ARC provides an interpretable lens by distinguishing between different information types to be covered and by separating omissions from factual errors. Using ARC, we evaluate summaries from eight open-weight large language models in two domains where argument roles are central: long legal opinions and scientific articles. Our results show that while these models capture some salient roles, they frequently omit critical information, particularly when arguments are sparsely distributed across the input. Moreover, ARC uncovers systematic patterns, showing how context window positional bias and role-specific preferences shape argument coverage, and provides actionable guidance for developing more complete and reliable summarization strategies.

[216] A Multi-Agent Framework for Mitigating Dialect Biases in Privacy Policy Question-Answering Systems

Đorđe Klisura, Astrid R Bernaga Torres, Anna Karen Gárate-Escamilla, Rajesh Roshan Biswal, Ke Yang, Hilal Pataci, Anthony Rios

Main category: cs.CL

TL;DR: Multi-agent framework reduces dialect bias in privacy policy QA systems without retraining, improving GPT-4o-mini’s accuracy significantly across different English dialects.

Details

Motivation: Privacy policies are complex and existing QA systems have performance disparities across English dialects, disadvantaging speakers of non-standard varieties. There's a need for equitable access to privacy information across linguistic diversity.

Method: Proposes a multi-agent framework with two agents: 1) Dialect Agent that translates queries into Standard American English while preserving intent, and 2) Privacy Policy Agent that refines predictions using domain expertise. No retraining or dialect-specific fine-tuning required.

Result: Framework improves GPT-4o-mini’s zero-shot accuracy from 0.394 to 0.601 on PrivacyQA and from 0.352 to 0.464 on PolicyQA, surpassing or matching few-shot baselines without additional training data.

Conclusion: Structured agent collaboration effectively mitigates dialect biases in NLP systems, highlighting the importance of designing systems that account for linguistic diversity for equitable access to privacy information.

Abstract: Privacy policies inform users about data collection and usage, yet their complexity limits accessibility for diverse populations. Existing Privacy Policy Question Answering (QA) systems exhibit performance disparities across English dialects, disadvantaging speakers of non-standard varieties. We propose a novel multi-agent framework inspired by human-centered design principles to mitigate dialectal biases. Our approach integrates a Dialect Agent, which translates queries into Standard American English (SAE) while preserving dialectal intent, and a Privacy Policy Agent, which refines predictions using domain expertise. Unlike prior approaches, our method does not require retraining or dialect-specific fine-tuning, making it broadly applicable across models and domains. Evaluated on PrivacyQA and PolicyQA, our framework improves GPT-4o-mini’s zero-shot accuracy from 0.394 to 0.601 on PrivacyQA and from 0.352 to 0.464 on PolicyQA, surpassing or matching few-shot baselines without additional training data. These results highlight the effectiveness of structured agent collaboration in mitigating dialect biases and underscore the importance of designing NLP systems that account for linguistic diversity to ensure equitable access to privacy information.

[217] MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLMs

Ke Wang, Yiming Qin, Nikolaos Dimitriadis, Alessandro Favero, Pascal Frossard

Main category: cs.CL

TL;DR: MEMOIR is a scalable framework for lifelong model editing that uses residual memory with sparse activation patterns to inject knowledge while preserving pre-trained capabilities and minimizing interference between edits.

Details

Motivation: Language models need efficient post-hoc updates to incorporate new knowledge without retraining or forgetting previous information, but existing editing methods compromise generalization, cause interference, or don't scale well to long editing sequences.

Method: Proposes MEMOIR framework that injects knowledge through a residual memory module with sample-dependent sparse activation masks, confining each edit to distinct memory parameters and using activation pattern matching at inference to identify relevant edits.

Result: Achieves state-of-the-art performance on QA, hallucination correction, and OOD generalization benchmarks for LLaMA-3 and Mistral models, scaling to thousands of sequential edits with minimal forgetting.

Conclusion: MEMOIR provides a scalable solution for lifelong model editing that maintains reliability, generalization, and locality while enabling efficient knowledge updates without retraining.

Abstract: Language models deployed in real-world systems often require post-hoc updates to incorporate new or corrected knowledge. However, editing such models efficiently and reliably-without retraining or forgetting previous information-remains a major challenge. Existing methods for lifelong model editing either compromise generalization, interfere with past edits, or fail to scale to long editing sequences. We propose MEMOIR, a novel scalable framework that injects knowledge through a residual memory, i.e., a dedicated parameter module, while preserving the core capabilities of the pre-trained model. By sparsifying input activations through sample-dependent masks, MEMOIR confines each edit to a distinct subset of the memory parameters, minimizing interference among edits. At inference, it identifies relevant edits by comparing the sparse activation patterns of new queries to those stored during editing. This enables generalization to rephrased queries by activating only the relevant knowledge while suppressing unnecessary memory activation for unrelated prompts. Experiments on question answering, hallucination correction, and out-of-distribution generalization benchmarks for LLaMA-3 and Mistral backbones demonstrate that MEMOIR achieves state-of-the-art performance across reliability, generalization, and locality metrics, scaling to thousands of sequential edits with minimal forgetting.

[218] Draft-based Approximate Inference for LLMs

Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, Kangwook Lee

Main category: cs.CL

TL;DR: A framework for approximate LLM inference using small draft models to predict token and KV pair importance, enabling efficient KV cache dropping and prompt compression for long-context LLMs.

Details

Motivation: The quadratic compute and linear memory costs of Transformers make inference optimization crucial for long-context LLMs. Existing methods rely on coarse importance predictions, lacking accuracy.

Method: Introduces a framework leveraging small draft models for lookahead-based importance estimation. Presents SpecKV (KV cache dropping), SpecPC (prompt compression), and SpecKV-PC (combined cascaded compression).

Result: Extensive experiments on long-context benchmarks show higher accuracy than existing baselines while maintaining same efficiency gains in memory, latency, and throughput.

Conclusion: Draft model lookahead enables more accurate importance estimation for approximate LLM inference, outperforming existing methods in accuracy-efficiency tradeoffs for long-context scenarios.

Abstract: Optimizing inference for long-context large language models (LLMs) is increasingly important due to the quadratic compute and linear memory cost of Transformers. Existing approximate inference methods, including key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on coarse predictions of token or KV pair importance. We unify and extend recent work by introducing a framework for approximate LLM inference that leverages small draft models to more accurately predict token and KV pair importance. We provide novel theoretical and empirical analyses justifying lookahead-based importance estimation techniques. Within this framework, we present: (i) SpecKV, the first method to use lookahead with a small draft model to enable precise KV cache dropping; (ii) SpecPC, which leverages draft model attention activations to identify and discard less important prompt tokens; and (iii) SpecKV-PC, a cascaded compression strategy combining both techniques. Extensive experiments on long-context benchmarks demonstrate that our methods consistently achieve higher accuracy than existing baselines while retaining the same efficiency gains in memory usage, latency, and throughput.

[219] Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I. Jordan, Stuart Russell, Song Mei

Main category: cs.CL

TL;DR: LLMs exhibit both generalization and hallucination through out-of-context reasoning (OCR) - deducing implications by associating concepts, regardless of causal relationships.

Details

Motivation: To understand why LLMs show contradictory behaviors: they can generalize well from new facts through fine-tuning but also hallucinate incorrect information, with the underlying mechanism remaining poorly understood.

Method: 1) Experiments across five prominent LLMs to confirm OCR drives both behaviors; 2) Formalization of OCR as synthetic factual recall task; 3) Empirical study of one-layer single-head attention-only transformers with factorized vs. combined weight matrices; 4) Theoretical analysis of gradient descent’s implicit bias favoring solutions minimizing nuclear norm.

Result: OCR indeed drives both generalization and hallucination depending on causal relationships; factorized weight matrices enable OCR learning while combined weights cannot; gradient descent’s implicit bias explains high sample efficiency in associating facts and implications.

Conclusion: OCR provides unified explanation for LLMs’ contradictory behaviors; matrix factorization is crucial for OCR capability; theoretical foundation offers new lens for analyzing and mitigating undesirable behaviors from knowledge injection.

Abstract: Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.

[220] EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization

Zhongqian Fu, Tianyi Zhao, Ning Ding, Xianzhi Yu, Xiaosong Li, Yehui Tang, Yunhe Wang

Main category: cs.CL

TL;DR: EAQuant is a post-training quantization framework for Mixture-of-Experts models that addresses activation outliers, routing instability, and sparse expert calibration to enable robust ultra-low-bit quantization.

Details

Motivation: Mixture-of-Experts models face significant quantization challenges due to sparse expert activation and dynamic routing, causing existing PTQ methods to fail with activation outliers, routing instability, and poor sparse expert calibration.

Method: EAQuant introduces three expert-aware innovations: (1) smoothing aggregation to suppress activation outliers, (2) routing consistency alignment to preserve expert selection post-quantization, and (3) calibration data balance to optimize sparsely activated experts.

Result: EAQuant significantly outperforms existing methods across extreme quantization settings (W4A4/W3A4/W3A3/W2A4), achieving average accuracy improvements of 1.15-13.81% across three diverse MoE architectures, with particularly pronounced gains in reasoning tasks.

Conclusion: EAQuant establishes a new state-of-the-art for high-precision, efficient MoE model compression by integrating expert-aware quantization strategies that address the unique challenges of sparse activation and dynamic routing in MoE architectures.

Abstract: Mixture-of-Experts (MoE) models enable scalable computation and performance in large-scale deep learning but face quantization challenges due to sparse expert activation and dynamic routing. Existing post-training quantization (PTQ) methods fail to address activation outliers, routing instability, and sparse expert calibration, leading to significant performance degradation. To address this, we propose EAQuant, a PTQ framework tailored for MoE architectures. Our method introduces three expert-aware innovations: (1) smoothing aggregation to suppress activation outliers, (2) routing consistency alignment to preserve expert selection post-quantization, and (3) calibration data balance to optimize sparsely activated experts. These strategies collectively enable robust, high-precision quantization of MoE models under ultra-low-bit constraints.Extensive experiments across several extreme quantization settings (e.g., W4A4/W3A4/W3A3/W2A4) demonstrate that EAQuant significantly outperforms existing methods, achieving average accuracy improvements of 1.15 - 13.81% across three diverse MoE architectures, with particularly pronounced gains in reasoning tasks and robust performance retention under aggressive quantization. By integrating these innovations, EAQuant establishes a new state-of-the-art for high-precision, efficient MoE model compression.Our code is available at https://github.com/darren-fzq1/EAQuant.

[221] FinCoT: Grounding Chain-of-Thought in Expert Financial Reasoning

Natapong Nitarach, Warit Sirichotedumrong, Panop Pitchayarthorn, Pittawat Taveekitworachai, Potsawee Manakul, Kunat Pipatanakul

Main category: cs.CL

TL;DR: FinCoT is a structured chain-of-thought prompting framework that embeds domain-specific financial reasoning blueprints to guide LLMs, improving accuracy and reducing output length while enhancing interpretability.

Details

Motivation: Current financial NLP prompting lacks structured reasoning with domain expertise. While standard prompting and unstructured CoT exist, structured CoT with financial domain knowledge remains underexplored despite its potential benefits.

Method: FinCoT embeds expert financial reasoning blueprints into structured chain-of-thought prompts. It compares three approaches: standard prompting, unstructured CoT, and structured CoT with domain expertise across ten CFA-style financial domains.

Result: FinCoT improved Qwen3-8B-Base accuracy from 63.2% to 80.5% and Fin-R1 (7B) from 65.7% to 75.7%, while reducing output length by up to 8.9x and 1.16x compared to structured CoT methods.

Conclusion: Structured CoT with domain expertise (FinCoT) significantly improves LLM performance in finance, reduces inference costs, and yields more interpretable, expert-aligned reasoning, especially beneficial for models lacking financial post-training.

Abstract: This paper presents FinCoT, a structured chain-of-thought (CoT) prompting framework that embeds domain-specific expert financial reasoning blueprints to guide large language models’ behaviors. We identify three main prompting styles in financial NLP (FinNLP): (1) standard prompting (zero-shot), (2) unstructured CoT (free-form reasoning), and (3) structured CoT (with explicitly structured reasoning steps). Prior work has mainly focused on the first two, while structured CoT remains underexplored and lacks domain expertise incorporation. Therefore, we evaluate all three prompting approaches across ten CFA-style financial domains and introduce FinCoT as the first structured finance-specific prompting approach incorporating blueprints from domain experts. FinCoT improves the accuracy of a general-purpose model, Qwen3-8B-Base, from 63.2% to 80.5%, and boosts Fin-R1 (7B), a finance-specific model, from 65.7% to 75.7%, while reducing output length by up to 8.9x and 1.16x compared to structured CoT methods, respectively. We find that FinCoT proves most effective for models lacking financial post-training. Our findings show that FinCoT does not only improve performance and reduce inference costs but also yields more interpretable and expert-aligned reasoning traces.

[222] Mind the Gap: Assessing Wiktionary’s Crowd-Sourced Linguistic Knowledge on Morphological Gaps in Two Related Languages

Jonathan Sakunkoo, Annabella Sakunkoo

Main category: cs.CL

TL;DR: Computational validation of morphological defectivity data from Wiktionary using neural analyzers on Latin/Italian corpora reveals high reliability for Italian but 7% errors in Latin defectivity claims.

Details

Motivation: Morphological defectivity (missing inflectional forms) is understudied but important for NLP in morphologically rich languages. Crowd-sourced resources like Wiktionary are controversial but often the only available data for rare phenomena in under-explored languages.

Method: Customized neural morphological analyzer to annotate Latin and Italian corpora, then computationally validated crowd-sourced lists of defective verbs from Wiktionary against the massive annotated corpus data.

Result: Wiktionary provides highly reliable account of Italian morphological gaps, but 7% of Latin lemmata listed as defective show strong corpus evidence of being non-defective, revealing limitations of crowd-sourced wikis.

Conclusion: Crowd-sourced wikis have value for rare linguistic features but limitations for less-studied phenomena/languages. The work provides scalable tools for quality assurance of crowd-sourced data, advancing computational morphology and linguistic knowledge of defectivity.

Abstract: Morphological defectivity is an intriguing and understudied phenomenon in linguistics. Addressing defectivity, where expected inflectional forms are absent, is essential for improving the accuracy of NLP tools in morphologically rich languages. However, traditional linguistic resources often lack coverage of morphological gaps as such knowledge requires significant human expertise and effort to document and verify. For scarce linguistic phenomena in under-explored languages, Wikipedia and Wiktionary often serve as among the few accessible resources. Despite their extensive reach, their reliability has been a subject of controversy. This study customizes a novel neural morphological analyzer to annotate Latin and Italian corpora. Using the massive annotated data, crowd-sourced lists of defective verbs compiled from Wiktionary are validated computationally. Our results indicate that while Wiktionary provides a highly reliable account of Italian morphological gaps, 7% of Latin lemmata listed as defective show strong corpus evidence of being non-defective. This discrepancy highlights potential limitations of crowd-sourced wikis as definitive sources of linguistic knowledge, particularly for less-studied phenomena and languages, despite their value as resources for rare linguistic features. By providing scalable tools and methods for quality assurance of crowd-sourced data, this work advances computational morphology and expands linguistic knowledge of defectivity in non-English, morphologically rich languages.

[223] Improving the Distributional Alignment of LLMs using Supervision

Gauri Kambhatla, Sanjana Gautam, Angela Zhang, Alex Liu, Ravi Srinivasan, Junyi Jessy Li, Matthew Lease

Main category: cs.CL

TL;DR: Simple supervision improves LLM alignment with diverse population groups on subjective questions across multiple datasets and topics

Details

Motivation: To improve language model alignment with diverse population groups on subjective questions, which has significant value for making LLMs more representative and useful

Method: Simple supervision techniques applied across three datasets spanning various topics, evaluating alignment with diverse population groups and analyzing distributional alignment

Result: Simple supervision consistently improves language model alignment with diverse population groups, with analysis showing how alignment varies across specific groups

Conclusion: The research provides insights into distributional alignment of LLMs with diverse populations and establishes a benchmark for future research through evaluation of many LLMs and prompting strategies

Abstract: The ability to accurately align LLMs with population groups on subjective questions would have great value. In this work, we show that simple supervision can more consistently improve language model alignment with diverse population groups, as measured across three datasets spanning various topics. Beyond evaluating average alignment, we also report how alignment varies across specific groups. Our broad findings provide insights into the distributional alignment of LLMs with diverse populations. By conducting evaluation over many LLMs and prompting strategies, we provide a benchmark to stimulate future research.

[224] DP-Fusion: Token-Level Differentially Private Inference for Large Language Models

Rushil Thareja, Preslav Nakov, Praneeth Vepakomma, Nils Lukas

Main category: cs.CL

TL;DR: DP-Fusion: A differentially private inference mechanism for LLMs that bounds token influence on outputs, focusing on document privatization to protect sensitive information while maintaining text quality.

Details

Motivation: LLMs can inadvertently reveal sensitive information from their context at inference time, especially when augmented with tools/databases containing private data. Existing privacy-preserving methods lack provable guarantees or have poor utility/privacy trade-offs.

Method: DP-Fusion works by: (1) labeling sensitive tokens, (2) inferring without sensitive tokens for baseline, (3) inferring with sensitive tokens, (4) blending distributions to bound distance from baseline. Uses differential privacy parameter ε to control privacy/utility trade-off.

Result: Achieves token-level provably privatized documents with substantially improved theoretical and empirical privacy, achieving 6× lower perplexity than related differentially private inference methods.

Conclusion: DP-Fusion provides a practical differentially private inference mechanism for LLMs that balances privacy protection with text quality, addressing document privatization needs while mitigating jailbreak-style prompt injection risks.

Abstract: Large language models (LLMs) do not preserve privacy at inference-time. The LLM’s outputs can inadvertently reveal information about the model’s context, which presents a privacy challenge when the LLM is augmented via tools or databases containing sensitive information. Existing privacy-preserving methods at inference-time have significant limitations since they (i) lack provable guarantees or (ii) have a poor utility/privacy trade-off. We propose DP-Fusion, a Differentially Private Inference (DPI) mechanism for LLMs that provably bounds the influence a set of tokens in the context can have on the LLM’s output. DP-Fusion works as follows: (1) label a subset of sensitive tokens, (2) infer the LLM without any sensitive tokens to obtain a baseline, (3) infer the LLM with the sensitive tokens, and (4) blend distributions so that the final output remains within a bounded distance of the baseline distribution. While this per-token influence bound also mitigates jailbreak-style prompt injection, we focus on \emph{document privatization}, where the goal is to paraphrase a document containing sensitive tokens, e.g., personally identifiable information, so that no attacker can reliably infer them from the paraphrased document while preserving high text quality. The privacy/utility trade-off is controlled by $ε$, where $ε=0$ hides sensitive tokens entirely, while higher values trade off privacy for improved text quality. We show that our method creates token-level provably privatized documents with substantially improved theoretical and empirical privacy, achieving $6\times$ lower perplexity than related DPI methods.

[225] Lost in Localization: Building RabakBench with Human-in-the-Loop Validation to Measure Multilingual Safety Gaps

Gabriel Chua, Leanne Tan, Ziyu Ge, Roy Ka-Wei Lee

Main category: cs.CL

TL;DR: RabakBench is a multilingual safety benchmark for Singapore’s linguistic landscape (Singlish, Chinese, Malay, Tamil) created via LLM-driven red teaming, semi-automated labeling, and toxicity-preserving translation.

Details

Motivation: LLMs often fail to maintain safety in low-resource language varieties like code-mixed vernaculars and regional dialects, creating a need for localized safety evaluation frameworks.

Method: Three-stage pipeline: (1) Generate unsafe content via LLM-driven red teaming, (2) Label with semi-automated multi-label annotation using majority-voted LLM labelers, (3) Translate with high-fidelity toxicity-preserving translation.

Result: Created RabakBench with over 5,000 examples across six safety categories, achieving 0.70-0.80 inter-annotator agreement. Evaluations of 13 state-of-the-art guardrails show significant performance degradation in these languages.

Conclusion: RabakBench provides a reproducible framework for building safety benchmarks in underserved communities and demonstrates the critical need for localized safety evaluation beyond high-resource languages.

Abstract: Large language models (LLMs) often fail to maintain safety in low-resource language varieties, such as code-mixed vernaculars and regional dialects. We introduce RabakBench, a multilingual safety benchmark and scalable pipeline localized to Singapore’s unique linguistic landscape, covering Singlish, Chinese, Malay, and Tamil. We construct the benchmark through a three-stage pipeline: (1) Generate: augmenting real-world unsafe web content via LLM-driven red teaming; (2) Label: applying semi-automated multi-label annotation using majority-voted LLM labelers; and (3) Translate: performing high-fidelity, toxicity-preserving translation. The resulting dataset contains over 5,000 examples across six fine-grained safety categories. Despite using LLMs for scalability, our framework maintains rigorous human oversight, achieving 0.70-0.80 inter-annotator agreement. Evaluations of 13 state-of-the-art guardrails reveal significant performance degradation, underscoring the need for localized evaluation. RabakBench provides a reproducible framework for building safety benchmarks in underserved communities.

[226] AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

Talor Abramovich, Gal Chechik

Main category: cs.CL

TL;DR: AblationBench: A benchmark suite for evaluating language model agents on ablation planning tasks in AI research, with two tasks - AuthorAblation (proposing ablations from method sections) and ReviewerAblation (finding missing ablations in full papers).

Details

Motivation: Language model agents are increasingly used for scientific research automation, but evaluating their scientific contributions remains challenging. Ablation experiments are a key mechanism for gaining insights, but there's no standardized way to assess agents' abilities in planning and identifying such experiments.

Method: Created AblationBench with two tasks: AuthorAblation (83 instances) for proposing ablation experiments from method sections, and ReviewerAblation (350 instances) for finding missing ablations in full papers. Developed LM-based judges for automatic evaluation and tested frontier language models on these tasks.

Result: Current LMs perform poorly on ablation planning tasks, with the best system identifying only 38% of original ablations on average (below human-level). Found inverse performance trend between author and reviewer tasks attributed to model grounding differences. Chain-of-thought prompting outperformed agent-based approaches.

Conclusion: AblationBench provides a standardized framework for evaluating language model agents’ scientific reasoning capabilities in ablation planning. The tasks remain challenging for current LMs, highlighting limitations in scientific reasoning and the need for improved grounding and reasoning approaches.

Abstract: Language model agents are increasingly used to automate scientific research, yet evaluating their scientific contributions remains a challenge. A key mechanism to obtain such insights is through ablation experiments. To this end, we introduce AblationBench, a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research. It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section and contains 83 instances, and ReviewerAblation, which helps reviewers find missing ablations in a full paper and contains 350 instances. For both tasks, we develop LM-based judges that serve as an automatic evaluation framework. Our experiments with frontier LMs show that these tasks remain challenging, with the best-performing LM system identifying only 38% of the original ablations on average, below human-level performance. We observe an inverse performance trend between the author and reviewer tasks, which we attribute to differences in model grounding. Lastly, we analyze the limitations of current LMs on these tasks, and find that chain-of-thought prompting outperforms an agent-based approach. Our data is available on https://huggingface.co/collections/ai-coscientist/ablationbench, and our code is available on https://github.com/ai-scientist-bench/ablation-bench .

[227] Addressing Data Imbalance in Transformer-Based Multi-Label Emotion Detection with Weighted Loss

Xia Cui

Main category: cs.CL

TL;DR: Simple weighted loss function applied to Transformer models for multi-label emotion detection on BRIGHTER dataset, addressing data imbalance with dynamic class weighting.

Details

Motivation: Address data imbalance in multi-label emotion detection without computational burden of traditional resampling methods, particularly for SemEval-2025 Shared Task 11.

Method: Apply weighted loss function to Transformer models (BERT, RoBERTa, BART) with dynamic class weight adjustment, evaluated on BRIGHTER dataset using multiple metrics.

Result: Weighted loss improves performance on high-frequency emotion classes but has limited impact on minority classes, showing both effectiveness and challenges.

Conclusion: Weighted loss functions can help with data imbalance in emotion detection but have limitations for minority classes, highlighting need for further research.

Abstract: This paper explores the application of a simple weighted loss function to Transformer-based models for multi-label emotion detection in SemEval-2025 Shared Task 11. Our approach addresses data imbalance by dynamically adjusting class weights, thereby enhancing performance on minority emotion classes without the computational burden of traditional resampling methods. We evaluate BERT, RoBERTa, and BART on the BRIGHTER dataset, using evaluation metrics such as Micro F1, Macro F1, ROC-AUC, Accuracy, and Jaccard similarity coefficients. The results demonstrate that the weighted loss function improves performance on high-frequency emotion classes but shows limited impact on minority classes. These findings underscore both the effectiveness and the challenges of applying this approach to imbalanced multi-label emotion detection.

[228] CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset

Jindřich Libovický, Jindřich Helcl, Andrei Manea, Gianluca Vico

Main category: cs.CL

TL;DR: CUS-QA is a multimodal benchmark for regional question answering covering Czech, Slovak, and Ukrainian contexts, with both textual and visual questions, showing current LLMs achieve only 40% accuracy on text and below 30% on visual questions.

Details

Motivation: The paper addresses the need for better evaluation of multimodal question answering systems, particularly for regional contexts where cultural and geographical knowledge is important. Existing benchmarks often lack coverage of specific regions and multimodal aspects.

Method: Created a manually curated dataset with questions and answers grounded in Wikipedia, developed by native speakers from Czechia, Slovakia, and Ukraine. Includes both textual and visual questions with English translations. Evaluated state-of-the-art LLMs through prompting and added human judgments for answer correctness. Analyzed reliability of automatic evaluation metrics using human evaluations.

Result: Best open-weight LLMs achieved only over 40% accuracy on textual questions and below 30% on visual questions. LLM-based evaluation metrics showed strong correlation with human judgment, while traditional string-overlap metrics performed surprisingly well due to prevalence of named entities in answers.

Conclusion: The CUS-QA benchmark reveals significant gaps in current LLM capabilities for multimodal regional question answering, particularly for visual understanding tasks. The dataset provides a valuable resource for evaluating and improving multimodal AI systems for specific cultural contexts.

Abstract: We introduce CUS-QA, a benchmark for evaluation of open-ended regional question answering that encompasses both textual and visual modalities. We also provide strong baselines using state-of-the-art large language models (LLMs). Our dataset consists of manually curated questions and answers grounded in Wikipedia, created by native speakers from Czechia, Slovakia, and Ukraine, with accompanying English translations. It includes both purely textual questions and those requiring visual understanding. We evaluate state-of-the-art LLMs through prompting and add human judgments of answer correctness. Using these human evaluations, we analyze the reliability of existing automatic evaluation metrics. Our baseline results show that even the best open-weight LLMs achieve only over 40% accuracy on textual questions and below 30% on visual questions. LLM-based evaluation metrics show strong correlation with human judgment, while traditional string-overlap metrics perform surprisingly well due to the prevalence of named entities in answers.

[229] Learning to Evolve: Bayesian-Guided Continual Knowledge Graph Embedding

Linyu Li, Zhi Jin, Yuanpeng He, Dongming Jin, Yichi Zhang, Haoran Duan, Xuan Zhang, Zhengwei Tao, Nyima Tash

Main category: cs.CL

TL;DR: BAKE: A Bayesian continual learning framework for knowledge graph embeddings that addresses catastrophic forgetting in dynamic social media contexts by treating sequential updates as Bayesian posterior updates with clustering regularization.

Details

Motivation: Traditional static knowledge graph embedding models become outdated quickly with rapidly evolving social media content (new topics, relationships, events). Continual KGE methods suffer from catastrophic forgetting, losing valuable older information when learning new knowledge, preventing effective learning of data evolution.

Method: Formulates CKGE as sequential Bayesian inference problem using Bayesian posterior update principle as continual learning strategy. Treats each batch of new data as Bayesian update to model’s prior, maintaining posterior distribution to preserve earlier knowledge. Introduces continual clustering method with regularization term to maintain compact cluster structure of entity embeddings for semantic consistency while allowing controlled adaptation.

Result: Extensive experiments on multiple CKGE benchmarks show BAKE achieves top performance in vast majority of cases compared to existing approaches.

Conclusion: BAKE effectively addresses catastrophic forgetting in continual knowledge graph embedding through Bayesian sequential inference and clustering regularization, enabling better preservation of prior knowledge while adapting to new information in dynamic social media environments.

Abstract: As social media and the World Wide Web become hubs for information dissemination, effectively organizing and understanding the vast amounts of dynamically evolving Web content is crucial. Knowledge graphs (KGs) provide a powerful framework for structuring this information. However, the rapid emergence of new hot topics, user relationships, and events in social media renders traditional static knowledge graph embedding (KGE) models rapidly outdated. Continual Knowledge Graph Embedding (CKGE) aims to address this issue, but existing methods commonly suffer from catastrophic forgetting, whereby older, but still valuable, information is lost when learning new knowledge (such as new memes or trending events). This means the model cannot effectively learn the evolution of the data. We propose a novel CKGE framework, BAKE. Unlike existing methods, BAKE formulates CKGE as a sequential Bayesian inference problem and utilizes the Bayesian posterior update principle as a natural continual learning strategy. This principle is insensitive to data order and provides theoretical guarantees to preserve prior knowledge as much as possible. Specifically, we treat each batch of new data as a Bayesian update to the model’s prior. By maintaining the posterior distribution, the model effectively preserves earlier knowledge even as it evolves over multiple snapshots. Furthermore, to constrain the evolution of knowledge across snapshots, we introduce a continual clustering method that maintains the compact cluster structure of entity embeddings through a regularization term, ensuring semantic consistency while allowing controlled adaptation to new knowledge. We conduct extensive experiments on multiple CKGE benchmarks, which demonstrate that BAKE achieves the top performance in the vast majority of cases compared to existing approaches.

[230] Beyond Content: How Grammatical Gender Shapes Visual Representation in Text-to-Image Models

Muhammed Saeed, Shaina Raza, Ashmal Vayani, Muhammad Abdul-Mageed, Ali Emami, Shady Shehata

Main category: cs.CL

TL;DR: Grammatical gender in languages significantly influences gender representation in Text-to-Image models, with masculine grammatical markers increasing male representation to 73% and feminine markers increasing female representation to 38% compared to gender-neutral English.

Details

Motivation: Current bias research in T2I models focuses on demographic representation and stereotypical attributes, but overlooks how grammatical gender influences visual representation across different languages. The paper aims to investigate this fundamental linguistic influence on AI-generated imagery.

Method: Created a cross-linguistic benchmark with 800 unique prompts across five gendered languages (French, Spanish, German, Italian, Russian) and two gender-neutral control languages (English, Chinese). Generated 28,800 images using three state-of-the-art T2I models, focusing on words where grammatical gender contradicts stereotypical gender associations.

Result: Grammatical gender dramatically influences image generation: masculine grammatical markers increase male representation to 73% on average (vs 22% in gender-neutral English), while feminine markers increase female representation to 38% (vs 28% in English). Effects vary by language resource availability and model architecture, with high-resource languages showing stronger effects.

Conclusion: Language structure itself, not just content, shapes AI-generated visual outputs, introducing a new dimension for understanding bias and fairness in multilingual, multimodal systems. This establishes grammatical gender as a significant factor in T2I model bias.

Abstract: Research on bias in Text-to-Image (T2I) models has primarily focused on demographic representation and stereotypical attributes, overlooking a fundamental question: how does grammatical gender influence visual representation across languages? We introduce a cross-linguistic benchmark examining words where grammatical gender contradicts stereotypical gender associations (e.g., une sentinelle'' - grammatically feminine in French but referring to the stereotypically masculine concept guard’’). Our dataset spans five gendered languages (French, Spanish, German, Italian, Russian) and two gender-neutral control languages (English, Chinese), comprising 800 unique prompts that generated 28,800 images across three state-of-the-art T2I models. Our analysis reveals that grammatical gender dramatically influences image generation: masculine grammatical markers increase male representation to 73% on average (compared to 22% with gender-neutral English), while feminine grammatical markers increase female representation to 38% (compared to 28% in English). These effects vary systematically by language resource availability and model architecture, with high-resource languages showing stronger effects. Our findings establish that language structure itself, not just content, shapes AI-generated visual outputs, introducing a new dimension for understanding bias and fairness in multilingual, multimodal systems.

[231] BiasGym: A Simple and Generalizable Framework for Analyzing and Removing Biases through Elicitation

Sekh Mainul Islam, Nadav Borenstein, Siddhesh Milind Pawar, Haeun Yu, Arnav Arora, Isabelle Augenstein

Main category: cs.CL

TL;DR: BiasGym: A framework for injecting, analyzing, and mitigating biases in LLMs through controlled bias injection and targeted debiasing while preserving downstream task performance.

Details

Motivation: Understanding and mitigating biases in LLMs is crucial but challenging because biased behavior is often subtle and difficult to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly difficult.

Method: BiasGym consists of two components: BiasInject (safely injects specific biases into models via token-based fine-tuning while keeping the model frozen) and BiasScope (leverages injected signals to identify and reliably steer components responsible for biased behavior).

Result: The method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading downstream task performance, and generalizes to biases unseen during fine-tuning. Demonstrated effectiveness in reducing real-world stereotypes like “people from Italy being reckless drivers.”

Conclusion: BiasGym provides a simple, cost-effective, and generalizable framework for bias analysis and mitigation that is useful for both safety interventions and interpretability research in LLMs.

Abstract: Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. However, biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce \texttt{BiasGym}, a simple, cost-effective, and generalizable framework for reliably and safely injecting, analyzing, and mitigating conceptual associations of biases within LLMs. \texttt{BiasGym} consists of two components: \texttt{BiasInject}, which safely injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and \texttt{BiasScope}, which leverages these injected signals to identify and reliably steer the components responsible for biased behavior. Our method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading performance on downstream tasks, and generalizes to biases unseen during fine-tuning. We demonstrate the effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from Italy being `reckless drivers’), showing its utility for both safety interventions and interpretability research.

[232] Inference-Aware Prompt Optimization for Aligning Black-Box Large Language Models

Saaduddin Mahmud, Mason Nakamura, Kyle Hollins Wray, Shlomo Zilberstein

Main category: cs.CL

TL;DR: IAPO is a unified framework that jointly optimizes prompts and inference scaling strategies for black-box LLMs, addressing the interdependence between prompt optimization and inference strategies while considering user budget and task objectives.

Details

Motivation: Existing prompt optimization methods are inference strategy agnostic, ignoring the strong interdependence between prompt optimization and inference scaling strategies. Users also have varying preferences regarding trade-offs among multiple objectives and inference budgets, creating a methodological gap.

Method: IAPO (Inference-Aware Prompt Optimization) jointly optimizes prompts and inference scale while being aware of inference budget and task objectives. PSST (Prompt Scaling via Sequential Trimming) is a fixed-budget training algorithm developed for IAPO with finite-budget error probability guarantees.

Result: The paper demonstrates the effectiveness of PSST on six tasks including multi-objective text generation and reasoning, showing the critical role of incorporating inference-awareness in aligning black-box LLMs through prompt optimization.

Conclusion: Inference-aware prompt optimization is crucial for effective alignment of black-box LLMs, and the IAPO framework with PSST algorithm successfully addresses the interdependence between prompt optimization and inference strategies while considering practical constraints.

Abstract: Prompt optimization methods have demonstrated significant effectiveness in aligning black-box large language models (LLMs). In parallel, inference scaling strategies such as Best-of-N Sampling and Majority Voting have likewise been shown to improve alignment and performance by trading additional computation for better output. However, existing prompt optimization approaches are inference strategy agnostic; that is, they optimize prompts without accounting for the inference strategy. This constitutes a significant methodological gap, as our empirical and theoretical analysis reveals a strong interdependence between these two paradigms. Moreover, we find that user preferences regarding trade-offs among multiple objectives and inference budgets substantially influence the choice of prompt and inference configuration. To address this gap, we introduce a novel unified framework named IAPO (Inference-Aware Prompt Optimization) that jointly optimizes the prompt and inference scale, while being aware of the inference budget and different task objectives. We then develop a fixed-budget training algorithm for IAPO, called PSST (Prompt Scaling via Sequential Trimming), and establish finite-budget guarantees on the error probability. Finally, we evaluate the effectiveness of PSST on six tasks, including multi-objective text generation and reasoning, and demonstrate the critical role of incorporating inference-awareness in aligning black-box LLMs using prompt optimization.

[233] ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction

Xingshan Zeng, Weiwen Liu, Lingzhi Wang, Liangyou Li, Fei Mi, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu

Main category: cs.CL

TL;DR: ToolACE-MT: A non-autoregressive iterative framework for generating high-quality multi-turn agentic dialogues using coarse initialization, iterative refinement, and offline verification.

Details

Motivation: Existing simulation-based data generation for agentic task-solving with LLMs relies on costly autoregressive interactions between multiple LLM agents, limiting real-world performance. There's a need for more efficient methods to construct high-quality multi-turn agentic dialogues.

Method: Three-stage framework: 1) Coarse-grained initialization builds structurally complete but semantically coarse dialogue skeletons; 2) Iterative refinement introduces realistic complexities via mask-and-fill operations; 3) Offline verification ensures correctness and coherence through rule- and model-based checks.

Result: Experiments demonstrate that ToolACE-MT enables efficient, effective, and generalizable agentic data generation, offering a new paradigm for high-quality data construction in tool-augmented LLM scenarios.

Conclusion: ToolACE-MT provides a novel non-autoregressive iterative generation framework that addresses the limitations of existing simulation-based methods, enabling more efficient construction of high-quality multi-turn agentic dialogues for tool-augmented LLMs.

Abstract: Agentic task-solving with Large Language Models (LLMs) requires multi-turn, multi-step interactions, often involving complex function calls and dynamic user-agent exchanges. Existing simulation-based data generation methods for such scenarios rely heavily on costly autoregressive interactions between multiple LLM agents, thereby limiting real-world performance of agentic tasks. In this paper, we propose ToolACE-MT, a novel Non-Autoregressive Iterative Generation framework for constructing high-quality multi-turn agentic dialogues. ToolACE-MT generates full conversational trajectories through three stages: coarse-grained initialization, iterative refinement, and offline verification. The initialization phase builds a structurally complete yet semantically coarse dialogue skeleton; the iterative refinement phase introduces realistic complexities and continued refinement via mask-and-fill operations; and the offline verification phase ensures correctness and coherence via rule- and model-based checks. Experiments demonstrate that ToolACE-MT enables efficient, effective and generalizable agentic data generation, offering a new paradigm for high-quality data construction in tool-augmented LLM scenarios.

Yupei Yang, Fan Feng, Lin Yang, Wanxi Deng, Lin Qu, Biwei Huang, Shikui Tu, Lei Xu

Main category: cs.CL

TL;DR: DEPTH is a framework for relation extraction that uses dependency-aware sentence simplification and two-tiered hierarchical refinement to reduce hallucinations and improve accuracy in LLMs.

Details

Motivation: LLMs struggle with reliable relation extraction, especially in sentences with complex syntax or subtle semantics, showing severe hallucination problems (e.g., 96.9% error rate on NO-RELATION instances in SciERC).

Method: Two-stage framework: (1) Grounding module extracts relations using shortest dependency paths to create minimal relational contexts, (2) Refinement module aggregates predictions and revises them holistically. Also includes causality-driven reward model for RLHF fine-tuning.

Result: Reduces average hallucination rate to 7.9% and achieves 9.3% improvement in average F1 score over existing LLM-based baselines across eight benchmarks.

Conclusion: DEPTH effectively addresses LLM hallucination in relation extraction through dependency-aware simplification and hierarchical refinement, significantly improving reliability and accuracy.

Abstract: Relation extraction (RE) enables the construction of structured knowledge for many downstream applications. While large language models (LLMs) have shown great promise in this task, they often struggle to reliably determine whether a relation exists, particularly in sentences with complex syntax or subtle semantics. For instance, we find that Qwen2.5-14B-Instruct incorrectly predicts a relation in 96.9% of NO-RELATION instances on SciERC, revealing a severe hallucination problem. To address these challenges, we propose DEPTH, a framework that integrates Dependency-aware sEntence simPlification and Two-tiered Hierarchical refinement into the relation extraction pipeline. Given a sentence and its candidate entity pairs, DEPTH operates in two stages: (1) the Grounding module extracts relations for each pair by leveraging their shortest dependency path, distilling the sentence into a minimal yet coherent relational context that reduces syntactic noise while preserving key semantics; (2) the Refinement module aggregates all local predictions and revises them based on a holistic understanding of the sentence, correcting omissions and inconsistencies. We further introduce a causality-driven reward model that mitigates reward hacking by disentangling spurious correlations, enabling robust fine-tuning via reinforcement learning with human feedback. Experiments on eight well-established benchmarks demonstrate that DEPTH reduces the average hallucination rate to 7.9% while achieving a 9.3% improvement in average F1 score over existing LLM-based extraction baselines.

[235] End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning

Qiaoyu Zheng, Yuze Sun, Chaoyi Wu, Weike Zhao, Pengcheng Qiu, Yongguo Yu, Kun Sun, Jian Zhang, Yanfeng Wang, Ya Zhang, Weidi Xie

Main category: cs.CL

TL;DR: Deep-DxSearch is an agentic RAG system trained with reinforcement learning for medical diagnostic reasoning, outperforming existing methods and boosting physician diagnostic accuracy.

Details

Motivation: Current LLM-based healthcare systems face knowledge limitations, hallucinations, and disconnect from Evidence-Based Medicine. Traditional RAG systems use static workflows that miss clinicians' iterative, hypothetico-deductive reasoning process.

Method: An agentic RAG system trained end-to-end via reinforcement learning, treating LLM as an agent in an environment of 16,000+ disease profiles, 150,000+ patient records, and 27M+ biomedical documents. Uses soft verifiable rewards to co-optimize retrieval and reasoning, learning to formulate queries, evaluate evidence, and refine searches.

Result: Outperforms prompt-engineering and training-free RAG methods, achieving 22.7% average accuracy gain over second-best model on ID/OOD benchmarks. Boosts physicians’ diagnostic accuracy from 45.6% to 69.1% on 150 real-world cases. Surpasses GPT-4o, DeepSeek-R1, and medical-specific frameworks.

Conclusion: Evolving agentic systems to leverage statistical regularities in large-scale healthcare data is key for trustworthy diagnostic assistants. The approach demonstrates superior diagnostic reasoning capabilities.

Abstract: The integration of Large Language Models (LLMs) into healthcare is constrained by knowledge limitations, hallucinations, and a disconnect from Evidence-Based Medicine (EBM). While Retrieval-Augmented Generation (RAG) offers a solution, current systems often rely on static workflows that miss the iterative, hypothetico-deductive reasoning of clinicians. To address this, we introduce Deep-DxSearch, an agentic RAG system trained end-to-end via reinforcement learning (RL) for traceable diagnostic reasoning. Deep-DxSearch acts as an active investigator, treating the LLM as an agent within an environment of 16,000+ guideline-derived disease profiles, 150,000+ patient records for case-based reasoning, and over 27 million biomedical documents. Using soft verifiable rewards that co-optimize retrieval and reasoning, the model learns to formulate queries, evaluate evidence, and refine searches to close diagnostic gaps. Experiments show our end-to-end RL framework consistently outperforms prompt-engineering and training-free RAG methods. On in-distribution (ID) and out-of-distribution (OOD) benchmarks for common and rare diseases, Deep-DxSearch surpasses strong baselines-including GPT-4o, DeepSeek-R1, and medical-specific frameworks-achieving an average accuracy gain of 22.7% over the second-best model. In validation with 150 real-world cases, Deep-DxSearch boosts physicians’ average diagnostic accuracy from 45.6% to 69.1%. These results indicate that evolving agentic systems to leverage statistical regularities in large-scale healthcare data is key for trustworthy diagnostic assistants. All data, code, and checkpoints are available at https://qiaoyu-zheng.github.io/Deep-DxSearch.

[236] Retrieval-Augmented Generation for Natural Language Art Provenance Searches in the Getty Provenance Index

Mathew Henrickson

Main category: cs.CL

TL;DR: RAG framework for art provenance research using Getty Provenance Index data, enabling natural-language multilingual searches to navigate fragmented archival auction records.

Details

Motivation: Art provenance research is essential for authenticity verification, restitution claims, and cultural understanding, but current methods are hindered by fragmented multilingual archival data and rigid metadata requirements that limit exploratory searches.

Method: Retrieval-Augmented Generation (RAG) framework with semantic retrieval and contextual summarization, tested on 10,000 auction records from Getty Provenance Index - German Sales to enable natural-language multilingual searches.

Result: The approach successfully retrieves and summarizes auction records, reducing dependence on metadata structures and providing a scalable solution for navigating art market archives.

Conclusion: RAG offers a practical tool for historians and cultural heritage professionals conducting historically sensitive research by enabling more flexible, exploratory access to fragmented art provenance data.

Abstract: This research presents a Retrieval-Augmented Generation (RAG) framework for art provenance studies, focusing on the Getty Provenance Index. Provenance research establishes the ownership history of artworks, which is essential for verifying authenticity, supporting restitution and legal claims, and understanding the cultural and historical context of art objects. The process is complicated by fragmented, multilingual archival data that hinders efficient retrieval. Current search portals require precise metadata, limiting exploratory searches. Our method enables natural-language and multilingual searches through semantic retrieval and contextual summarization, reducing dependence on metadata structures. We assess RAG’s capability to retrieve and summarize auction records using a 10,000-record sample from the Getty Provenance Index - German Sales. The results show this approach provides a scalable solution for navigating art market archives, offering a practical tool for historians and cultural heritage professionals conducting historically sensitive research.

[237] CCF: A Context Compression Framework for Efficient Long-Sequence Language Modeling

Wenhao Li, Bangcheng Sun, Weihao Ye, Tianyi Zhang, Daohai Yu, Fei Chao, Rongrong Ji

Main category: cs.CL

TL;DR: CCF is a context compression framework for efficient long-context language modeling using hierarchical latent representations and key-value memory encoding.

Details

Motivation: Scaling language models to longer contexts is essential but imposes significant computational and memory burdens, leading to inefficiencies in training and inference. There's a need for efficient long-context modeling that reduces input redundancy while preserving global semantics.

Method: Proposes CCF (Context Compression Framework) that learns hierarchical latent representations through segment-wise semantic aggregation with key-value memory encoding. Uses incremental segment decoding with sparse reservoir sampling for training efficiency.

Result: Achieves competitive perplexity under high compression ratios on multiple long-context language modeling benchmarks. Significantly improves throughput and memory efficiency compared to existing approaches.

Conclusion: CCF demonstrates the potential of structured compression for scalable and effective long-context language modeling, offering efficient solutions to computational and memory challenges.

Abstract: Scaling language models to longer contexts is essential for capturing rich dependencies across extended discourse. However, naïve context extension imposes significant computational and memory burdens, often resulting in inefficiencies during both training and inference. In this work, we propose CCF, a novel context compression framework designed to enable efficient long-context modeling by learning hierarchical latent representations that preserve global semantics while aggressively reducing input redundancy. CCF integrates segment-wise semantic aggregation with key-value memory encoding, forming compact representations that support accurate reconstruction and long-range understanding. To further enhance scalability, we introduce a training-efficient optimization strategy that couples incremental segment decoding with sparse reservoir sampling, substantially reducing memory overhead without degrading performance. Empirical results on multiple long-context language modeling benchmarks demonstrate that CCF achieves competitive perplexity under high compression ratios, and significantly improves throughput and memory efficiency compared to existing approaches. These findings highlight the potential of structured compression for scalable and effective long-context language modeling.

[238] MaiBERT: A Pre-training Corpus and Language Model for Low-Resourced Maithili Language

Sumit Yadav, Raju Kumar Yadav, Utsav Maskey, Gautam Siddharth Kashyap, Ganesh Gautam, Usman Naseem

Main category: cs.CL

TL;DR: Introduces maiBERT, a BERT-based language model specifically pre-trained for the low-resource Maithili language using MLM, achieving 87.02% accuracy on news classification and outperforming existing regional models.

Details

Motivation: Addresses the scarcity of computational resources for Maithili, a language spoken by millions but lacking adequate digital and AI-driven applications due to limited high-quality data and language-specific models.

Method: Developed maiBERT using BERT architecture with Masked Language Modeling (MLM) pre-training on a newly constructed Maithili corpus, then evaluated through a news classification task.

Result: maiBERT achieved 87.02% accuracy on news classification, outperforming existing regional models like NepBERTa and HindiBERT with 0.13% overall accuracy gain and 5-7% improvement across various classes.

Conclusion: maiBERT successfully addresses the NLU gap for Maithili, demonstrating effectiveness for low-resource languages, and has been open-sourced on Hugging Face for further fine-tuning on downstream tasks like sentiment analysis and NER.

Abstract: Natural Language Understanding (NLU) for low-resource languages remains a major challenge in NLP due to the scarcity of high-quality data and language-specific models. Maithili, despite being spoken by millions, lacks adequate computational resources, limiting its inclusion in digital and AI-driven applications. To address this gap, we introducemaiBERT, a BERT-based language model pre-trained specifically for Maithili using the Masked Language Modeling (MLM) technique. Our model is trained on a newly constructed Maithili corpus and evaluated through a news classification task. In our experiments, maiBERT achieved an accuracy of 87.02%, outperforming existing regional models like NepBERTa and HindiBERT, with a 0.13% overall accuracy gain and 5-7% improvement across various classes. We have open-sourced maiBERT on Hugging Face enabling further fine-tuning for downstream tasks such as sentiment analysis and Named Entity Recognition (NER).

[239] Fair-GPTQ: Bias-Aware Quantization for Large Language Models

Irina Proskurina, Guillaume Metzler, Julien Velcin

Main category: cs.CL

TL;DR: Fair-GPTQ: A quantization method for large language models that incorporates group-fairness constraints to reduce biased outputs while maintaining performance and efficiency benefits of 4-bit quantization.

Details

Motivation: Standard quantization methods like GPTQ reduce computational costs but can increase biased outputs and degrade fairness performance. There's a need for quantization approaches that explicitly address fairness concerns in large language models.

Method: Adds explicit group-fairness constraints to the quantization objective, guiding the rounding operation toward less-biased text generation for protected groups (gender, race, religion). Focuses on reducing occupational bias and discriminatory language.

Result: Fair-GPTQ preserves at least 90% of baseline accuracy on zero-shot benchmarks, reduces unfairness relative to half-precision models, retains memory/speed benefits of 4-bit quantization, and performs on par with iterative null-space projection debiasing on racial-stereotype benchmarks.

Conclusion: The approach successfully addresses the fairness-quantization trade-off, provides a theoretical solution with group-bias constraints, and enables analysis of channel- and weight-level contributions to fairness during quantization.

Abstract: High memory demands of generative language models have drawn attention to quantization, which reduces computational cost, memory usage, and latency by mapping model weights to lower-precision integers. Approaches such as GPTQ effectively minimize input-weight product errors during quantization; however, recent empirical studies show that they can increase biased outputs and degrade performance on fairness benchmarks, and it remains unclear which specific weights cause this issue. In this work, we draw new links between quantization and model fairness by adding explicit group-fairness constraints to the quantization objective and introduce Fair-GPTQ, the first quantization method explicitly designed to reduce unfairness in large language models. The added constraints guide the learning of the rounding operation toward less-biased text generation for protected groups. Specifically, we focus on stereotype generation involving occupational bias and discriminatory language spanning gender, race, and religion. Fair-GPTQ has minimal impact on performance, preserving at least 90% of baseline accuracy on zero-shot benchmarks, reduces unfairness relative to a half-precision model, and retains the memory and speed benefits of 4-bit quantization. We also compare the performance of Fair-GPTQ with existing debiasing methods and find that it achieves performance on par with the iterative null-space projection debiasing approach on racial-stereotype benchmarks. Overall, the results validate our theoretical solution to the quantization problem with a group-bias term, highlight its applicability for reducing group bias at quantization time in generative models, and demonstrate that our approach can further be used to analyze channel- and weight-level contributions to fairness during quantization.

[240] From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations

Benlu Wang, Iris Xia, Yifan Zhang, Junda Wang, Feiyun Ouyang, Shuo Han, Arman Cohan, Hong Yu, Zonghai Yao

Main category: cs.CL

TL;DR: A comprehensive evaluation framework for medical calculation capabilities of LLMs, revealing significant performance gaps and proposing a modular agentic pipeline (MedRaC) that improves accuracy through retrieval-augmented generation and code execution.

Details

Motivation: Current medical benchmarks for LLMs inadequately evaluate calculation capabilities, using wide numerical tolerances that mask systematic reasoning failures, which could lead to serious clinical misjudgments. There's a need for more clinically trustworthy evaluation methods.

Method: 1) Cleaned and restructured MedCalc-Bench dataset with step-by-step evaluation pipeline assessing formula selection, entity extraction, and arithmetic computation separately; 2) Introduced automatic error analysis framework for structured failure attribution; 3) Proposed MedRaC - a modular agentic pipeline combining retrieval-augmented generation and Python-based code execution.

Result: Under granular evaluation, GPT-4o accuracy dropped from 62.7% to 43.6%, revealing previously masked errors. MedRaC improved different LLMs’ accuracy from 16.35% up to 53.19% without fine-tuning. Human evaluation confirmed the error analysis framework aligns with expert judgment.

Conclusion: Current benchmark practices have limitations for medical calculations. The proposed methodology enables transparent, transferable reasoning evaluation, moving closer to trustworthy LLM-based systems for real-world medical applications.

Abstract: Large language models (LLMs) have demonstrated promising performance on medical benchmarks; however, their ability to perform medical calculations, a crucial aspect of clinical decision-making, remains underexplored and poorly evaluated. Existing benchmarks often assess only the final answer with a wide numerical tolerance, overlooking systematic reasoning failures and potentially causing serious clinical misjudgments. In this work, we revisit medical calculation evaluation with a stronger focus on clinical trustworthiness. First, we clean and restructure the MedCalc-Bench dataset and propose a new step-by-step evaluation pipeline that independently assesses formula selection, entity extraction, and arithmetic computation. Under this granular framework, the accuracy of GPT-4o drops from 62.7% to 43.6%, revealing errors masked by prior evaluations. Second, we introduce an automatic error analysis framework that generates structured attribution for each failure mode. Human evaluation confirms its alignment with expert judgment, enabling scalable and explainable diagnostics. Finally, we propose a modular agentic pipeline, MedRaC, that combines retrieval-augmented generation and Python-based code execution. Without any fine-tuning, MedRaC improves the accuracy of different LLMs from 16.35% up to 53.19%. Our work highlights the limitations of current benchmark practices and proposes a more clinically faithful methodology. By enabling transparent and transferable reasoning evaluation, we move closer to making LLM-based systems trustworthy for real-world medical applications.

[241] FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

Amin Karimi Monsefi, Nikhil Bhendawade, Manuel Rafael Ciosici, Dominic Culver, Yizhe Zhang, Irina Belousova

Main category: cs.CL

TL;DR: FS-DFM is a few-step discrete flow-matching model for language generation that achieves high quality with only 8 sampling steps, providing 128× faster sampling than traditional 1,024-step discrete diffusion models while maintaining perplexity parity.

Details

Motivation: Autoregressive language models are serial (one token per forward pass), limiting throughput and increasing latency for long sequences. While diffusion language models parallelize across positions, standard discrete diffusion requires hundreds to thousands of model evaluations, trading serial depth for iterative breadth. The authors aim to develop a method that combines the parallelization benefits of diffusion with few-step efficiency.

Method: FS-DFM (Few-Step Discrete Flow-Matching) makes the number of sampling steps an explicit parameter and trains the model to be consistent across step budgets, so one big move lands where many small moves would. It uses a reliable update rule that moves probability in the right direction without overshooting, and incorporates strong teacher guidance distilled from long-run trajectories to make few-step sampling stable, accurate, and easy to control.

Result: On language modeling benchmarks, FS-DFM with only 8 sampling steps achieves perplexity parity with a 1,024-step discrete-flow baseline for generating 1,024 tokens using a similar-size model, delivering up to 128 times faster sampling and corresponding latency/throughput gains.

Conclusion: FS-DFM successfully bridges the gap between autoregressive and diffusion language models by enabling high-quality parallel generation with dramatically fewer sampling steps, making diffusion-based language generation practical for real-time applications.

Abstract: Autoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. Diffusion Language Models (DLMs) parallelize across positions and thus appear promising for language generation, yet standard discrete diffusion typically needs hundreds to thousands of model evaluations to reach high quality, trading serial depth for iterative breadth. We introduce FS-DFM, Few-Step Discrete Flow-Matching. A discrete flow-matching model designed for speed without sacrificing quality. The core idea is simple: make the number of sampling steps an explicit parameter and train the model to be consistent across step budgets, so one big move lands where many small moves would. We pair this with a reliable update rule that moves probability in the right direction without overshooting, and with strong teacher guidance distilled from long-run trajectories. Together, these choices make few-step sampling stable, accurate, and easy to control. On language modeling benchmarks, FS-DFM with 8 sampling steps achieves perplexity parity with a 1,024-step discrete-flow baseline for generating 1,024 tokens using a similar-size model, delivering up to 128 times faster sampling and corresponding latency/throughput gains.

[242] Enabling Approximate Joint Sampling in Diffusion LMs

Parikshit Bansal, Sujay Sanghavi

Main category: cs.CL

TL;DR: A method for approximate joint sampling in masked diffusion language models that allows multiple tokens to be unmasked in parallel while maintaining distributional fidelity.

Details

Motivation: Masked diffusion language models generate text by unmasking tokens out of order and in parallel, but this parallel unmasking moves away from the true joint distribution, causing accuracy drops. The paper aims to enable approximate joint sampling of multiple tokens in a single forward pass while maintaining distributional quality.

Method: Develops a lightweight single-layer “sampler” on top of existing large diffusion LMs. One full-model forward pass is followed by multiple forward passes of only this sampler layer to yield multiple unmasked tokens. The sampler is trained to mimic exact joint sampling from the frozen full model.

Result: When unmasking four tokens per denoising step, achieves MAUVE score of 0.87 (vs marginal baseline of 0.31) with respect to true joint distribution. Shows effectiveness on pretrained-only (Dream-7B-Base, Llada-7B-Base) and instruction-tuned models (Dream-7B-Instruct, Dream-7B-Coder) on language modeling and math & coding tasks.

Conclusion: Proposes an effective approximate joint sampling method for diffusion language models that enables parallel token generation while maintaining distributional fidelity, bridging the speed-accuracy trade-off in diffusion-based text generation.

Abstract: In autoregressive language models, each token is sampled by conditioning on all the past tokens; the overall string has thus been sampled from the correct underlying joint distribution represented by the model. In contrast, masked diffusion language models generate text by unmasking tokens out of order and potentially in parallel. Generating an overall string sampled from the correct underlying joint distribution would (again) require exactly one token unmasking in every full-model forward pass. The more tokens unmasked in parallel, the further away the string is from the true joint; this can be seen in the resulting drop in accuracy (but, increase in speed). In this paper we devise a way to {\em approximately} sample multiple tokens from the joint distribution in a single full-model forward pass; we do so by developing a new lightweight single-layer ``sampler" on top of an existing large diffusion LM. One forward pass of the full model can now be followed by multiple forward passes of only this sampler layer, to yield multiple unmasked tokens. Our sampler is trained to mimic exact joint sampling from the (frozen) full model. We show the effectiveness of our approximate joint sampling for both pretrained-only (Dream-7B-Base, Llada-7B-Base) and instruction-tuned (Dream-7B-Instruct, Dream-7B-Coder) models on language modeling and math & coding tasks. When four tokens are unmasked for each full-model denoising step, our sampling algorithm achieves a MAUVE score of 0.87 (vs marginal baseline of 0.31) with respect to the true joint distribution.

[243] A Structured Framework for Evaluating and Enhancing Interpretive Capabilities of Multimodal LLMs in Culturally Situated Tasks

Haorui Yu, Ramon Ruiz-Dolz, Qiufeng Yi

Main category: cs.CL

TL;DR: VLMs are evaluated for generating critiques of traditional Chinese paintings using a quantitative framework based on expert critique analysis and persona-guided prompting.

Details

Motivation: To systematically assess the capabilities of current Visual Language Models (VLMs) in generating art critiques for traditional Chinese painting, which requires complex semantic understanding and cultural context.

Method: Developed a quantitative framework by extracting multi-dimensional evaluative features from human expert critiques using zero-shot classification, defined critic personas, and evaluated VLMs (Llama, Qwen, Gemini) using persona-guided prompting.

Result: Revealed current performance levels, strengths, and areas for improvement of VLMs in art critique generation, showing their potential and limitations in complex semantic understanding tasks.

Conclusion: VLMs show promise but have limitations in art critique generation; the framework provides insights for improving multimodal models in complex cultural and semantic understanding tasks.

Abstract: This study aims to test and evaluate the capabilities and characteristics of current mainstream Visual Language Models (VLMs) in generating critiques for traditional Chinese painting. To achieve this, we first developed a quantitative framework for Chinese painting critique. This framework was constructed by extracting multi-dimensional evaluative features covering evaluative stance, feature focus, and commentary quality from human expert critiques using a zero-shot classification model. Based on these features, several representative critic personas were defined and quantified. This framework was then employed to evaluate selected VLMs such as Llama, Qwen, or Gemini. The experimental design involved persona-guided prompting to assess the VLM’s ability to generate critiques from diverse perspectives. Our findings reveal the current performance levels, strengths, and areas for improvement of VLMs in the domain of art critique, offering insights into their potential and limitations in complex semantic understanding and content generation tasks. The code used for our experiments can be publicly accessed at: https://github.com/yha9806/VULCA-EMNLP2025.

[244] One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient

Rui Ming, Haoyuan Wu, Shoubo Hu, Zhuolun He, Bei Yu

Main category: cs.CL

TL;DR: OTR (one-token rollout) is a novel fine-tuning algorithm that combines SFT with policy gradient by treating each token generation as a single-step RL trajectory, using on-policy token sampling to improve generalization.

Details

Motivation: SFT struggles with generalization compared to RL, likely due to using fixed off-policy data vs RL's on-policy sampling. The authors aim to bridge this gap by bringing on-policy learning benefits to SFT.

Method: OTR reframes autoregressive learning as single-step RL: at each token generation step, samples multiple candidate tokens from current policy, uses ground-truth token as reward signal, applies policy gradient to update model with on-policy learning at token level.

Result: OTR consistently outperforms standard SFT across diverse benchmarks including mathematical reasoning, code generation, and general domain reasoning, demonstrating improved generalization.

Conclusion: OTR establishes that on-policy nature of data is critical for generalization, offering a practical alternative for fine-tuning LLMs that combines benefits of RL with SFT efficiency.

Abstract: Supervised fine-tuning (SFT) is the predominant method for adapting large language models (LLMs), yet it often struggles with generalization compared to reinforcement learning (RL). In this work, we posit that this performance disparity stems not just from the loss function, but from a more fundamental difference: SFT learns from a fixed, pre-collected dataset, whereas RL utilizes on-policy data sampled from the current policy. Building on this hypothesis, we introduce one-token rollout (OTR), a novel fine-tuning algorithm that guides SFT with the policy gradient method. OTR reframes the autoregressive learning process by treating each token generation as a single-step reinforcement learning trajectory. At each step, it performs a Monte Carlo ``rollout’’ by sampling multiple candidate tokens from the current policy’s distribution. The ground-truth token from the supervised data is then used to provide a reward signal to these samples. Guided by policy gradient, our algorithm repurposes static, off-policy supervised data into a dynamic, on-policy signal at the token level, capturing the generalization benefits of on-policy learning while bypassing the costly overhead of full sentence generation. Through extensive experiments on a diverse suite of challenging benchmarks spanning mathematical reasoning, code generation, and general domain reasoning, we demonstrate that OTR consistently outperforms standard SFT. Our findings establish OTR as a powerful and practical alternative for fine-tuning LLMs and provide compelling evidence that the on-policy nature of data is a critical driver of generalization, offering a promising new direction for fine-tuning LLMs.

[245] AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees

Hongyi Zhou, Jin Zhu, Pingfan Su, Kai Ye, Ying Yang, Shakeel A O B Gavioli-Akilagun, Chengchun Shi

Main category: cs.CL

TL;DR: AdaDetectGPT is a novel classifier that improves LLM text detection by adaptively learning witness functions from training data, outperforming existing logits-based methods.

Details

Motivation: Existing state-of-the-art logits-based detectors for distinguishing human vs. LLM-generated text rely solely on log-probability statistics, which can be sub-optimal. The authors aim to develop a more effective detection method that goes beyond simple log-probability analysis.

Method: AdaDetectGPT introduces a novel classifier that adaptively learns a witness function from training data to enhance logits-based detectors. The method provides statistical guarantees on detection performance metrics (true positive rate, false positive rate, true negative rate, false negative rate).

Result: Extensive numerical studies show AdaDetectGPT nearly uniformly improves the state-of-the-art method across various dataset and LLM combinations, with improvements reaching up to 37%.

Conclusion: AdaDetectGPT represents a significant advancement in LLM text detection by adaptively learning from data rather than relying solely on log-probability statistics, offering both performance improvements and statistical guarantees.

Abstract: We study the problem of determining whether a piece of text has been authored by a human or by a large language model (LLM). Existing state of the art logits-based detectors make use of statistics derived from the log-probability of the observed text evaluated using the distribution function of a given source LLM. However, relying solely on log probabilities can be sub-optimal. In response, we introduce AdaDetectGPT – a novel classifier that adaptively learns a witness function from training data to enhance the performance of logits-based detectors. We provide statistical guarantees on its true positive rate, false positive rate, true negative rate and false negative rate. Extensive numerical studies show AdaDetectGPT nearly uniformly improves the state-of-the-art method in various combination of datasets and LLMs, and the improvement can reach up to 37%. A python implementation of our method is available at https://github.com/Mamba413/AdaDetectGPT.

[246] DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

Yifan Wang, Bolian Li, Junlin Wu, Zhaoxuan Tan, Zheli Liu, Ruqi Zhang, Ananth Grama, Qingkai Zeng

Main category: cs.CL

TL;DR: DRIFT is a preference learning method that uses abundant real-world user dissatisfaction signals instead of scarce explicit satisfaction feedback to train LLMs, achieving significant performance improvements over baselines.

Details

Motivation: Real-world LLM deployments generate abundant implicit user dissatisfaction signals (through refinements, corrections, preferences) while explicit satisfaction feedback is scarce. Existing preference learning approaches are poorly aligned with this data profile as they rely on costly human annotations or assume plentiful positive responses.

Method: DRIFT (Dissatisfaction-Refined Iterative preFerence Training) anchors training on real-world DSAT signals and samples positives dynamically from the evolving policy. It uses real-world WildFeedback datasets and synthetic UltraFeedback datasets for training.

Result: DRIFT models achieve up to +6.23% (7B) / +7.61% (14B) on WildBench Task Score and up to +8.95% (7B) / +12.29% (14B) on AlpacaEval2 win rate over base models, outperforming iterative DPO and SPIN. 14B models surpass GPT-4o-mini on WildBench. DRIFT also preserves exploratory capacity and yields more diverse high-reward solutions.

Conclusion: DRIFT is an effective and scalable recipe for real-world post-training that leverages the most abundant and informative signal (user dissatisfaction). The method avoids gradient degeneration and preserves preference margins theoretically.

Abstract: Real-world large language model deployments (e.g., conversational AI systems, code generation assistants) naturally generate abundant implicit user dissatisfaction (DSAT) signals, as users iterate toward better answers through refinements, corrections, and expressed preferences, while explicit satisfaction (SAT) feedback is scarce. Existing preference learning approaches are poorly aligned with this data profile, as they rely on costly human annotations or assume plentiful positive responses. In this paper, we introduce \textbf{DRIFT} (\textbf{D}issatisfaction-\textbf{R}efined \textbf{I}terative pre\textbf{F}erence \textbf{T}raining), which anchors training on real-world DSAT signals and samples positives dynamically from the evolving policy. Empirically, DRIFT models trained on real-world \textit{WildFeedback} datasets and synthetic \textit{UltraFeedback} datasets achieve up to +6.23% (7B) / +7.61% (14B) on WildBench Task Score and up to +8.95% (7B) / +12.29% (14B) on AlpacaEval2 win rate over base models, outperforming strong baseline methods such as iterative DPO and SPIN. At larger scales, the improvements are particularly pronounced: 14B models trained with DRIFT surpass GPT-4o-mini on WildBench. Further analysis shows that DRIFT also preserves exploratory capacity, yielding more diverse high-reward solutions rather than collapsing to narrow subsets. Theoretically, we demonstrate that this design preserves preference margins and avoids the gradient degeneration. These results show that DRIFT is an effective and scalable recipe for real-world post-training that leverages the most abundant and informative signal. The code and data are available at https://github.com/cacayaya/DRIFT.git.

[247] Standard-to-Dialect Transfer Trends Differ across Text and Speech: A Case Study on Intent and Topic Classification in German Dialects

Verena Blaschke, Miriam Winkler, Barbara Plank

Main category: cs.CL

TL;DR: Comparing standard-to-dialect transfer across text, speech, and cascaded systems for German dialect intent classification, finding speech-only models work best on dialect data while text-only works best on standard data.

Details

Motivation: Dialects are primarily spoken but most cross-dialectal transfer research focuses on text data, which has issues with non-standard spellings. Need to compare different modalities for dialect processing.

Method: Compare three settings: 1) text models, 2) speech models, and 3) cascaded systems (speech → automatic transcription → text model). Focus on German dialects for intent classification, releasing first dialectal audio intent classification dataset, with supporting topic classification experiments.

Result: Speech-only setup provides best results on dialect data, text-only setup works best on standard data. Cascaded systems lag behind text-only models for German but perform relatively well on dialectal data if transcription generates normalized, standard-like output.

Conclusion: Speech models are most effective for dialect processing, highlighting importance of multimodal approaches for dialect understanding. Cascaded systems can work well with proper transcription normalization.

Abstract: Research on cross-dialectal transfer from a standard to a non-standard dialect variety has typically focused on text data. However, dialects are primarily spoken, and non-standard spellings cause issues in text processing. We compare standard-to-dialect transfer in three settings: text models, speech models, and cascaded systems where speech first gets automatically transcribed and then further processed by a text model. We focus on German dialects in the context of written and spoken intent classification – releasing the first dialectal audio intent classification dataset – with supporting experiments on topic classification. The speech-only setup provides the best results on the dialect data while the text-only setup works best on the standard data. While the cascaded systems lag behind the text-only models for German, they perform relatively well on the dialectal data if the transcription system generates normalized, standard-like output.

[248] WUGNECTIVES: Novel Entity Inferences of Language Models from Discourse Connectives

Daniel Brubaker, William Sheffield, Junyi Jessy Li, Kanishka Misra

Main category: cs.CL

TL;DR: A study examining whether discourse connectives can help language models learn about novel entities, introducing the WUGNECTIVES dataset to evaluate LM inferences about world knowledge.

Details

Motivation: While language models are known to use world knowledge to predict discourse connectives, this work investigates the inverse: whether discourse connectives can inform language models about the world, particularly for learning about novel entities.

Method: Created WUGNECTIVES dataset with 8,880 stimuli evaluating LMs’ inferences about novel entities in contexts where connectives link entities to attributes. Investigated 17 different LMs at various scales and training regimens, including tuning LMs to show reasoning behavior.

Result: Tuning LMs for reasoning behavior yields improvements on most connectives, but large variation exists across connective types. All models systematically struggle with concessive connectives. Overall performance varies significantly by connective type.

Conclusion: Discourse connectives can inform language models about world knowledge, but effectiveness varies by connective type. Findings enable more nuanced investigation of language cues in LMs and highlight challenges with concessive connectives.

Abstract: The role of world knowledge has been particularly crucial to predict the discourse connective that marks the discourse relation between two arguments, with language models (LMs) being generally successful at this task. We flip this premise in our work, and instead study the inverse problem of understanding whether discourse connectives can inform LMs about the world. To this end, we present WUGNECTIVES, a dataset of 8,880 stimuli that evaluates LMs’ inferences about novel entities in contexts where connectives link the entities to particular attributes. On investigating 17 different LMs at various scales, and training regimens, we found that tuning an LM to show reasoning behavior yields noteworthy improvements on most connectives. At the same time, there was a large variation in LMs’ overall performance across connective type, with all models systematically struggling on connectives that express a concessive meaning. Our findings pave the way for more nuanced investigations into the functional role of language cues as captured by LMs. We release WUGNECTIVES at https://github.com/kanishkamisra/wugnectives

[249] Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization

Junjie Lu, Yuliang Liu, Chaofeng Qu, Wei Shen, Zhouhan Lin, Chuheng Zhang, Min Xu

Main category: cs.CL

TL;DR: CGPO uses confidence signals to identify model uncertainty points and applies self-generated, non-human-like reasoning guidance to improve LLM reasoning without human bias.

Details

Motivation: Current LLM reasoning optimization methods introduce human-like reasoning bias and depend on human/higher-model annotations, limiting exploration of alternative reasoning paths and constraining performance. Observations show models' first errors often occur after lowest-confidence points, suggesting better supervision at uncertainty points.

Method: Confidence-Guided Reasoning Path Preference Optimization (CGPO) leverages confidence signals to identify points of maximal uncertainty in model reasoning, then applies self-generated, non-human-like reasoning-path guidance to mitigate trajectory drift without human annotation dependency.

Result: Experiments across diverse models on code and mathematical reasoning tasks show CGPO with small-model-generated data achieves better performance than approaches using strong-model or human-annotated data with same training data amount.

Conclusion: CGPO effectively improves LLM reasoning by targeting uncertainty points with self-generated guidance, overcoming human-like reasoning bias and annotation dependency while achieving superior performance.

Abstract: Current approaches for strengthening LLM reasoning tend to introduce a training bias toward human-like reasoning trajectories. In step-wise preference optimization, in particular, dependence on human or higher-capacity model annotations for intermediate steps limits exploration of alternative, non-human-like reasoning paths and thus constrains achievable performance. Furthermore, through a small-scale pilot study, we observed that in approximately 75% of cases, the model’s first erroneous step occurs after the lowest-confidence point. This suggests that guiding the model at its lowest-confidence point before an error provides more accurate supervision than locating the first explicit error. In this paper, we propose Confidence-Guided Reasoning Path Preference Optimization (CGPO), a method that leverages a confidence signal to identify points of maximal uncertainty in the model’s reasoning process and applies self-generated, non-human-like reasoning-path guidance to mitigate trajectory drift. Our experiments span diverse models applied to both code and mathematical reasoning tasks. The results show that, with the same amount of training data, our method using data generated by a small model can achieve better performance in most cases compared with approaches using data generated by a strong model or human-annotated.

[250] SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation

Ryan Shea, Yunan Lu, Liang Qiu, Zhou Yu

Main category: cs.CL

TL;DR: SAGE is a user simulation framework for multi-turn agent evaluation that integrates business knowledge (customer profiles, product catalogs, FAQs) to create more realistic and diverse interactions, identifying more agent errors than generic approaches.

Details

Motivation: Current multi-turn interactive agent evaluation relies on human assessment or generic user simulations that lack domain-specific realism, failing to capture realistic customer behavior and business context.

Method: SAGE integrates top-down business logic (ideal customer profiles) and bottom-up business infrastructure knowledge (product catalogs, FAQs, knowledge bases) to simulate realistic user interactions grounded in specific business contexts.

Result: SAGE produces more realistic and diverse interactions while identifying up to 33% more agent errors compared to generic approaches, demonstrating effectiveness for bug-finding and iterative agent improvement.

Conclusion: Domain-specific user simulation incorporating business knowledge significantly improves multi-turn agent evaluation by generating more realistic interactions and better identifying agent weaknesses.

Abstract: Evaluating multi-turn interactive agents is challenging due to the need for human assessment. Evaluation with simulated users has been introduced as an alternative, however existing approaches typically model generic users and overlook the domain-specific principles required to capture realistic behavior. We propose SAGE, a novel user Simulation framework for multi-turn AGent Evaluation that integrates knowledge from business contexts. SAGE incorporates top-down knowledge rooted in business logic, such as ideal customer profiles, grounding user behavior in realistic customer personas. We further integrate bottom-up knowledge taken from business agent infrastructure (e.g., product catalogs, FAQs, and knowledge bases), allowing the simulator to generate interactions that reflect users’ information needs and expectations in a company’s target market. Through empirical evaluation, we find that this approach produces interactions that are more realistic and diverse, while also identifying up to 33% more agent errors, highlighting its effectiveness as an evaluation tool to support bug-finding and iterative agent improvement.

[251] Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning

Minsik Choi, Hyegang Son, Changhoon Kim, Young Geun Kim

Main category: cs.CL

TL;DR: HIES: A novel pruning criterion combining head importance scores with attention entropy for more effective transformer model compression while maintaining accuracy and stability.

Details

Motivation: Transformer models face efficiency challenges in inference and deployment due to their multi-layer, multi-head architecture. Existing gradient-based pruning methods like Head Importance Scores (HIS) have limitations as they only capture gradient-driven contributions and overlook attention pattern diversity.

Method: Proposes HIES (Head Importance-Entropy Score) which integrates head importance scores with attention entropy to provide complementary evidence on per-head contribution for more effective pruning decisions.

Result: HIES-based pruning yields up to 15.2% improvement in model quality and 2.04x improvement in stability over HIS-only methods, enabling substantial model compression without sacrificing accuracy or stability.

Conclusion: The HIES criterion provides a more comprehensive approach to transformer pruning by considering both gradient importance and attention pattern diversity, leading to better model compression with maintained performance.

Abstract: Transformer-based models have achieved remarkable performance in NLP tasks. However, their structural characteristics-multiple layers and attention heads-introduce efficiency challenges in inference and deployment. To address these challenges, various pruning methods have recently been proposed. Notably, gradient-based methods using Head Importance Scores (HIS) have gained traction for interpretability, efficiency, and ability to identify redundant heads. However, HIS alone has limitations as it captures only the gradient-driven contribution, overlooking the diversity of attention patterns. To overcome these limitations, we introduce a novel pruning criterion, HIES (Head Importance-Entropy Score), which integrates head importance scores with attention entropy, providing complementary evidence on per-head contribution. Empirically, HIES-based pruning yields up to 15.2% improvement in model quality and 2.04x improvement in stability over HIS-only methods, enabling substantial model compression without sacrificing either accuracy or stability. Code will be released upon publication.

[252] BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation

Fabian Wenz, Omar Bouattour, Devin Yang, Justin Choi, Cecil Gregg, Nesime Tatbul, Çağatay Demiralp

Main category: cs.CL

TL;DR: BenchPress: A human-in-the-loop system using LLMs and RAG to accelerate creation of domain-specific text-to-SQL benchmarks from enterprise SQL logs.

Details

Motivation: Existing text-to-SQL benchmarks focus on public datasets, but LLMs perform poorly on private enterprise data. Creating enterprise benchmarks is costly due to manual annotation by database experts.

Method: Given SQL queries, BenchPress uses retrieval-augmented generation (RAG) and LLMs to propose multiple natural language descriptions. Human experts then select, rank, or edit these drafts for accuracy and domain alignment.

Result: Evaluation on annotated enterprise SQL logs shows LLM-assisted annotation drastically reduces time and effort while maintaining high quality. Human verification combined with LLM suggestions enhances annotation accuracy and benchmark reliability.

Conclusion: BenchPress streamlines creation of custom text-to-SQL benchmarks for domain-specific workloads, providing researchers/practitioners with better evaluation tools for enterprise applications.

Abstract: Large language models (LLMs) have been successfully applied to many tasks, including text-to-SQL generation. However, much of this work has focused on publicly available datasets, such as Fiben, Spider, and Bird. Our earlier work showed that LLMs are much less effective in querying large private enterprise data warehouses and released Beaver, the first private enterprise text-to-SQL benchmark. To create Beaver, we leveraged SQL logs, which are often readily available. However, manually annotating these logs to identify which natural language questions they answer is a daunting task. Asking database administrators, who are highly trained experts, to take on additional work to construct and validate corresponding natural language utterances is not only challenging but also quite costly. To address this challenge, we introduce BenchPress, a human-in-the-loop system designed to accelerate the creation of domain-specific text-to-SQL benchmarks. Given a SQL query, BenchPress uses retrieval-augmented generation (RAG) and LLMs to propose multiple natural language descriptions. Human experts then select, rank, or edit these drafts to ensure accuracy and domain alignment. We evaluated BenchPress on annotated enterprise SQL logs, demonstrating that LLM-assisted annotation drastically reduces the time and effort required to create high-quality benchmarks. Our results show that combining human verification with LLM-generated suggestions enhances annotation accuracy, benchmark reliability, and model evaluation robustness. By streamlining the creation of custom benchmarks, BenchPress offers researchers and practitioners a mechanism for assessing text-to-SQL models on a given domain-specific workload. BenchPress is freely available via our public GitHub repository at https://github.com/fabian-wenz/enterprise-txt2sql and is also accessible on our website at http://dsg-mcgraw.csail.mit.edu:5000.

[253] Midtraining Bridges Pretraining and Posttraining Distributions

Emmy Liu, Graham Neubig, Chenyan Xiong

Main category: cs.CL

TL;DR: Midtraining (mixing specialized data with general pretraining data) functions as distributional bridging, providing better initialization for posttraining, with benefits largest for domains distant from general data like code and math.

Details

Motivation: To understand why midtraining (intermediate training phase mixing specialized and general data) is effective in language model development, as it's widely used but poorly understood.

Method: Conducted controlled pretraining experiments analyzing midtraining benefits across domains, focusing on code and math as distant domains from general pretraining data. Investigated starting time and mixture weight interactions using code as case study.

Result: Midtraining benefits are largest for domains distant from general pretraining data (code, math), scales with proximity advantage to target distribution, outperforms continued pretraining on specialized data alone, and shows strong interaction between introduction time and mixture weight.

Conclusion: Midtraining serves as distributional bridging for better posttraining initialization, with early introduction of specialized data amenable to high mixture weights while late introduction requires lower ones, suggesting distributional transitions between training phases benefit from similar bridging strategies.

Abstract: Midtraining, the practice of mixing specialized data with more general pretraining data in an intermediate training phase, has become widespread in language model development, yet there is little understanding of what makes it effective. We propose that midtraining functions as distributional bridging by providing better initialization for posttraining. We conduct controlled pretraining experiments, and find that midtraining benefits are largest for domains distant from general pretraining data, such as code and math, and scale with the proximity advantage the midtraining data provides toward the target distribution. In these domains, midtraining consistently outperforms continued pretraining on specialized data alone both in-domain and in terms of mitigating forgetting. We further conduct an investigation on the starting time and mixture weight of midtraining data, using code as a case study, and find that time of introduction and mixture weight interact strongly such that early introduction of specialized data is amenable to high mixture weights, while late introduction requires lower ones. This suggests that late introduction of specialized data outside a plasticity window cannot be compensated for by increasing data mixtures later in training. Beyond midtraining itself, this suggests that distributional transitions between any training phases may benefit from similar bridging strategies.

[254] Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety

Vamshi Krishna Bonagiri, Ponnurangam Kumaragurum, Khanh Nguyen, Benjamin Plaut

Main category: cs.CL

TL;DR: LLM agents can improve safety by quitting when uncertain, with explicit quit instructions providing +0.39 safety improvement while maintaining helpfulness.

Details

Motivation: As LLM agents operate in complex real-world environments with tool access, uncertainties compound leading to severe risks beyond traditional text generation failures. There's a need for safety mechanisms in multi-turn agentic scenarios.

Method: Propose “quitting” as a behavioral mechanism for LLM agents to withdraw when lacking confidence. Use ToolEmu framework to systematically evaluate quitting behavior across 12 state-of-the-art LLMs with explicit quit instructions.

Result: Agents with explicit quit instructions improve safety by +0.39 average on 0-3 scale (+0.64 for proprietary models) while maintaining negligible average decrease of -0.03 in helpfulness. Shows favorable safety-helpfulness trade-off.

Conclusion: Explicit quit instructions are a highly effective, immediately deployable safety mechanism for LLM agents, establishing quitting as an effective first-line defense for autonomous agents in high-stakes applications.

Abstract: As Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences, their safety becomes critical. While uncertainty quantification is well-studied for single-turn tasks, multi-turn agentic scenarios with real-world tool access present unique challenges where uncertainties and ambiguities compound, leading to severe or catastrophic risks beyond traditional text generation failures. We propose using “quitting” as a simple yet effective behavioral mechanism for LLM agents to recognize and withdraw from situations where they lack confidence. Leveraging the ToolEmu framework, we conduct a systematic evaluation of quitting behavior across 12 state-of-the-art LLMs. Our results demonstrate a highly favorable safety-helpfulness trade-off: agents prompted to quit with explicit instructions improve safety by an average of +0.39 on a 0-3 scale across all models (+0.64 for proprietary models), while maintaining a negligible average decrease of -0.03 in helpfulness. Our analysis demonstrates that simply adding explicit quit instructions proves to be a highly effective safety mechanism that can immediately be deployed in existing agent systems, and establishes quitting as an effective first-line defense mechanism for autonomous agents in high-stakes applications.

[255] Are Large Language Models Sensitive to the Motives Behind Communication?

Addison J. Wu, Ryan Liu, Kerem Oktar, Theodore R. Sumers, Thomas L. Griffiths

Main category: cs.CL

TL;DR: LLMs show basic sensitivity to human motivations in controlled experiments but struggle with real-world scenarios like online ads; simple interventions can improve their motivational vigilance.

Details

Motivation: Human communication is inherently motivated by intentions and incentives, and for LLMs to be effective in real-world applications, they need to critically evaluate content by factoring in source motivations, similar to how humans assess credibility.

Method: Used controlled cognitive science experiments to test LLMs’ ability to discount information from biased sources, then extended evaluation to real-world sponsored online advertisements. Implemented steering interventions to boost salience of intentions and incentives.

Result: LLMs successfully discount biased information in controlled experiments, behaving similarly to rational models. However, in real-world online ad settings, their inferences don’t closely track rational predictions due to distracting information. Simple steering interventions significantly improve correspondence with rational models.

Conclusion: LLMs possess basic sensitivity to others’ motivations but require further improvements to generalize effectively to novel real-world settings where motivational vigilance is crucial.

Abstract: Human communication is motivated: people speak, write, and create content with a particular communicative intent in mind. As a result, information that large language models (LLMs) and AI agents process is inherently framed by humans’ intentions and incentives. People are adept at navigating such nuanced information: we routinely identify benevolent or self-serving motives in order to decide what statements to trust. For LLMs to be effective in the real world, they too must critically evaluate content by factoring in the motivations of the source – for instance, weighing the credibility of claims made in a sales pitch. In this paper, we undertake a comprehensive study of whether LLMs have this capacity for motivational vigilance. We first employ controlled experiments from cognitive science to verify that LLMs’ behavior is consistent with rational models of learning from motivated testimony, and find they successfully discount information from biased sources in a human-like manner. We then extend our evaluation to sponsored online adverts, a more naturalistic reflection of LLM agents’ information ecosystems. In these settings, we find that LLMs’ inferences do not track the rational models’ predictions nearly as closely – partly due to additional information that distracts them from vigilance-relevant considerations. However, a simple steering intervention that boosts the salience of intentions and incentives substantially increases the correspondence between LLMs and the rational model. These results suggest that LLMs possess a basic sensitivity to the motivations of others, but generalizing to novel real-world settings will require further improvements to these models.

[256] Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention

J Rosser, José Luis Redondo García, Gustavo Penha, Konstantina Palla, Hugues Bouchard

Main category: cs.CL

TL;DR: Sparse Tracing enables efficient interpretability of long-context LLMs by using dynamic sparse attention to analyze attention patterns in near-linear time and linear space, pruning 90-99% of token interactions while preserving model behavior.

Details

Motivation: Traditional mechanistic interpretability techniques for analyzing attention in LLMs scale quadratically with context length, requiring terabytes of memory for contexts beyond 100,000 tokens, making long-context analysis infeasible on consumer hardware.

Method: Introduces Stream, a compilable hierarchical pruning algorithm that estimates per-head sparse attention masks in near-linear time O(T log T) and linear space O(T). It performs binary-search-style refinement to retain only top-k key blocks per query while preserving the model’s next-token behavior.

Result: On long chain-of-thought reasoning traces, Stream identifies thought anchors while pruning 97-99% of token interactions. On the RULER benchmark, it preserves critical retrieval paths while discarding 90-96% of interactions and exposes layer-wise routes from the needle to output.

Conclusion: Sparse Tracing provides a practical drop-in tool for analyzing attention patterns and tracing information flow without requiring terabytes of caches, making long-context interpretability feasible on consumer GPUs and democratizing chain-of-thought monitoring.

Abstract: As Large Language Models (LLMs) scale to million-token contexts, traditional Mechanistic Interpretability techniques for analyzing attention scale quadratically with context length, demanding terabytes of memory beyond 100,000 tokens. We introduce Sparse Tracing, a novel technique that leverages dynamic sparse attention to efficiently analyze long context attention patterns. We present Stream, a compilable hierarchical pruning algorithm that estimates per-head sparse attention masks in near-linear time $O(T \log T)$ and linear space $O(T)$, enabling one-pass interpretability at scale. Stream performs a binary-search-style refinement to retain only the top-$k$ key blocks per query while preserving the model’s next-token behavior. We apply Stream to long chain-of-thought reasoning traces and identify thought anchors while pruning 97-99% of token interactions. On the RULER benchmark, Stream preserves critical retrieval paths while discarding 90-96% of interactions and exposes layer-wise routes from the needle to output. Our method offers a practical drop-in tool for analyzing attention patterns and tracing information flow without terabytes of caches. By making long context interpretability feasible on consumer GPUs, Sparse Tracing helps democratize chain-of-thought monitoring. Code is available at https://anonymous.4open.science/r/stream-03B8/.

[257] FreeChunker: A Cross-Granularity Chunking Framework

Wenxuan Zhang, Yuan-Hao Jiang, Yang Cao, Yonghe Wu

Main category: cs.CL

TL;DR: FreeChunker is a cross-granularity encoding framework that treats sentences as atomic units and enables flexible retrieval of arbitrary sentence combinations, improving RAG system adaptability and efficiency.

Details

Motivation: Existing RAG chunking methods use fixed-granularity paradigms with static boundary identification, limiting adaptability to diverse query requirements and requiring computational overhead for semantic boundary detection.

Method: FreeChunker transforms the traditional chunking paradigm by treating sentences as atomic units and shifting from static chunk segmentation to flexible retrieval supporting arbitrary sentence combinations, avoiding semantic boundary detection overhead.

Result: Experimental evaluation on LongBench V2 shows FreeChunker has significant advantages in both retrieval performance and time efficiency compared to existing chunking methods.

Conclusion: FreeChunker’s paradigm shift from static chunking to flexible sentence combination retrieval offers improved adaptability and efficiency for RAG systems.

Abstract: Chunking strategies significantly impact the effectiveness of Retrieval-Augmented Generation (RAG) systems. Existing methods operate within fixed-granularity paradigms that rely on static boundary identification, limiting their adaptability to diverse query requirements. This paper presents FreeChunker, a Cross-Granularity Encoding Framework that fundamentally transforms the traditional chunking paradigm: the framework treats sentences as atomic units and shifts from static chunk segmentation to flexible retrieval supporting arbitrary sentence combinations. This paradigm shift not only significantly avoids the computational overhead required for semantic boundary detection, but also enhances adaptability to complex queries. Experimental evaluation on LongBench V2 demonstrates that FreeChunker possesses significant advantages in both retrieval performance and time efficiency compared to existing chunking methods. The pre-trained models and codes are available at https://github.com/mazehart/FreeChunker.

[258] VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations

Yupeng Xie, Zhiyang Zhang, Yifan Wu, Sirong Lu, Jiayi Zhang, Zhaoyang Yu, Jinlin Wang, Sirui Hong, Bang Liu, Chenglin Wu, Yuyu Luo

Main category: cs.CL

TL;DR: VisJudge-Bench is the first comprehensive benchmark for evaluating multimodal LLMs’ capabilities in assessing visualization aesthetics and quality, revealing significant gaps between current MLLMs and human experts, with VisJudge model proposed to address these limitations.

Details

Motivation: Evaluating visualization quality is challenging as it requires simultaneous judgment across data encoding accuracy, information expressiveness, and visual aesthetics. While MLLMs show promise in natural image aesthetic assessment, no systematic benchmark exists for measuring their capabilities in evaluating visualizations.

Method: Created VisJudge-Bench with 3,090 expert-annotated samples from real-world scenarios covering single visualizations, multiple visualizations, and dashboards across 32 chart types. Proposed VisJudge, a model specifically designed for visualization aesthetics and quality assessment.

Result: Advanced MLLMs (including GPT-5) show significant gaps compared to human experts with MAE of 0.553 and correlation of only 0.428. VisJudge reduces MAE to 0.421 (23.9% reduction) and increases consistency to 0.687 (60.5% improvement).

Conclusion: Current MLLMs are inadequate for visualization quality assessment, but specialized models like VisJudge can significantly narrow the gap with human judgment. The benchmark enables systematic evaluation of MLLMs’ capabilities in this domain.

Abstract: Visualization, a domain-specific yet widely used form of imagery, is an effective way to turn complex datasets into intuitive insights, and its value depends on whether data are faithfully represented, clearly communicated, and aesthetically designed. However, evaluating visualization quality is challenging: unlike natural images, it requires simultaneous judgment across data encoding accuracy, information expressiveness, and visual aesthetics. Although multimodal large language models (MLLMs) have shown promising performance in aesthetic assessment of natural images, no systematic benchmark exists for measuring their capabilities in evaluating visualizations. To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs’ performance in assessing visualization aesthetics and quality. It contains 3,090 expert-annotated samples from real-world scenarios, covering single visualizations, multiple visualizations, and dashboards across 32 chart types. Systematic testing on this benchmark reveals that even the most advanced MLLMs (such as GPT-5) still exhibit significant gaps compared to human experts in judgment, with a Mean Absolute Error (MAE) of 0.553 and a correlation with human ratings of only 0.428. To address this issue, we propose VisJudge, a model specifically designed for visualization aesthetics and quality assessment. Experimental results demonstrate that VisJudge significantly narrows the gap with human judgment, reducing the MAE to 0.421 (a 23.9% reduction) and increasing the consistency with human experts to 0.687 (a 60.5% improvement) compared to GPT-5. The benchmark is available at https://github.com/HKUSTDial/VisJudgeBench.

[259] DEBATE: A Large-Scale Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates

Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, You Li, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, Timothy T. Rogers

Main category: cs.CL

TL;DR: DEBATE benchmark evaluates authenticity of opinion dynamics in multi-agent LLM role-playing agents using large-scale human conversation data, showing LLM agents exhibit unnatural convergence that can be partially improved with fine-tuning.

Details

Motivation: Existing multi-agent LLM simulations for opinion dynamics often show unnatural group behaviors like premature convergence and lack empirical benchmarks to assess alignment with real human interactions.

Method: Created DEBATE benchmark with 36,383 messages from 2,832 U.S. participants across 708 groups and 107 topics. Evaluated 7 LLMs as “digital twin” role-playing agents using next-message prediction and full conversation rollout with stance-alignment and opinion-convergence metrics.

Result: Zero-shot LLM agents show strong opinion convergence relative to humans. Supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) improve stance alignment and bring group-level convergence closer to human behavior, though discrepancies in opinion change remain.

Conclusion: DEBATE enables rigorous benchmarking of simulated opinion dynamics and supports future research on aligning multi-agent LLM role-playing agents with realistic human interactions.

Abstract: Accurately modeling opinion change through social interactions is crucial for understanding and mitigating polarization, misinformation, and societal conflict. Recent work simulates opinion dynamics with role-playing LLM agents (RPLAs), but multi-agent simulations often display unnatural group behavior (e.g., premature convergence) and lack empirical benchmarks for assessing alignment with real human group interactions. We introduce DEBATE, a large-scale benchmark for evaluating the authenticity of opinion dynamics in multi-agent RPLA simulations. DEBATE contains 36,383 messages from 2,832 U.S.-based participants across 708 groups and 107 topics, with both public messages and private Likert-scale beliefs, enabling evaluation at the utterance and group levels (and supporting future individual-level analyses). We instantiate “digital twin” RPLAs with seven LLMs and evaluate across two settings: next-message prediction and full conversation rollout, using stance-alignment and opinion-convergence metrics. In zero-shot settings, RPLA groups exhibit strong opinion convergence relative to human groups. Post-training via supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) improves stance alignment and brings group-level convergence closer to human behavior, though discrepancies in opinion change and belief updating remain. DEBATE enables rigorous benchmarking of simulated opinion dynamics and supports future research on aligning multi-agent RPLAs with realistic human interactions.

[260] Recursive numeral systems are highly regular and easy to process

Ponrawee Prasertsom, Andrea Silvi, Jennifer Culbertson, Moa Johansson, Devdatt Dubhashi, Kenny Smith

Main category: cs.CL

TL;DR: The paper argues that regularity (systematicity of forms) is crucial for explaining linguistic systems, using recursive numeral systems as a case study. It proposes Minimum Description Length (MDL) measures of regularity and processing complexity that better distinguish natural systems from unnatural ones.

Details

Motivation: Previous work on cross-linguistic variation focuses on efficient communication but neglects the role of regularity. Existing analyses of recursive numeral systems rely on ad-hoc constraints to explain why only natural-language-like systems optimize trade-offs between lexicon size and morphosyntactic complexity.

Method: The authors apply Minimum Description Length (MDL) approach to measure both regularity and processing complexity in recursive numeral systems. They compare their MDL-based measures against previous optimization frameworks to evaluate how well they distinguish natural systems from unnatural ones.

Result: The MDL-based measures of regularity and processing complexity better capture differences between attested natural systems and theoretically possible ones, including previously identified “optimal” systems. The ad-hoc constraints from previous work naturally emerge from considerations of regularity.

Conclusion: Regularity is essential for understanding linguistic efficiency and should be incorporated in studies measuring language efficiency. The MDL approach provides a principled way to account for systematicity across sets of forms in linguistic systems.

Abstract: Much recent work has shown how cross-linguistic variation is constrained by competing pressures from efficient communication. However, little attention has been paid to the role of the systematicity of forms (regularity), a key property of natural language. Here, we demonstrate the importance of regularity in explaining the shape of linguistic systems by looking at recursive numeral systems. Previous work has argued that these systems optimise the trade-off between lexicon size and average morphosyntatic complexity (Denić and Szymanik, 2024). However, showing that only natural-language-like systems optimise this trade-off has proven elusive, and existing solutions rely on ad-hoc constraints to rule out unnatural systems (Yang and Regier, 2025). Drawing on the Minimum Description Length (MDL) approach, we argue that recursive numeral systems are better viewed as efficient with regard to their regularity and processing complexity. We show that our MDL-based measures of regularity and processing complexity better capture the key differences between attested, natural systems and theoretically possible ones, including “optimal” recursive numeral systems from previous work, and that the ad-hoc constraints naturally follow from regularity. Our approach highlights the need to incorporate regularity across sets of forms in studies attempting to measure efficiency in language.

[261] Surfacing Subtle Stereotypes: A Multilingual, Debate-Oriented Evaluation of Modern LLMs

Muhammed Saeed, Muhammad Abdul-mageed, Shady Shehata

Main category: cs.CL

TL;DR: New multilingual benchmark DebateBias-8K reveals narrative biases in LLMs across 7 languages, showing models reproduce stereotypes despite safety alignment, especially in low-resource languages.

Details

Motivation: Most bias evaluations rely on English classification tasks, but real-world LLM deployment involves open-ended communication. Need multilingual benchmarks to reveal narrative biases in realistic generative settings.

Method: Created DebateBias-8K with 8,400 structured debate prompts across 4 sensitive domains (women’s rights, socioeconomic development, terrorism, religion) in 7 languages. Tested 4 flagship models (GPT-4o, Claude 3, DeepSeek, LLaMA 3), generating and automatically classifying over 100,000 responses.

Result: All models reproduce entrenched stereotypes: Arabs linked to terrorism/religion (≥95%), Africans to socioeconomic “backwardness” (up to 77%), Western groups framed as modern/progressive. Biases grow sharply in lower-resource languages, showing English alignment doesn’t generalize globally.

Conclusion: Current alignment methods reduce explicit toxicity but fail to prevent biased outputs in open-ended contexts. Need better multilingual fairness evaluation and culturally inclusive model alignment.

Abstract: Large language models (LLMs) are widely deployed for open-ended communication, yet most bias evaluations still rely on English, classification-style tasks. We introduce DebateBias-8K, a new multilingual, debate-style benchmark designed to reveal how narrative bias appears in realistic generative settings. Our dataset includes 8,400 structured debate prompts spanning four sensitive domains: women’s rights, socioeconomic development, terrorism, and religion, across seven languages ranging from high-resource (English, Chinese) to low-resource (Swahili, Nigerian Pidgin). Using four flagship models (GPT-4o, Claude 3, DeepSeek, and LLaMA 3), we generate and automatically classify over 100,000 responses. Results show that all models reproduce entrenched stereotypes despite safety alignment: Arabs are overwhelmingly linked to terrorism and religion (>=95%), Africans to socioeconomic “backwardness” (up to <=77%), and Western groups are consistently framed as modern or progressive. Biases grow sharply in lower-resource languages, revealing that alignment trained primarily in English does not generalize globally. Our findings highlight a persistent divide in multilingual fairness: current alignment methods reduce explicit toxicity but fail to prevent biased outputs in open-ended contexts. We release our DebateBias-8K benchmark and analysis framework to support the next generation of multilingual bias evaluation and safer, culturally inclusive model alignment.

[262] RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning

Xinyuan Li, Murong Xu, Wenbiao Tao, Hanlun Zhu, Yike Zhao, Jipeng Zhang, Yunshi Lan

Main category: cs.CL

TL;DR: RIDE is an adversarial question-rewriting framework that uses Item Response Theory to generate more challenging mathematical problems for evaluating LLMs’ true reasoning abilities.

Details

Motivation: Current LLM evaluations for mathematical reasoning may be inflated by data leakage or pattern matching rather than genuine reasoning. Existing perturbation methods often create ill-posed questions and lack systematic difficulty measurement.

Method: Uses Item Response Theory to measure question difficulty, employs 35 LLMs as simulated students to build a difficulty ranker, then uses reinforcement learning with the ranker as reward signal to guide a question-rewriting model to create challenging variations.

Result: Applied to competition-level math benchmarks, RIDE generated perturbed versions that degraded advanced LLM performance by average 21.73% across 26 models, exposing limited robustness in mathematical reasoning.

Conclusion: RIDE provides a valid adversarial evaluation approach that reveals LLMs’ limited robustness in mathematical reasoning beyond superficial pattern matching.

Abstract: Large language models (LLMs) achieve high performance on mathematical reasoning, but these results can be inflated by training data leakage or superficial pattern matching rather than genuine reasoning. To this end, an adversarial perturbation-based evaluation is needed to measure true mathematical reasoning ability. Current rule-based perturbation methods often generate ill-posed questions and impede the systematic evaluation of question difficulty and the evolution of benchmarks. To bridge this gap, we propose RIDE, a novel adversarial question-rewriting framework that leverages Item Response Theory (IRT) to rigorously measure question difficulty and to generate intrinsically more challenging, well-posed variations of mathematical problems. We employ 35 LLMs to simulate students and build a difficulty ranker from their responses. This ranker provides a reward signal during reinforcement learning and guides a question-rewriting model to reformulate existing questions across difficulty levels. Applying RIDE to competition-level mathematical benchmarks yields perturbed versions that degrade advanced LLM performance, with experiments showing an average 21.73% drop across 26 models, thereby exposing limited robustness in mathematical reasoning and confirming the validity of our evaluation approach.

[263] Adaptive Testing for LLM Evaluation: A Psychometric Alternative to Static Benchmarks

Peiyu Li, Xiuxiu Tang, Si Chen, Ying Cheng, Ronald Metoyer, Ting Hua, Nitesh V. Chawla

Main category: cs.CL

TL;DR: ATLAS is an adaptive testing framework using Item Response Theory to evaluate LLMs with far fewer benchmark items while maintaining precision.

Details

Motivation: Current LLM evaluation requires thousands of benchmark items, making it expensive, slow, and impractical at scale. Existing methods treat all items as equally informative despite variation in difficulty and discrimination.

Method: ATLAS uses Item Response Theory (IRT) with Fisher information-guided item selection to adaptively choose the most informative test items for estimating model ability, reducing required items by up to 90%.

Result: ATLAS reduces items needed by up to 90% while maintaining precision (e.g., 41 items vs 5,600 on HellaSwag with 0.157 MAE). Reconstructed accuracies match raw accuracies, and ability estimates provide finer discrimination among models with identical accuracies.

Conclusion: ATLAS provides an efficient, precise alternative to traditional LLM evaluation, enabling scalable assessment while revealing meaningful performance differences that accuracy metrics miss.

Abstract: Evaluating large language models (LLMs) typically requires thousands of benchmark items, making the process expensive, slow, and increasingly impractical at scale. Existing evaluation protocols rely on average accuracy over fixed item sets, treating all items as equally informative despite substantial variation in difficulty and discrimination. We introduce ATLAS, an adaptive testing framework based on Item Response Theory (IRT) that estimates model ability using Fisher information-guided item selection. ATLAS reduces the number of required items by up to 90% while maintaining measurement precision. For instance, it matches whole-bank ability estimates using only 41 items (0.157 MAE) on HellaSwag (5,600 items). We further reconstruct accuracy from ATLAS’s ability estimates and find that reconstructed accuracies closely match raw accuracies across all five benchmarks, indicating that ability $θ$ preserves the global performance structure. At the same time, $θ$ provides finer discrimination within accuracy-equivalent models: among more than 3,000 evaluated models, 23-31% shift by more than 10 rank positions, and models with identical accuracies receive meaningfully different ability estimates. Code and calibrated item banks are available at https://github.com/Peiyu-Georgia-Li/ATLAS.git.

[264] TabRAG: Improving Tabular Document Question Answering for Retrieval Augmented Generation via Structured Representations

Jacob Si, Mike Qu, Michelle Lee, Marek Rei, Yingzhen Li

Main category: cs.CL

TL;DR: TabRAG is a parsing-based RAG framework that improves tabular document question answering by preserving 2D structural semantics through layout segmentation and hierarchical table extraction using vision language models.

Details

Motivation: Traditional RAG approaches for text documents fail on tabular documents because standard parsing techniques lose the two-dimensional structural semantics crucial for cell interpretation, leading to poor question answering performance.

Method: TabRAG uses layout segmentation to decompose documents into components, then employs a vision language model to parse tables into hierarchical structured representations, enhanced by a self-generated in-context learning module to handle various table styles and formats.

Result: TabRAG outperforms existing popular parsing techniques across a broad suite of evaluation and ablation benchmarks for tabular document question answering.

Conclusion: The framework effectively preserves tabular structural semantics and enables more accurate question answering on tabular documents through structured representations and vision-language integration.

Abstract: Incorporating external knowledge bases in traditional retrieval-augmented generation (RAG) relies on parsing the document, followed by querying a language model with the parsed information via in-context learning. While effective for text-based documents, question answering on tabular documents often fails to generate plausible responses. Standard parsing techniques lose the two-dimensional structural semantics critical for cell interpretation. In this work, we present TabRAG, a parsing-based RAG framework designed to improve tabular document question answering via structured representations. Our framework consists of layout segmentation that decomposes the document inputs into a series of components, enabling fine-grained extraction. Subsequently, a vision language model parses and extracts the document tables into a hierarchically structured representation. In order to cater various table styles and formats, we integrate a self-generated in-context learning module that guides the table extraction process. Experimental results demonstrate that TabRAG outperforms existing popular parsing techniques across a broad suite of evaluation and ablation benchmarks. Code is available at: https://github.com/jacobyhsi/TabRAG.

[265] Large Language Models for Scientific Idea Generation: A Creativity-Centered Survey

Fatemeh Shahhosseini, Arash Marioriyad, Ali Momen, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban, Shaghayegh Haghjooy Javanmard

Main category: cs.CL

TL;DR: Survey paper analyzing LLM-driven scientific ideation methods, organizing them into five families and evaluating them through cognitive creativity frameworks to understand trade-offs between novelty and scientific validity.

Details

Motivation: Scientific idea generation requires balancing novelty and scientific soundness, which is challenging to automate. While LLMs can generate plausible scientific ideas, their creative capabilities and limits are poorly understood, necessitating a structured analysis of methods and their trade-offs.

Method: The survey organizes LLM-driven scientific ideation methods into five families: 1) External knowledge augmentation, 2) Prompt-based distributional steering, 3) Inference-time scaling, 4) Multi-agent collaboration, and 5) Parameter-level adaptation. It analyzes these methods using two creativity frameworks: Boden taxonomy (for novelty levels) and Rhodes 4Ps framework (for creativity aspects).

Result: The survey provides a structured synthesis of existing methods, clarifies the evaluation landscape for scientific ideation, and identifies key challenges in achieving reliable and systematic LLM-based scientific discovery through systematic analysis of how different approaches trade off novelty and scientific validity.

Conclusion: The paper establishes a framework for understanding LLM-based scientific creativity, highlighting that different methodological families emphasize different aspects of creativity and trade-offs between novelty and validity, providing guidance for future research in automated scientific discovery.

Abstract: Scientific idea generation is central to discovery, requiring the joint satisfaction of novelty and scientific soundness. Unlike standard reasoning or general creative generation, scientific ideation is inherently open-ended and multi-objective, making its automation particularly challenging. Recent advances in large language models (LLMs) have enabled the generation of coherent and plausible scientific ideas, yet the nature and limits of their creative capabilities remain poorly understood. This survey provides a structured synthesis of methods for LLM-driven scientific ideation, focusing on how different approaches trade off novelty and scientific validity. We organize existing methods into five complementary families: External knowledge augmentation, Prompt-based distributional steering, Inference-time scaling, Multi-agent collaboration, and Parameter-level adaptation. To interpret their contributions, we adopt two complementary creativity frameworks: Boden taxonomy to characterize the expected level of creative novelty, and Rhodes 4Ps framework to analyze the aspects or sources of creativity emphasized by each method. By aligning methodological developments with cognitive creativity frameworks, this survey clarifies the evaluation landscape and identifies key challenges and directions for reliable and systematic LLM-based scientific discovery.

[266] Your Latent Reasoning is Secretly Policy Improvement Operator

Arip Asadulaev, Rayan Banerjee, Fakhri Karray, Martin Takac

Main category: cs.CL

TL;DR: Latent reasoning in small models can be formalized as classifier-free guidance and policy improvement, enabling optimization to reduce dead compute steps by 18x while maintaining performance.

Details

Motivation: Small models with latent recursion show promise on complex reasoning tasks but underperform compared to one-pass models with equivalent depth, suggesting many recursive steps are ineffective "dead compute." The paper aims to understand when latent reasoning improves performance versus when it wastes computation.

Method: Formalizes latent reasoning as classifier-free guidance and policy improvement algorithms. Proposes training schemes from reinforcement learning and diffusion methods for latent reasoning models. Tests modifications on Tiny Recursive Model to optimize recursive steps.

Result: With proposed modifications, achieves 18x reduction in total forward passes while maintaining performance, effectively avoiding dead compute steps. Shows policy improvement perspective explains model behavior and enables optimization.

Conclusion: Latent reasoning can be understood through policy improvement framework, enabling optimization to reduce computational waste. This perspective provides insights for improving recursive models and explains when recursive steps effectively contribute to reasoning.

Abstract: Recently, small models with latent recursion have obtained promising results on complex reasoning tasks. These results are typically explained by the theory that such recursion increases a networks depth, allowing it to compactly emulate the capacity of larger models. However, the performance of recursively added layers remains behind the capabilities of one pass models with the same feed forward depth. This means that in the looped version, not every recursive step effectively contributes to depth. This raises the question: when and why does latent reasoning improve performance, and when does it result in dead compute? In our work, we analyze the algorithms that latent reasoning provides answer to this question. We show that latent reasoning can be formalized as a classifier free guidance and policy improvement algorithm. Building on these insights, we propose to use a training schemes from reinforcement learning and diffusion methods for latent reasoning models. Using the Tiny Recursive Model as our testbed, we show that with our modifications we can avoid dead compute steps and reduce the total number of forward passes by 18x while maintaining performance. Broadly speaking, we show how a policy improvement perspective on recursive steps can explain model behavior and provide insights for further improvements.

[267] PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation

Yongfu Xue

Main category: cs.CL

TL;DR: PIRA is a training paradigm for reward models that improves alignment of LLMs with human preferences by leveraging LLMs’ instruction-following capabilities, using diverse preference-task instructions, and stabilizing reward estimation through dropout averaging.

Details

Motivation: Existing reward models for LLM alignment have two key limitations: discriminative models require large annotated data and cannot leverage LLMs' instruction-following capabilities, while all reward models are prone to reward overoptimization where LLMs exploit weaknesses in reward functions rather than improving true alignment.

Method: PIRA integrates three strategies: 1) Reformulating question-answer pairs into preference-task instructions to leverage LLMs’ preference instruction-following capability, 2) Averaging rewards from diverse preference-task instructions per sample to mitigate task-specific bias and enhance robustness, and 3) Averaging outputs from the value head under different dropout rates to stabilize reward estimation.

Result: Experiments on public datasets show that PIRA improves performance considerably, enhances generalization, and effectively mitigates reward overoptimization compared to existing approaches.

Conclusion: PIRA provides an effective training paradigm for reward models that addresses key limitations of existing approaches by better leveraging LLMs’ capabilities while reducing vulnerability to reward overoptimization.

Abstract: Reward models are pivotal for aligning Large Language Models (LLMs) with human preferences. Existing approaches face two key limitations: Discriminative reward models require large-scale annotated data, as they cannot exploit the preference instruction-following capability of LLMs available to generative reward models. Moreover, reward models are particularly prone to reward overoptimization, where LLMs exploit weaknesses in the reward function instead of improving true alignment. We introduce \textbf{PIRA}, a training paradigm that integrates three complementary strategies to address these challenges: (1) reformulating question-answer pairs into preference-task instructions to explicitly leverage LLMs’ preference instruction-following capability, (2) averaging the rewards aggregated from diverse preference-task instructions for each sample, which mitigates task-specific bias and enhances robustness across evaluation perspectives, and (3) averaging outputs from the value head under different dropout rates to stabilize reward estimation. Experiments on public datasets show that PIRA improves performance considerably, enhances generalization, and effectively mitigates reward overoptimization.

[268] CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution

Baoliang Tian, Yuxuan Si, Jilong Wang, Lingyao Li, Zhongyuan Bao, Zineng Zhou, Tao Wang, Sixu Li, Ziyao Xu, Mingze Wang, Zhouzhuo Zhang, Zhihao Wang, Yike Yun, Ke Tian, Ning Yang, Minghui Qiu

Main category: cs.CL

TL;DR: CrossCheck-Bench is a diagnostic benchmark for evaluating multimodal contradiction detection, revealing VLMs struggle with cross-modal inconsistency resolution despite good surface-level alignment.

Details

Motivation: Current multimodal LLMs are primarily trained on aligned image-text pairs, leaving their ability to detect and resolve real-world inconsistencies largely unexplored. Visual and textual cues often conflict in open-domain applications, requiring structured reasoning beyond surface-level alignment.

Method: Introduces CrossCheck-Bench with 15k QA pairs from real-world artifacts with synthetically injected contradictions. Uses hierarchical task framework covering three reasoning complexity levels and defines seven atomic capabilities for cross-modal inconsistency resolution. Constructed through multi-stage annotation pipeline with 450+ expert hours.

Result: Evaluation of 13 state-of-the-art VLMs shows consistent performance drop as tasks shift from perceptual matching to logical contradiction detection. Models perform well on isolated entity recognition but fail at synthesizing multiple clues for conflict reasoning. Conventional prompting strategies yield marginal gains, while symbolic reasoning with grounded visual processing shows more stable improvements.

Conclusion: Highlights persistent bottleneck in multimodal reasoning and suggests new directions for building models capable of robust cross-modal verification, emphasizing the need for structured reasoning beyond surface-level alignment.

Abstract: Multimodal Large Language Models are primarily trained and evaluated on aligned image-text pairs, which leaves their ability to detect and resolve real-world inconsistencies largely unexplored. In open-domain applications visual and textual cues often conflict, requiring models to perform structured reasoning beyond surface-level alignment. We introduce CrossCheck-Bench, a diagnostic benchmark for evaluating contradiction detection in multimodal inputs. The benchmark adopts a hierarchical task framework covering three levels of reasoning complexity and defines seven atomic capabilities essential for resolving cross-modal inconsistencies. CrossCheck-Bench includes 15k question-answer pairs sourced from real-world artifacts with synthetically injected contradictions. The dataset is constructed through a multi-stage annotation pipeline involving more than 450 expert hours to ensure semantic validity and calibrated difficulty across perception, integration, and reasoning. We evaluate 13 state-of-the-art vision-language models and observe a consistent performance drop as tasks shift from perceptual matching to logical contradiction detection. Most models perform well on isolated entity recognition but fail when multiple clues must be synthesized for conflict reasoning. Capability-level analysis further reveals uneven skill acquisition, especially in tasks requiring multi-step inference or rule-based validation. Additional probing shows that conventional prompting strategies such as Chain-of-Thought and Set-of-Mark yield only marginal gains. By contrast, methods that interleave symbolic reasoning with grounded visual processing achieve more stable improvements. These results highlight a persistent bottleneck in multimodal reasoning and suggest new directions for building models capable of robust cross-modal verification.

[269] Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios

Jianxiang Zang, Yongda Wei, Ruxue Bai, Shiyu Jiang, Nijia Mo, Binhong Li, Qiang Sun, Hui Liu

Main category: cs.CL

TL;DR: Reward Auditor is a hypothesis-testing framework for evaluating reward model suitability by assessing systematic vulnerabilities under real-world perturbations, moving beyond simple accuracy metrics.

Details

Motivation: Current reward model evaluation focuses only on preference perception accuracy in specific scenarios, missing critical vulnerabilities in real-world applications. The true challenge is assessing suitability - conditional reliability under real-world perturbations.

Method: Introduces Reward Auditor, a hypothesis-testing framework for RM suitability inference. It uses scientific auditing to quantify statistical significance and effect size by analyzing distribution degradation of RM preference perception confidence under real-world perturbed scenarios.

Result: The framework enables inference of both certainty and severity of RM vulnerabilities across diverse real-world scenarios, providing a more comprehensive evaluation approach.

Conclusion: Reward Auditor lays foundation for building next-generation LLM alignment systems that are verifiably safe, robust, and trustworthy by addressing the suitability dimension of reward model evaluation.

Abstract: Reliable reward models (RMs) are critical for ensuring the safe alignment of large language models (LLMs). However, current RM evaluation methods focus solely on preference perception accuracies in given specific scenarios, obscuring the critical vulnerabilities of RMs in real-world scenarios. We identify the true challenge lies in assessing a novel dimension: Suitability, defined as conditional reliability under specific real-world perturbations. To this end, we introduce Reward Auditor, a hypothesis-testing framework specifically designed for RM suitability inference. Rather than answering “How accurate is the RM’s preference perception for given samples?”, it employs scientific auditing to answer: “Can we infer RMs exhibit systematic vulnerabilities in specific real-world scenarios?”. Under real-world perturbed scenarios, Reward Auditor quantifies statistical significance and effect size by auditing distribution degradation of RM preference perception confidence. This enables inference of both the certainty and severity of RM vulnerabilities across diverse real-world scenarios. This lays a solid foundation for building next-generation LLM alignment systems that are verifiably safe, more robust, and trustworthy.

[270] Learning the Boundary of Solvability: Aligning LLMs to Detect Unsolvable Problems

Dengyun Peng, Qiguang Chen, Bofei Liu, Jiannan Guan, Libo Qin, Zheng Yan, Jinhao Liu, Jianshu Zhang, Wanxiang Che

Main category: cs.CL

TL;DR: A framework for distinguishing objective unsolvability from subjective capability limitations in LLMs, using a dataset with logical contradictions and reinforcement learning alignment.

Details

Motivation: Current LLMs conflate objective unsolvability (inherent contradictions) with subjective capability limitations, leading to hallucinations where models confidently answer unsolvable queries. There's a need to help LLMs distinguish between these two failure modes.

Method: 1) Construct UnsolvableQA dataset using “Reverse Construction” that systematically injects logical contradictions into valid reasoning chains. 2) Develop UnsolvableRL, a reinforcement learning paradigm that balances objective unsolvability detection with calibrated confidence under capability limits.

Result: Achieves robust unsolvability detection (>85% detection rate) and boosts solvable reasoning accuracy from 43.4% to 69.4% on Qwen3-4B-Instruct. Identifies data-training interaction: strict alignment constraints cause Capability Collapse without unsolvable data, but act as regularizers for rigor when such data is included.

Conclusion: The proposed approach effectively helps LLMs distinguish between objective unsolvability and capability limitations, improving overall robustness and reducing hallucinations. The interaction between alignment constraints and unsolvable data is crucial for achieving this balance.

Abstract: Ensuring large language model (LLM) reliability requires distinguishing objective unsolvability (inherent contradictions) from subjective capability limitations (tasks exceeding model competence). Current LLMs often conflate these dimensions, leading to hallucinations in which they return confident answers to inherently unsolvable queries. To address this issue, we propose a multi-domain dataset containing both solvable and unsolvable questions, UnsolvableQA, together with an alignment framework, UnsolvableRL. First, we construct UnsolvableQA by “Reverse Construction” that systematically injects logical contradictions into otherwise valid reasoning chains. Second, we introduce UnsolvableRL, a reinforcement learning paradigm that balances objective unsolvability detection with calibrated confidence under capability limits. Empirically, our approach achieves robust unsolvability detection (>85% detection rate) and boosts solvable reasoning accuracy from 43.4% to 69.4% on Qwen3-4B-Instruct. Crucially, we identify a data-training interaction: strict alignment constraints induce Capability Collapse without unsolvable data, but act as a regularizer for rigor when such data are included, thereby improving overall robustness. Our code and data are available at https://github.com/sfasfaffa/unsolvableQA .

[271] Latent Debate: A Surrogate Framework for Interpreting LLM Thinking

Lihu Chen, Xiang Yin, Francesca Toni

Main category: cs.CL

TL;DR: Latent debate framework interprets LLM predictions by capturing hidden supporting/attacking signals within a single model during inference, providing interpretability and hallucination detection capabilities.

Details

Motivation: To understand the internal thinking process of LLMs and the causes of hallucinations, which remain key challenges in AI interpretability.

Method: Introduces latent debate framework that captures implicit internal arguments within a single model during single inference, unlike explicit multi-agent debates. Presents model- and task-agnostic conceptual framework, then instantiates it symbolically for True/False prediction tasks.

Result: Latent debate serves as a faithful structured surrogate model with highly consistent predictions with original LLM. Provides strong baseline for hallucination detection, with analysis revealing correlations between hallucinations and debate patterns (e.g., high degree of latent debates in middle layers linked to higher hallucination risk).

Conclusion: Latent debate positions as a potential framework for understanding internal mechanisms of LLMs, especially for scenarios where internal (dis)agreements appear during inference steps, offering both interpretability and practical applications for hallucination detection.

Abstract: Understanding the internal thinking process of Large Language Models (LLMs) and the cause of hallucinations remains a key challenge. To this end, we introduce latent debate, a novel framework for interpreting model predictions through the lens of implicit internal arguments. Unlike the current work of self-consistency and multi-agent debate, which relies on explicit debates among multiple answers or multiple models, latent debate captures the hidden supporting and attacking signals that arise within a single model during a single inference. We first present a model- and task-agnostic conceptual framework, and then instantiate it symbolically to approximate the thinking process of LLMs on True/False prediction tasks. Empirical studies demonstrate that latent debate is a faithful structured surrogate model that has highly consistent predictions with the original LLM. Beyond interpretability, we demonstrate that latent debate provides a strong baseline for hallucination detection. Further analysis reveals strong correlations between hallucinations and debate patterns, such as a high degree of latent debates in the middle layers is linked to a higher risk of hallucinations. These findings position latent debate as a potential framework for understanding internal mechanisms of LLMs, especially for scenarios where internal (dis)agreements appear during the inference steps.

[272] On Group Relative Policy Optimization Collapse in Agent Search: The Lazy Likelihood-Displacement

Wenlong Deng, Yushu Li, Boying Gong, Yi Ren, Christos Thrampoulidis, Xiaoxiao Li

Main category: cs.CL

TL;DR: Identifies Lazy Likelihood Displacement (LLD) as the cause of training collapse in GRPO-based tool-integrated RL, proposes likelihood-preserving regularization to stabilize training and improve performance.

Details

Motivation: Tool-integrated RL enables LLMs to perform multi-step reasoning with external tools, but GRPO methods like Search-R1 suffer from consistent training collapse despite their fast convergence and value-free formulation.

Method: Identifies Lazy Likelihood Displacement (LLD) as the core failure mechanism, characterizes its three-phase trajectory, and proposes LLDS - a likelihood-preserving regularization that activates only when response likelihood decreases and regularizes only responsible tokens.

Result: Method stabilizes training, prevents gradient explosion, and yields substantial performance improvements across seven benchmarks, including +45.2% on Qwen2.5-3B and +37.1% on Qwen2.5-7B over vanilla GRPO training.

Conclusion: LLD is a previously overlooked bottleneck in GRPO-based tool-integrated RL, and the proposed likelihood-preserving regularization provides a practical path toward stable, scalable training of tool-integrated RL systems.

Abstract: Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a likelihood-preserving regularization LLDS that activates only when a response action’s likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference. Our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements across seven benchmarks, including relative improvements of +45.2% on Qwen2.5-3B and +37.1% on Qwen2.5-7B over vanilla GRPO training. Our results establish LLD as a previously overlooked bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated RL.

[273] When Large Language Models Do Not Work: Online Incivility Prediction through Graph Neural Networks

Zihan Chen, Lanyu Yu

Main category: cs.CL

TL;DR: A Graph Neural Network framework for detecting online incivility (toxicity, aggression, personal attacks) in Wikipedia comments that outperforms LLMs by leveraging both textual content and relational structures between comments.

Details

Motivation: Online incivility is a widespread problem with significant social and psychological impacts. Existing moderation and automated detection approaches have limited accuracy and efficiency, particularly text-only LLM paradigms that ignore relational context between comments.

Method: Proposes a GNN framework where each user comment is a node, edges are defined by textual similarity between comments, and includes a dynamically adjusted attention mechanism to balance nodal (textual) and topological (structural) features during information aggregation.

Result: The proposed architecture outperforms 12 state-of-the-art LLMs across multiple metrics while requiring significantly lower inference cost, demonstrating the importance of structural context for detecting online incivility.

Conclusion: Structural context is crucial for detecting online incivility, addressing limitations of text-only LLM paradigms in behavioral prediction. The approach offers better performance with lower computational cost.

Abstract: Online incivility has emerged as a widespread and persistent problem in digital communities, imposing substantial social and psychological burdens on users. Although many platforms attempt to curb incivility through moderation and automated detection, the performance of existing approaches often remains limited in both accuracy and efficiency. To address this challenge, we propose a Graph Neural Network (GNN) framework for detecting three types of uncivil behavior (i.e., toxicity, aggression, and personal attacks) within the English Wikipedia community. Our model represents each user comment as a node, with textual similarity between comments defining the edges, allowing the network to jointly learn from both linguistic content and relational structures among comments. We also introduce a dynamically adjusted attention mechanism that adaptively balances nodal and topological features during information aggregation. Empirical evaluations demonstrate that our proposed architecture outperforms 12 state-of-the-art Large Language Models (LLMs) across multiple metrics while requiring significantly lower inference cost. These findings highlight the crucial role of structural context in detecting online incivility and address the limitations of text-only LLM paradigms in behavioral prediction. All datasets and comparative outputs will be publicly available in our repository to support further research and reproducibility.

[274] Diffusion Language Model Inference with Monte Carlo Tree Search

Zheng Huang, Kiran Ramnath, Yueyan Chen, Aosong Feng, Sangmin Woo, Balasubramaniam Srinivasan, Zhichao Xu, Kang Zhou, Shuai Wang, Haibo Ding, Lin Lee Cheong

Main category: cs.CL

TL;DR: MEDAL introduces Monte Carlo Tree Search initialization for Diffusion Language Models to improve inference-time token selection without additional training.

Details

Motivation: Existing inference methods for Diffusion Language Models use heuristics that yield suboptimal decoding paths, or require additional training to guide token selection. There's a need for a principled search mechanism that can improve generation quality at inference time without retraining.

Method: MEDAL integrates Monte Carlo Tree Search at the initialization stage of DLM inference to explore promising unmasking trajectories, providing a robust starting point for subsequent refinement. This enables efficient inference-time scaling where generation quality improves with increased search budget.

Result: Across multiple benchmarks, MEDAL achieves up to 22.0% improvement over existing inference strategies for Diffusion Language Models.

Conclusion: MEDAL establishes a new paradigm for search-based inference in Diffusion Language Models, demonstrating that inference-time scaling with principled search mechanisms can significantly improve generation quality without additional training.

Abstract: Diffusion language models (DLMs) have recently emerged as a compelling alternative to autoregressive generation, offering parallel generation and improved global coherence. During inference, DLMs generate text by iteratively denoising masked sequences in parallel; however, determining which positions to unmask and which tokens to commit forms a large combinatorial search problem. Existing inference methods approximate this search using heuristics, which often yield suboptimal decoding paths; other approaches instead rely on additional training to guide token selection. To introduce a principled search mechanism for DLMs inference, we introduce MEDAL, an inference-time scaling framework that integrates Monte Carlo Tree SEarch initialization for Diffusion LAnguage Model inference. We employ Monte Carlo Tree Search at the initialization stage to explore promising unmasking trajectories, providing a robust starting point for subsequent refinement. This design enables efficient inference-time scaling, allowing generation quality to improve as the search budget increases, without additional training. Across multiple benchmarks, MEDAL achieves up to 22.0% improvement over existing inference strategies, establishing a new paradigm for search-based inference in DLMs.

[275] VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse

Ying Nie, Kai Han, Hongguang Li, Hang Zhou, Tianyu Guo, Enhua Wu, Xinghao Chen, Yunhe Wang

Main category: cs.CL

TL;DR: VersatileFFN is a novel feed-forward network architecture that enables flexible parameter reuse in width and depth dimensions within fixed parameter budgets, inspired by dual-process cognitive theory.

Details

Motivation: Addressing the prohibitive memory costs of scaling LLMs, existing parameter-efficient methods like pruning and quantization compress pretrained models without enhancing architectural capacity, hitting representational ceilings of base models.

Method: Proposes VersatileFFN with two adaptive pathways: width-versatile path generates mixture of sub-experts from single shared FFN (mimicking sparse expert routing without parameter increase), and depth-versatile path recursively applies same FFN for deeper processing. Difficulty-aware gating dynamically balances pathways based on token complexity.

Result: Experiments across diverse benchmarks and model scales demonstrate effectiveness of the method. Both pathways reuse same parameters, so all additional capacity comes from computation rather than memory.

Conclusion: VersatileFFN provides flexible parameter reuse approach that enhances model capacity within fixed parameter budgets, addressing memory scaling issues while maintaining computational efficiency.

Abstract: The rapid scaling of Large Language Models (LLMs) has achieved remarkable performance, but it also leads to prohibitive memory costs. Existing parameter-efficient approaches such as pruning and quantization mainly compress pretrained models without enhancing architectural capacity, thereby hitting the representational ceiling of the base model. In this work, we propose VersatileFFN, a novel feed-forward network (FFN) that enables flexible reuse of parameters in both width and depth dimensions within a fixed parameter budget. Inspired by the dual-process theory of cognition, VersatileFFN comprises two adaptive pathways: a width-versatile path that generates a mixture of sub-experts from a single shared FFN, mimicking sparse expert routing without increasing parameters, and a depth-versatile path that recursively applies the same FFN to emulate deeper processing for complex tokens. A difficulty-aware gating dynamically balances the two pathways, steering “easy” tokens through the efficient width-wise route and allocating deeper iterative refinement to “hard” tokens. Crucially, both pathways reuse the same parameters, so all additional capacity comes from computation rather than memory. Experiments across diverse benchmarks and model scales demonstrate the effectiveness of the method. The code is available at https://github.com/huawei-noah/noah-research/tree/master/VersatileFFN.

[276] MemBuilder: Reinforcing LLMs for Long-Term Memory Construction via Attributed Dense Rewards

Zhiyu Shen, Ziming Wu, Fuming Lai, Shaobing Lian, Yanghui Rao

Main category: cs.CL

TL;DR: MemBuilder is a reinforcement learning framework that trains LLMs to build multi-dimensional memory for long-term dialogue consistency, using dense rewards and contribution-aware gradient weighting to outperform closed-source models.

Details

Motivation: Standard retrieval mechanisms fail to capture temporal evolution in long-term dialogues, and current memory-augmented systems either rely on static prompting of closed-source models or suffer from ineffective training with sparse rewards.

Method: MemBuilder uses reinforcement learning with: 1) synthetic session-level question generation for dense intermediate rewards across extended trajectories, and 2) contribution-aware gradient weighting that scales policy updates based on each memory component’s downstream impact.

Result: A 4B-parameter model trained with MemBuilder outperforms state-of-the-art closed-source baselines and shows strong generalization across long-term dialogue benchmarks.

Conclusion: MemBuilder effectively addresses sparse reward and memory attribution challenges in long-term dialogue systems, enabling smaller open-source models to achieve superior performance through structured memory construction.

Abstract: Maintaining consistency in long-term dialogues remains a fundamental challenge for LLMs, as standard retrieval mechanisms often fail to capture the temporal evolution of historical states. While memory-augmented frameworks offer a structured alternative, current systems rely on static prompting of closed-source models or suffer from ineffective training paradigms with sparse rewards. We introduce MemBuilder, a reinforcement learning framework that trains models to orchestrate multi-dimensional memory construction with attributed dense rewards. MemBuilder addresses two key challenges: (1) Sparse Trajectory-Level Rewards: we employ synthetic session-level question generation to provide dense intermediate rewards across extended trajectories; and (2) Multi-Dimensional Memory Attribution: we introduce contribution-aware gradient weighting that scales policy updates based on each component’s downstream impact. Experimental results show that MemBuilder enables a 4B-parameter model to outperform state-of-the-art closed-source baselines, exhibiting strong generalization across long-term dialogue benchmarks.

[277] AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor

Shu Yang, Jingyu Hu, Tong Li, Hanqi Yan, Wenxuan Wang, Di Wang

Main category: cs.CL

TL;DR: AutoMonitor-Bench is the first benchmark for evaluating LLM-based misbehavior monitors across diverse tasks and failure modes, with 3,010 annotated samples and metrics for detection reliability.

Details

Motivation: There's a need for systematic evaluation of LLM-based misbehavior monitors to assess their reliability across different tasks and failure modes, as current approaches lack comprehensive benchmarks.

Method: Created AutoMonitor-Bench with 3,010 annotated samples spanning QA, code generation, and reasoning tasks; evaluated 22 LLMs using Miss Rate and False Alarm Rate metrics; fine-tuned Qwen3-4B-Instruction on 153,581 training samples.

Result: Found substantial variability in monitoring performance across models, consistent trade-off between MR and FAR, and limited improvement from training on known misbehavior datasets for detecting unseen implicit misbehaviors.

Conclusion: Reliable, scalable misbehavior monitoring remains challenging, highlighting inherent safety-utility tension and motivating future work on task-aware monitor design and training strategies.

Abstract: We introduce AutoMonitor-Bench, the first benchmark designed to systematically evaluate the reliability of LLM-based misbehavior monitors across diverse tasks and failure modes. AutoMonitor-Bench consists of 3,010 carefully annotated test samples spanning question answering, code generation, and reasoning, with paired misbehavior and benign instances. We evaluate monitors using two complementary metrics: Miss Rate (MR) and False Alarm Rate (FAR), capturing failures to detect misbehavior and oversensitivity to benign behavior, respectively. Evaluating 12 proprietary and 10 open-source LLMs, we observe substantial variability in monitoring performance and a consistent trade-off between MR and FAR, revealing an inherent safety-utility tension. To further explore the limits of monitor reliability, we construct a large-scale training corpus of 153,581 samples and fine-tune Qwen3-4B-Instruction to investigate whether training on known, relatively easy-to-construct misbehavior datasets improves monitoring performance on unseen and more implicit misbehaviors. Our results highlight the challenges of reliable, scalable misbehavior monitoring and motivate future work on task-aware designing and training strategies for LLM-based monitors.

[278] Don’t Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks

Elias Lumer, Faheem Nizar, Akshaya Jangiti, Kevin Frank, Anmol Gulati, Mandar Phadate, Vamse Kumar Subbiah

Main category: cs.CL

TL;DR: Prompt caching for LLM agents reduces API costs by 41-80% and improves latency by 13-31% across major providers, with strategic cache block control outperforming naive full-context caching.

Details

Motivation: While LLM providers offer prompt caching to reduce costs and latency, its benefits for agentic workloads with extensive tool calling remain underexplored, with no prior work quantifying savings or comparing caching strategies for multi-turn agentic tasks.

Method: Comprehensive evaluation across three major LLM providers (OpenAI, Anthropic, Google) comparing three caching strategies: full context caching, system prompt only caching, and caching excluding dynamic tool results. Evaluation on DeepResearch Bench with 500+ agent sessions and 10,000-token system prompts, measuring API cost and time to first token.

Result: Prompt caching reduces API costs by 41-80% and improves TTFT by 13-31% across providers. Strategic cache block control (placing dynamic content at end, avoiding dynamic function calling, excluding tool results) provides more consistent benefits than naive full-context caching, which can increase latency. Linear cost and TTFT benefits observed across prompt sizes (500-50k tokens) and tool call counts (3-50).

Conclusion: Prompt caching offers substantial cost and latency benefits for agentic systems, but requires strategic implementation with cache block control rather than naive approaches. Provider-specific strategy discrepancies exist, requiring nuanced guidance for production deployment.

Abstract: Recent advancements in Large Language Model (LLM) agents have enabled complex multi-turn agentic tasks requiring extensive tool calling, where conversations can span dozens of API calls with increasingly large context windows. However, although major LLM providers offer prompt caching to reduce cost and latency, its benefits for agentic workloads remain underexplored in the research literature. To our knowledge, no prior work quantifies these cost savings or compares caching strategies for multi-turn agentic tasks. We present a comprehensive evaluation of prompt caching across three major LLM providers (OpenAI, Anthropic, and Google) and compare three caching strategies, including full context caching, system prompt only caching, and caching that excludes dynamic tool results. We evaluate on DeepResearch Bench, a multi-turn agentic benchmark where agents autonomously execute real-world web search tool calls to answer complex research questions, measuring both API cost and time to first token (TTFT) across over 500 agent sessions with 10,000-token system prompts. Our results demonstrate that prompt caching reduces API costs by 41-80% and improves time to first token by 13-31% across providers. We find that strategic prompt cache block control, such as placing dynamic content at the end of the system prompt, avoiding dynamic traditional function calling, and excluding dynamic tool results, provides more consistent benefits than naive full-context caching, which can paradoxically increase latency. An ablation study across prompt sizes (500-50,000 tokens) and tool call counts (3-50) demonstrates universal linear cost and TTFT benefits, after the provider caching token minimum, and reveal provider-specific strategy discrepancies across variants. We provide nuanced discussion and guidance for implementing prompt caching in production agentic systems.

[279] X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests

Jie Wu, Haoling Li, Xin Zhang, Jiani Guo, Jane Luo, Steven Liu, Yangyu Huang, Ruihang Chu, Scarlett Li, Yujiu Yang

Main category: cs.CL

TL;DR: X-Coder model series trained on high-quality synthetic data achieves state-of-the-art performance on competitive programming benchmarks, outperforming larger models trained on real data.

Details

Motivation: Current Code LLMs for competitive programming rely heavily on finite real-world data, raising concerns about scalability and data contamination. The paper investigates whether expert-level reasoning can be achieved using fully synthetic data.

Method: Systematically investigates factors governing synthetic data quality, advances feature-based synthesis via domain-specific evolution and dual-verification strategy. Trains X-Coder model series under SFT-then-RL paradigm using high-quality synthetic data.

Result: X-Coder-7B shows significant performance gains on LiveCodeBench v5 (62.9% avg@8) and v6 (55.8% avg@8), outperforming larger models trained on real-world data. Provides insights into synthetic data scaling, domain-adapted feature evolution, and code-centric reinforcement.

Conclusion: Expert-level reasoning performance in competitive programming can be achieved using high-quality synthetic data, addressing scalability and contamination concerns of real-world data dependency.

Abstract: Competitive programming poses a significant challenge for Code LLMs. While recent models have shown promise, they heavily rely on finite real-world data, raising concerns about scalability and contamination. In this paper, we investigate a critical question: Can we elevate models to expert-level reasoning performance using fully synthetic data? In response, we first observe that off-the-shelf synthesis methods yield suboptimal results in this domain. To address this, we systematically investigate the key factors governing synthetic data quality. Leveraging these findings, we significantly advance the feature-based synthesis paradigm via domain-specific evolution and a dual-verification strategy, promoting task solvability, solution correctness, and test accuracy. Using this high-quality synthetic data, we train the X-Coder model series under an SFT-then-RL paradigm. X-Coder-7B shows significant performance gains on the challenging LiveCodeBench v5 (62.9% avg@8) and v6 (55.8% avg@8), outperforming larger models trained on real-world data. Extensive analysis distills valuable insights into synthetic data scaling, the necessity of domain-adapted feature evolution, and code-centric reinforcement.

[280] MedTutor: A Retrieval-Augmented LLM System for Case-Based Medical Education

Dongsuk Jang, Ziyao Shangguan, Kyle Tegtmeyer, Anurag Gupta, Jan Czerminski, Sophie Chheang, Arman Cohan

Main category: cs.CL

TL;DR: MedTutor is an AI system that generates evidence-based educational content and multiple-choice questions from clinical case reports using a hybrid RAG pipeline with medical textbooks and research literature.

Details

Motivation: Medical residents face challenges in interpreting complex case reports and quickly acquiring accurate medical knowledge from reliable sources. Current methods of studying case reports and discussing with peers/mentors are time-consuming for finding relevant educational materials.

Method: Uses Retrieval-Augmented Generation (RAG) pipeline with hybrid retrieval mechanism that queries local medical knowledge base and academic literature (PubMed, Semantic Scholar APIs). Retrieved evidence is filtered/ordered with reranking model, then LLM generates final educational content and questions.

Result: Three radiologists assessed outputs as high clinical/educational value. Large-scale LLM-as-a-Judge evaluation showed moderate alignment between LLM judgments and human expert assessments, highlighting need for expert oversight.

Conclusion: MedTutor effectively generates evidence-based educational content from case reports, but expert oversight remains necessary despite LLM evaluation capabilities.

Abstract: The learning process for medical residents presents significant challenges, demanding both the ability to interpret complex case reports and the rapid acquisition of accurate medical knowledge from reliable sources. Residents typically study case reports and engage in discussions with peers and mentors, but finding relevant educational materials and evidence to support their learning from these cases is often time-consuming and challenging. To address this, we introduce MedTutor, a novel system designed to augment resident training by automatically generating evidence-based educational content and multiple-choice questions from clinical case reports. MedTutor leverages a Retrieval-Augmented Generation (RAG) pipeline that takes clinical case reports as input and produces targeted educational materials. The system’s architecture features a hybrid retrieval mechanism that synergistically queries a local knowledge base of medical textbooks and academic literature (using PubMed, Semantic Scholar APIs) for the latest related research, ensuring the generated content is both foundationally sound and current. The retrieved evidence is filtered and ordered using a state-of-the-art reranking model and then an LLM generates the final long-form output describing the main educational content regarding the case-report. We conduct a rigorous evaluation of the system. First, three radiologists assessed the quality of outputs, finding them to be of high clinical and educational value. Second, we perform a large scale evaluation using an LLM-as-a Judge to understand if LLMs can be used to evaluate the output of the system. Our analysis using correlation between LLMs outputs and human expert judgments reveals a moderate alignment and highlights the continued necessity of expert oversight.

[281] VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding

Haorui Yu, Ramon Ruiz-Dolz, Diji Yang, Hang He, Fengrui Zhang, Qiufeng Yi

Main category: cs.CL

TL;DR: VULCA-Bench is a multicultural art-critique benchmark for evaluating VLMs’ cultural understanding beyond basic visual perception, featuring 7,410 image-critique pairs across 8 cultural traditions with Chinese-English bilingual coverage.

Details

Motivation: Existing VLM benchmarks focus too much on basic visual recognition (L1-L2 capabilities) and lack evaluation of higher-order cultural interpretation and understanding, which is crucial for models to truly comprehend art and cultural artifacts.

Method: Created a benchmark with 7,410 matched image-critique pairs spanning eight cultural traditions, using a five-layer framework (L1-L5) from Visual Perception to Philosophical Aesthetics, instantiated as 225 culture-specific dimensions with expert-written bilingual critiques.

Result: Pilot results show that higher-layer reasoning (L3-L5) is consistently more challenging for VLMs than visual and technical analysis (L1-L2), highlighting the gap in cultural understanding capabilities.

Conclusion: VULCA-Bench provides a comprehensive tool to evaluate VLMs’ cultural understanding beyond surface-level perception, revealing significant challenges in higher-order cultural interpretation that current models struggle with.

Abstract: We introduce VULCA-Bench, a multicultural art-critique benchmark for evaluating Vision-Language Models’ (VLMs) cultural understanding beyond surface-level visual perception. Existing VLM benchmarks predominantly measure L1-L2 capabilities (object recognition, scene description, and factual question answering) while under-evaluate higher-order cultural interpretation. VULCA-Bench contains 7,410 matched image-critique pairs spanning eight cultural traditions, with Chinese-English bilingual coverage. We operationalise cultural understanding using a five-layer framework (L1-L5, from Visual Perception to Philosophical Aesthetics), instantiated as 225 culture-specific dimensions and supported by expert-written bilingual critiques. Our pilot results indicate that higher-layer reasoning (L3-L5) is consistently more challenging than visual and technical analysis (L1-L2). The dataset, evaluation scripts, and annotation tools are available under CC BY 4.0 in the supplementary materials.

[282] DYCP: Dynamic Context Pruning for Long-Form Dialogue with LLMs

Nayoung Choi, Jonathan Zhang, Jinho D. Choi

Main category: cs.CL

TL;DR: DyCP is a lightweight context management method that dynamically identifies and retrieves relevant dialogue segments for LLMs without offline memory construction, improving inference efficiency while maintaining answer quality.

Details

Motivation: LLMs increasingly handle long-form dialogues with topic shifts, but extended context windows create inference cost and latency issues. Efficient dialogue history management is needed without compromising performance.

Method: DyCP is implemented outside the LLM as a lightweight context management method that dynamically identifies and retrieves relevant dialogue segments conditioned on the current turn, without requiring offline memory construction or predefined topic boundaries.

Result: Across three long-form dialogue benchmarks (LoCoMo, MT-Bench+, and SCM4LLMs) and multiple LLM backends, DyCP achieves competitive answer quality with more selective context usage and improved inference efficiency.

Conclusion: DyCP provides an effective solution for managing dialogue context in LLMs, enabling adaptive and efficient context selection while preserving dialogue sequentiality, addressing practical constraints of inference cost and latency.

Abstract: Large Language Models (LLMs) increasingly operate over long-form dialogues with frequent topic shifts. While recent LLMs support extended context windows, efficient management of dialogue history in practice is needed due to inference cost and latency constraints. We present DyCP, a lightweight context management method implemented outside the LLM that dynamically identifies and retrieves relevant dialogue segments conditioned on the current turn, without offline memory construction. DyCP manages dialogue context while preserving the sequential nature of dialogue without predefined topic boundaries, enabling adaptive and efficient context selection. Across three long-form dialogue benchmarks-LoCoMo, MT-Bench+, and SCM4LLMs-and multiple LLM backends, DyCP achieves competitive answer quality in downstream generation, with more selective context usage and improved inference efficiency.

[283] A.X K1 Technical Report

Sung Jun Cheon, Jaekyung Cho, Seongho Choi, Hyunjun Eun, Seokhwan Jo, Jaehyun Jun, Minsoo Kang, Jin Kim, Jiwon Kim, Minsang Kim, Sungwan Kim, Seungsik Kim, Tae Yoon Kim, Youngrang Kim, Hyeongmun Lee, Sangyeol Lee, Sungeun Lee, Youngsoon Lee, Yujin Lee, Seongmin Ok, Chanyong Park, Hyewoong Park, Junyoung Park, Hyunho Yang, Subin Yi, Soohyun Bae, Dhammiko Arya, Yongseok Choi, Sangho Choi, Dongyeon Cho, Seungmo Cho, Gyoungeun Han, Yong-jin Han, Seokyoung Hong, Hyeon Hwang, Wonbeom Jang, Minjeong Ju, Wonjin Jung, Keummin Ka, Sungil Kang, Dongnam Kim, Joonghoon Kim, Jonghwi Kim, SaeRom Kim, Sangjin Kim, Seongwon Kim, Youngjin Kim, Seojin Lee, Sunwoo Lee, Taehoon Lee, Chanwoo Park, Sohee Park, Sooyeon Park, Yohan Ra, Sereimony Sek, Seungyeon Seo, Gun Song, Sanghoon Woo, Janghan Yoon, Sungbin Yoon

Main category: cs.CL

TL;DR: A.X K1 is a 519B-parameter Mixture-of-Experts language model trained from scratch on 10T tokens with controllable reasoning capabilities and strong Korean performance.

Details

Motivation: To bridge the gap between reasoning capability and inference efficiency in large language models, enabling scalable deployment across diverse real-world scenarios with controllable reasoning.

Method: Leverages scaling laws to optimize training configurations and vocabulary size under fixed computational budgets. Uses a multi-stage data processing pipeline for corpus curation. Introduces Think-Fusion training recipe for user-controlled switching between thinking and non-thinking modes within a single unified model.

Result: A.X K1 achieves performance competitive with leading open-source models while establishing distinctive advantage in Korean-language benchmarks.

Conclusion: The paper presents a large-scale MoE language model with controllable reasoning capabilities that demonstrates strong performance, particularly in Korean language tasks.

Abstract: We introduce A.X K1, a 519B-parameter Mixture-of-Experts (MoE) language model trained from scratch. Our design leverages scaling laws to optimize training configurations and vocabulary size under fixed computational budgets. A.X K1 is pre-trained on a corpus of approximately 10T tokens, curated by a multi-stage data processing pipeline. Designed to bridge the gap between reasoning capability and inference efficiency, A.X K1 supports explicitly controllable reasoning to facilitate scalable deployment across diverse real-world scenarios. We propose a simple yet effective Think-Fusion training recipe, enabling user-controlled switching between thinking and non-thinking modes within a single unified model. Extensive evaluations demonstrate that A.X K1 achieves performance competitive with leading open-source models, while establishing a distinctive advantage in Korean-language benchmarks.

[284] Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

Ming Zhang, Jiabao Zhuang, Wenqing Jing, Kexin Tan, Ziyu Kong, Jingyi Deng, Yujiong Shen, Yuhang Zhao, Ning Luo, Renzhe Zheng, Jiahui Lin, Mingqi Wu, Long Ma, Shihan Dou, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: TaxoBench is a benchmark for evaluating deep research agents’ ability to retrieve essential papers and organize them into expert-like hierarchical taxonomies, using 72 LLM surveys as ground truth.

Details

Motivation: Existing benchmarks focus on writing quality or citation correctness but fail to evaluate hierarchical taxonomy organization, which is crucial for assessing whether automated research agents can match human experts in structuring literature reviews.

Method: Built from 72 highly-cited LLM surveys with expert-authored taxonomy trees (3,815 papers mapped to categories). Introduces novel metrics: Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path) to evaluate taxonomy structure at leaf and hierarchy levels.

Result: Evaluation of 7 Deep Research Agents and 12 frontier LLMs reveals dual bottlenecks: best agent retrieves only 20.92% of expert-cited papers, and even with perfect input, best model achieves only 31.24% Adjusted Rand Index with substantial structural gaps.

Conclusion: TaxoBench provides a comprehensive benchmark for evaluating deep research agents’ retrieval and organization capabilities, revealing significant gaps between current automated systems and human expert performance in literature survey generation.

Abstract: Deep Research Agents increasingly automate survey generation, yet whether they match human experts in two core abilities remains unclear: retrieving essential papers and organizing them into expert-like taxonomies. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics fail to capture hierarchical taxonomy structure. We introduce TaxoBench, a benchmark built from 72 highly-cited LLM surveys containing expert-authored taxonomy trees with 3,815 papers mapped to paper categories as ground truth. TaxoBench evaluates both abilities: (1) retrieval, measuring whether agents retrieve expert-cited papers; and (2) organization, assessed at two levels: the leaf-level measures paper-to-category assignment, while the hierarchy-level measures taxonomy structure via novel metrics – Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). TaxoBench supports two evaluation modes: Deep Research tests end-to-end capability given only a topic, while Bottom-Up provides the expert paper set to isolate organization ability. Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck: the best agent retrieves only 20.92% of expert-cited papers, and even with perfect input, the best model achieves only 31.24% ARI with substantial structural gaps. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench

[285] Probe and Skip: Self-Predictive Token Skipping for Efficient Long-Context LLM Inference

Zimeng Wu, Donghao Wang, Chaozhe Jin, Jiaxin Chen, Yunhong Wang

Main category: cs.CL

TL;DR: SPTS is a training-free framework for efficient long-context LLM inference using selective token skipping strategies and multi-stage delayed pruning to reduce computational overhead while maintaining accuracy.

Details

Motivation: Long-context inference in LLMs incurs significant computational overhead. Existing token-oriented methods (pruning/skipping) suffer from insufficient structure optimization, outdated selection criteria, and redundancy interference, leading to suboptimal speed-accuracy trade-offs.

Method: Proposes Self-Predictive Token Skipping (SPTS) with two selective token skipping strategies: Partial Attention Probing (PAP) for multi-head attention and Low-rank Transformation Probing (LTP) for feed forward networks. Also introduces Multi-Stage Delayed Pruning (MSDP) to reallocate skipping budgets and progressively remove redundant tokens across layers.

Result: Achieves up to 2.46× speedup for prefilling and 2.29× speedup for end-to-end generation while maintaining state-of-the-art accuracy in extensive experiments.

Conclusion: SPTS effectively addresses limitations of existing token skipping methods, providing an efficient training-free framework for long-context LLM inference with optimal speed-accuracy trade-offs.

Abstract: Long-context inference enhances the reasoning capability of Large Language Models (LLMs), but incurs significant computational overhead. Token-oriented methods, such as pruning and skipping, have shown great promise in reducing inference latency, yet still suffer from inherently insufficient structure optimization, outdated selection criteria, and redundancy interference, resulting in suboptimal speed-accuracy trade-off. To address these issues, we propose a novel training-free framework dubbed Self-Predictive Token Skipping (SPTS), for efficient long-context LLM inference. Specifically, motivated by probing the influence of target layers prior to skipping, we design two selective token skipping strategies for typical structures, including Partial Attention Probing (PAP) for multi-head attention and Low-rank Transformation Probing (LTP) for feed forward network. The former selects informative tokens via partial forward attention computation, while the latter constructs a low-rank proxy network to predict token transformations. In addition, a Multi-Stage Delayed Pruning (MSDP) strategy reallocates skipping budgets and progressively removes redundant tokens across layers. Extensive experiments display the effectiveness of our method, achieving up to 2.46$\times$ and 2.29$\times$ speedups for prefilling and end-to-end generation, respectively, while maintaining state-of-the-art accuracy. We will release the source code upon acceptance.

[286] Trust Me, I’m an Expert: Decoding and Steering Authority Bias in Large Language Models

Priyanka Mary Mammen, Emil Joswin, Shankar Venkitachalam

Main category: cs.CL

TL;DR: Language models show systematic bias favoring endorsements from perceived experts, leading to increased susceptibility to misleading information and confidence in wrong answers as source expertise increases.

Details

Motivation: While prior research shows language models are influenced by suggestions and endorsements, the impact of endorsement source credibility (particularly expertise levels) remains underexplored. The study aims to investigate whether models exhibit systematic bias based on perceived expertise of endorsement providers.

Method: Evaluated 11 models across 4 datasets spanning mathematical, legal, and medical reasoning domains. Used personas representing four expertise levels per domain to test model responses to endorsements from sources with varying perceived authority.

Result: Models show increasing susceptibility to incorrect/misleading endorsements as source expertise increases. Higher-authority sources induce both accuracy degradation and increased confidence in wrong answers. The authority bias is mechanistically encoded within models, but models can be steered away from this bias to improve performance even when experts give misleading endorsements.

Conclusion: Language models exhibit systematic authority bias that makes them vulnerable to expert misinformation. This bias is embedded in model mechanisms but can be mitigated through steering techniques, highlighting important safety considerations for reasoning tasks.

Abstract: Prior research demonstrates that performance of language models on reasoning tasks can be influenced by suggestions, hints and endorsements. However, the influence of endorsement source credibility remains underexplored. We investigate whether language models exhibit systematic bias based on the perceived expertise of the provider of the endorsement. Across 4 datasets spanning mathematical, legal, and medical reasoning, we evaluate 11 models using personas representing four expertise levels per domain. Our results reveal that models are increasingly susceptible to incorrect/misleading endorsements as source expertise increases, with higher-authority sources inducing not only accuracy degradation but also increased confidence in wrong answers. We also show that this authority bias is mechanistically encoded within the model and a model can be steered away from the bias, thereby improving its performance even when an expert gives a misleading endorsement.

[287] Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

Yuming Yang, Mingyoung Lai, Wanxu Zhao, Xiaoran Fan, Zhiheng Xi, Mingqi Wu, Chiyue Huang, Jun Zhao, Haijun Lv, Jian Tong, Yunhua Zhou, Yicheng Zou, Qipeng Guo, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: RSR metric balances alignment and informativeness for selecting reasoning trajectories in LLM distillation, outperforming likelihood-based metrics

Details

Motivation: Stronger teacher trajectories don't always yield better student performance in reasoning distillation, highlighting need for better data-student suitability assessment beyond just likelihood alignment

Method: Proposes Rank-Surprisal Ratio (RSR) metric combining token-wise rank and negative log-likelihood to capture both alignment and informativeness of reasoning trajectories

Result: RSR strongly correlates with post-training reasoning performance (average Spearman 0.86) across 5 student models and 11 teachers, outperforming existing metrics for trajectory and teacher selection

Conclusion: RSR provides effective assessment of reasoning trajectory suitability for LLM distillation, balancing learning signal strength with behavioral alignment

Abstract: Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that align closely with the student model’s current behavior but overlooking more informative ones. Addressing this, we propose Rank-Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically balance learning signal strength and behavioral alignment by combining low absolute probability with relatively high-ranked tokens under the student model. Concretely, RSR is defined as the ratio of a trajectory’s average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training reasoning performance (average Spearman 0.86), consistently outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.

[288] Beyond Marginal Distributions: A Framework to Evaluate the Representativeness of Demographic-Aligned LLMs

Tristan Williams, Franziska Weeber, Sebastian Padó, Alan Akbik

Main category: cs.CL

TL;DR: Evaluating LLM representativeness through multivariate correlation patterns beyond marginal distributions, showing persona prompting and demographic fine-tuning fail to capture human correlation structures from World Values Survey.

Details

Motivation: Current LLM alignment research focuses on marginal response distributions but overlooks deeper latent structures and correlation patterns that characterize real populations and cultural values theories.

Method: Proposed framework evaluates model representativeness using multivariate correlation patterns in addition to marginal distributions. Compared two steering techniques (persona prompting and demographic fine-tuning) against human responses from World Values Survey.

Result: Demographically fine-tuned model better approximates marginal distributions than persona prompting, but both techniques fail to capture gold standard correlation patterns from human survey data.

Conclusion: Representativeness is a distinct aspect of value alignment; evaluation focused only on marginals can mask structural failures and lead to overly optimistic conclusions about model capabilities.

Abstract: Large language models are increasingly used to represent human opinions, values, or beliefs, and their steerability towards these ideals is an active area of research. Existing work focuses predominantly on aligning marginal response distributions, treating each survey item independently. While essential, this may overlook deeper latent structures that characterise real populations and underpin cultural values theories. We propose a framework for evaluating the representativeness of aligned models through multivariate correlation patterns in addition to marginal distributions. We show the value of our evaluation scheme by comparing two model steering techniques (persona prompting and demographic fine-tuning) and evaluating them against human responses from the World Values Survey. While the demographically fine-tuned model better approximates marginal response distributions than persona prompting, both techniques fail to fully capture the gold standard correlation patterns. We conclude that representativeness is a distinct aspect of value alignment and an evaluation focused on marginals can mask structural failures, leading to overly optimistic conclusions about model capabilities.

[289] Identity, Cooperation and Framing Effects within Groups of Real and Simulated Humans

Suhong Moon, Minwoo Kang, Joseph Suh, Mustafa Safdari, John Canny

Main category: cs.CL

TL;DR: LLMs can simulate human action in social dilemmas by binding base models with narrative identities, improving fidelity over prior steering approaches and capturing contextual factors like time, framing, and participant pools.

Details

Motivation: To develop more faithful simulations of human behavior in social dilemma games by moving beyond simple persona steering to deep binding of LLMs with narrative identities, capturing both rational deliberation and identity/contextual factors that influence human action.

Method: Deep binding of base language models with extended backstories/narrative identities, using instruction-tuned models to check consistency, and conditioning models on rich contextual factors including time (study year), question framing, and participant pool characteristics.

Result: Improved simulation fidelity compared to human studies, with LLMs able to model contextual factors often omitted from experiment descriptions that hamper accurate replication.

Conclusion: LLMs with deep identity binding and contextual conditioning provide a powerful tool for exploring nuanced human behavior in social dilemmas, capturing details that affect human studies but are typically omitted from experimental descriptions.

Abstract: Humans act via a nuanced process that depends both on rational deliberation and also on identity and contextual factors. In this work, we study how large language models (LLMs) can simulate human action in the context of social dilemma games. While prior work has focused on “steering” (weak binding) of chat models to simulate personas, we analyze here how deep binding of base models with extended backstories leads to more faithful replication of identity-based behaviors. Our study has these findings: simulation fidelity vs human studies is improved by conditioning base LMs with rich context of narrative identities and checking consistency using instruction-tuned models. We show that LLMs can also model contextual factors such as time (year that a study was performed), question framing, and participant pool effects. LLMs, therefore, allow us to explore the details that affect human studies but which are often omitted from experiment descriptions, and which hamper accurate replication.

[290] When Domain Pretraining Interferes with Instruction Alignment: An Empirical Study of Adapter Merging in Medical LLMs

Junyi Zou

Main category: cs.CL

TL;DR: Adapter interference in LLMs when combining domain adaptation and instruction alignment, showing that weighted merging of medical domain PT and SFT LoRA adapters can reactivate latent thinking behavior and shift output distributions despite attempts to disable chain-of-thought.

Details

Motivation: To investigate unexpected adapter interference in safety-critical settings when combining domain adaptation (medical knowledge injection) and instruction alignment in LLMs, particularly examining how weighted merging of LoRA adapters affects model behavior and output distributions.

Method: Two-stage LoRA pipeline: (1) domain-oriented pre-training (PT/DOPT) for medical knowledge injection, (2) supervised fine-tuning (SFT) for instruction following on medical QA. Form weighted adapter merge by linearly combining PT and SFT LoRA deltas before exporting single merged checkpoint. Use fixed generation evaluation with template disabling chain-of-thought.

Result: Adding PT signal reactivates latent “thinking” behavior and systematically shifts output distribution. Pure SFT achieves BLEU-4=17.84, while merged model (PT=0.3, SFT=0.7) drops to BLEU-4=6.50. Multiple-choice accuracy remains comparable (0.777 vs 0.778), MedQA improves from 0.664 to 0.681. Small pipeline mistakes can spuriously attribute SFT-only behavior to merged models.

Conclusion: Adapter interference is a significant issue in safety-critical LLM applications, requiring careful verification of merged weights and pipeline implementation. The paper provides lightweight merge-verification routine and full logs for reproducibility to address these challenges.

Abstract: Large language models (LLMs) can exhibit surprising \emph{adapter interference} when combining domain adaptation and instruction alignment in safety-critical settings. We study a 14B base model trained with a two-stage LoRA pipeline: (i) domain-oriented pre-training (PT/DOPT) for medical knowledge injection and (ii) supervised fine-tuning (SFT) for instruction following on medical QA. We then form a \emph{weighted adapter merge} by linearly combining PT and SFT LoRA deltas before exporting a single merged checkpoint for inference. We find that adding PT signal can reactivate latent ``thinking’’ behavior and systematically shift the output distribution even when training/evaluation templates attempt to disable chain-of-thought. Under a fixed generation evaluation (template \texttt{qwen3_nothink}, Temp=0.6, Top-$p$=0.8), pure SFT achieves BLEU-4=17.84 on our validation set, while the merged model (PT=0.3, SFT=0.7) drops to BLEU-4=6.50. Meanwhile multiple-choice accuracy remains comparable (avg 0.777 vs 0.778) and MedQA improves from 0.664 to 0.681. We further show that small pipeline mistakes (e.g., loading the wrong adapter, export-directory overwrite, or template mismatch) can spuriously attribute SFT-only behavior to merged models. We provide a lightweight merge-verification routine that numerically checks merged weights against the intended linear combination, along with full logs for reproducibility.

[291] Code over Words: Overcoming Semantic Inertia via Code-Grounded Reasoning

Manjie Xu, Isabella Yin, Xinyi Tu, Chi Zhang, Yixin Zhu

Main category: cs.CL

TL;DR: Larger LLMs struggle with “Semantic Inertia” - inability to inhibit pre-trained priors when rules change, performing worse than smaller models. Representing dynamics as executable code instead of descriptive text reverses this trend and enables effective prior inhibition.

Details

Motivation: LLMs exhibit "Semantic Inertia" - they can't suppress pre-trained associations when dynamic, in-context rules contradict them. This is problematic for domains requiring flexible reasoning where rules change, as larger models paradoxically perform worse than smaller ones in these scenarios.

Method: Use Baba Is You game where physical laws are mutable text rules to evaluate models’ ability to override learned priors. Introduce Code-Grounded Vistas (LCV) which fine-tunes models on counterfactual pairs and identifies states with contradictory rules, forcing attention to logical constraints rather than visual semantics. Represent dynamics as executable code instead of descriptive text.

Result: Larger models show inverse scaling - perform worse than smaller models when suppressing pre-trained associations. Code representation reverses this trend, enabling effective prior inhibition. LCV training approach outperforms expensive inference-time search methods in both efficiency and accuracy.

Conclusion: Representation fundamentally determines whether scaling improves or impairs contextual reasoning. Larger models aren’t universally better, especially for domains requiring dynamic overriding of learned priors. Code-based representations enable better inhibition of semantic inertia.

Abstract: LLMs struggle with Semantic Inertia: the inability to inhibit pre-trained priors (e.g., “Lava is Dangerous”) when dynamic, in-context rules contradict them. We probe this phenomenon using Baba Is You, where physical laws are mutable text rules, enabling precise evaluation of models’ ability to override learned priors when rules change. We quantatively observe that larger models can exhibit inverse scaling: they perform worse than smaller models when natural language reasoning requires suppressing pre-trained associations (e.g., accepting “Lava is Safe”). Our analysis attributes this to natural language encoding, which entangles descriptive semantics and logical rules, leading to persistent hallucinations of familiar physics despite explicit contradictory rules. Here we show that representing dynamics as executable code, rather than descriptive text, reverses this trend and enables effective prior inhibition. We introduce Code-Grounded Vistas (LCV), which fine-tunes models on counterfactual pairs and identifies states with contradictory rules, thereby forcing attention to logical constraints rather than visual semantics. This training-time approach outperforms expensive inference-time search methods in both efficiency and accuracy. Our results demonstrate that representation fundamentally determines whether scaling improves or impairs contextual reasoning. This challenges the assumption that larger models are universally better, with implications for domains that require dynamic overriding of learned priors.

[292] Zero-Shot Stance Detection in the Wild: Dynamic Target Generation and Multi-Target Adaptation

Aohua Li, Yuanshuo Zhang, Ge Gao, Bo Chen, Xiaobing Zhao

Main category: cs.CL

TL;DR: Zero-shot stance detection with dynamic target generation and multi-target adaptation for Chinese social media, using LLM fine-tuning approaches.

Details

Motivation: Real-world social media stance detection requires handling dynamic, undefined targets rather than predefined static ones, necessitating zero-shot approaches that can automatically identify multiple target-stance pairs from text.

Method: Proposes DGTA framework with dynamic target generation and multi-target adaptation. Explores integrated and two-stage fine-tuning strategies for LLMs on a constructed Chinese social media dataset with multi-dimensional evaluation metrics.

Result: Fine-tuned LLMs achieve strong performance: Qwen2.5-7B reaches 66.99% target recognition score, DeepSeek-R1-Distill-Qwen-7B achieves 79.26% stance detection F1 score.

Conclusion: LLM fine-tuning approaches effectively address zero-shot stance detection in the wild, demonstrating practical applicability for real-world social media scenarios with dynamic targets.

Abstract: Current stance detection research typically relies on predicting stance based on given targets and text. However, in real-world social media scenarios, targets are neither predefined nor static but rather complex and dynamic. To address this challenge, we propose a novel task: zero-shot stance detection in the wild with Dynamic Target Generation and Multi-Target Adaptation (DGTA), which aims to automatically identify multiple target-stance pairs from text without prior target knowledge. We construct a Chinese social media stance detection dataset and design multi-dimensional evaluation metrics. We explore both integrated and two-stage fine-tuning strategies for large language models (LLMs) and evaluate various baseline models. Experimental results demonstrate that fine-tuned LLMs achieve superior performance on this task: the two-stage fine-tuned Qwen2.5-7B attains the highest comprehensive target recognition score of 66.99%, while the integrated fine-tuned DeepSeek-R1-Distill-Qwen-7B achieves a stance detection F1 score of 79.26%.

[293] SAPO: Self-Adaptive Process Optimization Makes Small Reasoners Stronger

Kaiyuan Chen, Guangmin Zheng, Jin Wang, Xiaobing Zhou, Xuejie Zhang

Main category: cs.CL

TL;DR: SAPO introduces self-adaptive process optimization for small language models by minimizing the reasoner-verifier gap using error localization inspired by neuroscience, outperforming existing methods on math and code tasks.

Details

Motivation: Existing self-evolution methods overlook fine-grained reasoning steps, creating a reasoner-verifier gap. Monte Carlo process supervision is computationally inefficient, making it difficult to bridge this gap. Inspired by Error-Related Negativity in neuroscience where reasoners can localize errors after incorrect decisions, the authors aim to develop an efficient self-improvement method for small language models.

Method: Proposes Self-Adaptive Process Optimization (SAPO) that adaptively introduces process supervision signals by actively minimizing the reasoner-verifier gap. Instead of relying on inefficient Monte Carlo estimations, SAPO enables the reasoner to localize errors and make rapid adjustments, similar to Error-Related Negativity in human cognition.

Result: Extensive experiments show SAPO outperforms most existing self-evolution methods on challenging mathematics and code tasks. The work also introduces two new benchmarks for process reward models in both mathematical and coding domains to further investigate SAPO’s impact on verifier performance.

Conclusion: SAPO provides an efficient self-improvement method for small language models by addressing the reasoner-verifier gap through adaptive process optimization, demonstrating strong performance on complex reasoning tasks and establishing new benchmarks for process reward evaluation.

Abstract: Existing self-evolution methods overlook the influence of fine-grained reasoning steps, which leads to the reasoner-verifier gap. The computational inefficiency of Monte Carlo (MC) process supervision further exacerbates the difficulty in mitigating the gap. Motivated by the Error-Related Negativity (ERN), which the reasoner can localize error following incorrect decisions, guiding rapid adjustments, we propose a Self-Adaptive Process Optimization (SAPO) method for self-improvement in Small Language Models (SLMs). SAPO adaptively and efficiently introduces process supervision signals by actively minimizing the reasoner-verifier gap rather than relying on inefficient MC estimations. Extensive experiments demonstrate that the proposed method outperforms most existing self-evolution methods on two challenging task types: mathematics and code. Additionally, to further investigate SAPO’s impact on verifier performance, this work introduces two new benchmarks for process reward models in both mathematical and coding tasks.

[294] CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria

Xinyu Hu, Yancheng He, Weixun Wang, Tao Feng, Li Lin, Jiashun Liu, Wenbo Su, Bo Zheng, Xiaojun Wan

Main category: cs.CL

TL;DR: CE-RM-4B: A pointwise generative reward model trained with two-stage rollout and unified query-based criteria for better RL alignment and evaluation

Details

Motivation: LLM-as-a-Judge paradigms show promise for evaluation and as generative reward models for RL, but there's a gap between benchmark performance and actual RL effectiveness due to limitations like dominance of pairwise evaluation and inadequate optimization of evaluation criteria.

Method: Proposes CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method and adopting unified query-based criteria, using only about 5.7K high-quality data curated from open-source preference datasets.

Result: CE-RM-4B achieves superior performance on diverse reward model benchmarks, especially in Best-of-N scenarios, and delivers more effective improvements in downstream RL practice.

Conclusion: The proposed approach addresses limitations of existing LLM-as-a-Judge methods and demonstrates better alignment between benchmark performance and practical RL effectiveness.

Abstract: Automatic evaluation is crucial yet challenging for open-ended natural language generation, especially when rule-based metrics are infeasible. Compared with traditional methods, the recent LLM-as-a-Judge paradigms enable better and more flexible evaluation, and show promise as generative reward models for reinforcement learning. However, prior work has revealed a notable gap between their seemingly impressive benchmark performance and actual effectiveness in RL practice. We attribute this issue to some limitations in existing studies, including the dominance of pairwise evaluation and inadequate optimization of evaluation criteria. Therefore, we propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method, and adopting unified query-based criteria. Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks, especially in Best-of-N scenarios, and delivers more effective improvements in downstream RL practice.

[295] VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

Yibo Wang, Yongcheng Jing, Shunyu Liu, Hao Guan, Rong-cheng Tu, Chengyu Wang, Jun Huang, Dacheng Tao

Main category: cs.CL

TL;DR: VTC-R1: Vision-text compression for efficient long-context reasoning by rendering intermediate reasoning segments into compact images as “optical memory” for vision-language models.

Details

Motivation: Long-context reasoning in LLMs creates severe efficiency bottlenecks due to computational complexity. Existing approaches require complex training or external compression models, limiting scalability and losing fine-grained information.

Method: Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as “optical memory.” Fine-tunes VLMs (Glyph and Qwen3-VL) on OpenR1-Math-220K dataset achieving 3.4x token compression.

Result: Outperforms standard long-context reasoning on benchmarks (MATH500, AIME25, AMC23, GPQA-D). Achieves 2.7x speedup in end-to-end latency and 3.4x token compression.

Conclusion: VTC-R1 presents a scalable solution for reasoning-intensive applications by integrating vision-text compression into the reasoning process, improving both performance and efficiency.

Abstract: Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine-grained information. In this paper, we propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process. Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as “optical memory.” We construct a training dataset based on OpenR1-Math-220K achieving 3.4x token compression and fine-tune representative VLMs-Glyph and Qwen3-VL. Extensive experiments on benchmarks such as MATH500, AIME25, AMC23 and GPQA-D demonstrate that VTC-R1 consistently outperforms standard long-context reasoning. Furthermore, our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency, highlighting its potential as a scalable solution for reasoning-intensive applications. Our code is available at https://github.com/w-yibo/VTC-R1.

[296] Bias Beyond Borders: Political Ideology Evaluation and Steering in Multilingual LLMs

Afrozah Nadeem, Agrima, Mehwish Nasim, Usman Naseem

Main category: cs.CL

TL;DR: Multilingual evaluation of political bias in LLMs across 50 countries/33 languages, with Cross-Lingual Alignment Steering (CLAS) framework for post-hoc mitigation that aligns ideological representations across languages while preserving response quality.

Details

Motivation: LLMs shape global discourse but political bias evaluation has focused on Western languages, leaving cross-lingual consistency and effective mitigation underexplored. Need for fairness-aware multilingual LLM governance that balances ideological neutrality with linguistic diversity.

Method: Large-scale multilingual evaluation across 50 countries/33 languages. Introduces Cross-Lingual Alignment Steering (CLAS) framework that: 1) aligns latent ideological representations from political prompts into shared ideological subspace, 2) uses adaptive mechanism to prevent over-correction, 3) dynamically regulates intervention strength for cross-lingual consistency.

Result: Substantial bias reduction along both economic and social axes with minimal degradation in response quality. Framework establishes scalable, interpretable paradigm for fairness-aware multilingual LLM governance.

Conclusion: CLAS provides effective post-hoc mitigation for political bias in multilingual LLMs, balancing ideological neutrality with linguistic/cultural diversity while maintaining response quality. Addresses critical gap in cross-lingual fairness evaluation and mitigation.

Abstract: Large Language Models (LLMs) increasingly shape global discourse, making fairness and ideological neutrality essential for responsible AI deployment. Despite growing attention to political bias in LLMs, prior work largely focuses on high-resource, Western languages or narrow multilingual settings, leaving cross-lingual consistency and safe post-hoc mitigation underexplored. To address this gap, we present a large-scale multilingual evaluation of political bias spanning 50 countries and 33 languages. We introduce a complementary post-hoc mitigation framework, Cross-Lingual Alignment Steering (CLAS), designed to augment existing steering methods by aligning ideological representations across languages and dynamically regulating intervention strength. This method aligns latent ideological representations induced by political prompts into a shared ideological subspace, ensuring cross lingual consistency, with the adaptive mechanism prevents over correction and preserves coherence. Experiments demonstrate substantial bias reduction along both economic and social axes with minimal degradation in response quality. The proposed framework establishes a scalable and interpretable paradigm for fairness-aware multilingual LLM governance, balancing ideological neutrality with linguistic and cultural diversity.

cs.CV

[297] EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

Weiyu Sun, Liangliang Chen, Yongnuo Cai, Huiru Xie, Yi Zeng, Ying Zhang

Main category: cs.CV

TL;DR: EDU-CIRCUIT-HW dataset reveals MLLMs’ poor reliability in recognizing authentic student handwritten STEM solutions, with latent failures in both recognition fidelity and auto-grading performance.

Details

Motivation: MLLMs show promise for education but struggle with authentic student handwritten solutions containing mixed mathematical formulas, diagrams, and text. Current benchmarks lack domain-specific authentic data, and evaluation focuses only on downstream tasks, missing holistic understanding of handwritten logic.

Method: Released EDU-CIRCUIT-HW dataset with 1,300+ authentic student handwritten solutions from university STEM courses. Used expert-verified transcriptions and grading reports to evaluate MLLMs’ upstream recognition fidelity and downstream auto-grading performance simultaneously.

Result: Uncovered astonishing scale of latent failures in MLLM-recognized content, showing insufficient reliability for high-stakes educational applications. Demonstrated that preemptive error detection and rectification with minimal human intervention (4% of solutions) can significantly enhance grading system robustness.

Conclusion: Current MLLMs are unreliable for auto-grading authentic student handwritten STEM solutions. Proactive error detection and correction with minimal human oversight can improve system robustness for educational applications.

Abstract: Multimodal Large Language Models (MLLMs) hold significant promise for revolutionizing traditional education and reducing teachers’ workload. However, accurately interpreting unconstrained STEM student handwritten solutions with intertwined mathematical formulas, diagrams, and textual reasoning poses a significant challenge due to the lack of authentic and domain-specific benchmarks. Additionally, current evaluation paradigms predominantly rely on the outcomes of downstream tasks (e.g., auto-grading), which often probe only a subset of the recognized content, thereby failing to capture the MLLMs’ understanding of complex handwritten logic as a whole. To bridge this gap, we release EDU-CIRCUIT-HW, a dataset consisting of 1,300+ authentic student handwritten solutions from a university-level STEM course. Utilizing the expert-verified verbatim transcriptions and grading reports of student solutions, we simultaneously evaluate various MLLMs’ upstream recognition fidelity and downstream auto-grading performance. Our evaluation uncovers an astonishing scale of latent failures within MLLM-recognized student handwritten content, highlighting the models’ insufficient reliability for auto-grading and other understanding-oriented applications in high-stakes educational settings. In solution, we present a case study demonstrating that leveraging identified error patterns to preemptively detect and rectify recognition errors, with only minimal human intervention (approximately 4% of the total solutions), can significantly enhance the robustness of the deployed AI-enabled grading system on unseen student solutions.

[298] Mirage2Matter: A Physically Grounded Gaussian World Model from Video

Zhengqing Gao, Ziwen Li, Xin Wang, Jiaxin Huang, Zhenyang Ren, Mingkai Shao, Hanlue Zhang, Tianyu Huang, Yongkang Cheng, Yandong Guo, Runqi Lin, Yuanyuan Wang, Tongliang Liu, Kun Zhang, Mingming Gong

Main category: cs.CV

TL;DR: Simulate Anything: A framework for generating high-fidelity embodied training data from multi-view videos using 3D Gaussian Splatting reconstruction and generative models for physics simulation.

Details

Motivation: To address the scarcity of real-world interaction data for embodied intelligence training by creating scalable simulation methods that bridge the visual and physical gap between simulation and reality without requiring expensive sensors or precise calibration.

Method: Uses 3D Gaussian Splatting (3DGS) to reconstruct photorealistic scene representations from multi-view videos, then applies generative models to recover physically realistic representations, integrates them into simulation via precision calibration targets for accurate scale alignment.

Result: Vision Language Action (VLA) models trained on simulated data achieve strong zero-shot performance on downstream tasks, matching or surpassing results obtained with real-world data.

Conclusion: Reconstruction-driven world modeling enables scalable and practical embodied intelligence training, demonstrating that simulated data can effectively replace or complement real-world data for training multimodal models.

Abstract: The scalability of embodied intelligence is fundamentally constrained by the scarcity of real-world interaction data. While simulation platforms provide a promising alternative, existing approaches often suffer from a substantial visual and physical gap to real environments and rely on expensive sensors, precise robot calibration, or depth measurements, limiting their practicality at scale. We present Simulate Anything, a graphics-driven world modeling and simulation framework that enables efficient generation of high-fidelity embodied training data using only multi-view environment videos and off-the-shelf assets. Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS), seamlessly capturing fine-grained geometry and appearance from video. We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target, enabling accurate scale alignment between the reconstructed scene and the real world. Together, these components provide a unified, editable, and physically grounded world model. Vision Language Action (VLA) models trained on our simulated data achieve strong zero-shot performance on downstream tasks, matching or even surpassing results obtained with real-world data, highlighting the potential of reconstruction-driven world modeling for scalable and practical embodied intelligence training.

[299] R3G: A Reasoning–Retrieval–Reranking Framework for Vision-Centric Answer Generation

Zhuohong Chen, Zhengxian Wu, Zirui Liao, Shenao Jiang, Hangrui Xu, Yang Chen, Chaokui Su, Xiaoyu Liu, Haoqian Wang

Main category: cs.CV

TL;DR: R3G is a modular framework for vision-centric VQA that uses reasoning plans to guide retrieval and reranking of evidence images, improving accuracy across multiple MLLM backbones.

Details

Motivation: Vision-centric VQA requires retrieving relevant images to supply missing visual cues, but selecting the right images and integrating them effectively into reasoning remains challenging.

Method: Proposes R3G: Reasoning-Retrieval-Reranking framework that first generates reasoning plans specifying required visual cues, then uses two-stage strategy with coarse retrieval followed by fine-grained reranking to select evidence images.

Result: Improves accuracy across six MLLM backbones and nine sub-scenarios on MRAG-Bench, achieving state-of-the-art overall performance. Ablations show sufficiency-aware reranking and reasoning steps are complementary.

Conclusion: R3G effectively addresses the challenge of selecting and integrating evidence images for vision-centric VQA through a modular reasoning-guided retrieval approach.

Abstract: Vision-centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process. However, selecting the right images and integrating them effectively into the model’s reasoning remains challenging.To address this challenge, we propose R3G, a modular Reasoning-Retrieval-Reranking framework.It first produces a brief reasoning plan that specifies the required visual cues, then adopts a two-stage strategy, with coarse retrieval followed by fine-grained reranking, to select evidence images.On MRAG-Bench, R3G improves accuracy across six MLLM backbones and nine sub-scenarios, achieving state-of-the-art overall performance. Ablations show that sufficiency-aware reranking and reasoning steps are complementary, helping the model both choose the right images and use them well. We release code and data at https://github.com/czh24/R3G.

[300] VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

Hongzhu Yi, Yujia Yang, Yuanxiang Wang, Zhenyu Guan, Jiahuan Chen, Chenxi Bao, Tiankun Yang, Yixuan Yuan, Tianyu Zong, Xinming Wang, Tao Yu, Ruiwen Tao, Haijin Liang, Jin Ma, Jinwen Luo, Yeshani Xinyu Zuo, Jungang Xu

Main category: cs.CV

TL;DR: VDE Bench is a benchmark for evaluating multimodal image editing models on multilingual, densely textual visual documents, addressing limitations of existing approaches that focus mainly on English and sparse layouts.

Details

Motivation: Current multimodal image editing models lack proper evaluation for visual document editing, especially for multilingual content and dense, complex document layouts. Existing approaches focus on English and sparse text, failing to address real-world documents with dense text or non-Latin scripts like Chinese.

Method: Proposes VDE Bench: 1) A human-annotated benchmark dataset with dense textual documents in English and Chinese (academic papers, posters, slides, exams, newspapers), 2) A decoupled evaluation framework using OCR parsing for fine-grained assessment of text modification accuracy.

Result: Comprehensive evaluation of state-of-the-art image editing models shows strong consistency between human judgments and automated metrics. The benchmark enables systematic assessment of multilingual document editing capabilities.

Conclusion: VDE Bench is the first systematic benchmark for evaluating image editing models on multilingual and densely textual visual documents, addressing a critical gap in multimodal editing research.

Abstract: In recent years, multimodal image editing models have achieved substantial progress, enabling users to manipulate visual content through natural language in a flexible and interactive manner. Nevertheless, an important yet insufficiently explored research direction remains visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing approaches, including AnyText, GlyphControl, and TextCtrl, predominantly focus on English-language scenarios and documents with relatively sparse textual layouts, thereby failing to adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose \textbf{V}isual \textbf{D}oc \textbf{E}dit Bench(VDE Bench), a rigorously human-annotated and evaluated benchmark specifically designed to assess image editing models on multilingual and complex visual document editing tasks. The benchmark comprises a high-quality dataset encompassing densely textual documents in both English and Chinese, including academic papers, posters, presentation slides, examination materials, and newspapers. Furthermore, we introduce a decoupled evaluation framework that systematically quantifies editing performance at the OCR parsing level, enabling fine-grained assessment of text modification accuracy. Based on this benchmark, we conduct a comprehensive evaluation of representative state-of-the-art image editing models. Manual verification demonstrates a strong consistency between human judgments and automated evaluation metrics. VDE Bench constitutes the first systematic benchmark for evaluating image editing models on multilingual and densely textual visual documents.

[301] HYPE-EDIT-1: Benchmark for Measuring Reliability in Frontier Image Editing Models

Wing Chan, Richard Allen

Main category: cs.CV

TL;DR: HYPE-EDIT-1 is a 100-task benchmark for evaluating image editing models on real-world marketing/design tasks with binary pass/fail criteria, measuring per-attempt success rates and effective costs including retries and human review time.

Details

Motivation: Current image editing model demos show best-case samples, but real workflows involve retries and review time. There's a need for realistic evaluation that accounts for these practical costs and failure rates in reference-based editing tasks.

Method: Created a 100-task benchmark of reference-based marketing/design edits with binary pass/fail judging. For each task, generated 10 independent outputs to estimate per-attempt pass rate, pass@10, expected attempts under retry caps, and effective cost per successful edit combining model price with human review time. Released 50 public tasks and kept 50-task private split for server-side evaluation, with standardized JSON schema and tooling for VLM and human-based judging.

Result: Across evaluated models, per-attempt pass rates ranged 34-83% and effective cost per success spanned USD 0.66-1.42. Models with low per-image pricing became more expensive when considering total effective costs of retries and human reviews.

Conclusion: Real-world evaluation of image editing models must account for retry rates and human review costs, not just per-image pricing. The HYPE-EDIT-1 benchmark provides a practical framework for assessing model performance in production workflows.

Abstract: Public demos of image editing models are typically best-case samples; real workflows pay for retries and review time. We introduce HYPE-EDIT-1, a 100-task benchmark of reference-based marketing/design edits with binary pass/fail judging. For each task we generate 10 independent outputs to estimate per-attempt pass rate, pass@10, expected attempts under a retry cap, and an effective cost per successful edit that combines model price with human review time. We release 50 public tasks and maintain a 50-task held-out private split for server-side evaluation, plus a standardized JSON schema and tooling for VLM and human-based judging. Across the evaluated models, per-attempt pass rates span 34-83 percent and effective cost per success spans USD 0.66-1.42. Models that have low per-image pricing are more expensive when you consider the total effective cost of retries and human reviews.

Yuan Gao, Xinyu Guo, Wenjing Xie, Zifan Wang, Hongwen Yu, Gongyang Li, Shugong Xu

Main category: cs.CV

TL;DR: A multi-modal UAV trajectory prediction method fusing LiDAR and millimeter-wave radar data using a deep fusion network with bidirectional cross-attention to improve prediction accuracy by 40% over baseline.

Details

Motivation: To address the need for managing unauthorized UAVs in the low-altitude economy by developing accurate trajectory prediction through fusion of complementary LiDAR and radar information.

Method: Multi-Modal Deep Fusion Framework with two modality-specific feature extraction networks (LiDAR and radar) and a bidirectional cross-attention fusion module to exploit complementary spatial geometric and dynamic reflection characteristics.

Result: The model achieves 40% improvement in trajectory prediction accuracy compared to baseline, validated on the MMAUD dataset from CVPR 2024 UG2+ UAV Tracking and Pose-Estimation Challenge.

Conclusion: The proposed multi-modal fusion model effectively utilizes complementary sensor data and provides an efficient solution for unauthorized UAV trajectory prediction in low-altitude applications.

Abstract: To meet the requirements for managing unauthorized UAVs in the low-altitude economy, a multi-modal UAV trajectory prediction method based on the fusion of LiDAR and millimeter-wave radar information is proposed. A deep fusion network for multi-modal UAV trajectory prediction, termed the Multi-Modal Deep Fusion Framework, is designed. The overall architecture consists of two modality-specific feature extraction networks and a bidirectional cross-attention fusion module, aiming to fully exploit the complementary information of LiDAR and radar point clouds in spatial geometric structure and dynamic reflection characteristics. In the feature extraction stage, the model employs independent but structurally identical feature encoders for LiDAR and radar. After feature extraction, the model enters the Bidirectional Cross-Attention Mechanism stage to achieve information complementarity and semantic alignment between the two modalities. To verify the effectiveness of the proposed model, the MMAUD dataset used in the CVPR 2024 UG2+ UAV Tracking and Pose-Estimation Challenge is adopted as the training and testing dataset. Experimental results show that the proposed multi-modal fusion model significantly improves trajectory prediction accuracy, achieving a 40% improvement compared to the baseline model. In addition, ablation experiments are conducted to demonstrate the effectiveness of different loss functions and post-processing strategies in improving model performance. The proposed model can effectively utilize multi-modal data and provides an efficient solution for unauthorized UAV trajectory prediction in the low-altitude economy.

[303] SITUATE – Synthetic Object Counting Dataset for VLM training

René Peinl, Vincent Tischler, Patrick Schröder, Christian Groth

Main category: cs.CV

TL;DR: SITUATE is a novel dataset for training/evaluating Vision Language Models on counting tasks with spatial constraints, bridging the gap between simple 2D datasets and ambiguous real-life datasets.

Details

Motivation: Existing counting datasets have limitations: simple 2D datasets lack real-world complexity, while real-life datasets lack control over occlusions and spatial composition. There's a need for a dataset that provides spatial constraints while maintaining realistic complexity.

Method: Created SITUATE dataset with spatial constraints for counting tasks. Conducted experiments by fine-tuning Qwen VL 2.5 7B model on SITUATE and comparing performance with Pixmo count test data and other established counting benchmarks.

Result: Fine-tuning on SITUATE improves accuracy on Pixmo count test data, but not vice versa. The dataset helps improve generalization for out-of-distribution images compared to equally sized fine-tuning sets from Pixmo count.

Conclusion: SITUATE dataset effectively bridges the gap between simple and ambiguous counting datasets, enabling better training of Vision Language Models for counting tasks with spatial constraints and improving out-of-distribution generalization.

Abstract: We present SITUATE, a novel dataset designed for training and evaluating Vision Language Models on counting tasks with spatial constraints. The dataset bridges the gap between simple 2D datasets like VLMCountBench and often ambiguous real-life datasets like TallyQA, which lack control over occlusions and spatial composition. Experiments show that our dataset helps to improve generalization for out-of-distribution images, since a finetune of Qwen VL 2.5 7B on SITUATE improves accuracy on the Pixmo count test data, but not vice versa. We cross validate this by comparing the model performance across established other counting benchmarks and against an equally sized fine-tuning set derived from Pixmo count.

[304] GTATrack: Winner Solution to SoccerTrack 2025 with Deep-EIoU and Global Tracklet Association

Rong-Lin Jian, Ming-Chi Luo, Chen-Wei Huang, Chia-Ming Lee, Yu-Fan Lin, Chih-Chung Hsu

Main category: cs.CV

TL;DR: GTATrack: A hierarchical tracking framework for multi-object tracking in sports using fisheye cameras, winning SoccerTrack Challenge 2025 with Deep Expansion IoU for motion-agnostic association and Global Tracklet Association for trajectory refinement.

Details

Motivation: Multi-object tracking in sports is challenging due to irregular motion, uniform appearances, frequent occlusions, and additional difficulties from fisheye camera distortion and extreme scale variation.

Method: Two-stage hierarchical framework: 1) Deep Expansion IoU (Deep-EIoU) for motion-agnostic online association, 2) Global Tracklet Association (GTA) for trajectory-level refinement, plus pseudo-labeling to boost detector recall on small/distorted targets.

Result: Achieved winning HOTA score of 0.60 in SoccerTrack Challenge 2025, significantly reduced false positives to 982, demonstrating state-of-the-art accuracy in fisheye-based soccer tracking.

Conclusion: The synergy between local association and global reasoning effectively addresses identity switches, occlusions, and tracking fragmentation in challenging sports tracking scenarios with fisheye cameras.

Abstract: Multi-object tracking (MOT) in sports is highly challenging due to irregular player motion, uniform appearances, and frequent occlusions. These difficulties are further exacerbated by the geometric distortion and extreme scale variation introduced by static fisheye cameras. In this work, we present GTATrack, a hierarchical tracking framework that win first place in the SoccerTrack Challenge 2025. GTATrack integrates two core components: Deep Expansion IoU (Deep-EIoU) for motion-agnostic online association and Global Tracklet Association (GTA) for trajectory-level refinement. This two-stage design enables both robust short-term matching and long-term identity consistency. Additionally, a pseudo-labeling strategy is used to boost detector recall on small and distorted targets. The synergy between local association and global reasoning effectively addresses identity switches, occlusions, and tracking fragmentation. Our method achieved a winning HOTA score of 0.60 and significantly reduced false positives to 982, demonstrating state-of-the-art accuracy in fisheye-based soccer tracking. Our code is available at https://github.com/ron941/GTATrack-STC2025.

[305] Robustness of Presentation Attack Detection in Remote Identity Validation Scenarios

John J. Howard, Richard O. Plesh, Yevgeniy B. Sirotin, Jerry L. Tipton, Arun R. Vemury

Main category: cs.CV

TL;DR: Commercial presentation attack detection systems show significant performance degradation in low-light and auto-capture scenarios, with error rates increasing up to 4x, highlighting the need for more robust testing across diverse environmental conditions.

Details

Motivation: Presentation attack detection (PAD) is crucial for remote identity validation systems, but ensuring robust performance across diverse environmental and procedural conditions remains a challenge, particularly in real-world scenarios like low-light conditions and automated image acquisition workflows.

Method: The paper investigates the impact of low-light conditions and automated image acquisition on commercial PAD systems using scenario testing of remote identity validation systems, comparing performance across different environmental and procedural conditions.

Result: PAD systems experience significant performance decline in low-light or auto-capture scenarios, with model-predicted error rates increasing by about 4x under low-light conditions and doubling under auto-capture workflows. Only one tested system maintained robust performance with maximum bona fide presentation classification error rate below 3% across all scenarios.

Conclusion: Testing across diverse environments is essential to ensure robust and reliable PAD performance in real-world applications, as most commercial systems show vulnerability to common environmental and procedural variations.

Abstract: Presentation attack detection (PAD) subsystems are an important part of effective and user-friendly remote identity validation (RIV) systems. However, ensuring robust performance across diverse environmental and procedural conditions remains a critical challenge. This paper investigates the impact of low-light conditions and automated image acquisition on the robustness of commercial PAD systems using a scenario test of RIV. Our results show that PAD systems experience a significant decline in performance when utilized in low-light or auto-capture scenarios, with a model-predicted increase in error rates by a factor of about four under low-light conditions and a doubling of those odds under auto-capture workflows. Specifically, only one of the tested systems was robust to these perturbations, maintaining a maximum bona fide presentation classification error rate below 3% across all scenarios. Our findings emphasize the importance of testing across diverse environments to ensure robust and reliable PAD performance in real-world applications.

[306] DRFormer: A Dual-Regularized Bidirectional Transformer for Person Re-identification

Ying Shu, Pujian Zhan, Huiqi Yang, Hehe Fan, Youfang Lin, Kai Lv

Main category: cs.CV

TL;DR: DRFormer integrates DINO’s local texture mining with CLIP’s global semantic understanding for person re-identification, using dual-regularized bidirectional transformer to balance both feature types.

Details

Motivation: Person re-identification faces challenges like occlusion and pose variations that require both fine-grained discriminative details and global semantic features. While DINO excels at local textures and CLIP captures global semantics, existing methods use only one paradigm, missing benefits of integration.

Method: Proposes Dual-Regularized Bidirectional Transformer (DRFormer) that synergizes DINO and CLIP architectures through dual-regularization mechanism ensuring diverse feature extraction and balanced contributions from both models.

Result: Extensive experiments on five benchmarks show the method effectively harmonizes local and global representations, achieving competitive performance against state-of-the-art methods.

Conclusion: The integration of vision foundation models (DINO) and vision-language models (CLIP) through DRFormer framework successfully addresses person re-identification challenges by combining complementary local and global feature representations.

Abstract: Both fine-grained discriminative details and global semantic features can contribute to solving person re-identification challenges, such as occlusion and pose variations. Vision foundation models (\textit{e.g.}, DINO) excel at mining local textures, and vision-language models (\textit{e.g.}, CLIP) capture strong global semantic difference. Existing methods predominantly rely on a single paradigm, neglecting the potential benefits of their integration. In this paper, we analyze the complementary roles of these two architectures and propose a framework to synergize their strengths by a \textbf{D}ual-\textbf{R}egularized Bidirectional \textbf{Transformer} (\textbf{DRFormer}). The dual-regularization mechanism ensures diverse feature extraction and achieves a better balance in the contributions of the two models. Extensive experiments on five benchmarks show that our method effectively harmonizes local and global representations, achieving competitive performance against state-of-the-art methods.

[307] Observing Health Outcomes Using Remote Sensing Imagery and Geo-Context Guided Visual Transformer

Yu Li, Guilherme N. DeSouza, Praveen Rao, Chi-Ren Shyu

Main category: cs.CV

TL;DR: A novel multimodal transformer model that enhances remote sensing image analysis by incorporating geospatial data through aligned embeddings and guided attention mechanisms.

Details

Motivation: Current vision-language models for remote sensing focus on semantic alignment between visual and textual content but lack structured geospatial understanding, limiting their ability to reason with geospatial layers and auxiliary spatial data.

Method: Proposes a geospatial embedding mechanism that transforms diverse geospatial data into embedding patches spatially aligned with image patches, and a guided attention module that dynamically integrates multimodal information by computing attention weights based on correlations with auxiliary data.

Result: The proposed framework outperforms existing pretrained geospatial foundation models in predicting disease prevalence, demonstrating effectiveness in multimodal geospatial understanding.

Conclusion: The model successfully bridges the gap between visual transformers and geospatial reasoning by incorporating structured geospatial information through novel embedding and attention mechanisms, enabling more effective multimodal analysis of remote sensing data.

Abstract: Visual transformers have driven major progress in remote sensing image analysis, particularly in object detection and segmentation. Recent vision-language and multimodal models further extend these capabilities by incorporating auxiliary information, including captions, question and answer pairs, and metadata, which broadens applications beyond conventional computer vision tasks. However, these models are typically optimized for semantic alignment between visual and textual content rather than geospatial understanding, and therefore are not suited for representing or reasoning with structured geospatial layers. In this study, we propose a novel model that enhances remote sensing imagery processing with guidance from auxiliary geospatial information. Our approach introduces a geospatial embedding mechanism that transforms diverse geospatial data into embedding patches that are spatially aligned with image patches. To facilitate cross-modal interaction, we design a guided attention module that dynamically integrates multimodal information by computing attention weights based on correlations with auxiliary data, thereby directing the model toward the most relevant regions. In addition, the module assigns distinct roles to individual attention heads, allowing the model to capture complementary aspects of the guidance information and improving the interpretability of its predictions. Experimental results demonstrate that the proposed framework outperforms existing pretrained geospatial foundation models in predicting disease prevalence, highlighting its effectiveness in multimodal geospatial understanding.

[308] From Manual Observation to Automated Monitoring: Space Allowance Effects on Play Behaviour in Group-Housed Dairy Calves

Haiyu Yang, Heidi Lesscher, Enhong Liu, Miel Hostens

Main category: cs.CV

TL;DR: Computer vision pipeline developed to monitor dairy calf play behavior, finding optimal space allowance of 8-10 m² per calf for welfare benefits.

Details

Motivation: To understand the relationship between space allowance and play behavior in dairy calves under commercial conditions, and to develop automated monitoring systems for scalable welfare assessment.

Method: Studied 60 group-housed dairy calves across 14 farms with space range 2.66-17.98 m² per calf; used detailed ethogram for manual video analysis; developed computer vision pipeline trained on 108 hours of manual annotations from 6 farms; employed linear mixed models with farm as random effect.

Result: Computer vision classifier achieved 97.6% accuracy with 99.4% recall for active play detection; calves spent average 1.0% of observation period playing (~10 minutes per 17-hour period); space-play relationship was non-linear with highest play at 8-10 m² per calf (1.6% OP) and lowest at 6-8 m² and 12-14 m² (<0.6% OP); space remained significant after controlling for age, health, and group size.

Conclusion: 8-10 m² per calf represents optimal space allowance balancing welfare benefits with economic feasibility; automated computer vision monitoring can scale small annotation projects to continuous welfare assessment systems.

Abstract: Play behaviour serves as a positive welfare indicator in dairy calves, yet the influence of space allowance under commercial conditions remains poorly characterized, particularly at intermediate-to-high allowances (6-20 m2 per calf). This study investigated the relationship between space allowance and play behaviour in 60 group-housed dairy calves across 14 commercial farms in the Netherlands (space range: 2.66-17.98 m2 per calf), and developed an automated computer vision pipeline for scalable monitoring. Video observations were analyzed using a detailed ethogram, with play expressed as percentage of observation period (%OP). Statistical analysis employed linear mixed models with farm as a random effect. A computer vision pipeline was trained on manual annotations from 108 hours on 6 farms and validated on held-out test data. The computer vision classifier achieved 97.6% accuracy with 99.4% recall for active play detection. Calves spent on average 1.0% of OP playing reflecting around 10 minutes per 17-hour period. The space-play relationship was non-linear, with highest play levels at 8-10 m2 per calf (1.6% OP) and lowest at 6-8 m2 and 12-14 m2 (<0.6% OP). Space remained significant after controlling for age, health, and group size. In summary, these findings suggest that 8-10 m2 per calf represents a practical target balancing welfare benefits with economic feasibility, and demonstrate that automated monitoring can scale small annotation projects to continuous welfare assessment systems.

[309] AI-Driven Three-Dimensional Reconstruction and Quantitative Analysis for Burn Injury Assessment

S. Kalaycioglu, C. Hong, K. Zhai, H. Xie, J. N. Wong

Main category: cs.CV

TL;DR: AI platform for burn assessment using multi-view photogrammetry and 3D reconstruction to objectively measure burn metrics and track healing progression over time.

Details

Motivation: Current burn assessment methods using visual inspection and 2D photography are subjective and inadequate for longitudinal comparison, lacking objective metrics for treatment planning and monitoring.

Method: Integrates multi-view photogrammetry, 3D surface reconstruction, and deep learning-based segmentation to reconstruct patient-specific 3D burn surfaces from standard camera images, enabling computation of objective metrics like surface area, TBSA, and volumetric changes.

Result: System demonstrates stable reconstructions, consistent metric computation, and clinically plausible longitudinal trends, supporting objective tracking of wound contraction and depth reduction over time.

Conclusion: The platform provides a scalable, non-invasive approach to objective, geometry-aware burn assessment and decision support in clinical settings.

Abstract: Accurate, reproducible burn assessment is critical for treatment planning, healing monitoring, and medico-legal documentation, yet conventional visual inspection and 2D photography are subjective and limited for longitudinal comparison. This paper presents an AI-enabled burn assessment and management platform that integrates multi-view photogrammetry, 3D surface reconstruction, and deep learning-based segmentation within a structured clinical workflow. Using standard multi-angle images from consumer-grade cameras, the system reconstructs patient-specific 3D burn surfaces and maps burn regions onto anatomy to compute objective metrics in real-world units, including surface area, TBSA, depth-related geometric proxies, and volumetric change. Successive reconstructions are spatially aligned to quantify healing progression over time, enabling objective tracking of wound contraction and depth reduction. The platform also supports structured patient intake, guided image capture, 3D analysis and visualization, treatment recommendations, and automated report generation. Simulation-based evaluation demonstrates stable reconstructions, consistent metric computation, and clinically plausible longitudinal trends, supporting a scalable, non-invasive approach to objective, geometry-aware burn assessment and decision support in acute and outpatient care.

[310] 1S-DAug: One-Shot Data Augmentation for Robust Few-Shot Generalization

Yunwei Bai, Ying Kiat Tan, Yao Shu, Tsuhan Chen

Main category: cs.CV

TL;DR: 1S-DAug is a one-shot generative augmentation method that synthesizes diverse image variants from a single example at test time using geometric perturbations, noise injection, and diffusion processes to improve few-shot learning performance.

Details

Motivation: Traditional test-time augmentations fail in few-shot learning scenarios where only a few labeled examples are available. There's a need for augmentation methods that can generate diverse yet faithful image variants from minimal data to improve model generalization in FSL settings.

Method: 1S-DAug combines geometric perturbations with controlled noise injection and a denoising diffusion process conditioned on the original image. It generates augmented images, encodes them, and aggregates their representations with the original image’s representation for more robust predictions.

Result: The method consistently improves few-shot learning across 4 standard benchmarks without model parameter updates, achieving over 10% proportional accuracy improvement on miniImagenet 5-way-1-shot benchmark.

Conclusion: 1S-DAug is an effective training-free, model-agnostic plugin for few-shot learning that enhances generalization by generating diverse yet faithful image variants from minimal data at test time.

Abstract: Few-shot learning (FSL) challenges model generalization to novel classes based on just a few shots of labeled examples, a testbed where traditional test-time augmentations fail to be effective. We introduce 1S-DAug, a one-shot generative augmentation operator that synthesizes diverse yet faithful variants from just one example image at test time. 1S-DAug couples traditional geometric perturbations with controlled noise injection and a denoising diffusion process conditioned on the original image. The generated images are then encoded and aggregated, alongside the original image, into a combined representation for more robust FSL predictions. Integrated as a training-free model-agnostic plugin, 1S-DAug consistently improves FSL across standard benchmarks of 4 different datasets without any model parameter update, including achieving over 10% proportional accuracy improvement on the miniImagenet 5-way-1-shot benchmark. Codes will be released.

[311] Event Driven Clustering Algorithm

David El-Chai Ben-Ezra, Adar Tal, Daniel Brisk

Main category: cs.CV

TL;DR: Novel asynchronous event-driven algorithm for real-time detection of small event clusters in event camera data with linear O(n) complexity.

Details

Motivation: Event cameras produce asynchronous, sparse data streams that require efficient real-time processing algorithms. Existing clustering methods may not fully leverage the unique data structure of event cameras or may have computational complexity that scales poorly with data dimensions.

Method: Hierarchical agglomerative clustering algorithm that operates asynchronously and event-driven, using tempo-spatial distance for cluster detection. The algorithm employs sophisticated yet simple decision-making to achieve linear O(n) complexity where n is the number of events, with runtime independent of pixel array dimensions.

Result: The algorithm achieves efficient real-time detection of small event clusters in event camera data with linear computational complexity, making it suitable for real-time applications where computational efficiency is critical.

Conclusion: The proposed algorithm provides an efficient solution for real-time event cluster detection in event camera data, leveraging the unique asynchronous nature of the data structure while maintaining computational efficiency through linear complexity.

Abstract: This paper introduces a novel asynchronous, event-driven algorithm for real-time detection of small event clusters in event camera data. Like other hierarchical agglomerative clustering algorithms, the algorithm detects the event clusters based on their tempo-spatial distance. However, the algorithm leverages the special asynchronous data structure of event camera, and by a sophisticated, efficient and simple decision-making, enjoys a linear complexity of $O(n)$ where $n$ is the events amount. In addition, the run-time of the algorithm is independent with the dimensions of the pixels array.

[312] IC-EO: Interpretable Code-based assistant for Earth Observation

Lamia Lahouel, Laurynas Lopata, Simon Gruening, Gabriele Meoni, Gaetan Petit, Sylvain Lobry

Main category: cs.CV

TL;DR: A conversational code-generating agent for Earth Observation analysis that transforms natural language queries into executable Python workflows, enabling transparent and reproducible geospatial analysis.

Details

Motivation: Earth Observation analysis is difficult for non-experts, requiring specialized knowledge and technical skills. Current systems often provide black-box predictions that are hard to audit or reproduce. There's a need for accessible, transparent tools that allow laymen to perform complex EO analysis through natural language.

Method: The study proposes a conversational agent powered by tool LLMs that converts natural language queries into executable Python code. The agent operates over a unified, extensible API supporting classification, segmentation, detection (oriented bounding boxes), spectral indices, and geospatial operators. The framework allows control at three levels: tool-level performance on EO benchmarks, agent-level code generation quality, and task-level use case evaluation.

Result: The agent outperformed general-purpose LLM/VLM baselines (GPT-4o, LLaVA), achieving 64.2% vs. 51.7% accuracy on land-composition mapping and 50% vs. 0% on post-wildfire damage assessment. The approach produces transparent, interpretable results through verifiable code output.

Conclusion: The proposed conversational code-generating agent makes Earth Observation analysis accessible to non-experts while maintaining transparency and reproducibility through code-based workflows, addressing key limitations of black-box EO systems.

Abstract: Despite recent advances in computer vision, Earth Observation (EO) analysis remains difficult to perform for the laymen, requiring expert knowledge and technical capabilities. Furthermore, many systems return black-box predictions that are difficult to audit or reproduce. Leveraging recent advances in tool LLMs, this study proposes a conversational, code-generating agent that transforms natural-language queries into executable, auditable Python workflows. The agent operates over a unified easily extendable API for classification, segmentation, detection (oriented bounding boxes), spectral indices, and geospatial operators. With our proposed framework, it is possible to control the results at three levels: (i) tool-level performance on public EO benchmarks; (ii) at the agent-level to understand the capacity to generate valid, hallucination-free code; and (iii) at the task-level on specific use cases. In this work, we select two use-cases of interest: land-composition mapping and post-wildfire damage assessment. The proposed agent outperforms general-purpose LLM/VLM baselines (GPT-4o, LLaVA), achieving 64.2% vs. 51.7% accuracy on land-composition and 50% vs. 0% on post-wildfire analysis, while producing results that are transparent and easy to interpret. By outputting verifiable code, the approach turns EO analysis into a transparent, reproducible process.

[313] TP-Blend: Textual-Prompt Attention Pairing for Precise Object-Style Blending in Diffusion Models

Xin Jin, Yichuan Zhong, Yapeng Tian

Main category: cs.CV

TL;DR: TP-Blend is a training-free diffusion framework that enables simultaneous object replacement and style transfer using two separate textual prompts through complementary attention mechanisms.

Details

Motivation: Current text-conditioned diffusion editors handle single object replacement well but struggle when both a new object and new style must be introduced simultaneously, requiring a solution that can precisely control both content and appearance in a single editing operation.

Method: TP-Blend uses two attention processors: Cross-Attention Object Fusion (CAOF) averages head-wise attention to locate spatial tokens, solves an entropy-regularized optimal transport problem to reassign multi-head feature vectors, and updates at full combined dimensionality. Self-Attention Style Fusion (SASF) injects style through Detail-Sensitive Instance Normalization with Gaussian filtering to separate low/high frequencies, and swaps Key/Value matrices from style prompt for context-aware texture modulation.

Result: TP-Blend produces high-resolution, photo-realistic edits with precise control over both content and appearance, surpassing recent baselines in quantitative fidelity, perceptual quality, and inference speed.

Conclusion: TP-Blend provides an effective lightweight training-free framework for simultaneous object replacement and style transfer in diffusion models, enabling precise control over both content and appearance through complementary attention mechanisms.

Abstract: Current text-conditioned diffusion editors handle single object replacement well but struggle when a new object and a new style must be introduced simultaneously. We present Twin-Prompt Attention Blend (TP-Blend), a lightweight training-free framework that receives two separate textual prompts, one specifying a blend object and the other defining a target style, and injects both into a single denoising trajectory. TP-Blend is driven by two complementary attention processors. Cross-Attention Object Fusion (CAOF) first averages head-wise attention to locate spatial tokens that respond strongly to either prompt, then solves an entropy-regularised optimal transport problem that reassigns complete multi-head feature vectors to those positions. CAOF updates feature vectors at the full combined dimensionality of all heads (e.g., 640 dimensions in SD-XL), preserving rich cross-head correlations while keeping memory low. Self-Attention Style Fusion (SASF) injects style at every self-attention layer through Detail-Sensitive Instance Normalization. A lightweight one-dimensional Gaussian filter separates low- and high-frequency components; only the high-frequency residual is blended back, imprinting brush-stroke-level texture without disrupting global geometry. SASF further swaps the Key and Value matrices with those derived from the style prompt, enforcing context-aware texture modulation that remains independent of object fusion. Extensive experiments show that TP-Blend produces high-resolution, photo-realistic edits with precise control over both content and appearance, surpassing recent baselines in quantitative fidelity, perceptual quality, and inference speed.

[314] Context-Aware Autoencoders for Anomaly Detection in Maritime Surveillance

Divya Acharya, Pierre Bernab’e, Antoine Chevrot, Helge Spieker, Arnaud Gotlieb, Bruno Legeard

Main category: cs.CV

TL;DR: A context-aware autoencoder approach for maritime anomaly detection that incorporates vessel-specific context to improve detection of collective and contextual anomalies in AIS data.

Details

Motivation: Current autoencoder methods for maritime anomaly detection are limited in identifying collective and contextual anomalies, which depend on vessel-specific contexts from AIS messages. There's a need for more accurate anomaly detection in maritime vessel traffic surveillance systems.

Method: Proposes a context-aware autoencoder that integrates context-specific thresholds. Compares four variants of context-aware autoencoders against a conventional autoencoder, focusing on fishing status anomalies in maritime surveillance using AIS data.

Result: The context-aware autoencoder outperforms other methods in detecting anomalies in time series data. Results show significant impact of context on reconstruction loss and anomaly detection accuracy, with improved detection performance and reduced computational cost.

Conclusion: Incorporating context-specific thresholds and recognizing the importance of context in anomaly detection offers a promising solution to improve accuracy in maritime vessel traffic surveillance systems.

Abstract: The detection of anomalies is crucial to ensuring the safety and security of maritime vessel traffic surveillance. Although autoencoders are popular for anomaly detection, their effectiveness in identifying collective and contextual anomalies is limited, especially in the maritime domain, where anomalies depend on vessel-specific contexts derived from self-reported AIS messages. To address these limitations, we propose a novel solution: the context-aware autoencoder. By integrating context-specific thresholds, our method improves detection accuracy and reduces computational cost. We compare four context-aware autoencoder variants and a conventional autoencoder using a case study focused on fishing status anomalies in maritime surveillance. Results demonstrate the significant impact of context on reconstruction loss and anomaly detection. The context-aware autoencoder outperforms others in detecting anomalies in time series data. By incorporating context-specific thresholds and recognizing the importance of context in anomaly detection, our approach offers a promising solution to improve accuracy in maritime vessel traffic surveillance systems.

[315] TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention

Chuancheng Shi, Shangze Li, Wenjun Lu, Wenhua Wu, Cong Wang, Zifeng Cheng, Fei Shen, Tat-Seng Chua

Main category: cs.CV

TL;DR: TraceRouter is a path-level defense framework that traces and disconnects causal propagation circuits of harmful semantics in large foundation models, achieving better adversarial robustness-utility trade-off than localized suppression methods.

Details

Motivation: Current defenses for large foundation models rely on the "locality hypothesis" and suppress isolated neurons/features, but harmful semantics act as distributed, cross-layer circuits, making localized interventions brittle and detrimental to utility.

Method: Three-stage framework: (1) pinpoint sensitive onset layer via attention divergence analysis, (2) use sparse autoencoders and differential activation analysis to disentangle/isolate malicious features, (3) map features to downstream causal pathways via feature influence scores from zero-out interventions, then selectively suppress these causal chains.

Result: TraceRouter significantly outperforms state-of-the-art baselines, achieving superior trade-off between adversarial robustness and general utility in extensive experiments.

Conclusion: TraceRouter provides an effective path-level defense framework that physically severs harmful information flow while preserving orthogonal computation routes, addressing limitations of localized suppression approaches.

Abstract: Despite their capabilities, large foundation models (LFMs) remain susceptible to adversarial manipulation. Current defenses predominantly rely on the “locality hypothesis”, suppressing isolated neurons or features. However, harmful semantics act as distributed, cross-layer circuits, rendering such localized interventions brittle and detrimental to utility. To bridge this gap, we propose \textbf{TraceRouter}, a path-level framework that traces and disconnects the causal propagation circuits of illicit semantics. TraceRouter operates in three stages: (1) it pinpoints a sensitive onset layer by analyzing attention divergence; (2) it leverages sparse autoencoders (SAEs) and differential activation analysis to disentangle and isolate malicious features; and (3) it maps these features to downstream causal pathways via feature influence scores (FIS) derived from zero-out interventions. By selectively suppressing these causal chains, TraceRouter physically severs the flow of harmful information while leaving orthogonal computation routes intact. Extensive experiments demonstrate that TraceRouter significantly outperforms state-of-the-art baselines, achieving a superior trade-off between adversarial robustness and general utility. Our code will be publicly released. WARNING: This paper contains unsafe model responses.

[316] D3R-Net: Dual-Domain Denoising Reconstruction Network for Robust Industrial Anomaly Detection

Dmytro Filatov, Valentyn Fedorov, Vira Filatova, Andrii Zelenchuk

Main category: cs.CV

TL;DR: D3R-Net: A dual-domain denoising reconstruction framework for unsupervised anomaly detection that combines spatial and frequency domain losses to improve defect segmentation accuracy in manufacturing visual inspection.

Details

Motivation: Reconstruction-based anomaly detection methods often produce oversmoothed results that fail to highlight subtle defects, limiting segmentation accuracy. The paper aims to address this by incorporating frequency-aware regularization to better preserve high-frequency details.

Method: Proposes D3R-Net, a dual-domain framework that couples self-supervised ‘healing’ of synthetically corrupted normal images with frequency-aware regularization. Uses spatial MSE loss plus FFT magnitude loss for frequency domain consistency, with optional SSIM term.

Result: On MVTec AD Hazelnut benchmark, FFT loss improves PRO AUC from 0.603 to 0.687 while maintaining robust image-level ROC AUC. Across 15 MVTec categories, average pixel ROC AUC increases from 0.733 to 0.751 and PRO AUC from 0.417 to 0.468, running at ~20 FPS on single GPU.

Conclusion: D3R-Net provides a practical lightweight alternative to heavy pre-trained feature embedding methods, demonstrating that frequency-aware regularization significantly improves anomaly localization in unsupervised visual inspection tasks.

Abstract: Unsupervised anomaly detection (UAD) is a key ingredient of automated visual inspection in modern manufacturing. The reconstruction-based methods appeal because they have basic architectural design and they process data quickly but they produce oversmoothed results for high-frequency details. As a result, subtle defects are partially reconstructed rather than highlighted, which limits segmentation accuracy. We build on this line of work and introduce D3R-Net, a Dual-Domain Denoising Reconstruction framework that couples a self-supervised ‘healing’ task with frequency-aware regularization. During training, the network receives synthetically corrupted normal images and is asked to reconstruct the clean targets, which prevents trivial identity mapping and pushes the model to learn the manifold of defect-free textures. In addition to the spatial mean squared error, we employ a Fast Fourier Transform (FFT) magnitude loss that encourages consistency in the frequency domain. The implementation also allows an optional structural similarity (SSIM) term, which we study in an ablation. On the MVTec AD Hazelnut benchmark, D3R-Net with the FFT loss improves localization consistency over a spatial-only baseline: PRO AUC increases from 0.603 to 0.687, while image-level ROC AUC remains robust. Evaluated across fifteen MVTec categories, the FFT variant raises the average pixel ROC AUC from 0.733 to 0.751 and PRO AUC from 0.417 to 0.468 compared to the MSE-only baseline, at roughly 20 FPS on a single GPU. The network is trained from scratch and uses a lightweight convolutional autoencoder backbone, providing a practical alternative to heavy pre-trained feature embedding methods.

[317] PovNet+: A Deep Learning Architecture for Socially Assistive Robots to Learn and Assist with Multiple Activities of Daily Living

Fraser Robinson, Souren Pashangpour, Matthew Lisondra, Goldie Nejat

Main category: cs.CV

TL;DR: POVNet+ is a multimodal deep learning architecture for multi-activity recognition in socially assistive robots, using ADL and motion embeddings to distinguish between known activities, unseen activities, and atypical performance, enabling proactive assistance.

Details

Motivation: Current socially assistive robots lack the ability to perceive and assist with multiple activities of daily living (ADLs), limiting their long-term deployment. There's a need for systems that can recognize both known and unseen ADLs, as well as detect when known ADLs are being performed atypically, to provide proactive assistance in real-world scenarios.

Method: Presents POVNet+, a multimodal deep learning architecture that introduces ADL and motion embedding spaces. Uses these embeddings to distinguish between: 1) known ADLs being performed normally, 2) new unseen ADLs, and 3) known ADLs being performed atypically. Applies a novel user state estimation method to the motion embedding space for recognizing new ADLs while monitoring user performance.

Result: POVNet+ achieves higher ADL classification accuracy compared to state-of-the-art human activity recognition methods. Human-robot interaction experiments in a cluttered living environment with multiple users and the socially assistive robot Leia demonstrate successful identification of different seen/unseen ADLs and atypical ADL performance, with appropriate assistive interactions initiated.

Conclusion: POVNet+ represents a significant advancement for socially assistive robots, enabling them to perceive multiple ADLs (including unseen and atypical ones) and proactively initiate appropriate assistive behaviors, addressing a key barrier to long-term deployment of autonomous assistive robots.

Abstract: A significant barrier to the long-term deployment of autonomous socially assistive robots is their inability to both perceive and assist with multiple activities of daily living (ADLs). In this paper, we present the first multimodal deep learning architecture, POVNet+, for multi-activity recognition for socially assistive robots to proactively initiate assistive behaviors. Our novel architecture introduces the use of both ADL and motion embedding spaces to uniquely distinguish between a known ADL being performed, a new unseen ADL, or a known ADL being performed atypically in order to assist people in real scenarios. Furthermore, we apply a novel user state estimation method to the motion embedding space to recognize new ADLs while monitoring user performance. This ADL perception information is used to proactively initiate robot assistive interactions. Comparison experiments with state-of-the-art human activity recognition methods show our POVNet+ method has higher ADL classification accuracy. Human-robot interaction experiments in a cluttered living environment with multiple users and the socially assistive robot Leia using POVNet+ demonstrate the ability of our multi-modal ADL architecture in successfully identifying different seen and unseen ADLs, and ADLs being performed atypically, while initiating appropriate assistive human-robot interactions.

[318] SDCM: Simulated Densifying and Compensatory Modeling Fusion for Radar-Vision 3-D Object Detection in Internet of Vehicles

Shucong Li, Xiaoluo Zhou, Yuqian He, Zhenyu Liu

Main category: cs.CV

TL;DR: SDCM framework improves 3D object detection in IoV by densifying sparse 4D radar point clouds, compensating for vision degradation, and using Mamba-based interactive fusion.

Details

Motivation: Address challenges in 4D radar-vision 3D object detection: sparse radar point clouds leading to poor 3D representation, and vision data degradation under low-light, long-distance, and occlusion conditions.

Method: Three-module framework: 1) SimDen module generates dense radar point clouds using Gaussian simulation and curvature-based outline generation; 2) RCM module uses radar data to compensate for vision degradation; 3) MMIF module uses Mamba modeling for heterogeneous reduction and interactive fusion.

Result: Achieves best performance on VoD, TJ4DRadSet, and Astyx HiRes 2019 datasets with lower parameter quantity and faster inference speed compared to other methods.

Conclusion: SDCM effectively addresses sparse radar and degraded vision challenges in multimodal 3D object detection for IoV applications through simulated densifying, compensatory modeling, and Mamba-based interactive fusion.

Abstract: 3-D object detection based on 4-D radar-vision is an important part in Internet of Vehicles (IoV). However, there are two challenges which need to be faced. First, the 4-D radar point clouds are sparse, leading to poor 3-D representation. Second, vision datas exhibit representation degradation under low-light, long distance detection and dense occlusion scenes, which provides unreliable texture information during fusion stage. To address these issues, a framework named SDCM is proposed, which contains Simulated Densifying and Compensatory Modeling Fusion for radar-vision 3-D object detection in IoV. Firstly, considering point generation based on Gaussian simulation of key points obtained from 3-D Kernel Density Estimation (3-D KDE), and outline generation based on curvature simulation, Simulated Densifying (SimDen) module is designed to generate dense radar point clouds. Secondly, considering that radar data could provide more real time information than vision data, due to the all-weather property of 4-D radar. Radar Compensatory Mapping (RCM) module is designed to reduce the affects of vision datas’ representation degradation. Thirdly, considering that feature tensor difference values contain the effective information of every modality, which could be extracted and modeled for heterogeneity reduction and modalities interaction, Mamba Modeling Interactive Fusion (MMIF) module is designed for reducing heterogeneous and achieving interactive Fusion. Experiment results on the VoD, TJ4DRadSet and Astyx HiRes 2019 dataset show that SDCM achieves best performance with lower parameter quantity and faster inference speed. Our code will be available.

[319] Shedding the Facades, Connecting the Domains: Detecting Shifting Multimodal Hate Video with Test-Time Adaptation

Jiao Li, Jian Lang, Xikai Tang, Wenzheng Shu, Ting Zhong, Qiang Gao, Yong Wang, Leiting Chen, Fan Zhou

Main category: cs.CV

TL;DR: SCANNER is a Test-Time Adaptation framework for Hate Video Detection that addresses semantic drift in evolving hateful content by leveraging stable core concepts and using centroid-guided alignment with diversity regularization.

Details

Motivation: Hateful content evolves into irregular forms to evade censorship, causing semantic drift that makes trained models ineffective. Existing TTA methods target mild distribution shifts and struggle with severe semantic drift in hate video detection.

Method: SCANNER uses centroid-guided alignment to reveal stable cores from ambiguous hateful content, incorporates sample-level adaptive centroid alignment to handle outliers, and adds intra-cluster diversity regularization to prevent semantic collapse.

Result: SCANNER outperforms all baselines with an average gain of 4.69% in Macro-F1 over the best existing method.

Conclusion: The proposed framework effectively addresses severe semantic drift in hate video detection by leveraging invariant core concepts and adaptive alignment strategies.

Abstract: Hate Video Detection (HVD) is crucial for online ecosystems. Existing methods assume identical distributions between training (source) and inference (target) data. However, hateful content often evolves into irregular and ambiguous forms to evade censorship, resulting in substantial semantic drift and rendering previously trained models ineffective. Test-Time Adaptation (TTA) offers a solution by adapting models during inference to narrow the cross-domain gap, while conventional TTA methods target mild distribution shifts and struggle with the severe semantic drift in HVD. To tackle these challenges, we propose SCANNER, the first TTA framework tailored for HVD. Motivated by the insight that, despite the evolving nature of hateful manifestations, their underlying cores remain largely invariant (i.e., targeting is still based on characteristics like gender, race, etc), we leverage these stable cores as a bridge to connect the source and target domains. Specifically, SCANNER initially reveals the stable cores from the ambiguous layout in evolving hateful content via a principled centroid-guided alignment mechanism. To alleviate the impact of outlier-like samples that are weakly correlated with centroids during the alignment process, SCANNER enhances the prior by incorporating a sample-level adaptive centroid alignment strategy, promoting more stable adaptation. Furthermore, to mitigate semantic collapse from overly uniform outputs within clusters, SCANNER introduces an intra-cluster diversity regularization that encourages the cluster-wise semantic richness. Experiments show that SCANNER outperforms all baselines, with an average gain of 4.69% in Macro-F1 over the best.

[320] See Without Decoding: Motion-Vector-Based Tracking in Compressed Video

Axel Duché, Clément Chatelain, Gilles Gasso

Main category: cs.CV

TL;DR: Lightweight compressed-domain tracking model using motion vectors and transform coefficients from video streams without full RGB decoding, achieving 3.7x speed-up with minimal accuracy loss.

Details

Motivation: To enable real-time object tracking in large monitoring systems by reducing computational overhead through operating directly on compressed video data rather than requiring full RGB decoding.

Method: Uses motion vectors and transform coefficients from compressed video streams to propagate object bounding boxes across frames with a deep model, eliminating need for full RGB video decoding.

Result: Achieves computational speed-up of up to 3.7x with only 4% mAP@0.5 drop compared to RGB baseline on MOTS15/17/20 datasets.

Conclusion: Compressed-domain motion modeling is efficient for real-time analytics in large monitoring systems, offering significant speed improvements with minimal accuracy trade-offs.

Abstract: We propose a lightweight compressed-domain tracking model that operates directly on video streams, without requiring full RGB video decoding. Using motion vectors and transform coefficients from compressed data, our deep model propagates object bounding boxes across frames, achieving a computational speed-up of order up to 3.7 with only a slight 4% mAP@0.5 drop vs RGB baseline on MOTS15/17/20 datasets. These results highlight codec-domain motion modeling efficiency for real-time analytics in large monitoring systems.

[321] LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models

Pengcheng Zheng, Chaoning Zhang, Jiarong Mo, GuoHui Li, Jiaquan Zhang, Jiahao Zhang, Sihan Cao, Sheng Zheng, Caiyan Qin, Guoqing Wang, Yang Yang

Main category: cs.CV

TL;DR: LLaVA-FA: Joint low-rank plus quantization approximation in frequency domain for efficient multimodal models, using Fourier transform properties and polar-coordinate quantization for complex matrices.

Details

Motivation: Large multimodal models have high computational and memory costs hindering deployment. Existing compression methods decouple low-rank decomposition and quantization, causing compounded errors, especially with cross-modal redundancy in multimodal architectures.

Method: Proposes LLaVA-FA that performs joint low-rank plus quantization approximation in frequency domain using Fourier transform properties. Introduces PolarQuant for complex matrices and optional diagonal calibration scheme without large-scale calibration data.

Result: Outperforms existing efficient multimodal models across multiple benchmarks while maintaining minimal activated parameters and low computational costs.

Conclusion: LLaVA-FA is an effective solution for compressing large multimodal models through frequency-domain joint approximation and specialized quantization techniques.

Abstract: Large multimodal models (LMMs) have achieved impressive performance on various vision-language tasks, but their substantial computational and memory costs hinder their practical deployment. Existing compression methods often decouple low-rank decomposition and quantization, leading to compounded reconstruction errors, especially in multimodal architectures with cross-modal redundancy. To address this issue, we propose LLaVA-FA, a novel efficient LMM that performs joint low-rank plus quantization approximation in the frequency domain. By leveraging the de-correlation and conjugate symmetry properties of Fourier transform, LLaVA-FA achieves more compact and accurate weight representations. Furthermore, we introduce PolarQuant, a polar-coordinate quantization method tailored for complex matrices, and an optional diagonal calibration (ODC) scheme that eliminates the need for large-scale calibration data. Extensive experimental results demonstrate that our proposed LLaVA-FA outperforms existing efficient multimodal models across multiple benchmarks while maintaining minimal activated parameters and low computational costs, validating its effectiveness as a powerful solution for compressing LMMs.

[322] Development of a Cacao Disease Identification and Management App Using Deep Learning

Zaldy Pagaduan, Jason Occidental, Nathaniel Duro, Dexielito Badilles, Eleonor Palconit

Main category: cs.CV

TL;DR: Mobile app with deep learning model for offline cacao disease identification, achieving 96.93% accuracy for disease detection and 79.49% for infection level assessment.

Details

Motivation: Smallholder cacao farmers in the Philippines lack access to agricultural data and expertise, facing challenges from pests and diseases without proper diagnostic tools, especially in remote areas with limited connectivity.

Method: Developed a mobile application with integrated deep learning model for offline cacao disease identification. The system includes disease detection and infection level assessment models trained on cacao disease data.

Result: Disease identification model achieved 96.93% validation accuracy, infection level detection model achieved 79.49% accuracy. Field testing showed 84.2% agreement rate with expert assessments.

Conclusion: The offline mobile app with deep learning provides accessible technology for smallholder farmers to improve cacao crop health and productivity through accurate disease identification.

Abstract: Smallholder cacao producers often rely on outdated farming techniques and face significant challenges from pests and diseases, unlike larger plantations with more resources and expertise. In the Philippines, cacao farmers have limited access to data, information, and good agricultural practices. This study addresses these issues by developing a mobile application for cacao disease identification and management that functions offline, enabling use in remote areas where farms are mostly located. The core of the system is a deep learning model trained to identify cacao diseases accurately. The trained model is integrated into the mobile app to support farmers in field diagnosis. The disease identification model achieved a validation accuracy of 96.93% while the model for detecting cacao black pod infection levels achieved 79.49% validation accuracy. Field testing of the application showed an agreement rate of 84.2% compared with expert cacao technician assessments. This approach empowers smallholder farmers by providing accessible, technology-enabled tools to improve cacao crop health and productivity.

[323] Scalable Analytic Classifiers with Associative Drift Compensation for Class-Incremental Learning of Vision Transformers

Xuan Rao, Mingming Ha, Bo Zhao, Derong Liu, Cesare Alippi

Main category: cs.CV

TL;DR: LR-RGDA with HopDC: A scalable class-incremental learning framework for Vision Transformers that combines low-rank factorized regularized Gaussian discriminant analysis with Hopfield-based distribution compensation to reduce computational complexity while maintaining accuracy.

Details

Motivation: Class-incremental learning with Vision Transformers faces computational bottlenecks during classifier reconstruction, as existing methods rely on costly iterative SGD. While analytic RGDA offers Bayes-optimal alternatives with comparable accuracy, its quadratic inference complexity limits scalability for large-scale CIL scenarios.

Method: Proposes Low-Rank Factorized RGDA (LR-RGDA) that exploits low-rank structure of covariance via Woodbury matrix identity to decompose discriminant function into global affine term plus low-rank quadratic perturbation, reducing inference complexity from O(Cd²) to O(d² + Crd²). Also introduces Hopfield-based Distribution Compensator (HopDC) - a training-free mechanism using continuous Hopfield Networks to recalibrate historical class statistics through associative memory dynamics on unlabeled anchors, with theoretical error bound.

Result: Extensive experiments on diverse CIL benchmarks demonstrate state-of-the-art performance, providing a scalable solution for large-scale class-incremental learning with Vision Transformers.

Conclusion: The framework achieves efficient and accurate class-incremental learning for Vision Transformers by combining the expressivity of RGDA with linear classifier efficiency through low-rank factorization, while addressing representation drift via Hopfield-based compensation.

Abstract: Class-incremental learning (CIL) with Vision Transformers (ViTs) faces a major computational bottleneck during the classifier reconstruction phase, where most existing methods rely on costly iterative stochastic gradient descent (SGD). We observe that analytic Regularized Gaussian Discriminant Analysis (RGDA) provides a Bayes-optimal alternative with accuracy comparable to SGD-based classifiers; however, its quadratic inference complexity limits its use in large-scale CIL scenarios. To overcome this, we propose Low-Rank Factorized RGDA (LR-RGDA), a scalable classifier that combines RGDA’s expressivity with the efficiency of linear classifiers. By exploiting the low-rank structure of the covariance via the Woodbury matrix identity, LR-RGDA decomposes the discriminant function into a global affine term refined by a low-rank quadratic perturbation, reducing the inference complexity from $\mathcal{O}(Cd^2)$ to $\mathcal{O}(d^2 + Crd^2)$, where $C$ is the class number, $d$ the feature dimension, and $r \ll d$ the subspace rank. To mitigate representation drift caused by backbone updates, we further introduce Hopfield-based Distribution Compensator (HopDC), a training-free mechanism that uses modern continuous Hopfield Networks to recalibrate historical class statistics through associative memory dynamics on unlabeled anchors, accompanied by a theoretical bound on the estimation error. Extensive experiments on diverse CIL benchmarks demonstrate that our framework achieves state-of-the-art performance, providing a scalable solution for large-scale class-incremental learning with ViTs. Code: https://github.com/raoxuan98-hash/lr_rgda_hopdc.

[324] DensiThAI, A Multi-View Deep Learning Framework for Breast Density Estimation using Infrared Images

Siva Teja Kakileti, Geetha Manjunath

Main category: cs.CV

TL;DR: DensiThAI: A multi-view deep learning framework that estimates breast density from thermal images as a non-ionizing alternative to mammography.

Details

Motivation: Current breast density assessment relies on X-ray mammography which uses ionizing radiation. The authors propose using thermal imaging as a non-ionizing alternative, hypothesizing that different breast tissues exhibit distinct thermophysical properties that can be detected through temperature variations.

Method: Proposed DensiThAI, a multi-view deep learning framework for breast density classification from thermal images. Uses five standard thermal views and mammography-derived density labels as reference for training and evaluation on a multi-center dataset of 3,500 women.

Result: Achieved mean AUROC of 0.73 across 10 random splits with statistically significant separation between density classes (p « 0.05). Consistent performance across age cohorts demonstrates feasibility.

Conclusion: Thermal imaging shows potential as a non-ionizing approach for breast density assessment, offering improved patient experience and workflow optimization compared to traditional mammography.

Abstract: Breast tissue density is a key biomarker of breast cancer risk and a major factor affecting mammographic sensitivity. However, density assessment currently relies almost exclusively on X-ray mammography, an ionizing imaging modality. This study investigates the feasibility of estimating breast density using artificial intelligence over infrared thermal images, offering a non-ionizing imaging approach. The underlying hypothesis is that fibroglandular and adipose tissues exhibit distinct thermophysical and physiological properties, leading to subtle but spatially coherent temperature variations on the breast surface. In this paper, we propose DensiThAI, a multi-view deep learning framework for breast density classification from thermal images. The framework was evaluated on a multi-center dataset of 3,500 women using mammography-derived density labels as reference. Using five standard thermal views, DensiThAI achieved a mean AUROC of 0.73 across 10 random splits, with statistically significant separation between density classes across all splits (p « 0.05). Consistent performance across age cohorts supports the potential of thermal imaging as a non-ionizing approach for breast density assessment with implications for improved patient experience and workflow optimization.

[325] Learning Physics-Grounded 4D Dynamics with Neural Gaussian Force Fields

Shiqian Li, Ruihong Shen, Junfeng Ni, Chang Pan, Chi Zhang, Yixin Zhu

Main category: cs.CV

TL;DR: NGFF is an end-to-end neural framework that combines 3D Gaussian perception with physics-based modeling to generate physically realistic 4D videos from multi-view RGB inputs, achieving 100x speedup over prior Gaussian simulators.

Details

Motivation: Current video generation models lack physical plausibility, while physics-based approaches using 3D Gaussian splatting are computationally expensive and not robust in complex real-world scenarios. There's a need for efficient, physically-grounded video prediction.

Method: Neural Gaussian Force Field (NGFF) integrates 3D Gaussian perception with physics-based dynamic modeling in an end-to-end neural framework. Uses GSCollision dataset with 640k rendered physical videos for training, featuring diverse materials and multi-object interactions.

Result: NGFF achieves two orders of magnitude faster generation than prior Gaussian simulators, shows strong generalization and robustness in physical reasoning on synthetic and real 3D scenarios, and advances video prediction towards physics-grounded world models.

Conclusion: NGFF successfully bridges the gap between visual quality and physical plausibility in video generation, providing an efficient framework for physically realistic 4D video prediction from multi-view inputs.

Abstract: Predicting physical dynamics from raw visual data remains a major challenge in AI. While recent video generation models have achieved impressive visual quality, they still cannot consistently generate physically plausible videos due to a lack of modeling of physical laws. Recent approaches combining 3D Gaussian splatting and physics engines can produce physically plausible videos, but are hindered by high computational costs in both reconstruction and simulation, and often lack robustness in complex real-world scenarios. To address these issues, we introduce Neural Gaussian Force Field (NGFF), an end-to-end neural framework that integrates 3D Gaussian perception with physics-based dynamic modeling to generate interactive, physically realistic 4D videos from multi-view RGB inputs, achieving two orders of magnitude faster than prior Gaussian simulators. To support training, we also present GSCollision, a 4D Gaussian dataset featuring diverse materials, multi-object interactions, and complex scenes, totaling over 640k rendered physical videos (~4 TB). Evaluations on synthetic and real 3D scenarios show NGFF’s strong generalization and robustness in physical reasoning, advancing video prediction towards physics-grounded world models.

[326] Combined Flicker-banding and Moire Removal for Screen-Captured Images

Libo Zhu, Zihan Zhou, Zhiyi Zhou, Yiyang Qu, Weihang Zhang, Keyu Shi, Yifan Fu, Yulun Zhang

Main category: cs.CV

TL;DR: CLEAR: A unified framework for joint removal of moiré patterns and flicker-banding artifacts in screen-captured images using frequency-domain decomposition and trajectory alignment.

Details

Motivation: Mobile device screen captures suffer from severe degradations due to coexisting moiré patterns and flicker-banding artifacts, which existing single-degradation methods fail to address effectively in compound scenarios.

Method: Proposes CLEAR framework with frequency-domain decomposition and re-composition module, trajectory alignment loss, and uses an ISP-based flicker simulation pipeline for training stabilization and degradation distribution expansion.

Result: Extensive experiments show CLEAR consistently outperforms existing image restoration approaches across multiple evaluation metrics in complex real-world scenarios.

Conclusion: First systematic study on joint removal of moiré patterns and flicker-banding in screen-captured images, with proposed CLEAR framework effectively addressing compound artifacts through novel frequency-domain modeling and training techniques.

Abstract: Capturing display screens with mobile devices has become increasingly common, yet the resulting images often suffer from severe degradations caused by the coexistence of moiré patterns and flicker-banding, leading to significant visual quality degradation. Due to the strong coupling of these two artifacts in real imaging processes, existing methods designed for single degradations fail to generalize to such compound scenarios. In this paper, we present the first systematic study on joint removal of moiré patterns and flicker-banding in screen-captured images, and propose a unified restoration framework, named CLEAR. To support this task, we construct a large-scale dataset containing both moiré patterns and flicker-banding, and introduce an ISP-based flicker simulation pipeline to stabilize model training and expand the degradation distribution. Furthermore, we design a frequency-domain decomposition and re-composition module together with a trajectory alignment loss to enhance the modeling of compound artifacts. Extensive experiments demonstrate that the proposed method consistently. outperforms existing image restoration approaches across multiple evaluation metrics, validating its effectiveness in complex real-world scenarios.

[327] Investigating the Impact of Histopathological Foundation Models on Regressive Prediction of Homologous Recombination Deficiency

Alexander Blezinger, Wolfgang Nejdl, Ming Tang

Main category: cs.CV

TL;DR: Systematic evaluation of histopathology foundation models for regression-based biomarker prediction (HRD scores) across multiple cancer types, showing improved performance over contrastive learning baselines and proposing distribution-based upsampling for target imbalance.

Details

Motivation: While foundation models pretrained on large-scale histopathology data have shown success in various computational pathology tasks, their impact on regressive biomarker prediction remains underexplored. The authors aim to evaluate these models for predicting continuous biomarker scores like homologous recombination deficiency (HRD), which is critical for personalized cancer treatment.

Method: Used multiple instance learning frameworks with five state-of-the-art histopathology foundation models to extract patch-level features from whole slide images. Compared these against contrastive learning-based features across breast, endometrial, and lung cancer cohorts from two public datasets. Proposed a distribution-based upsampling strategy to address target imbalance, and conducted ablation studies on different sampling strategies and instance bag sizes.

Result: Models trained on foundation model features consistently outperformed baseline contrastive learning features in predictive accuracy and generalization across cancer types. Systematic differences were observed among different foundation models. The proposed distribution-based upsampling significantly improved recall and balanced accuracy for underrepresented patient populations.

Conclusion: Large-scale histopathological pretraining provides benefits for more precise and transferable regressive biomarker prediction, demonstrating potential to advance AI-driven precision oncology through improved biomarker prediction capabilities.

Abstract: Foundation models pretrained on large-scale histopathology data have found great success in various fields of computational pathology, but their impact on regressive biomarker prediction remains underexplored. In this work, we systematically evaluate histopathological foundation models for regression-based tasks, demonstrated through the prediction of homologous recombination deficiency (HRD) score - a critical biomarker for personalized cancer treatment. Within multiple instance learning frameworks, we extract patch-level features from whole slide images (WSI) using five state-of-the-art foundation models, and evaluate their impact compared to contrastive learning-based features. Models are trained to predict continuous HRD scores based on these extracted features across breast, endometrial, and lung cancer cohorts from two public medical data collections. Extensive experiments demonstrate that models trained on foundation model features consistently outperform the baseline in terms of predictive accuracy and generalization capabilities while exhibiting systematic differences among the foundation models. Additionally, we propose a distribution-based upsampling strategy to mitigate target imbalance in these datasets, significantly improving the recall and balanced accuracy for underrepresented but clinically important patient populations. Furthermore, we investigate the impact of different sampling strategies and instance bagsizes by ablation studies. Our results highlight the benefits of large-scale histopathological pretraining for more precise and transferable regressive biomarker prediction, showcasing its potential to advance AI-driven precision oncology.

[328] Real-Time Human Activity Recognition on Edge Microcontrollers: Dynamic Hierarchical Inference with Multi-Spectral Sensor Fusion

Boyu Li, Kuangji Zuo, Lincong Li, Yonghui Wu

Main category: cs.CV

TL;DR: HPPI-Net: A resource-aware hierarchical network for real-time on-device Human Activity Recognition using multi-spectral fusion and interpretable modules, achieving high accuracy with minimal memory footprint on ARM Cortex-M4.

Details

Motivation: The need for accurate on-device pattern recognition in edge applications that balances accuracy with computational constraints, particularly for Human Activity Recognition in wearable, industrial, and smart home applications.

Method: Two-layer hierarchical architecture: first layer extracts preliminary features using FFT spectrograms; second layer selectively activates either stationary activity module or parallel LSTM-MobileNet network (PLMN) for dynamic states. PLMN fuses FFT, Wavelet, and Gabor spectrograms through parallel LSTM encoders and refines features using Efficient Channel Attention and Depthwise Separable Convolution.

Result: Achieves 96.70% accuracy with only 22.3 KiB RAM and 439.5 KiB ROM on ARM Cortex-M4. Compared to MobileNetV3: improves accuracy by 1.22%, reduces RAM usage by 71.2%, and reduces ROM usage by 42.1%.

Conclusion: HPPI-Net achieves favorable accuracy-efficiency trade-off with explainable predictions, establishing a practical solution for memory-constrained edge platforms in wearable, industrial, and smart home HAR applications.

Abstract: The demand for accurate on-device pattern recognition in edge applications is intensifying, yet existing approaches struggle to reconcile accuracy with computational constraints. To address this challenge, a resource-aware hierarchical network based on multi-spectral fusion and interpretable modules, namely the Hierarchical Parallel Pseudo-image Enhancement Fusion Network (HPPI-Net), is proposed for real-time, on-device Human Activity Recognition (HAR). Deployed on an ARM Cortex-M4 microcontroller for low-power real-time inference, HPPI-Net achieves 96.70% accuracy while utilizing only 22.3 KiB of RAM and 439.5 KiB of ROM after optimization. HPPI-Net employs a two-layer architecture. The first layer extracts preliminary features using Fast Fourier Transform (FFT) spectrograms, while the second layer selectively activates either a dedicated module for stationary activity recognition or a parallel LSTM-MobileNet network (PLMN) for dynamic states. PLMN fuses FFT, Wavelet, and Gabor spectrograms through three parallel LSTM encoders and refines the concatenated features using Efficient Channel Attention (ECA) and Depthwise Separable Convolution (DSC), thereby offering channel-level interpretability while substantially reducing multiply-accumulate operations. Compared with MobileNetV3, HPPI-Net improves accuracy by 1.22% and reduces RAM usage by 71.2% and ROM usage by 42.1%. These results demonstrate that HPPI-Net achieves a favorable accuracy-efficiency trade-off and provides explainable predictions, establishing a practical solution for wearable, industrial, and smart home HAR on memory-constrained edge platforms.

[329] Deep Learning Pose Estimation for Multi-Label Recognition of Combined Hyperkinetic Movement Disorders

Laura Cif, Diane Demailly, Gabriella A. Horvàth, Juan Dario Ortigoza Escobar, Nathalie Dorison, Mayté Castro Jiménez, Cécile A. Hubsch, Thomas Wirth, Gun-Marie Hariz, Sophie Huby, Morgan Dornadic, Zohra Souei, Muhammad Mushhood Ur Rehman, Simone Hemm, Mehdi Boulayme, Eduardo M. Moraud, Jocelyne Bloch, Xavier Vasques

Main category: cs.CV

TL;DR: A pose-based ML framework that converts clinical videos into keypoint time series to objectively analyze hyperkinetic movement disorders using kinematic features

Details

Motivation: Hyperkinetic movement disorders are difficult to diagnose and monitor due to their fluctuating nature and overlapping symptoms, with current clinical assessments being subjective and prone to inter-rater variability

Method: Developed a pose-based machine learning framework that converts outpatient clinical videos into anatomically meaningful keypoint time series, then computes kinematic descriptors including statistical, temporal, spectral, and higher-order irregularity-complexity features

Result: The method provides objective and scalable analysis of movement disorders from routine clinical videos, addressing the limitations of subjective clinical assessments

Conclusion: This framework offers a promising approach for objective clinical recognition and longitudinal monitoring of hyperkinetic movement disorders using standard video data

Abstract: Hyperkinetic movement disorders (HMDs) such as dystonia, tremor, chorea, myoclonus, and tics are disabling motor manifestations across childhood and adulthood. Their fluctuating, intermittent, and frequently co-occurring expressions hinder clinical recognition and longitudinal monitoring, which remain largely subjective and vulnerable to inter-rater variability. Objective and scalable methods to distinguish overlapping HMD phenotypes from routine clinical videos are still lacking. Here, we developed a pose-based machine-learning framework that converts standard outpatient videos into anatomically meaningful keypoint time series and computes kinematic descriptors spanning statistical, temporal, spectral, and higher-order irregularity-complexity features.

[330] YOLOE-26: Integrating YOLO26 with YOLOE for Real-Time Open-Vocabulary Instance Segmentation

Ranjan Sapkota, Manoj Karkee

Main category: cs.CV

TL;DR: YOLOE-26 is a unified framework combining YOLOv26’s efficient architecture with open-vocabulary learning for real-time instance segmentation, enabling text-prompted, visual-prompted, or autonomous segmentation.

Details

Motivation: To extend YOLO's efficient real-time object detection capabilities beyond closed-set recognition to open-vocabulary instance segmentation, enabling dynamic, real-world applications where predefined categories are insufficient.

Method: Integrates YOLOv26’s NMS-free, end-to-end design with open-vocabulary learning via object embedding head for similarity matching against text/visual prompts. Uses RepRTA for zero-overhead text prompting, SAVPE for example-guided segmentation, and Lazy Region Prompt Contrast for prompt-free inference.

Result: Demonstrates consistent scaling behavior and favorable accuracy-efficiency trade-offs across model sizes in both prompted and prompt-free settings, providing practical real-time open-vocabulary instance segmentation.

Conclusion: YOLOE-26 offers a scalable, practical solution for real-time open-vocabulary instance segmentation that maintains YOLO’s efficiency while extending capabilities to dynamic real-world environments.

Abstract: This paper presents YOLOE-26, a unified framework that integrates the deployment-optimized YOLO26(or YOLOv26) architecture with the open-vocabulary learning paradigm of YOLOE for real-time open-vocabulary instance segmentation. Building on the NMS-free, end-to-end design of YOLOv26, the proposed approach preserves the hallmark efficiency and determinism of the YOLO family while extending its capabilities beyond closed-set recognition. YOLOE-26 employs a convolutional backbone with PAN/FPN-style multi-scale feature aggregation, followed by end-to-end regression and instance segmentation heads. A key architectural contribution is the replacement of fixed class logits with an object embedding head, which formulates classification as similarity matching against prompt embeddings derived from text descriptions, visual examples, or a built-in vocabulary. To enable efficient open-vocabulary reasoning, the framework incorporates Re-Parameterizable Region-Text Alignment (RepRTA) for zero-overhead text prompting, a Semantic-Activated Visual Prompt Encoder (SAVPE) for example-guided segmentation, and Lazy Region Prompt Contrast for prompt-free inference. All prompting modalities operate within a unified object embedding space, allowing seamless switching between text-prompted, visual-prompted, and fully autonomous segmentation. Extensive experiments demonstrate consistent scaling behavior and favorable accuracy-efficiency trade-offs across model sizes in both prompted and prompt-free settings. The training strategy leverages large-scale detection and grounding datasets with multi-task optimization and remains fully compatible with the Ultralytics ecosystem for training, validation, and deployment. Overall, YOLOE-26 provides a practical and scalable solution for real-time open-vocabulary instance segmentation in dynamic, real-world environments.

[331] Intra-Class Subdivision for Pixel Contrastive Learning: Application to Semi-supervised Cardiac Image Segmentation

Jiajun Zhao, Xuan Yang

Main category: cs.CV

TL;DR: SPCL framework uses intra-class subdivision pixel contrastive learning with “unconcerned samples” to improve cardiac image segmentation by addressing boundary representation contamination.

Details

Motivation: Address representation contamination at boundaries in cardiac image segmentation, where pixel representations at boundaries can be ambiguous and affect segmentation quality.

Method: Proposes intra-class subdivision pixel contrastive learning with “unconcerned samples” to distinguish inner vs boundary pixel representations within same class, plus boundary contrastive loss to enhance discrimination across boundaries.

Result: SPCL significantly improves segmentation performance on public cardiac datasets, outperforming existing methods in both segmentation quality and boundary precision.

Conclusion: The SPCL framework effectively addresses boundary representation issues in cardiac segmentation through novel contrastive learning techniques, leading to improved performance.

Abstract: We propose an intra-class subdivision pixel contrastive learning (SPCL) framework for cardiac image segmentation to address representation contamination at boundaries. The novel concept ``Unconcerned sample’’ is proposed to distinguish pixel representations at the inner and boundary regions within the same class, facilitating a clearer characterization of intra-class variations. A novel boundary contrastive loss for boundary representations is proposed to enhance representation discrimination across boundaries. The advantages of the unconcerned sample and boundary contrastive loss are analyzed theoretically. Experimental results in public cardiac datasets demonstrate that SPCL significantly improves segmentation performance, outperforming existing methods with respect to segmentation quality and boundary precision. Our code is available at https://github.com/Jrstud203/SPCL.

[332] Stabilizing Diffusion Posterior Sampling by Noise–Frequency Continuation

Feng Tian, Yixuan Li, Weili Zeng, Weitian Zhang, Yichao Yan, Xiaokang Yang

Main category: cs.CV

TL;DR: A noise-frequency continuation framework for diffusion-based inverse problems that enforces measurement consistency only within noise-dependent frequency bands, improving detail recovery and reducing artifacts.

Details

Motivation: Current diffusion posterior sampling methods for inverse problems often fail to recover fine details because measurement consistency guidance is weakly coupled to diffusion noise levels, causing early-step drift, spurious artifacts, and sensitivity to schedules/operators.

Method: Proposes a noise-frequency continuation framework that constructs intermediate posteriors with likelihood enforcing measurement consistency only within noise-dependent frequency bands. Uses stabilized posterior sampler combining diffusion predictor, band-limited likelihood guidance, and multi-resolution consistency strategy.

Result: Achieves state-of-the-art performance across super-resolution, inpainting, and deblurring tasks, improving motion deblurring PSNR by up to 5 dB over strong baselines.

Conclusion: The noise-frequency continuation framework effectively addresses limitations of current diffusion posterior sampling by better coupling measurement consistency with diffusion noise levels, enabling more reliable detail recovery in inverse problems.

Abstract: Diffusion posterior sampling solves inverse problems by combining a pretrained diffusion prior with measurement-consistency guidance, but it often fails to recover fine details because measurement terms are applied in a manner that is weakly coupled to the diffusion noise level. At high noise, data-consistency gradients computed from inaccurate estimates can be geometrically incongruent with the posterior geometry, inducing early-step drift, spurious high-frequency artifacts, plus sensitivity to schedules and ill-conditioned operators. To address these concerns, we propose a noise–frequency Continuation framework that constructs a continuous family of intermediate posteriors whose likelihood enforces measurement consistency only within a noise-dependent frequency band. This principle is instantiated with a stabilized posterior sampler that combines a diffusion predictor, band-limited likelihood guidance, and a multi-resolution consistency strategy that aggressively commits reliable coarse corrections while conservatively adopting high-frequency details only when they become identifiable. Across super-resolution, inpainting, and deblurring, our method achieves state-of-the-art performance and improves motion deblurring PSNR by up to 5 dB over strong baselines.

[333] CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

Hang Wu, Yujun Cai, Zehao Li, Haonan Ge, Bowen Sun, Junsong Yuan, Yiwei Wang

Main category: cs.CV

TL;DR: CamReasoner is a framework that reformulates camera movement understanding as a structured inference process using an Observation-Thinking-Answer paradigm with RL-based logical alignment to ground motion inferences in physical geometry rather than contextual guesswork.

Details

Motivation: Existing multimodal models treat camera dynamics as black-box classification, confusing physically distinct motions by relying on superficial visual patterns rather than geometric cues. There's a need to bridge the gap between perception and cinematic logic through structured reasoning.

Method: Uses Observation-Thinking-Answer (O-T-A) paradigm to decode spatio-temporal cues like trajectories and view frustums. Constructs Large-scale Inference Trajectory Suite with 18k SFT reasoning chains and 38k RL feedback samples. First to employ RL for logical alignment in this domain to ensure motion inferences are grounded in physical geometry.

Result: Achieves state-of-the-art performance across multiple benchmarks. Effectively suppresses hallucinations by applying Reinforcement Learning to the O-T-A reasoning paradigm.

Conclusion: CamReasoner successfully bridges perception and cinematic logic through structured inference, demonstrating that explicit reasoning blocks and RL-based alignment can significantly improve camera movement understanding in multimodal models.

Abstract: Understanding camera dynamics is a fundamental pillar of video spatial intelligence. However, existing multimodal models predominantly treat this task as a black-box classification, often confusing physically distinct motions by relying on superficial visual patterns rather than geometric cues. We present CamReasoner, a framework that reformulates camera movement understanding as a structured inference process to bridge the gap between perception and cinematic logic. Our approach centers on the Observation-Thinking-Answer (O-T-A) paradigm, which compels the model to decode spatio-temporal cues such as trajectories and view frustums within an explicit reasoning block. To instill this capability, we construct a Large-scale Inference Trajectory Suite comprising 18k SFT reasoning chains and 38k RL feedback samples. Notably, we are the first to employ RL for logical alignment in this domain, ensuring motion inferences are grounded in physical geometry rather than contextual guesswork. By applying Reinforcement Learning to the Observation-Think-Answer (O-T-A) reasoning paradigm, CamReasoner effectively suppresses hallucinations and achieves state-of-the-art performance across multiple benchmarks.

[334] AI-Generated Image Detectors Overrely on Global Artifacts: Evidence from Inpainting Exchange

Elif Nebioglu, Emirhan Bilgiç, Adrian Popescu

Main category: cs.CV

TL;DR: INP-X reveals that current inpainting detectors rely on global VAE artifacts rather than local synthesized content, showing detectors fail when artifacts are removed while keeping synthesized regions intact.

Details

Motivation: Current inpainting detectors focus on global VAE-induced artifacts rather than actual synthesized content, making them vulnerable to attacks that preserve synthesized regions while removing these artifacts.

Method: Introduces Inpainting Exchange (INP-X) operation that restores original pixels outside edited regions while preserving synthesized content. Creates 90K test dataset with real, inpainted, and exchanged images to evaluate detector behavior.

Result: State-of-the-art detectors show dramatic accuracy drops (e.g., 91% to 55%) when tested on INP-X images, often approaching chance level. Training on INP-X dataset yields better generalization and localization than standard inpainting detection.

Conclusion: Current inpainting detectors are fundamentally flawed as they rely on global VAE artifacts rather than local content synthesis. Content-aware detection is needed, and the INP-X dataset enables more robust detector training.

Abstract: Modern deep learning-based inpainting enables realistic local image manipulation, raising critical challenges for reliable detection. However, we observe that current detectors primarily rely on global artifacts that appear as inpainting side effects, rather than on locally synthesized content. We show that this behavior occurs because VAE-based reconstruction induces a subtle but pervasive spectral shift across the entire image, including unedited regions. To isolate this effect, we introduce Inpainting Exchange (INP-X), an operation that restores original pixels outside the edited region while preserving all synthesized content. We create a 90K test dataset including real, inpainted, and exchanged images to evaluate this phenomenon. Under this intervention, pretrained state-of-the-art detectors, including commercial ones, exhibit a dramatic drop in accuracy (e.g., from 91% to 55%), frequently approaching chance level. We provide a theoretical analysis linking this behavior to high-frequency attenuation caused by VAE information bottlenecks. Our findings highlight the need for content-aware detection. Indeed, training on our dataset yields better generalization and localization than standard inpainting. Our dataset and code are publicly available at https://github.com/emirhanbilgic/INP-X.

[335] Vision-Language Model Purified Semi-Supervised Semantic Segmentation for Remote Sensing Images

Shanwen Wang, Xin Sun, Danfeng Hong, Fei Zhou

Main category: cs.CV

TL;DR: SemiEarth introduces vision-language models to improve pseudo-label quality in semi-supervised semantic segmentation for remote sensing images, achieving state-of-the-art performance with better interpretability.

Details

Motivation: Traditional semi-supervised semantic segmentation (S4) methods suffer from low-quality pseudo-labels, especially in teacher-student frameworks. This is particularly challenging in remote sensing (RS) domain where multi-class boundary regions are complex. The authors aim to leverage vision-language models (VLMs) to address these issues by purifying pseudo-labels.

Method: Proposes SemiEarth with a VLM pseudo-label purifying (VLM-PP) structure that uses vision-language models to purify teacher network’s pseudo-labels. The VLM-PP module corrects mispredicted categories in low-confidence pseudo-labels when discrepancies arise between VLM predictions and pseudo-labels. The approach is architecture-independent and leverages VLMs’ open-world capabilities.

Result: Extensive experiments on multiple RS datasets demonstrate that SemiEarth achieves state-of-the-art (SOTA) performance. The model shows significant improvements in multi-class boundary regions and offers good interpretability compared to previous SOTA RS S4 methods.

Conclusion: SemiEarth successfully integrates vision-language models into semi-supervised semantic segmentation for remote sensing, addressing pseudo-label quality issues and achieving superior performance with interpretability benefits.

Abstract: The semi-supervised semantic segmentation (S4) can learn rich visual knowledge from low-cost unlabeled images. However, traditional S4 architectures all face the challenge of low-quality pseudo-labels, especially for the teacher-student framework.We propose a novel SemiEarth model that introduces vision-language models (VLMs) to address the S4 issues for the remote sensing (RS) domain. Specifically, we invent a VLM pseudo-label purifying (VLM-PP) structure to purify the teacher network’s pseudo-labels, achieving substantial improvements. Especially in multi-class boundary regions of RS images, the VLM-PP module can significantly improve the quality of pseudo-labels generated by the teacher, thereby correctly guiding the student model’s learning. Moreover, since VLM-PP equips VLMs with open-world capabilities and is independent of the S4 architecture, it can correct mispredicted categories in low-confidence pseudo-labels whenever a discrepancy arises between its prediction and the pseudo-label. We conducted extensive experiments on multiple RS datasets, which demonstrate that our SemiEarth achieves SOTA performance. More importantly, unlike previous SOTA RS S4 methods, our model not only achieves excellent performance but also offers good interpretability. The code is released at https://github.com/wangshanwen001/SemiEarth.

[336] Deep Transformer Network for Monocular Pose Estimation of Shipborne Unmanned Aerial Vehicle

Maneesha Wickramasuriya, Taeyoung Lee, Murray Snyder

Main category: cs.CV

TL;DR: Transformer-based 6D pose estimation for UAV relative to ships using monocular images, with Bayesian fusion of multiple ship part keypoints.

Details

Motivation: Enable autonomous UAV landing and navigation on ships by accurately estimating relative 6D pose using monocular vision, which is challenging due to varying lighting and ship motion.

Method: Create synthetic ship dataset with 2D keypoint annotations for multiple ship parts. Train Transformer Neural Network to detect keypoints and estimate 6D pose per part, then integrate estimates using Bayesian fusion.

Result: Achieved position errors of ~0.8% of distance on synthetic data and ~1.0% on flight experiments, demonstrating robustness across lighting conditions.

Conclusion: Transformer-based approach with Bayesian fusion enables accurate 6D pose estimation for ship-UAV relative positioning, supporting autonomous landing applications.

Abstract: This paper introduces a deep transformer network for estimating the relative 6D pose of a Unmanned Aerial Vehicle (UAV) with respect to a ship using monocular images. A synthetic dataset of ship images is created and annotated with 2D keypoints of multiple ship parts. A Transformer Neural Network model is trained to detect these keypoints and estimate the 6D pose of each part. The estimates are integrated using Bayesian fusion. The model is tested on synthetic data and in-situ flight experiments, demonstrating robustness and accuracy in various lighting conditions. The position estimation error is approximately 0.8% and 1.0% of the distance to the ship for the synthetic data and the flight experiments, respectively. The method has potential applications for ship-based autonomous UAV landing and navigation.

[337] Interpretable Unsupervised Deformable Image Registration via Confidence-bound Multi-Hop Visual Reasoning

Zafar Iqbal, Anwar Ul Haq, Srimannarayana Grandhi

Main category: cs.CV

TL;DR: VCoR: Multi-hop visual reasoning framework for unsupervised medical image registration that provides interpretability through progressive refinement with theoretical bounds and uncertainty estimation.

Details

Motivation: Existing deep learning methods for unsupervised deformable image registration lack transparency and interpretability, leading to error drift and reduced clinical trust. There's a need for registration methods that are not only accurate but also provide explainable intermediate steps and confidence measures.

Method: Proposes Multi-Hop Visual Chain of Reasoning (VCoR) framework that reformulates registration as progressive reasoning. Each hop integrates Localized Spatial Refinement (LSR) module for feature enrichment and Cross-Reference Attention (CRA) mechanism for iterative refinement. Uses multi-hop strategy to handle large deformations with theoretical bounds on convergence.

Result: Achieves competitive registration accuracy on DIR-Lab 4D CT (lung) and IXI T1-weighted MRI (brain) datasets. Provides rich intermediate visualizations, confidence measures, and uncertainty estimation via deformation field stability across hops.

Conclusion: VCoR presents an interpretable, reliable, and clinically viable unsupervised medical image registration framework by embedding implicit visual reasoning paradigm with built-in transparency and uncertainty estimation.

Abstract: Unsupervised deformable image registration requires aligning complex anatomical structures without reference labels, making interpretability and reliability critical. Existing deep learning methods achieve considerable accuracy but often lack transparency, leading to error drift and reduced clinical trust. We propose a novel Multi-Hop Visual Chain of Reasoning (VCoR) framework that reformulates registration as a progressive reasoning process. Inspired by the iterative nature of clinical decision-making, each visual reasoning hop integrates a Localized Spatial Refinement (LSR) module to enrich feature representations and a Cross-Reference Attention (CRA) mechanism that leads the iterative refinement process, preserving anatomical consistency. This multi-hop strategy enables robust handling of large deformations and produces a transparent sequence of intermediate predictions with a theoretical bound. Beyond accuracy, our framework offers built-in interpretability by estimating uncertainty via the stability and convergence of deformation fields across hops. Extensive evaluations on two challenging public datasets, DIR-Lab 4D CT (lung) and IXI T1-weighted MRI (brain), demonstrate that VCoR achieves competitive registration accuracy while offering rich intermediate visualizations and confidence measures. By embedding an implicit visual reasoning paradigm, we present an interpretable, reliable, and clinically viable unsupervised medical image registration.

[338] Deep Learning Based CNN Model for Automated Detection of Pneumonia from Chest XRay Images

Sathish Krishna Anumula, Vetrivelan Tamilmani, Aniruddha Arjun Singh, Dinesh Rajendran, Venkata Deepak Namburi

Main category: cs.CV

TL;DR: A custom CNN model for pneumonia detection in chest X-rays using depthwise separable convolutions optimized for grayscale medical images, with CLAHE and geometric augmentation preprocessing.

Details

Motivation: Pneumonia causes high morbidity/mortality, especially in pediatric/elderly populations in resource-limited areas. Traditional manual chest X-ray interpretation suffers from inter-observer variation, expert fatigue, and radiologist shortages, requiring automated solutions.

Method: Custom CNN architecture with depthwise separable convolutional design optimized for grayscale medical images. Uses CLAHE (Contrast Limited Adaptive Histogram Equalization) and geometric augmentation for preprocessing to handle class imbalance and improve generalization.

Result: Tested on dataset of 5,863 anterior-posterior chest X-rays, achieving high precision pneumonia detection with minimal computational expense compared to generic transfer learning models with redundant parameters.

Conclusion: The proposed unified automated diagnostic model provides fast, precise pneumonia detection in chest X-rays, addressing limitations of manual interpretation while being computationally efficient.

Abstract: Pneumonia has been one of the major causes of morbidities and mortality in the world and the prevalence of this disease is disproportionately high among the pediatric and elderly populations especially in resources trained areas Fast and precise diagnosis is a prerequisite for successful clinical intervention but due to inter observer variation fatigue among experts and a shortage of qualified radiologists traditional approaches that rely on manual interpretation of chest radiographs are frequently constrained To address these problems this paper introduces a unified automated diagnostic model using a custom Convolutional Neural Network CNN that can recognize pneumonia in chest Xray images with high precision and at minimal computational expense In contrast like other generic transfer learning based models which often possess redundant parameters the offered architecture uses a tailor made depth wise separable convolutional design which is optimized towards textural characteristics of grayscale medical images Contrast Limited Adaptive Histogram Equalization CLAHE and geometric augmentation are two significant preprocessing techniques used to ensure that the system does not experience class imbalance and is more likely to generalize The system is tested using a dataset of 5863 anterior posterior chest Xrays.

[339] Diffusion-based Layer-wise Semantic Reconstruction for Unsupervised Out-of-Distribution Detection

Ying Yang, De Cheng, Chaowei Fang, Yubiao Wang, Changzhe Jiao, Lechao Cheng, Nannan Wang

Main category: cs.CV

TL;DR: Proposes a diffusion-based layer-wise semantic reconstruction approach for unsupervised OOD detection using diffusion models’ reconstruction ability in latent feature space with multi-layer feature extraction.

Details

Motivation: Current reconstruction-based OOD detection methods face a dilemma: improving reconstruction power while keeping compact ID data representation. Need better approaches for safe real-world ML systems.

Method: Uses diffusion models’ intrinsic data reconstruction ability to distinguish ID from OOD samples in latent feature space. Employs multi-layer semantic feature extraction strategy, distorts features with Gaussian noise, and applies diffusion model for feature reconstruction.

Result: Achieves state-of-the-art performance on multiple benchmarks in terms of detection accuracy and speed.

Conclusion: The diffusion-based layer-wise semantic reconstruction approach effectively addresses the reconstruction power vs. compact representation dilemma for unsupervised OOD detection.

Abstract: Unsupervised out-of-distribution (OOD) detection aims to identify out-of-domain data by learning only from unlabeled In-Distribution (ID) training samples, which is crucial for developing a safe real-world machine learning system. Current reconstruction-based methods provide a good alternative approach by measuring the reconstruction error between the input and its corresponding generative counterpart in the pixel/feature space. However, such generative methods face a key dilemma: improving the reconstruction power of the generative model while keeping a compact representation of the ID data. To address this issue, we propose the diffusion-based layer-wise semantic reconstruction approach for unsupervised OOD detection. The innovation of our approach is that we leverage the diffusion model’s intrinsic data reconstruction ability to distinguish ID samples from OOD samples in the latent feature space. Moreover, to set up a comprehensive and discriminative feature representation, we devise a multi-layer semantic feature extraction strategy. By distorting the extracted features with Gaussian noise and applying the diffusion model for feature reconstruction, the separation of ID and OOD samples is implemented according to the reconstruction errors. Extensive experimental results on multiple benchmarks built upon various datasets demonstrate that our method achieves state-of-the-art performance in terms of detection accuracy and speed. Code is available at https://github.com/xbyym/DLSR.

[340] A Geometric Multimodal Foundation Model Integrating Bp-MRI and Clinical Reports in Prostate Cancer Classification

Juan A. Olmos, Antoine Manzanera, Fabio Martínez

Main category: cs.CV

TL;DR: A multimodal foundation model (MFM-Geom) for prostate cancer diagnosis that combines bi-parametric MRI and clinical reports using geometric representations and Riemannian deep learning, achieving strong performance with limited data.

Details

Motivation: Prostate cancer diagnosis using MRI and clinical variables is subjective and existing methods overlook clinical context while suffering from data scarcity, limiting robust representation learning.

Method: Proposes MFM-Geom, a geometric multimodal foundation model that learns representations from bp-MRI and clinical reports, using symmetric positive definite matrices and Riemannian deep learning to integrate imaging-text representations.

Result: Using only 10% of training data, MFM-Geom outperformed baseline class token embedding-based classification by +8.3%, achieving AUC-PR of 90.67%. Generalization on external dataset achieved AUC-PR of 90.6.

Conclusion: The geometric multimodal foundation model effectively integrates medical imaging and clinical context, demonstrating robust performance with limited data and good generalization capabilities for prostate cancer diagnosis.

Abstract: Prostate cancer (PCa) is one of the most common cancers in men worldwide. Bi-parametric MRI (bp-MRI) and clinical variables are crucial for PCa identification and improving treatment decisions. However, this process is subjective to expert interpretations. Furthermore, most existing computer-aided diagnosis methods focus on imaging-based models, overlooking the clinical context and suffering from data scarcity, limiting their ability to learn robust representations. We propose a geometric multimodal Foundation Model (FM), named MFM-Geom, that learns representations from bp-MRI and clinical reports, encoding visual findings and information from the context of clinical variables. In the representations classification head, the approach leverages symmetric positive definite (SPD) matrices and Riemannian deep learning to integrate imaging-text representations from a biomedical multimodal FM. Using 10% of the training data, MFM-Geom outperformed baseline class token embedding-based classification (+8.3%, AUC-PR of 90.67). Generalization on external dataset confirmed the robustness of fine-tuning biomedical FM, achieving an AUC-PR of 90.6.

[341] An Extended VIIRS-like Artificial Nighttime Light Data Reconstruction (1986-2024)

Yihe Tian, Kwan Man Cheng, Zhengbo Zhang, Tao Zhang, Junning Feng, Zhehao Ren, Suju Li, Dongmei Yan, Bing Xu

Main category: cs.CV

TL;DR: A new annual nighttime light dataset (EVAL) for China (1986-2024) generated using a two-stage deep learning model that addresses intensity underestimation and structural detail omission in existing VIIRS-like products.

Details

Motivation: Existing VIIRS-like NTL data products have limitations: they underestimate light intensity and omit structural details, and the VIIRS sensor's temporal coverage (starting 2012) restricts long-term studies. There's a need for high-quality extended NTL data for China spanning earlier periods.

Method: A novel two-stage deep learning model: first constructs an initial estimate, then refines fine-grained structural details using high-resolution impervious surface data as guidance to generate the EVAL dataset.

Result: EVAL significantly outperforms state-of-the-art products, showing superior temporal consistency and stronger correlation with socioeconomic indicators.

Conclusion: The EVAL dataset provides a high-quality, extended VIIRS-like nighttime light dataset for China from 1986 to 2024, overcoming limitations of existing products through a novel deep learning approach.

Abstract: Artificial Night-Time Light (NTL) remote sensing is a vital proxy for quantifying the intensity and spatial distribution of human activities. Although the NPP-VIIRS sensor provides high-quality NTL observations, its temporal coverage, which begins in 2012, restricts long-term time-series studies that extend to earlier periods. Current extended VIIRS-like NTL data products suffer from two significant shortcomings: the underestimation of light intensity and the omission of structural details. To overcome these limitations, we present the Extended VIIRS-like Artificial Nighttime Light (EVAL) dataset, a new annual NTL dataset for China spanning from 1986 to 2024. This dataset was generated using a novel two-stage deep learning model designed to address the aforementioned shortcomings. The model first constructs an initial estimate and subsequently refines fine-grained structural details using high-resolution impervious surface data as guidance. Quantitative evaluations demonstrate that EVAL significantly outperforms state-of-the-art products, exhibiting superior temporal consistency and a stronger correlation with socioeconomic indicators.

[342] CAPA: Contribution-Aware Pruning and FFN Approximation for Efficient Large Vision-Language Models

Samyak Jha, Junho Kim

Main category: cs.CV

TL;DR: CAPA: A framework that improves efficiency in Large Vision-Language Models by pruning visual tokens using attention contribution scores and approximating FFN computations with linear approximations.

Details

Motivation: Current LVLMs suffer from high computational costs due to processing thousands of visual tokens, but existing methods for token selection (using attention scores) are imperfect proxies for actual token contribution.

Method: Introduces Attention Contribution metric (weighting attention probabilities by value vector magnitude) to better identify important visual tokens. Proposes CAPA framework with dual strategy: 1) pruning visual tokens at critical functional transitions using attention contribution, and 2) reducing FFN computation through efficient linear approximations, especially for intermediate layers where image tokens exhibit linear behavior.

Result: CAPA achieves competent efficiency-performance trade-offs with improved robustness across various benchmarks and baselines. Identifies that visual attention sinks consist of Probability Dumps (low contribution, can be pruned) and Structural Anchors (high contribution, essential for performance).

Conclusion: Attention Contribution provides better criterion for visual token selection than traditional attention scores. The proposed CAPA framework effectively reduces computational costs while maintaining performance in LVLMs through targeted pruning and FFN approximation.

Abstract: Efficient inference in Large Vision-Language Models is constrained by the high cost of processing thousands of visual tokens, yet it remains unclear which tokens and computations can be safely removed. While attention scores are commonly used to estimate visual token importance, they are an imperfect proxy for actual contribution. We show that Attention Contribution, which weights attention probabilities by value vector magnitude, provides a more accurate criterion for visual token selection. Our empirical analysis reveals that visual attention sinks are functionally heterogeneous, comprising Probability Dumps with low contribution that can be safely pruned, and Structural Anchors with high contribution essential for maintaining model performance. Further, we identify substantial redundancy in Feed-Forward Networks (FFNs) associated with visual tokens, particularly in intermediate layers where image tokens exhibit linear behavior. Based on our findings, we introduce CAPA (Contribution-Aware Pruning and FFN Approximation), a dual-strategy framework that prunes visual tokens using attention contribution at critical functional transitions and reduces FFN computation through efficient linear approximations. Experiments on various benchmarks across baselines show that CAPA achieves competent efficiency–performance trade-offs with improved robustness.

[343] Hospital-Specific Bias in Patch-Based Pathology Models

Mengliang Zhang

Main category: cs.CV

TL;DR: PFMs show strong performance but are sensitive to hospital domain shifts; a lightweight adversarial adaptor reduces hospital-specific bias while maintaining disease classification accuracy.

Details

Motivation: Pathology foundation models achieve strong performance on diverse histopathology tasks, but their sensitivity to hospital-specific domain shifts remains underexplored, requiring systematic evaluation and methods to enhance cross-hospital robustness.

Method: Systematically evaluate state-of-the-art PFMs on TCGA patch-level datasets and introduce a lightweight adversarial adaptor to remove hospital-related domain information from latent representations.

Result: While disease classification accuracy is largely maintained, the adaptor effectively reduces hospital-specific bias, as confirmed by t-SNE visualizations, establishing a benchmark for assessing cross-hospital robustness in PFMs.

Conclusion: The study provides a practical strategy for enhancing generalization under heterogeneous clinical settings and establishes a benchmark for assessing cross-hospital robustness in pathology foundation models.

Abstract: Pathology foundation models (PFMs) achieve strong performance on diverse histopathology tasks, but their sensitivity to hospital-specific domain shifts remains underexplored. We systematically evaluate state-of-the-art PFMs on TCGA patch-level datasets and introduce a lightweight adversarial adaptor to remove hospital-related domain information from latent representations. Experiments show that, while disease classification accuracy is largely maintained, the adaptor effectively reduces hospital-specific bias, as confirmed by t-SNE visualizations. Our study establishes a benchmark for assessing cross-hospital robustness in PFMs and provides a practical strategy for enhancing generalization under heterogeneous clinical settings. Our code is available at https://github.com/MengRes/pfm_domain_bias.

[344] SANEval: Open-Vocabulary Compositional Benchmarks with Failure-mode Diagnosis

Rishav Pramanik, Ian E. Nielsen, Jeff Smith, Saurav Pandit, Ravi P. Ramachandran, Zhaozheng Yin

Main category: cs.CV

TL;DR: SANEval is a comprehensive benchmark for evaluating text-to-image models’ compositional capabilities, using LLMs for prompt understanding and open-vocabulary object detection to assess spatial relations, attribute binding, and numeracy.

Details

Motivation: Current T2I models struggle with complex prompts involving multiple objects, attributes, and spatial relationships, but progress is hampered by inadequate evaluation methods that lack fine-grained diagnostic capabilities and are constrained by closed vocabularies.

Method: Developed SANEval benchmark with a scalable pipeline combining LLMs for deep prompt understanding and LLM-enhanced open-vocabulary object detectors to evaluate compositional adherence without vocabulary constraints.

Result: SANEval’s automated evaluations provide more faithful proxy for human assessment, achieving statistically significant Spearman’s rank correlation improvements over existing benchmarks across attribute binding, spatial relations, and numeracy tasks.

Conclusion: SANEval addresses critical limitations in T2I evaluation, offering comprehensive compositional assessment and interpretable feedback to facilitate future research in compositional T2I generation and evaluation.

Abstract: The rapid progress of text-to-image (T2I) models has unlocked unprecedented creative potential, yet their ability to faithfully render complex prompts involving multiple objects, attributes, and spatial relationships remains a significant bottleneck. Progress is hampered by a lack of adequate evaluation methods; current benchmarks are often restricted to closed-set vocabularies, lack fine-grained diagnostic capabilities, and fail to provide the interpretable feedback necessary to diagnose and remedy specific compositional failures. We solve these challenges by introducing SANEval (Spatial, Attribute, and Numeracy Evaluation), a comprehensive benchmark that establishes a scalable new pipeline for open-vocabulary compositional evaluation. SANEval combines a large language model (LLM) for deep prompt understanding with an LLM-enhanced, open-vocabulary object detector to robustly evaluate compositional adherence, unconstrained by a fixed vocabulary. Through extensive experiments on six state-of-the-art T2I models, we demonstrate that SANEval’s automated evaluations provide a more faithful proxy for human assessment; our metric achieves a Spearman’s rank correlation with statistically different results than those of existing benchmarks across tasks of attribute binding, spatial relations, and numeracy. To facilitate future research in compositional T2I generation and evaluation, we will release the SANEval dataset and our open-source evaluation pipeline.

[345] Subspace Clustering on Incomplete Data with Self-Supervised Contrastive Learning

Huanran Li, Daniel Pimentel-Alarcón

Main category: cs.CV

TL;DR: Contrastive self-supervised framework for subspace clustering on incomplete data using masked views and SimCLR-style contrastive learning.

Details

Motivation: Most subspace clustering methods assume fully observed data, limiting effectiveness in real-world scenarios with missing entries. Need for robust methods that can handle incomplete data.

Method: Proposes Contrastive Subspace Clustering (CSC) framework: generates masked views of partially observed inputs, trains deep neural network using SimCLR-style contrastive loss to learn invariant embeddings, then clusters using sparse subspace clustering.

Result: Experiments on six benchmark datasets show CSC consistently outperforms both classical and deep learning baselines, demonstrating strong robustness to missing data and scalability to large datasets.

Conclusion: CSC provides an effective contrastive self-supervised approach for subspace clustering on incomplete data, addressing limitations of existing methods that require fully observed data.

Abstract: Subspace clustering aims to group data points that lie in a union of low-dimensional subspaces and finds wide application in computer vision, hyperspectral imaging, and recommendation systems. However, most existing methods assume fully observed data, limiting their effectiveness in real-world scenarios with missing entries. In this paper, we propose a contrastive self-supervised framework, Contrastive Subspace Clustering (CSC), designed for clustering incomplete data. CSC generates masked views of partially observed inputs and trains a deep neural network using a SimCLR-style contrastive loss to learn invariant embeddings. These embeddings are then clustered using sparse subspace clustering. Experiments on six benchmark datasets show that CSC consistently outperforms both classical and deep learning baselines, demonstrating strong robustness to missing data and scalability to large datasets.

[346] World-Shaper: A Unified Framework for 360° Panoramic Editing

Dong Liang, Yuhao Liu, Jinyuan Jia, Youjun Zhao, Rynson W. H. Lau

Main category: cs.CV

TL;DR: World-Shaper: A geometry-aware framework for panoramic image editing directly in equirectangular projection domain, addressing spatial structure modeling and geometric distortion in 360° images.

Details

Motivation: Existing perspective-based image editing methods fail to model panoramic spatial structure, and conventional cube-map decompositions break global consistency due to mismatch with spherical geometry.

Method: Reformulates panoramic editing directly in ERP domain using a generate-then-edit paradigm with controllable panoramic generation for paired data synthesis. Introduces geometry-aware learning with position-aware shape supervision and panoramic priors through progressive training.

Result: Achieves superior geometric consistency, editing fidelity, and text controllability compared to SOTA methods on new benchmark PEBench, enabling coherent and flexible 360° visual world creation.

Conclusion: World-Shaper bridges panoramic generation and editing within a unified editing-centric design, effectively addressing geometric distortion and enabling high-quality 360° visual editing.

Abstract: Being able to edit panoramic images is crucial for creating realistic 360° visual experiences. However, existing perspective-based image editing methods fail to model the spatial structure of panoramas. Conventional cube-map decompositions attempt to overcome this problem but inevitably break global consistency due to their mismatch with spherical geometry. Motivated by this insight, we reformulate panoramic editing directly in the equirectangular projection (ERP) domain and present World-Shaper, a unified geometry-aware framework that bridges panoramic generation and editing within a single editing-centric design. To overcome the scarcity of paired data, we adopt a generate-then-edit paradigm, where controllable panoramic generation serves as an auxiliary stage to synthesize diverse paired examples for supervised editing learning. To address geometric distortion, we introduce a geometry-aware learning strategy that explicitly enforces position-aware shape supervision and implicitly internalizes panoramic priors through progressive training. Extensive experiments on our new benchmark, PEBench, demonstrate that our method achieves superior geometric consistency, editing fidelity, and text controllability compared to SOTA methods, enabling coherent and flexible 360° visual world creation with unified editing control. Code, model, and data will be released at our project page: https://world-shaper-project.github.io/

[347] PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories

Gemma Canet Tarrés, Manel Baradad, Francesc Moreno-Noguer, Yumeng Li

Main category: cs.CV

TL;DR: PLACID is a framework for studio-level multi-object compositing that uses a pretrained image-to-video diffusion model with text control to preserve object identities and background details, trained on synthetic data of objects moving to target positions.

Details

Motivation: Current generative AI models fall short for studio-level multi-object compositing, often altering object details, omitting/duplicating objects, and producing incorrect layouts. There's a need for simultaneous preservation of object identity, background fidelity, layout control, and complete appealing displays.

Method: Uses pretrained image-to-video diffusion model with text control to exploit temporal priors from videos for object consistency. Proposes novel data curation strategy generating synthetic sequences where randomly placed objects smoothly move to target positions, aligning with video model’s temporal priors during training.

Result: PLACID surpasses state-of-the-art methods in multi-object compositing, achieving superior identity, background, and color preservation with less omitted objects and visually appealing results, as demonstrated through extensive quantitative evaluations and user studies.

Conclusion: PLACID effectively bridges the gap in studio-level multi-object compositing by leveraging video temporal priors and synthetic training data, enabling coherent layouts guided by text while preserving object identities and background details.

Abstract: Recent advances in generative AI have dramatically improved photorealistic image synthesis, yet they fall short for studio-level multi-object compositing. This task demands simultaneous (i) near-perfect preservation of each item’s identity, (ii) precise background and color fidelity, (iii) layout and design elements control, and (iv) complete, appealing displays showcasing all objects. However, current state-of-the-art models often alter object details, omit or duplicate objects, and produce layouts with incorrect relative sizing or inconsistent item presentations. To bridge this gap, we introduce PLACID, a framework that transforms a collection of object images into an appealing multi-object composite. Our approach makes two main contributions. First, we leverage a pretrained image-to-video (I2V) diffusion model with text control to preserve objects consistency, identities, and background details by exploiting temporal priors from videos. Second, we propose a novel data curation strategy that generates synthetic sequences where randomly placed objects smoothly move to their target positions. This synthetic data aligns with the video model’s temporal priors during training. At inference, objects initialized at random positions consistently converge into coherent layouts guided by text, with the final frame serving as the composite image. Extensive quantitative evaluations and user studies demonstrate that PLACID surpasses state-of-the-art methods in multi-object compositing, achieving superior identity, background, and color preservation, with less omitted objects and visually appealing results.

[348] TokenTrim: Inference-Time Token Pruning for Autoregressive Long Video Generation

Ariel Shaulov, Eitan Shaar, Amit Edenzon, Lior Wolf

Main category: cs.CV

TL;DR: A simple inference-time method to mitigate temporal drift in auto-regressive video generation by identifying and removing unstable latent tokens before they are reused for conditioning.

Details

Motivation: Auto-regressive video generation suffers from severe temporal drift where errors accumulate and amplify over long horizons. The authors hypothesize this drift stems from inference-time error propagation rather than insufficient model capacity, specifically from uncontrolled reuse of corrupted latent conditioning tokens during auto-regressive inference.

Method: Proposes an inference-time method that identifies and removes unstable latent tokens before they are reused for conditioning. Unstable tokens are defined as those whose representations deviate significantly from previously generated batches, indicating potential corruption or semantic drift. The method removes corrupted latent tokens from the auto-regressive context without modifying spatial regions, model architecture, training procedures, or leaving latent space.

Result: The method significantly improves long-horizon temporal consistency in auto-regressive video generation by preventing unreliable latent information from influencing future generation steps.

Conclusion: Temporal drift in auto-reggressive video generation can be effectively mitigated through a simple inference-time approach that identifies and removes corrupted latent tokens, improving long-horizon consistency without architectural or training modifications.

Abstract: Auto-regressive video generation enables long video synthesis by iteratively conditioning each new batch of frames on previously generated content. However, recent work has shown that such pipelines suffer from severe temporal drift, where errors accumulate and amplify over long horizons. We hypothesize that this drift does not primarily stem from insufficient model capacity, but rather from inference-time error propagation. Specifically, we contend that drift arises from the uncontrolled reuse of corrupted latent conditioning tokens during auto-regressive inference. To correct this accumulation of errors, we propose a simple, inference-time method that mitigates temporal drift by identifying and removing unstable latent tokens before they are reused for conditioning. For this purpose, we define unstable tokens as latent tokens whose representations deviate significantly from those of the previously generated batch, indicating potential corruption or semantic drift. By explicitly removing corrupted latent tokens from the auto-regressive context, rather than modifying entire spatial regions or model parameters, our method prevents unreliable latent information from influencing future generation steps. As a result, it significantly improves long-horizon temporal consistency without modifying the model architecture, training procedure, or leaving latent space.

[349] TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs

Baiqi Li, Kangyi Zhao, Ce Zhang, Chancharik Mitra, Jean de Dieu Nyandwi, Gedas Bertasius

Main category: cs.CV

TL;DR: TimeBlind is a diagnostic benchmark for evaluating compositional spatio-temporal understanding in multimodal LLMs, revealing that current models rely on static visual shortcuts rather than genuine temporal reasoning.

Details

Motivation: Multimodal LLMs excel at static semantics but have brittle temporal understanding. There's a need for benchmarks that specifically test fine-grained temporal reasoning without conflating it with static recognition.

Method: Created TimeBlind benchmark using minimal-pairs paradigm: video pairs share identical static content but differ only in temporal structure. Categorizes temporal understanding into three levels: atomic events, event properties, and event interdependencies. Uses complementary questions to neutralize language priors.

Result: Evaluated 20+ state-of-the-art MLLMs on 600 instances (2400 video-question pairs). Best MLLM achieved only 48.2% instance accuracy (correctly distinguishing both videos in a pair), far below human performance of 98.2%.

Conclusion: Current MLLMs rely heavily on static visual shortcuts rather than genuine temporal logic. TimeBlind serves as a vital diagnostic tool for advancing video understanding capabilities in multimodal models.

Abstract: Fine-grained spatio-temporal understanding is essential for video reasoning and embodied AI. Yet, while Multimodal Large Language Models (MLLMs) master static semantics, their grasp of temporal dynamics remains brittle. We present TimeBlind, a diagnostic benchmark for compositional spatio-temporal understanding. Inspired by cognitive science, TimeBlind categorizes fine-grained temporal understanding into three levels: recognizing atomic events, characterizing event properties, and reasoning about event interdependencies. Unlike benchmarks that conflate recognition with temporal reasoning, TimeBlind leverages a minimal-pairs paradigm: video pairs share identical static visual content but differ solely in temporal structure, utilizing complementary questions to neutralize language priors. Evaluating over 20 state-of-the-art MLLMs (e.g., GPT-5, Gemini 3 Pro) on 600 curated instances (2400 video-question pairs), reveals that the Instance Accuracy (correctly distinguishing both videos in a pair) of the best performing MLLM is only 48.2%, far below the human performance (98.2%). These results demonstrate that even frontier models rely heavily on static visual shortcuts rather than genuine temporal logic, positioning TimeBlind as a vital diagnostic tool for next-generation video understanding. Dataset and code are available at https://baiqi-li.github.io/timeblind_project/ .

[350] Computer Vision and Its Relationship to Cognitive Science: A perspective from Bayes Decision Theory

Alan Yuille, Daniel Kersten

Main category: cs.CV

TL;DR: The paper introduces computer vision through Bayes Decision Theory, connecting Bayesian approaches with deep neural networks and discussing their integration within a unified framework.

Details

Motivation: To provide a theoretical foundation for computer vision that bridges Bayesian decision theory with deep learning approaches, connecting computational methods with cognitive science perspectives.

Method: Uses Bayes Decision Theory as a unifying framework to analyze and compare Bayesian approaches (conceptually attractive, cognitive science-aligned) and deep neural networks (practical success, biologically inspired).

Result: Presents a theoretical lens that captures key computer vision concepts, relates strengths/weaknesses of Bayesian vs. deep learning approaches, and identifies limitations of BDT for guiding future integration.

Conclusion: Bayes Decision Theory provides a valuable framework for understanding computer vision approaches, but its limitations point toward richer frameworks that can combine Bayesian and deep learning methods.

Abstract: This document presents an introduction to computer vision, and its relationship to Cognitive Science, from the perspective of Bayes Decision Theory (Berger 1985). Computer vision is a vast and complex field, so this overview has a narrow scope and provides a theoretical lens which captures many key concepts. BDT is rich enough to include two different approaches: (i) the Bayesian viewpoint, which gives a conceptually attractive framework for vision with concepts that resonate with Cognitive Science (Griffiths et al., 2024), and (ii) the Deep Neural Network approach whose successes in the real world have made Computer Vision into a trillion-dollar industry and which is motivated by the hierarchical structure of the visual ventral stream. The BDT framework relates and captures the strengths and weakness of these two approaches and, by discussing the limitations of BDT, points the way to how they can be combined in a richer framework.

[351] LogicGaze: Benchmarking Causal Consistency in Visual Narratives via Counterfactual Verification

Rory Driscoll, Alexandros Christoforos, Chadbourne Davis

Main category: cs.CV

TL;DR: LogicGaze is a benchmark framework that evaluates whether Vision-Language Models can validate sequential causal reasoning chains against visual evidence, exposing hallucination vulnerabilities in current VLMs.

Details

Motivation: While VLMs show improved capabilities in sequential reasoning for complex multimodal tasks, there's insufficient exploration of whether they can properly ground these reasoning chains in actual visual evidence, particularly addressing the pervasive issue of hallucination.

Method: LogicGaze is a benchmark framework curated from 40,000 video segments from ShareGPT4Video and Flickr30k imagery. It integrates causal sequences with visually contradictory but linguistically plausible perturbations, forcing models to verify each reasoning step. The evaluation uses a tripartite protocol: Causal Validation, Grounded Narrative Synthesis, and Perturbation Rejection.

Result: The benchmark exposes significant vulnerabilities in state-of-the-art VLMs like Qwen2.5-VL-72B, revealing their limitations in validating sequential causal chains against visual inputs.

Conclusion: LogicGaze advocates for more robust and trustworthy multimodal reasoning by providing a framework to evaluate and improve VLMs’ ability to ground reasoning in visual evidence, with all resources made publicly available.

Abstract: While sequential reasoning enhances the capability of Vision-Language Models (VLMs) to execute complex multimodal tasks, their reliability in grounding these reasoning chains within actual visual evidence remains insufficiently explored. We introduce LogicGaze, a novel benchmark framework designed to rigorously interrogate whether VLMs can validate sequential causal chains against visual inputs, specifically targeting the pervasive issue of hallucination. Curated from 40,000 video segments from ShareGPT4Video and a subset of Flickr30k imagery, LogicGaze integrates causal sequences with visually contradictory yet linguistically plausible perturbations, compelling models to verify the authenticity of each reasoning step. Our tripartite evaluation protocol - Causal Validation, Grounded Narrative Synthesis, and Perturbation Rejection - exposes significant vulnerabilities in state-of-the-art VLMs such as Qwen2.5-VL-72B. LogicGaze advocates for robust, trustworthy multimodal reasoning, with all resources publicly available in an anonymized repository.

[352] Opportunistic Promptable Segmentation: Leveraging Routine Radiological Annotations to Guide 3D CT Lesion Segmentation

Samuel Church, Joshua D. Warner, Danyal Maqbool, Xin Tie, Junjie Hu, Meghan G. Lubner, Tyler J. Bradshaw

Main category: cs.CV

TL;DR: SAM2CT is a promptable segmentation model that converts radiologists’ sparse annotations (arrows/lines) in CT images into 3D segmentations, enabling large-scale dataset creation from existing clinical data.

Details

Motivation: Creating 3D segmentation datasets for CT imaging is costly and time-consuming, requiring extensive manual annotation by radiologists. However, radiologists routinely create sparse annotations (arrows, lines) during clinical reads that are stored in PACS systems but underutilized for training ML models.

Method: SAM2CT extends SAM2 with a prompt encoder supporting arrow and line inputs, and introduces Memory-Conditioned Memories (MCM) for 3D medical volumes. It converts sparse radiologist annotations into full 3D segmentations through promptable segmentation.

Result: SAM2CT outperforms existing promptable segmentation models, achieving Dice scores of 0.649 (arrow prompts) and 0.757 (line prompts). On clinical PACS data, it generates clinically acceptable segmentations in 87% of cases and shows strong zero-shot performance on Emergency Department findings.

Conclusion: Large-scale mining of historical GSPS annotations using promptable segmentation models like SAM2CT represents a scalable approach for generating 3D CT segmentation datasets, potentially revolutionizing medical imaging dataset creation.

Abstract: The development of machine learning models for CT imaging depends on the availability of large, high-quality, and diverse annotated datasets. Although large volumes of CT images and reports are readily available in clinical picture archiving and communication systems (PACS), 3D segmentations of critical findings are costly to obtain, typically requiring extensive manual annotation by radiologists. On the other hand, it is common for radiologists to provide limited annotations of findings during routine reads, such as line measurements and arrows, that are often stored in PACS as GSPS objects. We posit that these sparse annotations can be extracted along with CT volumes and converted into 3D segmentations using promptable segmentation models, a paradigm we term Opportunistic Promptable Segmentation. To enable this paradigm, we propose SAM2CT, the first promptable segmentation model designed to convert radiologist annotations into 3D segmentations in CT volumes. SAM2CT builds upon SAM2 by extending the prompt encoder to support arrow and line inputs and by introducing Memory-Conditioned Memories (MCM), a memory encoding strategy tailored to 3D medical volumes. On public lesion segmentation benchmarks, SAM2CT outperforms existing promptable segmentation models and similarly trained baselines, achieving Dice similarity coefficients of 0.649 for arrow prompts and 0.757 for line prompts. Applying the model to pre-existing GSPS annotations from a clinical PACS (N = 60), SAM2CT generates 3D segmentations that are clinically acceptable or require only minor adjustments in 87% of cases, as scored by radiologists. Additionally, SAM2CT demonstrates strong zero-shot performance on select Emergency Department findings. These results suggest that large-scale mining of historical GSPS annotations represents a promising and scalable approach for generating 3D CT segmentation datasets.

[353] On the Assessment of Sensitivity of Autonomous Vehicle Perception

Apostol Vassilev, Munawar Hasan, Edward Griffor, Honglan Jin, Pavel Piliptchak, Mahima Arora, Thoshitha Gamage

Main category: cs.CV

TL;DR: Paper evaluates robustness of automated vehicle perception systems using ensemble models and predictive sensitivity quantification under adverse conditions like weather, lighting, and occlusions.

Details

Motivation: Automated driving depends on reliable perception systems that must perform accurately under both ideal and adverse conditions (natural and adversarial factors). Current systems face challenges with perception errors and detection delays in challenging scenarios, necessitating robustness assessment and reliability improvement strategies.

Method: Uses predictive sensitivity quantification based on ensemble of models to capture model disagreement and inference variability. Evaluates perception performance under adverse driving scenarios in simulated and real-world conditions. Proposes notional architecture for perception assessment and develops criterion based on AV stopping distance at stop signs on varying road surfaces. Tests five state-of-the-art computer vision models: YOLO (v8-v9), DETR50, DETR101, and RT-DETR.

Result: Diminished lighting conditions (fog, low sun altitude) have greatest impact on perception model performance. Adversarial road conditions like occlusions increase perception sensitivity, and performance drops further with combination of adversarial conditions and inclement weather. Greater distance to roadway objects leads to greater impact on perception performance and diminished robustness.

Conclusion: Perception systems for automated vehicles show significant vulnerability to adverse environmental conditions, particularly lighting and combined adversarial factors. Ensemble-based sensitivity quantification provides effective assessment methodology, revealing critical robustness limitations that must be addressed for reliable automated driving.

Abstract: The viability of automated driving is heavily dependent on the performance of perception systems to provide real-time accurate and reliable information for robust decision-making and maneuvers. These systems must perform reliably not only under ideal conditions, but also when challenged by natural and adversarial driving factors. Both of these types of interference can lead to perception errors and delays in detection and classification. Hence, it is essential to assess the robustness of the perception systems of automated vehicles (AVs) and explore strategies for making perception more reliable. We approach this problem by evaluating perception performance using predictive sensitivity quantification based on an ensemble of models, capturing model disagreement and inference variability across multiple models, under adverse driving scenarios in both simulated environments and real-world conditions. A notional architecture for assessing perception performance is proposed. A perception assessment criterion is developed based on an AV’s stopping distance at a stop sign on varying road surfaces, such as dry and wet asphalt, and vehicle speed. Five state-of-the-art computer vision models are used, including YOLO (v8-v9), DEtection TRansformer (DETR50, DETR101), Real-Time DEtection TRansformer (RT-DETR)in our experiments. Diminished lighting conditions, e.g., resulting from the presence of fog and low sun altitude, have the greatest impact on the performance of the perception models. Additionally, adversarial road conditions such as occlusions of roadway objects increase perception sensitivity and model performance drops when faced with a combination of adversarial road conditions and inclement weather conditions. Also, it is demonstrated that the greater the distance to a roadway object, the greater the impact on perception performance, hence diminished perception robustness.

[354] Bridging the Semantic Chasm: Synergistic Conceptual Anchoring for Generalized Few-Shot and Zero-Shot OOD Perception

Alexandros Christoforos, Sarah Jenkins, Michael Brown, Tuan Pham, David Chen

Main category: cs.CV

TL;DR: SynerNet framework uses multi-agent neural network with four specialized units to address cross-modal alignment degeneration in VLMs for OOD concepts, improving few-shot and zero-shot performance.

Details

Motivation: Addresses the problem of cross-modal alignment degeneration in Vision-Language Models when encountering Out-of-Distribution concepts, which limits their generalization capabilities to novel domains.

Method: Proposes Synergistic Neural Agents Network (SynerNet) with four specialized computational units: visual perception, linguistic context, nominal embedding, and global coordination. These units collaborate through structured message-propagation protocol to rectify modality disparities. Includes multi-agent latent space nomenclature acquisition, semantic context-interchange algorithm for few-shot adaptation, and adaptive dynamic equilibrium mechanism.

Result: Empirical evaluations on VISTA-Beyond benchmark show substantial performance improvements in both few-shot and zero-shot scenarios, with precision gains ranging from 1.2% to 5.4% across diverse domains.

Conclusion: SynerNet effectively mitigates cross-modal alignment degeneration in VLMs for OOD concepts through synergistic multi-agent collaboration, enhancing generalization capabilities to novel domains.

Abstract: This manuscript presents a pioneering Synergistic Neural Agents Network (SynerNet) framework designed to mitigate the phenomenon of cross-modal alignment degeneration in Vision-Language Models (VLMs) when encountering Out-of-Distribution (OOD) concepts. Specifically, four specialized computational units - visual perception, linguistic context, nominal embedding, and global coordination - collaboratively rectify modality disparities via a structured message-propagation protocol. The principal contributions encompass a multi-agent latent space nomenclature acquisition framework, a semantic context-interchange algorithm for enhanced few-shot adaptation, and an adaptive dynamic equilibrium mechanism. Empirical evaluations conducted on the VISTA-Beyond benchmark demonstrate that SynerNet yields substantial performance augmentations in both few-shot and zero-shot scenarios, exhibiting precision improvements ranging from 1.2% to 5.4% across a diverse array of domains.

[355] When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs

Beidi Zhao, Wenlong Deng, Xinting Liao, Yushu Li, Nazim Shaikh, Yao Nie, Xiaoxiao Li

Main category: cs.CV

TL;DR: MAD-RAG addresses attention distraction in vision-language models during retrieval-augmented generation by decoupling visual grounding from context integration through dual-question formulation and attention mixing.

Details

Motivation: The paper identifies a new failure mode in RAG for vision-language models called Attention Distraction (AD), where retrieved text context suppresses visual attention globally, causing models to fail on questions they could originally answer correctly without retrieval.

Method: Proposes MAD-RAG, a training-free intervention that uses dual-question formulation to decouple visual grounding from context integration, combined with attention mixing to preserve image-conditioned evidence.

Result: Extensive experiments on OK-VQA, E-VQA, and InfoSeek show MAD-RAG consistently outperforms existing baselines across different model families with absolute gains up to 4.76%, 9.20%, and 6.18% over vanilla RAG, rectifying up to 74.68% of failure cases.

Conclusion: MAD-RAG effectively mitigates attention distraction in RAG for vision-language models, improving performance on knowledge-based VQA tasks with negligible computational overhead.

Abstract: While Retrieval-Augmented Generation (RAG) is one of the dominant paradigms for enhancing Large Vision-Language Models (LVLMs) on knowledge-based VQA tasks, recent work attributes RAG failures to insufficient attention towards the retrieved context, proposing to reduce the attention allocated to image tokens. In this work, we identify a distinct failure mode that previous study overlooked: Attention Distraction (AD). When the retrieved context is sufficient (highly relevant or including the correct answer), the retrieved text suppresses the visual attention globally, and the attention on image tokens shifts away from question-relevant regions. This leads to failures on questions the model could originally answer correctly without the retrieved text. To mitigate this issue, we propose MAD-RAG, a training-free intervention that decouples visual grounding from context integration through a dual-question formulation, combined with attention mixing to preserve image-conditioned evidence. Extensive experiments on OK-VQA, E-VQA, and InfoSeek demonstrate that MAD-RAG consistently outperforms existing baselines across different model families, yielding absolute gains of up to 4.76%, 9.20%, and 6.18% over the vanilla RAG baseline. Notably, MAD-RAG rectifies up to 74.68% of failure cases with negligible computational overhead.

[356] AdaFuse: Adaptive Multimodal Fusion for Lung Cancer Risk Prediction via Reinforcement Learning

Chongyu Qu, Zhengyi Lu, Yuxiang Lai, Thomas Z. Li, Junchao Zhu, Junlin Guo, Juming Xiong, Yanfan Zhu, Yuechen Yang, Allen J. Luna, Kim L. Sandler, Bennett A. Landman, Yuankai Huo

Main category: cs.CV

TL;DR: AdaFuse uses reinforcement learning to adaptively select and fuse medical modalities for lung cancer risk prediction, achieving better performance with fewer computations than fixed fusion methods.

Details

Motivation: Current multimodal fusion methods process all available modalities equally or with learned weights, but don't address whether certain modalities should be used at all for individual patients. There's a need for personalized modality selection.

Method: AdaFuse formulates multimodal fusion as a sequential decision process using reinforcement learning. A policy network iteratively decides whether to incorporate additional modalities or proceed to prediction based on already acquired information, enabling early termination when sufficient information is available.

Result: Achieved highest AUC (0.762) on NLST dataset compared to single-modality baselines (0.732), fixed fusion (0.759), and adaptive baselines DynMM (0.754) and MoE (0.742), while using fewer FLOPs than triple-modality methods.

Conclusion: Demonstrates RL’s potential for personalized multimodal fusion in medical imaging, shifting from uniform fusion toward adaptive diagnostic pipelines that learn when to consult additional modalities.

Abstract: Multimodal fusion has emerged as a promising paradigm for disease diagnosis and prognosis, integrating complementary information from heterogeneous data sources such as medical images, clinical records, and radiology reports. However, existing fusion methods process all available modalities through the network, either treating them equally or learning to assign different contribution weights, leaving a fundamental question unaddressed: for a given patient, should certain modalities be used at all? We present AdaFuse, an adaptive multimodal fusion framework that leverages reinforcement learning (RL) to learn patient-specific modality selection and fusion strategies for lung cancer risk prediction. AdaFuse formulates multimodal fusion as a sequential decision process, where the policy network iteratively decides whether to incorporate an additional modality or proceed to prediction based on the information already acquired. This sequential formulation enables the model to condition each selection on previously observed modalities and terminate early when sufficient information is available, rather than committing to a fixed subset upfront. We evaluate AdaFuse on the National Lung Screening Trial (NLST) dataset. Experimental results demonstrate that AdaFuse achieves the highest AUC (0.762) compared to the best single-modality baseline (0.732), the best fixed fusion strategy (0.759), and adaptive baselines including DynMM (0.754) and MoE (0.742), while using fewer FLOPs than all triple-modality methods. Our work demonstrates the potential of reinforcement learning for personalized multimodal fusion in medical imaging, representing a shift from uniform fusion strategies toward adaptive diagnostic pipelines that learn when to consult additional modalities and when existing information suffices for accurate prediction.

[357] MASC: Metal-Aware Sampling and Correction via Reinforcement Learning for Accelerated MRI

Zhengyi Lu, Ming Lu, Chongyu Qu, Junchao Zhu, Junlin Guo, Marilyn Lionts, Yanfan Zhu, Yuechen Yang, Tianyuan Yao, Jayasai Rajagopal, Bennett Allan Landman, Xiao Wang, Xinqiang Yan, Yuankai Huo

Main category: cs.CV

TL;DR: MASC: Unified RL framework for joint optimization of metal-aware k-space sampling and artifact correction in accelerated MRI with metal implants

Details

Motivation: Metal implants in MRI cause severe artifacts that degrade image quality. Traditional approaches treat metal artifact reduction (MAR) and accelerated MRI acquisition as separate problems, but they should be jointly optimized for better performance.

Method: Proposes MASC, a reinforcement learning framework using Proximal Policy Optimization (PPO) agent to select k-space phase-encoding lines under limited acquisition budget. Uses physics-based simulation to create paired dataset with/without metal implants. Combines U-Net-based MAR network with acquisition policy in end-to-end training.

Result: MASC’s learned policies outperform conventional sampling strategies. End-to-end training improves performance compared to using frozen pre-trained MAR network. Cross-dataset experiments on FastMRI with physics-based simulation confirm generalization to realistic clinical MRI data.

Conclusion: Joint optimization of acquisition and artifact correction is beneficial for MRI with metal implants. The unified RL framework effectively learns sampling policies that maximize reconstruction quality while reducing metal artifacts.

Abstract: Metal implants in MRI cause severe artifacts that degrade image quality and hinder clinical diagnosis. Traditional approaches address metal artifact reduction (MAR) and accelerated MRI acquisition as separate problems. We propose MASC, a unified reinforcement learning framework that jointly optimizes metal-aware k-space sampling and artifact correction for accelerated MRI. To enable supervised training, we construct a paired MRI dataset using physics-based simulation, generating k-space data and reconstructions for phantoms with and without metal implants. This paired dataset provides simulated 3D MRI scans with and without metal implants, where each metal-corrupted sample has an exactly matched clean reference, enabling direct supervision for both artifact reduction and acquisition policy learning. We formulate active MRI acquisition as a sequential decision-making problem, where an artifact-aware Proximal Policy Optimization (PPO) agent learns to select k-space phase-encoding lines under a limited acquisition budget. The agent operates on undersampled reconstructions processed through a U-Net-based MAR network, learning patterns that maximize reconstruction quality. We further propose an end-to-end training scheme where the acquisition policy learns to select k-space lines that best support artifact removal while the MAR network simultaneously adapts to the resulting undersampling patterns. Experiments demonstrate that MASC’s learned policies outperform conventional sampling strategies, and end-to-end training improves performance compared to using a frozen pre-trained MAR network, validating the benefit of joint optimization. Cross-dataset experiments on FastMRI with physics-based artifact simulation further confirm generalization to realistic clinical MRI data. The code and models of MASC have been made publicly available: https://github.com/hrlblab/masc

[358] ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models

Ignacy Kolton, Kacper Marzol, Paweł Batorski, Marcin Mazur, Paul Swoboda, Przemysław Spurek

Main category: cs.CV

TL;DR: ReLAPSe is a reinforcement learning framework that efficiently restores concepts from unlearned diffusion models by using model-intrinsic feedback signals, enabling near-real-time recovery of fine-grained identities and styles.

Details

Motivation: Existing adversarial approaches for exploiting leakage in unlearned diffusion models have limitations: optimization-based methods are computationally expensive, while reasoning-based techniques lack direct feedback from the model's latent visual representations.

Method: ReLAPSe reformulates concept restoration as a reinforcement learning problem using Reinforcement Learning with Verifiable Rewards (RLVR), leveraging the diffusion model’s noise prediction loss as model-intrinsic feedback to train an agent for textual prompt manipulation.

Result: Achieves efficient, near-real-time recovery of fine-grained identities and styles across multiple state-of-the-art unlearning methods, providing scalable red-teaming capabilities for unlearned diffusion models.

Conclusion: ReLAPSe pioneers the shift from per-instance optimization to global policy learning, offering a scalable tool for rigorous security evaluation of machine unlearning in text-to-image diffusion models.

Abstract: Machine unlearning is a key defense mechanism for removing unauthorized concepts from text-to-image diffusion models, yet recent evidence shows that latent visual information often persists after unlearning. Existing adversarial approaches for exploiting this leakage are constrained by fundamental limitations: optimization-based methods are computationally expensive due to per-instance iterative search. At the same time, reasoning-based and heuristic techniques lack direct feedback from the target model’s latent visual representations. To address these challenges, we introduce ReLAPSe, a policy-based adversarial framework that reformulates concept restoration as a reinforcement learning problem. ReLAPSe trains an agent using Reinforcement Learning with Verifiable Rewards (RLVR), leveraging the diffusion model’s noise prediction loss as a model-intrinsic and verifiable feedback signal. This closed-loop design directly aligns textual prompt manipulation with latent visual residuals, enabling the agent to learn transferable restoration strategies rather than optimizing isolated prompts. By pioneering the shift from per-instance optimization to global policy learning, ReLAPSe achieves efficient, near-real-time recovery of fine-grained identities and styles across multiple state-of-the-art unlearning methods, providing a scalable tool for rigorous red-teaming of unlearned diffusion models. Some experimental evaluations involve sensitive visual concepts, such as nudity. Code is available at https://github.com/gmum/ReLaPSe

[359] Modeling Image-Caption Rating from Comparative Judgments

Kezia Minni, Qiang Zhang, Monoshiz Mahbub Khan, Zhe Yu

Main category: cs.CV

TL;DR: A comparative learning framework for image caption evaluation that models human pairwise comparisons instead of direct ratings, reducing annotation costs while maintaining effectiveness.

Details

Motivation: Direct rating of image caption accuracy is time-consuming and subjective for humans, while pairwise comparisons are easier and faster. The paper aims to develop a more efficient annotation approach for caption evaluation.

Method: Proposes a comparative learning framework that models human pairwise judgments between captions. Uses VICR dataset with ResNet-50 for visual features and MiniLM for text features. Trains both regression (direct rating) and comparative learning models, comparing their performance.

Result: Regression model achieves better performance (Pearson’s ρ: 0.7609, Spearman’s r_s: 0.7089), but comparative learning model steadily improves with more data and approaches regression baseline. Human evaluation shows comparative annotation yields faster results and greater inter-annotator agreement.

Conclusion: Comparative learning can effectively model human preferences for image caption evaluation while significantly reducing annotation costs, making it a practical alternative to direct rating approaches.

Abstract: Rating the accuracy of captions in describing images is time-consuming and subjective for humans. In contrast, it is often easier for people to compare two captions and decide which one better matches a given image. In this work, we propose a machine learning framework that models such comparative judgments instead of direct ratings. The model can then be applied to rank unseen image-caption pairs in the same way as a regression model trained on direct ratings. Using the VICR dataset, we extract visual features with ResNet-50 and text features with MiniLM, then train both a regression model and a comparative learning model. While the regression model achieves better performance (Pearson’s $ρ$: 0.7609 and Spearman’s $r_s$: 0.7089), the comparative learning model steadily improves with more data and approaches the regression baseline. In addition, a small-scale human evaluation study comparing absolute rating, pairwise comparison, and same-image comparison shows that comparative annotation yields faster results and has greater agreement among human annotators. These results suggest that comparative learning can effectively model human preferences while significantly reducing the cost of human annotations.

[360] Deep Learning-Based Object Detection for Autonomous Vehicles: A Comparative Study of One-Stage and Two-Stage Detectors on Basic Traffic Objects

Bsher Karbouj, Adam Michael Altenbuchner, Joerg Krueger

Main category: cs.CV

TL;DR: Comparative analysis of YOLOv5 vs Faster R-CNN for autonomous vehicle object detection, evaluating performance on real/synthetic datasets with metrics including mAP, recall, and inference speed.

Details

Motivation: Autonomous vehicles need robust object detection, but limited guidance exists on selecting appropriate deep learning methods (YOLO, SSD, Faster R-CNN) for specific driving applications, affecting accuracy, speed, and environmental robustness.

Method: Comprehensive experimental comparison of YOLOv5 (one-stage detector) and Faster R-CNN (two-stage detector) evaluated on diverse real/synthetic datasets using metrics like mean Average Precision (mAP), recall, and inference speed.

Result: YOLOv5 shows superior mAP, recall, and training efficiency, especially with larger datasets and higher resolutions. Faster R-CNN excels at detecting small distant objects and performs better in challenging lighting conditions.

Conclusion: Both models have trade-offs: YOLOv5 offers better overall performance and efficiency, while Faster R-CNN provides advantages for specific challenging scenarios like small object detection and poor lighting conditions in autonomous driving.

Abstract: Object detection is a crucial component in autonomous vehicle systems. It enables the vehicle to perceive and understand its environment by identifying and locating various objects around it. By utilizing advanced imaging and deep learning techniques, autonomous vehicle systems can rapidly and accurately identify objects based on their features. Different deep learning methods vary in their ability to accurately detect and classify objects in autonomous vehicle systems. Selecting the appropriate method significantly impacts system performance, robustness, and efficiency in real-world driving scenarios. While several generic deep learning architectures like YOLO, SSD, and Faster R-CNN have been proposed, guidance on their suitability for specific autonomous driving applications is often limited. The choice of method affects detection accuracy, processing speed, environmental robustness, sensor integration, scalability, and edge case handling. This study provides a comprehensive experimental analysis comparing two prominent object detection models: YOLOv5 (a one-stage detector) and Faster R-CNN (a two-stage detector). Their performance is evaluated on a diverse dataset combining real and synthetic images, considering various metrics including mean Average Precision (mAP), recall, and inference speed. The findings reveal that YOLOv5 demonstrates superior performance in terms of mAP, recall, and training efficiency, particularly as dataset size and image resolution increase. However, Faster R-CNN shows advantages in detecting small, distant objects and performs well in challenging lighting conditions. The models’ behavior is also analyzed under different confidence thresholds and in various real-world scenarios, providing insights into their applicability for autonomous driving systems.

[361] Robust automatic brain vessel segmentation in 3D CTA scans using dynamic 4D-CTA data

Alberto Mario Ceballos-Arroyo, Shrikanth M. Yadav, Chu-Hsuan Lin, Jisoo Kim, Geoffrey S. Young, Huaizu Jiang, Lei Qin

Main category: cs.CV

TL;DR: Novel methodology for brain vasculature annotation using dynamic 4D-CTA scans with bone/soft tissue subtraction, enabling robust deep learning segmentation with enhanced dataset size and performance.

Details

Motivation: Manual annotation of brain vessels from CTA scans is time-consuming and labor-intensive. The researchers aim to develop an automated approach using dynamic 4D-CTA to reduce manual effort while improving segmentation accuracy.

Method: Uses dynamic 4D-CTA head scans with multiple time points to subtract bone and soft tissue, enhancing vessel visualization. Trains deep learning models (nnUNet) on ground truth annotations, using same segmentation across multiple phases to effectively enlarge dataset 4-5x and induce robustness to contrast phases.

Result: Achieved significantly better segmentations than similarly-sized datasets, with average mDC of 0.846 for arteries and 0.957 for veins. Low error margins (aDHD of 0.304 mm for arteries, 0.078 for veins) and high topology sensitivity (tSens of 0.877 for arteries, 0.974 for veins).

Conclusion: The proposed methodology successfully reduces manual annotation effort while achieving excellent accuracy in brain vessel segmentation, with publicly available code and model weights for community use.

Abstract: In this study, we develop a novel methodology for annotating the brain vasculature using dynamic 4D-CTA head scans. By using multiple time points from dynamic CTA acquisitions, we subtract bone and soft tissue to enhance the visualization of arteries and veins, reducing the effort required to obtain manual annotations of brain vessels. We then train deep learning models on our ground truth annotations by using the same segmentation for multiple phases from the dynamic 4D-CTA collection, effectively enlarging our dataset by 4 to 5 times and inducing robustness to contrast phases. In total, our dataset comprises 110 training images from 25 patients and 165 test images from 14 patients. In comparison with two similarly-sized datasets for CTA-based brain vessel segmentation, a nnUNet model trained on our dataset can achieve significantly better segmentations across all vascular regions, with an average mDC of 0.846 for arteries and 0.957 for veins in the TopBrain dataset. Furthermore, metrics such as average directed Hausdorff distance (adHD) and topology sensitivity (tSens) reflected similar trends: using our dataset resulted in low error margins (aDHD of 0.304 mm for arteries and 0.078 for veins) and high sensitivity (tSens of 0.877 for arteries and 0.974 for veins), indicating excellent accuracy in capturing vessel morphology. Our code and model weights are available online: https://github.com/alceballosa/robust-vessel-segmentation

[362] Brazilian Portuguese Image Captioning with Transformers: A Study on Cross-Native-Translated Dataset

Gabriel Bromonschenkel, Alessandro L. Koerich, Thiago M. Paixão, Hilário Tomaz Alves de Oliveira

Main category: cs.CV

TL;DR: Cross-native-translated evaluation of Transformer-based vision-language models for Brazilian Portuguese image captioning, comparing native vs. translated datasets and analyzing model performance and biases.

Details

Motivation: Address the gap in image captioning research for low-resource languages like Brazilian Portuguese, which lack specialized datasets and models compared to English-based research.

Method: Proposes cross-native-translated evaluation using Flickr30K with native Brazilian Portuguese captions vs. automatically translated captions. Uses cross-context approach (train on one, test on other), attention maps for interpretation, and CLIP-Score for image-text alignment evaluation.

Result: Swin-DistilBERTimbau consistently outperforms other models with strong generalization. ViTucano (Brazilian Portuguese VLM) surpasses larger multilingual models in text metrics, while GPT-4 achieves highest CLIP-Score. Attention analysis reveals systematic biases including gender misclassification and spatial inconsistencies.

Conclusion: The study provides valuable insights into cross-lingual vision-language model evaluation for low-resource languages, highlighting performance differences between native and translated datasets, and revealing systematic biases in model attention mechanisms.

Abstract: Image captioning (IC) refers to the automatic generation of natural language descriptions for images, with applications ranging from social media content generation to assisting individuals with visual impairments. While most research has been focused on English-based models, low-resource languages such as Brazilian Portuguese face significant challenges due to the lack of specialized datasets and models. Several studies create datasets by automatically translating existing ones to mitigate resource scarcity. This work addresses this gap by proposing a cross-native-translated evaluation of Transformer-based vision and language models for Brazilian Portuguese IC. We use a version of Flickr30K comprised of captions manually created by native Brazilian Portuguese speakers and compare it to a version with captions automatically translated from English to Portuguese. The experiments include a cross-context approach, where models trained on one dataset are tested on the other to assess the translation impact. Additionally, we incorporate attention maps for model inference interpretation and use the CLIP-Score metric to evaluate the image-description alignment. Our findings show that Swin-DistilBERTimbau consistently outperforms other models, demonstrating strong generalization across datasets. ViTucano, a Brazilian Portuguese pre-trained VLM, surpasses larger multilingual models (GPT-4o, LLaMa 3.2 Vision) in traditional text-based evaluation metrics, while GPT-4 models achieve the highest CLIP-Score, highlighting improved image-text alignment. Attention analysis reveals systematic biases, including gender misclassification, object enumeration errors, and spatial inconsistencies. The datasets and the models generated and analyzed during the current study are available in: https://github.com/laicsiifes/transformer-caption-ptbr.

[363] Modeling Art Evaluations from Comparative Judgments: A Deep Learning Approach to Predicting Aesthetic Preferences

Manoj Reddy Bethi, Sai Rupa Jhade, Pravallika Yaganti, Monoshiz Mahbub Khan, Zhe Yu

Main category: cs.CV

TL;DR: Deep learning approach for predicting human aesthetic judgments in visual art using comparative learning framework with pairwise preferences instead of direct ratings, showing improved performance and annotation efficiency.

Details

Motivation: Human aesthetic judgments in visual art are challenging due to individual preference variability and high labeling costs; comparative learning with pairwise preferences offers less cognitive burden and greater consistency than direct scoring.

Method: Extract ResNet-50 CNN features from painting images, develop deep neural network regression model and dual-branch pairwise comparison model; explore four research questions comparing regression vs. comparative learning, individual preference prediction, and annotation cost trade-offs.

Result: Deep regression model outperforms baseline by up to 328% in R²; comparative model approaches regression performance without direct rating values; individual preference prediction remains challenging; comparative judgments require 60% less annotation time per item.

Conclusion: Pairwise comparative learning is effective for aesthetic judgment modeling, offering practical utility when direct ratings are unavailable and superior annotation efficiency for large-scale preference modeling.

Abstract: Modeling human aesthetic judgments in visual art presents significant challenges due to individual preference variability and the high cost of obtaining labeled data. To reduce cost of acquiring such labels, we propose to apply a comparative learning framework based on pairwise preference assessments rather than direct ratings. This approach leverages the Law of Comparative Judgment, which posits that relative choices exhibit less cognitive burden and greater cognitive consistency than direct scoring. We extract deep convolutional features from painting images using ResNet-50 and develop both a deep neural network regression model and a dual-branch pairwise comparison model. We explored four research questions: (RQ1) How does the proposed deep neural network regression model with CNN features compare to the baseline linear regression model using hand-crafted features? (RQ2) How does pairwise comparative learning compare to regression-based prediction when lacking access to direct rating values? (RQ3) Can we predict individual rater preferences through within-rater and cross-rater analysis? (RQ4) What is the annotation cost trade-off between direct ratings and comparative judgments in terms of human time and effort? Our results show that the deep regression model substantially outperforms the baseline, achieving up to $328%$ improvement in $R^2$. The comparative model approaches regression performance despite having no access to direct rating values, validating the practical utility of pairwise comparisons. However, predicting individual preferences remains challenging, with both within-rater and cross-rater performance significantly lower than average rating prediction. Human subject experiments reveal that comparative judgments require $60%$ less annotation time per item, demonstrating superior annotation efficiency for large-scale preference modeling.

[364] 3DGS$^2$-TR: Scalable Second-Order Trust-Region Method for 3D Gaussian Splatting

Roger Hsiao, Yuchen Fang, Xiangru Huang, Ruilong Li, Hesam Rabeti, Zan Gojcic, Javad Lavaei, James Demmel, Sophia Shao

Main category: cs.CV

TL;DR: 3DGS²-TR: A second-order optimizer for 3D Gaussian Splatting that uses diagonal Hessian approximation and trust-region regularization to accelerate training with minimal memory overhead.

Details

Motivation: Existing second-order optimizers for 3DGS (like 3DGS-LM and 3DGS2) require dense curvature representations that are computationally expensive and memory-intensive. There's a need for an efficient optimizer that can accelerate 3DGS training while maintaining low memory footprint for scalability to large scenes.

Method: Proposes 3DGS²-TR, a second-order optimizer that approximates curvature using only the diagonal of the Hessian matrix via Hutchinson’s method. It’s fully matrix-free with O(n) complexity. Introduces parameter-wise trust-region technique based on squared Hellinger distance to regularize Gaussian parameter updates for stable optimization despite strong nonlinearity in 3DGS rasterization.

Result: Achieves better reconstruction quality on standard datasets using 50% fewer training iterations compared to ADAM, with less than 1GB peak GPU memory overhead (17% more than ADAM, 85% less than 3DGS-LM). Enables scalability to very large scenes and potentially distributed training.

Conclusion: 3DGS²-TR provides an efficient second-order optimization approach for 3DGS that balances training speed, reconstruction quality, and memory efficiency, making it suitable for large-scale scene reconstruction applications.

Abstract: We propose 3DGS$^2$-TR,a second-order optimizer for accelerating the scene training problem in 3D Gaussian Splatting (3DGS). Unlike existing second-order approaches that rely on explicit or dense curvature representations, such as 3DGS-LM (Höllein et al., 2025) or 3DGS2 (Lan et al., 2025), our method approximates curvature using only the diagonal of the Hessian matrix, efficiently via Hutchinson’s method. Our approach is fully matrix-free and has the same complexity as ADAM (Kingma, 2024), $O(n)$ in both computation and memory costs. To ensure stable optimization in the presence of strong nonlinearity in the 3DGS rasterization process, we introduce a parameter-wise trust-region technique based on the squared Hellinger distance, regularizing updates to Gaussian parameters. Under identical parameter initialization and without densification, 3DGS$^2$-TR is able to achieve better reconstruction quality on standard datasets, using 50% fewer training iterations compared to ADAM, while incurring less than 1GB of peak GPU memory overhead (17% more than ADAM and 85% less than 3DGS-LM), enabling scalability to very large scenes and potentially to distributed training settings.

[365] Toward Autonomous Laboratory Safety Monitoring with Vision Language Models: Learning to See Hazards Through Scene Structure

Trishna Chakraborty, Udita Ghosh, Aldair Ernesto Gongora, Ruben Glatt, Yue Dong, Jiachen Li, Amit K. Roy-Chowdhury, Chengyu Song

Main category: cs.CV

TL;DR: VLMs show promise for lab safety monitoring but struggle with visual-only hazard detection; synthetic dataset creation and scene-graph-guided alignment improve performance.

Details

Motivation: Laboratory safety incidents often go unmonitored due to human limitations, and while VLMs could help, their effectiveness in real-world visual settings is unclear due to lack of appropriate evaluation data.

Method: Created synthetic dataset pipeline converting text scenarios to (image, scene graph, ground truth) triples using LLMs as scene graph architects and image generation models as renderers. Proposed scene-graph-guided alignment to bridge perceptual gaps by translating visual inputs to structured scene graphs.

Result: VLMs perform well with textual scene graphs but degrade substantially in visual-only settings. Scene-graph-guided alignment improves hazard detection performance in visual-only settings by better aligning visual inputs with VLM reasoning.

Conclusion: VLMs need structured scene understanding for effective lab safety monitoring; bridging visual perception with structured reasoning through scene graph translation improves performance in real-world visual settings.

Abstract: Laboratories are prone to severe injuries from minor unsafe actions, yet continuous safety monitoring – beyond mandatory pre-lab safety training – is limited by human availability. Vision language models (VLMs) offer promise for autonomous laboratory safety monitoring, but their effectiveness in realistic settings is unclear due to the lack of visual evaluation data, as most safety incidents are documented primarily as unstructured text. To address this gap, we first introduce a structured data generation pipeline that converts textual laboratory scenarios into aligned triples of (image, scene graph, ground truth), using large language models as scene graph architects and image generation models as renderers. Our experiments on the synthetic dataset of 1,207 samples across 362 unique scenarios and seven open- and closed-source models show that VLMs perform effectively given textual scene graph, but degrade substantially in visual-only settings indicating difficulty in extracting structured object relationships directly from pixels. To overcome this, we propose a post-training context-engineering approach, scene-graph-guided alignment, to bridge perceptual gaps in VLMs by translating visual inputs into structured scene graphs better aligned with VLM reasoning, improving hazard detection performance in visual only settings.

[366] Text is All You Need for Vision-Language Model Jailbreaking

Yihang Chen, Zhao Xu, Youyuan Jiang, Tianle Zheng, Cho-Jui Hsieh

Main category: cs.CV

TL;DR: Text-DJ: A novel jailbreak attack on Large Vision-Language Models that bypasses safety safeguards by exploiting OCR capabilities through fragmented text queries presented as image grids with distractions.

Details

Motivation: Current LVLM safety defenses focus on analyzing explicit textual inputs or relevant visual scenes, but overlook vulnerabilities in OCR capabilities when processing fragmented multimodal inputs.

Method: Three-stage approach: 1) Decompose harmful query into benign sub-queries, 2) Select maximally irrelevant distraction queries, 3) Present all queries as image grid with sub-queries positioned in middle, exploiting OCR processing.

Result: Successfully circumvents safety alignment of state-of-the-art LVLMs by bypassing text-based filters and inducing distractions that prevent safety protocols from linking scattered sub-queries.

Conclusion: Exposes critical vulnerability in LVLMs’ OCR capabilities that are not robust to dispersed, multi-image adversarial inputs, highlighting need for defenses for fragmented multimodal inputs.

Abstract: Large Vision-Language Models (LVLMs) are increasingly equipped with robust safety safeguards to prevent responses to harmful or disallowed prompts. However, these defenses often focus on analyzing explicit textual inputs or relevant visual scenes. In this work, we introduce Text-DJ, a novel jailbreak attack that bypasses these safeguards by exploiting the model’s Optical Character Recognition (OCR) capability. Our methodology consists of three stages. First, we decompose a single harmful query into multiple and semantically related but more benign sub-queries. Second, we pick a set of distraction queries that are maximally irrelevant to the harmful query. Third, we present all decomposed sub-queries and distraction queries to the LVLM simultaneously as a grid of images, with the position of the sub-queries being middle within the grid. We demonstrate that this method successfully circumvents the safety alignment of state-of-the-art LVLMs. We argue this attack succeeds by (1) converting text-based prompts into images, bypassing standard text-based filters, and (2) inducing distractions, where the model’s safety protocols fail to link the scattered sub-queries within a high number of irrelevant queries. Overall, our findings expose a critical vulnerability in LVLMs’ OCR capabilities that are not robust to dispersed, multi-image adversarial inputs, highlighting the need for defenses for fragmented multimodal inputs.

[367] DISK: Dynamic Inference SKipping for World Models

Anugunj Naman, Gaibo Zhang, Ayushman Singh, Yaguang Zhang

Main category: cs.CV

TL;DR: DISK is a training-free adaptive inference method for autoregressive world models that coordinates diffusion transformers for video and ego-trajectory prediction via dual-branch controllers with cross-modal skip decisions, achieving 2x speedup on trajectory diffusion and 1.6x speedup on video diffusion while maintaining performance.

Details

Motivation: The paper addresses the computational cost of autoregressive world models for video and trajectory prediction in applications like autonomous driving, aiming to reduce inference time while maintaining motion-appearance consistency without requiring retraining.

Method: DISK uses dual-branch controllers with cross-modal skip decisions to coordinate two coupled diffusion transformers for video and ego-trajectory prediction. It extends higher-order latent-difference skip testing to the autoregressive chain-of-forward regime and propagates controller statistics through rollout loops for long-horizon stability.

Result: When integrated into closed-loop driving rollouts on 1500 NuPlan and NuScenes samples, DISK achieves 2x speedup on trajectory diffusion and 1.6x speedup on video diffusion while maintaining L2 planning error, visual quality (FID/FVD), and NAVSIM PDMS scores.

Conclusion: DISK demonstrates practical long-horizon video-and-trajectory prediction at substantially reduced computational cost, enabling more efficient autoregressive world models for applications like autonomous driving without sacrificing performance.

Abstract: We present DISK, a training-free adaptive inference method for autoregressive world models. DISK coordinates two coupled diffusion transformers for video and ego-trajectory via dual-branch controllers with cross-modal skip decisions, preserving motion-appearance consistency without retraining. We extend higher-order latent-difference skip testing to the autoregressive chain-of-forward regime and propagate controller statistics through rollout loops for long-horizon stability. When integrated into closed-loop driving rollouts on 1500 NuPlan and NuScenes samples using an NVIDIA L40S GPU, DISK achieves 2x speedup on trajectory diffusion and 1.6x speedup on video diffusion while maintaining L2 planning error, visual quality (FID/FVD), and NAVSIM PDMS scores, demonstrating practical long-horizon video-and-trajectory prediction at substantially reduced cost.

[368] Model Optimization for Multi-Camera 3D Detection and Tracking

Ethan Anderson, Justin Silva, Kyle Zheng, Sameer Pusegaonkar, Yizhou Wang, Zheng Tang, Sujit Biswas

Main category: cs.CV

TL;DR: Sparse4D is a query-based 3D detection and tracking framework for multi-camera systems that maintains stable performance under moderate frame rate reductions but suffers identity association collapse below 2 FPS, with selective quantization offering best speed-accuracy trade-offs and attention modules being precision-sensitive.

Details

Motivation: The paper addresses the need for robust multi-camera perception in indoor environments where static camera networks must handle multi-target tracking under occlusion and heterogeneous viewpoints, requiring stable identity association across varying conditions.

Method: Sparse4D uses a query-based spatiotemporal 3D detection and tracking framework that fuses multi-view features in a shared world frame and propagates sparse object queries via instance memory. The study evaluates reduced input frame rates, post-training quantization (INT8 and FP8), transfer to WILDTRACK benchmark, and Transformer Engine mixed-precision fine-tuning.

Result: Sparse4D remains stable under moderate FPS reductions but identity association collapses below 2 FPS even when detections are stable. Selective quantization of backbone and neck offers best speed-accuracy trade-off, while attention modules are consistently sensitive to low precision. Low-FPS pretraining yields large zero-shot gains on WILDTRACK, while small-scale fine-tuning provides limited additional benefit. Transformer Engine mixed precision reduces latency but can destabilize identity propagation.

Conclusion: The study demonstrates the importance of stability-aware validation for multi-camera tracking systems, showing that while quantization and mixed precision can improve efficiency, they require careful implementation to maintain identity association stability, especially for attention-based modules.

Abstract: Outside-in multi-camera perception is increasingly important in indoor environments, where networks of static cameras must support multi-target tracking under occlusion and heterogeneous viewpoints. We evaluate Sparse4D, a query-based spatiotemporal 3D detection and tracking framework that fuses multi-view features in a shared world frame and propagates sparse object queries via instance memory. We study reduced input frame rates, post-training quantization (INT8 and FP8), transfer to the WILDTRACK benchmark, and Transformer Engine mixed-precision fine-tuning. To better capture identity stability, we report Average Track Duration (AvgTrackDur), which measures identity persistence in seconds. Sparse4D remains stable under moderate FPS reductions, but below 2 FPS, identity association collapses even when detections are stable. Selective quantization of the backbone and neck offers the best speed-accuracy trade-off, while attention-related modules are consistently sensitive to low precision. On WILDTRACK, low-FPS pretraining yields large zero-shot gains over the base checkpoint, while small-scale fine-tuning provides limited additional benefit. Transformer Engine mixed precision reduces latency and improves camera scalability, but can destabilize identity propagation, motivating stability-aware validation.

[369] LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

Benno Krojer, Shravan Nayak, Oscar Mañas, Vaibhav Adlakha, Desmond Elliott, Siva Reddy, Marius Mosbach

Main category: cs.CV

TL;DR: LatentLens is a novel interpretability method that maps visual token representations in VLMs to natural language descriptions by comparing them to contextualized textual representations from a large text corpus, revealing that visual tokens are highly interpretable across all layers.

Details

Motivation: To understand why LLMs can readily process visual tokens when transformed into VLMs, and to develop better interpretability methods that reveal what is encoded in visual token representations at every layer of LLM processing.

Method: LatentLens encodes a large text corpus and stores contextualized token representations for each token. Visual token representations are then compared to these textual representations, with the top-k nearest neighbors providing descriptions of the visual token content.

Result: LatentLens shows that commonly used methods like LogitLens substantially underestimate visual token interpretability. With LatentLens, the majority of visual tokens are interpretable across all studied models (10 different VLMs) and all layers, providing semantically meaningful and fine-grained interpretations.

Conclusion: LatentLens provides a powerful interpretability tool for understanding visual token processing in VLMs, contributing new evidence on the alignment between vision and language representations and opening new directions for analyzing latent representations.

Abstract: Transforming a large language model (LLM) into a Vision-Language Model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at every layer of LLM processing. In this work, we introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language. LatentLens works by encoding a large text corpus and storing contextualized token representations for each token in that corpus. Visual token representations are then compared to their contextualized textual representations, with the top-k nearest neighbor representations providing descriptions of the visual token. We evaluate this method on 10 different VLMs, showing that commonly used methods, such as LogitLens, substantially underestimate the interpretability of visual tokens. With LatentLens instead, the majority of visual tokens are interpretable across all studied models and all layers. Qualitatively, we show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans compared to individual tokens. More broadly, our findings contribute new evidence on the alignment between vision and language representations, opening up new directions for analyzing latent representations.

[370] PSGS: Text-driven Panorama Sliding Scene Generation via Gaussian Splatting

Xin Zhang, Shen Chen, Jiale Zhou, Lei Li

Main category: cs.CV

TL;DR: PSGS: A two-stage framework for generating high-fidelity 3D scenes from text using panoramic scene generation and 3D Gaussian Splatting with iterative MLLM feedback.

Details

Motivation: Text-to-3D scene generation is crucial for immersive applications like VR/AR/gaming, but existing methods suffer from limited 3D-text data and inconsistent multi-view stitching, resulting in simplistic scenes.

Method: Two-stage framework: 1) Two-layer optimization for panorama generation (layout reasoning layer parses text into spatial relationships, self-optimization layer refines details via iterative MLLM feedback), 2) Panorama sliding mechanism initializes 3D Gaussian Splatting point clouds with overlapping perspectives, incorporating depth and semantic coherence losses.

Result: PSGS outperforms existing methods in panorama generation and produces more appealing 3D scenes with improved quality and detail fidelity.

Conclusion: PSGS offers a robust solution for scalable immersive content creation through high-fidelity panoramic scene generation.

Abstract: Generating realistic 3D scenes from text is crucial for immersive applications like VR, AR, and gaming. While text-driven approaches promise efficiency, existing methods suffer from limited 3D-text data and inconsistent multi-view stitching, resulting in overly simplistic scenes. To address this, we propose PSGS, a two-stage framework for high-fidelity panoramic scene generation. First, a novel two-layer optimization architecture generates semantically coherent panoramas: a layout reasoning layer parses text into structured spatial relationships, while a self-optimization layer refines visual details via iterative MLLM feedback. Second, our panorama sliding mechanism initializes globally consistent 3D Gaussian Splatting point clouds by strategically sampling overlapping perspectives. By incorporating depth and semantic coherence losses during training, we greatly improve the quality and detail fidelity of rendered scenes. Our experiments demonstrate that PSGS outperforms existing methods in panorama generation and produces more appealing 3D scenes, offering a robust solution for scalable immersive content creation.

[371] ZS-TreeSeg: A Zero-Shot Framework for Tree Crown Instance Segmentation

Pengyu Chen, Fangzheng Lyu, Sicheng Wang, Cuizhen Wang

Main category: cs.CV

TL;DR: ZS-TreeSeg is a zero-shot framework for tree crown segmentation that adapts from canopy semantic segmentation and cells instance segmentation, using topological flow fields to separate overlapping crowns without training.

Details

Motivation: Current tree crown segmentation methods face challenges: supervised deep learning requires expensive annotations and lacks generalization, while foundation models like SAM lack domain knowledge for dense, overlapping canopies in remote sensing applications.

Method: Proposes ZS-TreeSeg framework that models tree crowns as star-convex objects within topological flow fields using Cellpose-SAM adaptation. Combines canopy semantic segmentation with cells instance segmentation to mathematically separate touching tree crown instances based on vector convergence.

Result: Experiments on NEON and BAMFOREST datasets show robust generalization across diverse sensor types and canopy densities, providing a training-free solution for tree crown instance segmentation and label generation.

Conclusion: ZS-TreeSeg offers an effective zero-shot approach for tree crown segmentation that bridges the gap between domain-specific knowledge and foundation model capabilities, enabling accurate delineation in dense, overlapping canopies without training.

Abstract: Individual tree crown segmentation is an important task in remote sensing for forest biomass estimation and ecological monitoring. However, accurate delineation in dense, overlapping canopies remains a bottleneck. While supervised deep learning methods suffer from high annotation costs and limited generalization, emerging foundation models (e.g., Segment Anything Model) often lack domain knowledge, leading to under-segmentation in dense clusters. To bridge this gap, we propose ZS-TreeSeg, a Zero-Shot framework that adapts from two mature tasks: 1) Canopy Semantic segmentation; and 2) Cells instance segmentation. By modeling tree crowns as star-convex objects within a topological flow field using Cellpose-SAM, the ZS-TreeSeg framework forces the mathematical separation of touching tree crown instances based on vector convergence. Experiments on the NEON and BAMFOREST datasets and visual inspection demonstrate that our framework generalizes robustly across diverse sensor types and canopy densities, which can offer a training-free solution for tree crown instance segmentation and labels generation.

[372] Refining Strokes by Learning Offset Attributes between Strokes for Flexible Sketch Edit at Stroke-Level

Sicong Zang, Tao Sun, Cairong Yan

Main category: cs.CV

TL;DR: SketchMod refines source strokes through transformation (scale, orientation, position) to align with target sketch patterns for precise stroke-level sketch editing.

Details

Motivation: Current sketch editing methods merely reposition source strokes without adjusting for size/orientation variations, leading to implausible results when source strokes differ significantly from target patterns.

Method: Learn three offset attributes (scale, orientation, position) from source to target strokes, then align through: 1) resizing by scale, 2) rotating by orientation, and 3) displacing by position, with precise control via exposed stroke attributes.

Result: Experimental results show SketchMod achieves precise and flexible performance on stroke-level sketch editing.

Conclusion: SketchMod enables precise stroke-level sketch editing by transforming source strokes to align with target sketch patterns through learned offset attributes.

Abstract: Sketch edit at stroke-level aims to transplant source strokes onto a target sketch via stroke expansion or replacement, while preserving semantic consistency and visual fidelity with the target sketch. Recent studies addressed it by relocating source strokes at appropriate canvas positions. However, as source strokes could exhibit significant variations in both size and orientation, we may fail to produce plausible sketch editing results by merely repositioning them without further adjustments. For example, anchoring an oversized source stroke onto the target without proper scaling would fail to produce a semantically coherent outcome. In this paper, we propose SketchMod to refine the source stroke through transformation so as to align it with the target sketch’s patterns, further realize flexible sketch edit at stroke-level. As the source stroke refinement is governed by the patterns of the target sketch, we learn three key offset attributes (scale, orientation and position) from the source stroke to another, and align it with the target by: 1) resizing to match spatial proportions by scale, 2) rotating to align with local geometry by orientation, and 3) displacing to meet with semantic layout by position. Besides, a stroke’s profiles can be precisely controlled during sketch edit via the exposed captured stroke attributes. Experimental results indicate that SketchMod achieves precise and flexible performances on stroke-level sketch edit.

[373] HSSDCT: Factorized Spatial-Spectral Correlation for Hyperspectral Image Fusion

Chia-Ming Lee, Yu-Hao Ho, Yu-Fan Lin, Jen-Wei Lee, Li-Wei Kang, Chih-Chung Hsu

Main category: cs.CV

TL;DR: HSSDCT: A hierarchical spatial-spectral dense correlation network for hyperspectral image fusion that achieves state-of-the-art performance with linear complexity self-attention.

Details

Motivation: Current deep learning methods for HSI fusion suffer from limited receptive fields, redundant spectral bands, and quadratic complexity of self-attention, restricting both efficiency and robustness.

Method: Proposes HSSDCT with two key modules: 1) Hierarchical Dense-Residue Transformer Block (HDRTB) that progressively enlarges windows with dense-residue connections for multi-scale feature aggregation, and 2) Spatial-Spectral Correlation Layer (SSCL) that factorizes spatial and spectral dependencies to reduce self-attention to linear complexity while mitigating spectral redundancy.

Result: Extensive experiments on benchmark datasets demonstrate superior reconstruction quality with significantly lower computational costs, achieving new state-of-the-art performance in HSI fusion.

Conclusion: HSSDCT effectively addresses the limitations of existing methods by combining hierarchical multi-scale feature aggregation with efficient spatial-spectral correlation modeling, delivering both high performance and computational efficiency.

Abstract: Hyperspectral image (HSI) fusion aims to reconstruct a high-resolution HSI (HR-HSI) by combining the rich spectral information of a low-resolution HSI (LR-HSI) with the fine spatial details of a high-resolution multispectral image (HR-MSI). Although recent deep learning methods have achieved notable progress, they still suffer from limited receptive fields, redundant spectral bands, and the quadratic complexity of self-attention, which restrict both efficiency and robustness. To overcome these challenges, we propose the Hierarchical Spatial-Spectral Dense Correlation Network (HSSDCT). The framework introduces two key modules: (i) a Hierarchical Dense-Residue Transformer Block (HDRTB) that progressively enlarges windows and employs dense-residue connections for multi-scale feature aggregation, and (ii) a Spatial-Spectral Correlation Layer (SSCL) that explicitly factorizes spatial and spectral dependencies, reducing self-attention to linear complexity while mitigating spectral redundancy. Extensive experiments on benchmark datasets demonstrate that HSSDCT delivers superior reconstruction quality with significantly lower computational costs, achieving new state-of-the-art performance in HSI fusion. Our code is available at https://github.com/jemmyleee/HSSDCT.

[374] RGBX-R1: Visual Modality Chain-of-Thought Guided Reinforcement Learning for Multimodal Grounding

Jiahe Wu, Bing Cao, Qilong Wang, Qinghua Hu, Dongdong Li, Pengfei Zhu

Main category: cs.CV

TL;DR: RGBX-R1 framework enhances MLLMs’ perception across diverse visual modalities (infrared, depth, event data) using UAV prompting and two-stage training with VM-CoT reasoning.

Details

Motivation: Current MLLMs are primarily pre-trained on RGB data, limiting their performance on other visual modalities crucial for complex scenarios like infrared, depth, and event data.

Method: Proposes Understand-Associate-Validate (UAV) prompting to create Visual Modality Chain-of-Thought (VM-CoT), and a two-stage training paradigm: Cold-Start Supervised Fine-Tuning (CS-SFT) and Spatio-Temporal Reinforcement Fine-Tuning (ST-RFT) with Modality-understanding Spatio-Temporal (MuST) reward.

Result: Outperforms baselines by 22.71% on three RGBX grounding tasks and establishes the first RGBX-Grounding benchmark, demonstrating superior multimodal understanding and spatial perception.

Conclusion: RGBX-R1 effectively extends MLLMs’ capabilities beyond RGB to diverse visual modalities, enhancing perception and reasoning for complex real-world scenarios.

Abstract: Multimodal Large Language Models (MLLM) are primarily pre-trained on the RGB modality, thereby limiting their performance on other modalities, such as infrared, depth, and event data, which are crucial for complex scenarios. To address this, we propose RGBX-R1, a framework to enhance MLLM’s perception and reasoning capacities across various X visual modalities. Specifically, we employ an Understand-Associate-Validate (UAV) prompting strategy to construct the Visual Modality Chain-of-Thought (VM-CoT), which aims to expand the MLLMs’ RGB understanding capability into X modalities. To progressively enhance reasoning capabilities, we introduce a two-stage training paradigm: Cold-Start Supervised Fine-Tuning (CS-SFT) and Spatio-Temporal Reinforcement Fine-Tuning (ST-RFT). CS-SFT supervises the reasoning process with the guidance of VM-CoT, equipping the MLLM with fundamental modality cognition. Building upon GRPO, ST-RFT employs a Modality-understanding Spatio-Temporal (MuST) reward to reinforce modality reasoning. Notably, we construct the first RGBX-Grounding benchmark, and extensive experiments verify our superiority in multimodal understanding and spatial perception, outperforming baselines by 22.71% on three RGBX grounding tasks.

[375] Sparse Shortcuts: Facilitating Efficient Fusion in Multimodal Large Language Models

Jingrui Zhang, Feng Liang, Yong Zhang, Wei Wang, Runhao Zeng, Xiping Hu

Main category: cs.CV

TL;DR: SparseCut introduces sparse shortcut connections between cross-modal encoder and LLM for hierarchical visual feature integration, enhancing multimodal understanding without computational overhead.

Details

Motivation: Current MLLMs often discard rich semantic information in mid- and low-level visual features when aligning modalities using only high-level features, limiting cross-modality understanding capabilities.

Method: Proposes SparseCut architecture with sparse shortcut connections between cross-modal encoder and LLM for hierarchical visual feature integration, plus efficient multi-grained feature fusion module that preserves language context without increasing input length.

Result: Experiments show SparseCut significantly enhances MLLM performance across various multimodal benchmarks with generality and scalability for different base LLMs.

Conclusion: SparseCut provides an effective cross-modal fusion architecture that enables richer semantic integration of visual features at multiple levels while maintaining computational efficiency.

Abstract: With the remarkable success of large language models (LLMs) in natural language understanding and generation, multimodal large language models (MLLMs) have rapidly advanced in their ability to process data across multiple modalities. While most existing efforts focus on scaling up language models or constructing higher-quality training data, limited attention has been paid to effectively integrating cross-modal knowledge into the language space. In vision-language models, for instance, aligning modalities using only high-level visual features often discards the rich semantic information present in mid- and low-level features, limiting the model’s ability of cross-modality understanding. To address this issue, we propose SparseCut, a general cross-modal fusion architecture for MLLMs, introducing sparse shortcut connections between the cross-modal encoder and the LLM. These shortcut connections enable the efficient and hierarchical integration of visual features at multiple levels, facilitating richer semantic fusion without increasing computational overhead. We further introduce an efficient multi-grained feature fusion module, which performs the fusion of visual features before routing them through the shortcuts. This preserves the original language context and does not increase the overall input length, thereby avoiding an increase in computational complexity for the LLM. Experiments demonstrate that SparseCut significantly enhances the performance of MLLMs across various multimodal benchmarks with generality and scalability for different base LLMs.

[376] DuoGen: Towards General Purpose Interleaved Multimodal Generation

Min Shi, Xiaohui Zeng, Jiannan Huang, Yin Cui, Francesco Ferroni, Jialuo Li, Shubham Pachori, Zhaoshuo Li, Yogesh Balaji, Haoxiang Wang, Tsung-Yi Lin, Xiao Fu, Yue Zhao, Chieh-Yun Chen, Ming-Yu Liu, Humphrey Shi

Main category: cs.CV

TL;DR: DuoGen is a general-purpose interleaved multimodal generation framework that combines visual understanding from a multimodal LLM with visual generation from a diffusion transformer, achieving state-of-the-art performance on text-to-image, image editing, and interleaved generation tasks.

Details

Motivation: Existing interleaved multimodal generation models have limited quality due to insufficient training data and base model capacity, despite the potential of interleaved generation for applications like instructional guides, visual planning, and reasoning.

Method: 1) Builds large-scale instruction-tuning dataset from curated multimodal conversations and synthetic examples; 2) Leverages pretrained multimodal LLM for visual understanding and diffusion transformer for visual generation; 3) Uses two-stage decoupled training: first instruction-tunes MLLM, then aligns DiT with curated interleaved image-text sequences.

Result: Outperforms prior open-source models in text quality, image fidelity, and image-context alignment across public and new benchmarks. Achieves SOTA performance on text-to-image and image editing among unified generation models.

Conclusion: DuoGen provides an effective framework for high-quality interleaved multimodal generation by systematically addressing data, architecture, and evaluation challenges, enabling flexible base model selection without costly unimodal pretraining.

Abstract: Interleaved multimodal generation enables capabilities beyond unimodal generation models, such as step-by-step instructional guides, visual planning, and generating visual drafts for reasoning. However, the quality of existing interleaved generation models under general instructions remains limited by insufficient training data and base model capacity. We present DuoGen, a general-purpose interleaved generation framework that systematically addresses data curation, architecture design, and evaluation. On the data side, we build a large-scale, high-quality instruction-tuning dataset by combining multimodal conversations rewritten from curated raw websites, and diverse synthetic examples covering everyday scenarios. Architecturally, DuoGen leverages the strong visual understanding of a pretrained multimodal LLM and the visual generation capabilities of a diffusion transformer (DiT) pretrained on video generation, avoiding costly unimodal pretraining and enabling flexible base model selection. A two-stage decoupled strategy first instruction-tunes the MLLM, then aligns DiT with it using curated interleaved image-text sequences. Across public and newly proposed benchmarks, DuoGen outperforms prior open-source models in text quality, image fidelity, and image-context alignment, and also achieves state-of-the-art performance on text-to-image and image editing among unified generation models. Data and code will be released at https://research.nvidia.com/labs/dir/duetgen/.

[377] SPARK: Stochastic Propagation via Affinity-guided Random walK for training-free unsupervised segmentation

Kunal Mahatha, Jose Dolz, Christian Desrosiers

Main category: cs.CV

TL;DR: Training-free segmentation reformulated as stochastic flow equilibrium over diffusion affinity graphs, using Markov propagation with adaptive pruning for better boundary preservation and stability.

Details

Motivation: Existing training-free segmentation methods rely on spectral graph partitioning over diffusion affinities, which has fundamental drawbacks: requires pre-selecting cluster numbers, causes boundary oversmoothing due to spectral relaxation, sensitive to noisy/multi-modal affinity distributions, and neglects local neighborhood structure crucial for stable affinity propagation and fine-grained contours.

Method: Reformulates training-free segmentation as stochastic flow equilibrium over diffusion-induced affinity graphs. Introduces Markov propagation scheme with random-walk-based label diffusion and adaptive pruning strategy that suppresses unreliable transitions while reinforcing confident affinity paths. Integrates global diffusion attention with local neighborhoods from stable diffusion for sparse yet expressive affinity structure.

Result: Achieves state-of-the-art zero-shot performance across seven widely used semantic segmentation benchmarks. Produces sharper boundaries, more coherent regions, and significantly more stable masks compared to prior spectral-clustering-based approaches.

Conclusion: The stochastic flow equilibrium formulation with Markov propagation and adaptive pruning addresses limitations of spectral graph partitioning methods, providing better segmentation quality, boundary preservation, and stability without requiring training.

Abstract: We argue that existing training-free segmentation methods rely on an implicit and limiting assumption, that segmentation is a spectral graph partitioning problem over diffusion-derived affinities. Such approaches, based on global graph partitioning and eigenvector-based formulations of affinity matrices, suffer from several fundamental drawbacks, they require pre-selecting the number of clusters, induce boundary oversmoothing due to spectral relaxation, and remain highly sensitive to noisy or multi-modal affinity distributions. Moreover, many prior works neglect the importance of local neighborhood structure, which plays a crucial role in stabilizing affinity propagation and preserving fine-grained contours. To address these limitations, we reformulate training-free segmentation as a stochastic flow equilibrium problem over diffusion-induced affinity graphs, where segmentation emerges from a stochastic propagation process that integrates global diffusion attention with local neighborhoods extracted from stable diffusion, yielding a sparse yet expressive affinity structure. Building on this formulation, we introduce a Markov propagation scheme that performs random-walk-based label diffusion with an adaptive pruning strategy that suppresses unreliable transitions while reinforcing confident affinity paths. Experiments across seven widely used semantic segmentation benchmarks demonstrate that our method achieves state-of-the-art zero-shot performance, producing sharper boundaries, more coherent regions, and significantly more stable masks compared to prior spectral-clustering-based approaches.

[378] MRAD: Zero-Shot Anomaly Detection with Memory-Driven Retrieval

Chaoran Xu, Chengkan Lv, Qiyu Chen, Feng Zhang, Zhengtao Zhang

Main category: cs.CV

TL;DR: MRAD is a unified anomaly detection framework that replaces parametric model fitting with direct memory retrieval from a two-level memory bank, offering train-free and lightweight training variants for improved cross-domain stability.

Details

Motivation: Existing zero-shot anomaly detection methods often use prompt learning or complex modeling that requires high training/inference costs and has limited cross-domain stability. The authors aim to address these limitations by leveraging empirical data distributions directly rather than relying solely on model fitting.

Method: Proposes MRAD framework with three variants: (1) MRAD-TF (train-free) freezes CLIP image encoder and constructs two-level memory bank (image-level and pixel-level) from auxiliary data, storing feature-label pairs as keys/values; (2) MRAD-FT adds two linear layers to fine-tune retrieval metric; (3) MRAD-CLIP injects normal/anomalous region priors as dynamic biases into CLIP’s text prompts for better generalization.

Result: Demonstrates superior performance across 16 industrial and medical datasets for both anomaly classification and segmentation, under both train-free and training-based settings.

Conclusion: Fully leveraging empirical distribution of raw data through memory retrieval can achieve stronger anomaly detection performance than relying only on model fitting, offering better cross-domain stability and lower computational costs.

Abstract: Zero-shot anomaly detection (ZSAD) often leverages pretrained vision or vision-language models, but many existing methods use prompt learning or complex modeling to fit the data distribution, resulting in high training or inference cost and limited cross-domain stability. To address these limitations, we propose Memory-Retrieval Anomaly Detection method (MRAD), a unified framework that replaces parametric fitting with a direct memory retrieval. The train-free base model, MRAD-TF, freezes the CLIP image encoder and constructs a two-level memory bank (image-level and pixel-level) from auxiliary data, where feature-label pairs are explicitly stored as keys and values. During inference, anomaly scores are obtained directly by similarity retrieval over the memory bank. Based on the MRAD-TF, we further propose two lightweight variants as enhancements: (i) MRAD-FT fine-tunes the retrieval metric with two linear layers to enhance the discriminability between normal and anomaly; (ii) MRAD-CLIP injects the normal and anomalous region priors from the MRAD-FT as dynamic biases into CLIP’s learnable text prompts, strengthening generalization to unseen categories. Across 16 industrial and medical datasets, the MRAD framework consistently demonstrates superior performance in anomaly classification and segmentation, under both train-free and training-based settings. Our work shows that fully leveraging the empirical distribution of raw data, rather than relying only on model fitting, can achieve stronger anomaly detection performance. The code will be publicly released at https://github.com/CROVO1026/MRAD.

[379] SAGE: Accelerating Vision-Language Models via Entropy-Guided Adaptive Speculative Decoding

Yujia Tong, Tian Zhang, Yunyang Wan, Kaiwei Lin, Jingling Yuan, Chuang Hu

Main category: cs.CV

TL;DR: SAGE is a dynamic speculative decoding framework for vision-language models that adapts tree structure based on real-time prediction uncertainty to optimize acceptance lengths and accelerate inference.

Details

Motivation: Existing speculative decoding methods for VLMs use static tree structures that don't adapt to varying prediction difficulty across generation steps, leading to suboptimal acceptance lengths and limited speedup.

Method: SAGE dynamically adjusts speculation tree structure based on real-time prediction uncertainty, using output entropy as a confidence indicator. It constructs deeper-narrower trees for high-confidence predictions and shallower-wider trees for uncertain predictions.

Result: SAGE achieves up to 3.36× decoding speedup for LLaVA-OneVision-72B and 3.18× for Qwen2.5-VL-72B without any loss in output quality, outperforming static tree baselines.

Conclusion: Dynamic tree adaptation based on prediction uncertainty significantly improves speculative decoding efficiency for vision-language models, enabling faster inference while maintaining output quality.

Abstract: Speculative decoding has emerged as a promising approach to accelerate inference in vision-language models (VLMs) by enabling parallel verification of multiple draft tokens. However, existing methods rely on static tree structures that remain fixed throughout the decoding process, failing to adapt to the varying prediction difficulty across generation steps. This leads to suboptimal acceptance lengths and limited speedup. In this paper, we propose SAGE, a novel framework that dynamically adjusts the speculation tree structure based on real-time prediction uncertainty. Our key insight is that output entropy serves as a natural confidence indicator with strong temporal correlation across decoding steps. SAGE constructs deeper-narrower trees for high-confidence predictions to maximize speculation depth, and shallower-wider trees for uncertain predictions to diversify exploration. SAGE improves acceptance lengths and achieves faster acceleration compared to static tree baselines. Experiments on multiple benchmarks demonstrate the effectiveness of SAGE: without any loss in output quality, it delivers up to $3.36\times$ decoding speedup for LLaVA-OneVision-72B and $3.18\times$ for Qwen2.5-VL-72B.

[380] Enhancing Open-Vocabulary Object Detection through Multi-Level Fine-Grained Visual-Language Alignment

Tianyi Zhang, Antoine Simoulin, Kai Li, Sana Lakdawala, Shiqing Yu, Arpit Mittal, Hongyu Fu, Yu Lin

Main category: cs.CV

TL;DR: VLDet is a novel open-vocabulary object detection framework that improves visual-language alignment through a revamped feature pyramid and sigmoid-based contrastive learning, achieving state-of-the-art performance on novel object categories.

Details

Motivation: Traditional object detection is limited to predefined categories, while open-vocabulary detection (OVD) aims to identify novel object classes. Existing approaches struggle with adapting CLIP's single-scale backbone to detection frameworks and maintaining robust visual-language alignment.

Method: VLDet introduces two key components: 1) VL-PUB module that revamps the feature pyramid for fine-grained visual-language alignment by exploiting CLIP’s knowledge and adapting the backbone for detection, and 2) SigRPN block with a sigmoid-based anchor-text contrastive alignment loss to improve novel category detection.

Result: Achieves 58.7 AP for novel classes on COCO2017 (27.6% improvement) and 24.8 AP on LVIS (6.9% improvement), surpassing all state-of-the-art methods. Also demonstrates superior zero-shot performance on closed-set object detection.

Conclusion: VLDet effectively bridges the gap between visual-language modeling and object detection, providing a robust framework for open-vocabulary detection with significant performance improvements over existing methods.

Abstract: Traditional object detection systems are typically constrained to predefined categories, limiting their applicability in dynamic environments. In contrast, open-vocabulary object detection (OVD) enables the identification of objects from novel classes not present in the training set. Recent advances in visual-language modeling have led to significant progress of OVD. However, prior works face challenges in either adapting the single-scale image backbone from CLIP to the detection framework or ensuring robust visual-language alignment. We propose Visual-Language Detection (VLDet), a novel framework that revamps feature pyramid for fine-grained visual-language alignment, leading to improved OVD performance. With the VL-PUB module, VLDet effectively exploits the visual-language knowledge from CLIP and adapts the backbone for object detection through feature pyramid. In addition, we introduce the SigRPN block, which incorporates a sigmoid-based anchor-text contrastive alignment loss to improve detection of novel categories. Through extensive experiments, our approach achieves 58.7 AP for novel classes on COCO2017 and 24.8 AP on LVIS, surpassing all state-of-the-art methods and achieving significant improvements of 27.6% and 6.9%, respectively. Furthermore, VLDet also demonstrates superior zero-shot performance on closed-set object detection.

[381] SADER: Structure-Aware Diffusion Framework with DEterministic Resampling for Multi-Temporal Remote Sensing Cloud Removal

Yifan Zhang, Qian Chen, Yi Liu, Wengen Li, Jihong Guan

Main category: cs.CV

TL;DR: SADER is a structure-aware diffusion framework for multi-temporal remote sensing cloud removal that improves sampling efficiency and leverages temporal priors through temporal fusion, hybrid attention, and guided resampling.

Details

Motivation: Cloud contamination severely degrades remote sensing imagery usability. Existing diffusion-based approaches suffer from limited sampling efficiency and insufficient exploitation of structural and temporal priors in multi-temporal scenarios.

Method: Proposes SADER with: 1) Multi-Temporal Conditional Diffusion Network (MTCDN) using temporal fusion and hybrid attention to capture multi-temporal correlations, 2) Cloud-aware attention loss emphasizing cloud-dominated regions based on thickness and brightness, 3) Deterministic resampling strategy for iterative refinement under fixed sampling steps.

Result: Extensive experiments on multiple multi-temporal datasets show SADER consistently outperforms state-of-the-art cloud removal methods across all evaluation metrics.

Conclusion: SADER effectively addresses cloud removal in remote sensing by combining temporal modeling, attention mechanisms, and efficient sampling strategies, with publicly available code.

Abstract: Cloud contamination severely degrades the usability of remote sensing imagery and poses a fundamental challenge for downstream Earth observation tasks. Recently, diffusion-based models have emerged as a dominant paradigm for remote sensing cloud removal due to their strong generative capability and stable optimization. However, existing diffusion-based approaches often suffer from limited sampling efficiency and insufficient exploitation of structural and temporal priors in multi-temporal remote sensing scenarios. In this work, we propose SADER, a structure-aware diffusion framework for multi-temporal remote sensing cloud removal. SADER first develops a scalable Multi-Temporal Conditional Diffusion Network (MTCDN) to fully capture multi-temporal and multimodal correlations via temporal fusion and hybrid attention. Then, a cloud-aware attention loss is introduced to emphasize cloud-dominated regions by accounting for cloud thickness and brightness discrepancies. In addition, a deterministic resampling strategy is designed for continuous diffusion models to iteratively refine samples under fixed sampling steps by replacing outliers through guided correction. Extensive experiments on multiple multi-temporal datasets demonstrate that SADER consistently outperforms state-of-the-art cloud removal methods across all evaluation metrics. The code of SADER is publicly available at https://github.com/zyfzs0/SADER.

[382] NPNet: A Non-Parametric Network with Adaptive Gaussian-Fourier Positional Encoding for 3D Classification and Segmentation

Mohammad Saeid, Amir Salarpour, Pedram MohajerAnsari, Mert D. Pesé

Main category: cs.CV

TL;DR: NPNet is a fully non-parametric 3D point cloud classification and segmentation method using deterministic operators and adaptive Gaussian-Fourier positional encoding without learned weights.

Details

Motivation: To develop a 3D point cloud processing method that doesn't require learned parameters, remains stable across different scales and sampling densities, and performs well in few-shot settings while being computationally efficient.

Method: Uses deterministic operators (farthest point sampling, k-nearest neighbors, pooling) with adaptive Gaussian-Fourier positional encoding whose parameters are chosen from input geometry. For segmentation, adds fixed-frequency Fourier features for global context.

Result: Achieves strong performance on ModelNet40/ModelNet-R, ScanObjectNN, and ShapeNetPart among non-parametric baselines, particularly effective in few-shot settings on ModelNet40, with favorable memory use and inference time.

Conclusion: NPNet demonstrates that effective 3D point cloud processing can be achieved without learned parameters through careful design of deterministic operators and adaptive positional encodings.

Abstract: We present NPNet, a fully non-parametric approach for 3D point-cloud classification and part segmentation. NPNet contains no learned weights; instead, it builds point features using deterministic operators such as farthest point sampling, k-nearest neighbors, and pooling. Our key idea is an adaptive Gaussian-Fourier positional encoding whose bandwidth and Gaussian-cosine mixing are chosen from the input geometry, helping the method remain stable across different scales and sampling densities. For segmentation, we additionally incorporate fixed-frequency Fourier features to provide global context alongside the adaptive encoding. Across ModelNet40/ModelNet-R, ScanObjectNN, and ShapeNetPart, NPNet achieves strong performance among non-parametric baselines, and it is particularly effective in few-shot settings on ModelNet40. NPNet also offers favorable memory use and inference time compared to prior non-parametric methods

[383] Learning to Decode Against Compositional Hallucination in Video Multimodal Large Language Models

Wenbin Xing, Quanxing Zha, Lizheng Zu, Mengran Li, Ming Li, Junchi Yan

Main category: cs.CV

TL;DR: OmniVCHall benchmark for evaluating isolated and compositional hallucinations in video multimodal LLMs, with TriCD contrastive decoding framework for mitigation.

Details

Motivation: Current research focuses on isolated error types in video hallucination, leaving compositional hallucinations (incorrect reasoning over multiple spatial-temporal factors) underexplored. Need systematic evaluation and mitigation approaches.

Method: 1) OmniVCHall benchmark with diverse video domains, novel camera-based hallucination type, fine-grained taxonomy, and adversarial answer options. 2) TriCD framework: contrastive decoding with triple-pathway calibration, adaptive perturbation controller for negative variants, saliency-guided enhancement for visual evidence, optimized via reinforcement learning.

Result: Evaluation of 39 VLLMs shows even advanced models (Qwen3-VL, GPT-5) have substantial performance degradation. TriCD improves performance across two backbones with average accuracy improvement over 10%.

Conclusion: Compositional hallucinations are critical challenge in VLLMs. OmniVCHall enables systematic evaluation, and TriCD provides effective mitigation framework with consistent improvements across models.

Abstract: Current research on video hallucination mitigation primarily focuses on isolated error types, leaving compositional hallucinations, arising from incorrect reasoning over multiple interacting spatial and temporal factors largely underexplored. We introduce OmniVCHall, a benchmark designed to systematically evaluate both isolated and compositional hallucinations in video multimodal large language models (VLLMs). OmniVCHall spans diverse video domains, introduces a novel camera-based hallucination type, and defines a fine-grained taxonomy, together with adversarial answer options (e.g., “All are correct” and “None of the above”) to prevent shortcut reasoning. The evaluations of 39 representative VLLMs reveal that even advanced models (e.g., Qwen3-VL and GPT-5) exhibit substantial performance degradation. We propose TriCD, a contrastive decoding framework with a triple-pathway calibration mechanism. An adaptive perturbation controller dynamically selects distracting operations to construct negative video variants, while a saliency-guided enhancement module adaptively reinforces grounded token-wise visual evidences. These components are optimized via reinforcement learning to encourage precise decision-making under compositional hallucination settings. Experimental results show that TriCD consistently improves performance across two representative backbones, achieving an average accuracy improvement of over 10%. The data and code can be find at https://github.com/BMRETURN/OmniVCHall.

[384] GLAD: Generative Language-Assisted Visual Tracking for Low-Semantic Templates

Xingyu Luo, Yidong Cai, Jie Liu, Jie Tang, Gangshan Wu, Limin Wang

Main category: cs.CV

TL;DR: GLAD is a generative language-assisted tracking model that uses diffusion models to fuse text descriptions and template images, enhancing cross-modal compatibility and improving tracking performance on low-semantic images.

Details

Motivation: Current vision-language trackers struggle with low-semantic images (blurry, low-resolution) which degrade cross-modal understanding. Direct concatenation of textual and visual features has limited effectiveness due to the gap between modalities.

Method: Proposes GLAD, a generative language-assisted tracking model using diffusion models for multi-modal fusion of text descriptions and template images to enhance compatibility between language and image modalities and improve template image semantics.

Result: Achieves state-of-the-art performance on multiple benchmarks with impressive inference speed. Blurry and semantically ambiguous template images can be effectively restored through the generative fusion paradigm.

Conclusion: GLAD demonstrates superior performance over existing fusion paradigms by using generative diffusion models for multi-modal fusion, effectively addressing challenges in vision-language tracking with low-semantic images.

Abstract: Vision-language tracking has gained increasing attention in many scenarios. This task simultaneously deals with visual and linguistic information to localize objects in videos. Despite its growing utility, the development of vision-language tracking methods remains in its early stage. Current vision-language trackers usually employ Transformer architectures for interactive integration of template, search, and text features. However, persistent challenges about low-semantic images including prevalent image blurriness, low resolution and so on, may compromise model performance through degraded cross-modal understanding. To solve this problem, language assistance is usually used to deal with the obstacles posed by low-semantic images. However, due to the existing gap between current textual and visual features, direct concatenation and fusion of these features may have limited effectiveness. To address these challenges, we introduce a pioneering Generative Language-AssisteD tracking model, GLAD, which utilizes diffusion models for the generative multi-modal fusion of text description and template image to bolster compatibility between language and image and enhance template image semantic information. Our approach demonstrates notable improvements over the existing fusion paradigms. Blurry and semantically ambiguous template images can be restored to improve multi-modal features in the generative fusion paradigm. Experiments show that our method establishes a new state-of-the-art on multiple benchmarks and achieves an impressive inference speed. The code and models will be released at: https://github.com/Confetti-lxy/GLAD

[385] Bridging Degradation Discrimination and Generation for Universal Image Restoration

JiaKui Hu, Zhengjian Yao, Lujia Jin, Yanye Lu

Main category: cs.CV

TL;DR: BDG is a universal image restoration method that combines degradation discrimination via MAS-GLCM with diffusion-based generation in a three-stage training process to handle multiple degradations while preserving texture details.

Details

Motivation: Universal image restoration faces challenges in sampling high-quality image distributions and adjusting outputs based on degradation types/levels. Existing methods struggle with fine-grained degradation discrimination while maintaining rich texture restoration capabilities.

Method: Proposes BDG with two key components: 1) MAS-GLCM for fine-grained degradation type/level discrimination, and 2) Three-stage diffusion training (generation, bridging, restoration) that integrates discriminative information into restoration while preserving texture generation capabilities.

Result: Achieves significant performance gains in all-in-one restoration and real-world super-resolution tasks, with substantial improvements in fidelity without compromising perceptual quality, without architectural changes.

Conclusion: BDG effectively bridges degradation discrimination and generation, demonstrating strong performance in multi-task, multi-degradation scenarios through the integration of MAS-GLCM features with diffusion-based restoration.

Abstract: Universal image restoration is a critical task in low-level vision, requiring the model to remove various degradations from low-quality images to produce clean images with rich detail. The challenges lie in sampling the distribution of high-quality images and adjusting the outputs on the basis of the degradation. This paper presents a novel approach, Bridging Degradation discrimination and Generation (BDG), which aims to address these challenges concurrently. First, we propose the Multi-Angle and multi-Scale Gray Level Co-occurrence Matrix (MAS-GLCM) and demonstrate its effectiveness in performing fine-grained discrimination of degradation types and levels. Subsequently, we divide the diffusion training process into three distinct stages: generation, bridging, and restoration. The objective is to preserve the diffusion model’s capability of restoring rich textures while simultaneously integrating the discriminative information from the MAS-GLCM into the restoration process. This enhances its proficiency in addressing multi-task and multi-degraded scenarios. Without changing the architecture, BDG achieves significant performance gains in all-in-one restoration and real-world super-resolution tasks, primarily evidenced by substantial improvements in fidelity without compromising perceptual quality. The code and pretrained models are provided in https://github.com/MILab-PKU/BDG.

[386] MAUGen: A Unified Diffusion Approach for Multi-Identity Facial Expression and AU Label Generation

Xiangdong Li, Ye Lou, Ao Gao, Wei Zhang, Siyang Song

Main category: cs.CV

TL;DR: MAUGen is a diffusion-based multimodal framework that generates photorealistic facial expressions with precise Action Unit annotations from text prompts, creating a large-scale synthetic dataset for AU recognition.

Details

Motivation: The paper addresses the fundamental bottleneck in developing generalizable Action Unit recognition systems: the lack of large-scale, demographically diverse face images with precise AU occurrence and intensity annotations.

Method: Proposes MAUGen with two key modules: (1) Multi-modal Representation Learning (MRL) that captures relationships among textual descriptions, facial identity, expression images, and AU activations in a unified latent space; (2) Diffusion-based Image label Generator (DIG) that decodes joint representations into aligned facial image-label pairs across diverse identities.

Result: Introduces MIFA, a large-scale multimodal synthetic dataset with comprehensive AU annotations and identity variations. MAUGen outperforms existing methods in synthesizing photorealistic, demographically diverse facial images with semantically aligned AU labels.

Conclusion: MAUGen successfully addresses the data scarcity problem in AU recognition by generating high-quality synthetic facial expression data with precise anatomical annotations, enabling more robust and generalizable facial expression analysis systems.

Abstract: The lack of large-scale, demographically diverse face images with precise Action Unit (AU) occurrence and intensity annotations has long been recognized as a fundamental bottleneck in developing generalizable AU recognition systems. In this paper, we propose MAUGen, a diffusion-based multi-modal framework that jointly generates a large collection of photorealistic facial expressions and anatomically consistent AU labels, including both occurrence and intensity, conditioned on a single descriptive text prompt. Our MAUGen involves two key modules: (1) a Multi-modal Representation Learning (MRL) module that captures the relationships among the paired textual description, facial identity, expression image, and AU activations within a unified latent space; and (2) a Diffusion-based Image label Generator (DIG) that decodes the joint representation into aligned facial image-label pairs across diverse identities. Under this framework, we introduce Multi-Identity Facial Action (MIFA), a large-scale multimodal synthetic dataset featuring comprehensive AU annotations and identity variations. Extensive experiments demonstrate that MAUGen outperforms existing methods in synthesizing photorealistic, demographically diverse facial images along with semantically aligned AU labels.

[387] From Pixels to Facts (Pix2Fact): Benchmarking Multi-Hop Reasoning for Fine-Grained Visual Fact Checking

Yifan Jiang, Cong Zhang, Bofei Zhang, Yifan Yang, Bingzhang Wang, Yew-Soon Ong

Main category: cs.CV

TL;DR: Pix2Fact is a new VQA benchmark requiring detailed visual grounding and knowledge-intensive multi-hop reasoning, where current VLMs achieve only 24% accuracy vs human 56%.

Details

Motivation: Existing benchmarks evaluate visual grounding and knowledge-based reasoning separately, but real-world challenges require their synergy. There's a need for benchmarks that test expert-level perception combined with deliberate knowledge-intensive reasoning.

Method: Created Pix2Fact benchmark with 1,000 high-resolution (4K+) images across 8 daily-life scenarios. Questions and answers were meticulously crafted by PhD annotators from top universities in partnership with a professional data annotation firm. Each question requires detailed visual grounding, multi-hop reasoning, and external knowledge integration.

Result: Evaluation of 9 state-of-the-art VLMs (including proprietary models like Gemini-3-Pro and GPT-5) shows substantial challenge: most advanced model achieves only 24.0% average accuracy, compared to human performance of 56%. This reveals significant limitations in current models’ visual comprehension.

Conclusion: Pix2Fact exposes critical limitations in current VLMs’ ability to combine fine-grained perception with robust knowledge-based reasoning. The benchmark will drive development of next-generation multimodal agents that can better replicate human-level visual comprehension.

Abstract: Despite progress on general tasks, VLMs struggle with challenges demanding both detailed visual grounding and deliberate knowledge-based reasoning, a synergy not captured by existing benchmarks that evaluate these skills separately. To close this gap, we introduce Pix2Fact, a new visual question-answering benchmark designed to evaluate expert-level perception and knowledge-intensive multi-hop reasoning. Pix2Fact contains 1,000 high-resolution (4K+) images spanning 8 daily-life scenarios and situations, with questions and answers meticulously crafted by annotators holding PhDs from top global universities working in partnership with a professional data annotation firm. Each question requires detailed visual grounding, multi-hop reasoning, and the integration of external knowledge to answer. Our evaluation of 9 state-of-the-art VLMs, including proprietary models like Gemini-3-Pro and GPT-5, reveals the substantial challenge posed by Pix2Fact: the most advanced model achieves only 24.0% average accuracy, in stark contrast to human performance of 56%. This significant gap underscores the limitations of current models in replicating human-level visual comprehension. We believe Pix2Fact will serve as a critical benchmark to drive the development of next-generation multimodal agents that combine fine-grained perception with robust, knowledge-based reasoning.

[388] Tune-Your-Style: Intensity-tunable 3D Style Transfer with Gaussian Splatting

Yian Zhao, Rushi Ye, Ruochong Zheng, Zesen Cheng, Chaoran Feng, Jiashu Yang, Pengchong Qiao, Chang Liu, Jie Chen

Main category: cs.CV

TL;DR: Tune-Your-Style introduces an intensity-tunable 3D style transfer method using Gaussian splatting that allows users to flexibly adjust style intensity for customized content-style balance.

Details

Motivation: Existing 3D style transfer methods using 3D Gaussian Splatting (3DGS) have fixed output paradigms that cannot adapt to diverse user preferences for content-style balance, limiting customizability.

Method: Introduces Gaussian neurons to model style intensity explicitly, parameterizes a learnable style tuner, and uses tunable stylization guidance with cross-view style alignment from diffusion models and two-stage optimization.

Result: The method delivers visually appealing results with flexible customizability for 3D style transfer, allowing users to adjust style intensity to match their desired content-style balance.

Conclusion: Tune-Your-Style provides an effective intensity-tunable paradigm for 3D style transfer that enhances customizability while maintaining visual quality.

Abstract: 3D style transfer refers to the artistic stylization of 3D assets based on reference style images. Recently, 3DGS-based stylization methods have drawn considerable attention, primarily due to their markedly enhanced training and rendering speeds. However, a vital challenge for 3D style transfer is to strike a balance between the content and the patterns and colors of the style. Although the existing methods strive to achieve relatively balanced outcomes, the fixed-output paradigm struggles to adapt to the diverse content-style balance requirements from different users. In this work, we introduce a creative intensity-tunable 3D style transfer paradigm, dubbed \textbf{Tune-Your-Style}, which allows users to flexibly adjust the style intensity injected into the scene to match their desired content-style balance, thus enhancing the customizability of 3D style transfer. To achieve this goal, we first introduce Gaussian neurons to explicitly model the style intensity and parameterize a learnable style tuner to achieve intensity-tunable style injection. To facilitate the learning of tunable stylization, we further propose the tunable stylization guidance, which obtains multi-view consistent stylized views from diffusion models through cross-view style alignment, and then employs a two-stage optimization strategy to provide stable and efficient guidance by modulating the balance between full-style guidance from the stylized views and zero-style guidance from the initial rendering. Extensive experiments demonstrate that our method not only delivers visually appealing results, but also exhibits flexible customizability for 3D style transfer. Project page is available at https://zhao-yian.github.io/TuneStyle.

[389] Towards Interpretable Hallucination Analysis and Mitigation in LVLMs via Contrastive Neuron Steering

Guangtao Lyu, Xinyi Cheng, Qi Liu, Chenghao Xu, Jiexi Yan, Muli Yang, Fen Fang, Cheng Deng

Main category: cs.CV

TL;DR: The paper introduces Contrastive Neuron Steering (CNS) to mitigate hallucinations in LVLMs by analyzing and manipulating sparse interpretable neurons in visual embeddings, identifying that hallucinations arise from disruptions in image-specific neurons.

Details

Motivation: LVLMs suffer from hallucinations despite strong multimodal capabilities. Current methods focus on output-level adjustments without exploring internal mechanisms. The paper aims to understand and address hallucinations at the representation level.

Method: Uses sparse autoencoders to decompose dense visual embeddings into interpretable neurons. Identifies neuron types (always-on vs image-specific). Proposes Contrastive Neuron Steering (CNS) that uses contrastive analysis between clean/noisy inputs to identify image-specific neurons, then selectively amplifies informative neurons while suppressing perturbation-induced activations.

Result: CNS reduces hallucinations while preserving multimodal understanding. Experiments on hallucination-focused and general multimodal benchmarks show consistent improvement. The method operates at prefilling stage and is compatible with existing decoding-stage methods.

Conclusion: Hallucinations in LVLMs stem from disruptions in image-specific neurons. CNS provides an effective representation-level intervention that improves visual grounding and reduces hallucinations through neuron-level manipulation.

Abstract: LVLMs achieve remarkable multimodal understanding and generation but remain susceptible to hallucinations. Existing mitigation methods predominantly focus on output-level adjustments, leaving the internal mechanisms that give rise to these hallucinations largely unexplored. To gain a deeper understanding, we adopt a representation-level perspective by introducing sparse autoencoders (SAEs) to decompose dense visual embeddings into sparse, interpretable neurons. Through neuron-level analysis, we identify distinct neuron types, including always-on neurons and image-specific neurons. Our findings reveal that hallucinations often result from disruptions or spurious activations of image-specific neurons, while always-on neurons remain largely stable. Moreover, selectively enhancing or suppressing image-specific neurons enables controllable intervention in LVLM outputs, improving visual grounding and reducing hallucinations. Building on these insights, we propose Contrastive Neuron Steering (CNS), which identifies image-specific neurons via contrastive analysis between clean and noisy inputs. CNS selectively amplifies informative neurons while suppressing perturbation-induced activations, producing more robust and semantically grounded visual representations. This not only enhances visual understanding but also effectively mitigates hallucinations. By operating at the prefilling stage, CNS is fully compatible with existing decoding-stage methods. Extensive experiments on both hallucination-focused and general multimodal benchmarks demonstrate that CNS consistently reduces hallucinations while preserving overall multimodal understanding.

[390] FaceSnap: Enhanced ID-fidelity Network for Tuning-free Portrait Customization

Benxiang Zhai, Yifang Xu, Guofeng Zhang, Yang Li, Sidan Du

Main category: cs.CV

TL;DR: FaceSnap: A plug-and-play method for personalized portrait generation using Stable Diffusion that achieves high facial fidelity with single reference image and single inference, featuring Facial Attribute Mixer and Landmark Predictor.

Details

Motivation: Existing personalized image generation methods for portraits either require time-consuming fine-tuning and lack generalizability, or fail to achieve high fidelity in facial details. There's a need for efficient single-image methods that maintain facial consistency.

Method: Based on Stable Diffusion, FaceSnap uses a Facial Attribute Mixer to extract comprehensive fused information from both low-level specific features and high-level abstract features. It includes a Landmark Predictor to maintain reference identity across different poses, and an ID-preserving module to inject these into the UNet architecture.

Result: Experimental results show FaceSnap performs remarkably in personalized and customized portrait generation, surpassing other state-of-the-art methods in this domain with high fidelity facial details.

Conclusion: FaceSnap provides an efficient, plug-and-play solution for personalized portrait generation that requires only a single reference image and produces consistent results in a single inference stage, with strong generalization capabilities.

Abstract: Benefiting from the significant advancements in text-to-image diffusion models, research in personalized image generation, particularly customized portrait generation, has also made great strides recently. However, existing methods either require time-consuming fine-tuning and lack generalizability or fail to achieve high fidelity in facial details. To address these issues, we propose FaceSnap, a novel method based on Stable Diffusion (SD) that requires only a single reference image and produces extremely consistent results in a single inference stage. This method is plug-and-play and can be easily extended to different SD models. Specifically, we design a new Facial Attribute Mixer that can extract comprehensive fused information from both low-level specific features and high-level abstract features, providing better guidance for image generation. We also introduce a Landmark Predictor that maintains reference identity across landmarks with different poses, providing diverse yet detailed spatial control conditions for image generation. Then we use an ID-preserving module to inject these into the UNet. Experimental results demonstrate that our approach performs remarkably in personalized and customized portrait generation, surpassing other state-of-the-art methods in this domain.

[391] S$^3$POT: Contrast-Driven Face Occlusion Segmentation via Self-Supervised Prompt Learning

Lingsong Wang, Mancheng Meng, Ziyan Wu, Terrence Chen, Fan Yang, Dinggang Shen

Main category: cs.CV

TL;DR: S³POT is a contrast-driven framework for face occlusion segmentation that synergizes face generation with self-supervised spatial prompting, eliminating the need for occlusion ground truth masks.

Details

Motivation: Existing face parsing methods struggle with occlusions because occlusion is a high-level concept not tied to specific object categories, making comprehensive real-world dataset collection and accurate mask annotation impractical.

Method: Three-module framework: 1) Reference Generation produces occlusion-free face using structural guidance from parsed masks, 2) Feature Enhancement contrasts tokens between raw/reference images for initial prompts and modifies features via cross-attention, 3) Prompt Selection constructs positive/negative prompts and screens them with self-attention network for mask decoder, trained with three novel objective functions without occlusion ground truth.

Result: Extensive experiments on a dedicatedly collected dataset demonstrate S³POT’s superior performance and effectiveness of each module.

Conclusion: S³POT successfully addresses face occlusion segmentation by leveraging face generation capabilities and foundation segmentation models without requiring occlusion ground truth masks.

Abstract: Existing face parsing methods usually misclassify occlusions as facial components. This is because occlusion is a high-level concept, it does not refer to a concrete category of object. Thus, constructing a real-world face dataset covering all categories of occlusion object is almost impossible and accurate mask annotation is labor-intensive. To deal with the problems, we present S$^3$POT, a contrast-driven framework synergizing face generation with self-supervised spatial prompting, to achieve occlusion segmentation. The framework is inspired by the insights: 1) Modern face generators’ ability to realistically reconstruct occluded regions, creating an image that preserve facial geometry while eliminating occlusion, and 2) Foundation segmentation models’ (e.g., SAM) capacity to extract precise mask when provided with appropriate prompts. In particular, S$^3$POT consists of three modules: Reference Generation (RF), Feature enhancement (FE), and Prompt Selection (PS). First, a reference image is produced by RF using structural guidance from parsed mask. Second, FE performs contrast of tokens between raw and reference images to obtain an initial prompt, then modifies image features with the prompt by cross-attention. Third, based on the enhanced features, PS constructs a set of positive and negative prompts and screens them with a self-attention network for a mask decoder. The network is learned under the guidance of three novel and complementary objective functions without occlusion ground truth mask involved. Extensive experiments on a dedicatedly collected dataset demonstrate S$^3$POT’s superior performance and the effectiveness of each module.

[392] VIZOR: Viewpoint-Invariant Zero-Shot Scene Graph Generation for 3D Scene Reasoning

Vivek Madhavaram, Vartika Sengar, Arkadipta De, Charu Sharma

Main category: cs.CV

TL;DR: VIZOR is a training-free framework for generating viewpoint-invariant 3D scene graphs directly from raw 3D scenes, using object-centric spatial relationships and open-vocabulary reasoning without annotated data.

Details

Motivation: Existing 3D scene understanding methods struggle with generalization and produce inaccurate spatial relationships that become inconsistent across different viewpoints, limiting their practical utility.

Method: VIZOR constructs dense 3D scene graphs directly from raw 3D scenes without training, defining spatial relationships relative to each object’s front-facing direction for viewpoint invariance, and infers open-vocabulary relationships without annotated training data.

Result: VIZOR outperforms state-of-the-art methods in scene graph generation and achieves 22% and 4.81% gains in zero-shot grounding accuracy on Replica and Nr3D datasets respectively.

Conclusion: VIZOR provides an effective training-free solution for generating unambiguous, viewpoint-invariant 3D scene graphs with improved generalization and accuracy in downstream tasks.

Abstract: Scene understanding and reasoning has been a fundamental problem in 3D computer vision, requiring models to identify objects, their properties, and spatial or comparative relationships among the objects. Existing approaches enable this by creating scene graphs using multiple inputs such as 2D images, depth maps, object labels, and annotated relationships from specific reference view. However, these methods often struggle with generalization and produce inaccurate spatial relationships like “left/right”, which become inconsistent across different viewpoints. To address these limitations, we propose Viewpoint-Invariant Zero-shot scene graph generation for 3D scene Reasoning (VIZOR). VIZOR is a training-free, end-to-end framework that constructs dense, viewpoint-invariant 3D scene graphs directly from raw 3D scenes. The generated scene graph is unambiguous, as spatial relationships are defined relative to each object’s front-facing direction, making them consistent regardless of the reference view. Furthermore, it infers open-vocabulary relationships that describe spatial and proximity relationships among scene objects without requiring annotated training data. We conduct extensive quantitative and qualitative evaluations to assess the effectiveness of VIZOR in scene graph generation and downstream tasks, such as query-based object grounding. VIZOR outperforms state-of-the-art methods, showing clear improvements in scene graph generation and achieving 22% and 4.81% gains in zero-shot grounding accuracy on the Replica and Nr3D datasets, respectively.

[393] Diff-PC: Identity-preserving and 3D-aware Controllable Diffusion for Zero-shot Portrait Customization

Yifang Xu, Benxiang Zhai, Chenyu Zhang, Ming Li, Yang Li, Sidan Du

Main category: cs.CV

TL;DR: Diff-PC: A diffusion-based framework for zero-shot portrait customization that generates realistic portraits with high identity fidelity, specified facial attributes, and diverse backgrounds using 3D facial priors and specialized ID modules.

Details

Motivation: Existing portrait customization methods lack precise identity preservation and facial control, creating a need for a solution that can generate realistic portraits while maintaining high identity fidelity and allowing for specific facial attribute control.

Method: Uses 3D face predictor to reconstruct 3D-aware facial priors (reference ID, target expressions, poses). Includes ID-Encoder for local/global facial feature fusion, ID-Ctrl for aligning ID features using 3D face guidance, and ID-Injector for enhanced ID fidelity and facial controllability. Trained on collected ID-centric dataset.

Result: Extensive experiments show Diff-PC surpasses state-of-the-art methods in ID preservation, facial control, and text-to-image consistency. Also compatible with multi-style foundation models.

Conclusion: Diff-PC effectively addresses limitations in existing portrait customization methods by providing precise identity preservation and facial control through a diffusion-based framework with 3D facial priors and specialized ID modules.

Abstract: Portrait customization (PC) has recently garnered significant attention due to its potential applications. However, existing PC methods lack precise identity (ID) preservation and face control. To address these tissues, we propose Diff-PC, a diffusion-based framework for zero-shot PC, which generates realistic portraits with high ID fidelity, specified facial attributes, and diverse backgrounds. Specifically, our approach employs the 3D face predictor to reconstruct the 3D-aware facial priors encompassing the reference ID, target expressions, and poses. To capture fine-grained face details, we design ID-Encoder that fuses local and global facial features. Subsequently, we devise ID-Ctrl using the 3D face to guide the alignment of ID features. We further introduce ID-Injector to enhance ID fidelity and facial controllability. Finally, training on our collected ID-centric dataset improves face similarity and text-to-image (T2I) alignment. Extensive experiments demonstrate that Diff-PC surpasses state-of-the-art methods in ID preservation, facial control, and T2I consistency. Furthermore, our method is compatible with multi-style foundation models.

[394] A Hybrid Mamba-SAM Architecture for Efficient 3D Medical Image Segmentation

Mohammadreza Gholipour Shahraki, Mehdi Rezaeian, Mohammad Ghasemzadeh

Main category: cs.CV

TL;DR: Mamba-SAM combines frozen SAM encoder with Mamba SSMs for efficient 3D medical image segmentation, using dual-branch and adapter-based approaches with novel frequency-domain enhancements.

Details

Motivation: Foundation models like SAM struggle with medical imaging due to domain shift, 2D design limitations, and high fine-tuning costs. Need efficient adaptation for 3D medical segmentation.

Method: Two parameter-efficient strategies: 1) Dual-branch architecture fusing frozen SAM features with trainable VMamba encoder via cross-attention; 2) Adapter-based approach injecting lightweight Tri-Plane Mamba modules into SAM ViT encoder. Introduces Multi-Frequency Gated Convolution for spatial-frequency analysis.

Result: On ACDC cardiac MRI: Dual-branch achieves 0.906 Dice (comparable to UNet++ 0.907), outperforms baselines on Myocardium (0.910) and Left Ventricle (0.971). Adapter-based variant offers 4.77 FPS with 0.880 Dice.

Conclusion: Hybridizing foundation models with efficient SSM-based architectures provides practical solution for 3D medical image segmentation, balancing accuracy and computational efficiency.

Abstract: Accurate segmentation of 3D medical images such as MRI and CT is essential for clinical diagnosis and treatment planning. Foundation models like the Segment Anything Model (SAM) provide powerful general-purpose representations but struggle in medical imaging due to domain shift, their inherently 2D design, and the high computational cost of fine-tuning. To address these challenges, we propose Mamba-SAM, a novel and efficient hybrid architecture that combines a frozen SAM encoder with the linear-time efficiency and long-range modeling capabilities of Mamba-based State Space Models (SSMs). We investigate two parameter-efficient adaptation strategies. The first is a dual-branch architecture that explicitly fuses general features from a frozen SAM encoder with domain-specific representations learned by a trainable VMamba encoder using cross-attention. The second is an adapter-based approach that injects lightweight, 3D-aware Tri-Plane Mamba (TPMamba) modules into the frozen SAM ViT encoder to implicitly model volumetric context. Within this framework, we introduce Multi-Frequency Gated Convolution (MFGC), which enhances feature representation by jointly analyzing spatial and frequency-domain information via 3D discrete cosine transforms and adaptive gating. Extensive experiments on the ACDC cardiac MRI dataset demonstrate the effectiveness of the proposed methods. The dual-branch Mamba-SAM-Base model achieves a mean Dice score of 0.906, comparable to UNet++ (0.907), while outperforming all baselines on Myocardium (0.910) and Left Ventricle (0.971) segmentation. The adapter-based TP MFGC variant offers superior inference speed (4.77 FPS) with strong accuracy (0.880 Dice). These results show that hybridizing foundation models with efficient SSM-based architectures provides a practical and effective solution for 3D medical image segmentation.

[395] Non-Contrastive Vision-Language Learning with Predictive Embedding Alignment

Lukas Kuhn, Giuseppe Serra, Florian Buettner

Main category: cs.CV

TL;DR: NOVA is a non-contrastive vision-language alignment framework that predicts text embeddings from augmented image views with distributional regularization, eliminating negative sampling and complex training requirements.

Details

Motivation: Contrastive vision-language models like CLIP require large batch sizes, careful negative sampling, and extensive hyperparameter tuning, which can be unstable and computationally expensive. The authors aim to develop a simpler, more stable alternative.

Method: NOVA aligns visual representations to a frozen text encoder by predicting text embeddings from augmented image views. It uses Sketched Isotropic Gaussian Regularization (SIGReg) to enforce an isotropic Gaussian structure, eliminating negative sampling, momentum encoders, and stop-gradients.

Result: NOVA outperforms multiple standard baselines on zero-shot chest X-ray classification across three benchmark datasets, while exhibiting substantially more consistent training runs with reduced hyperparameter sensitivity.

Conclusion: Non-contrastive vision-language pretraining offers a simpler, more stable, and more effective alternative to contrastive methods, particularly in specialized domains like medical imaging.

Abstract: Vision-language models have transformed multimodal representation learning, yet dominant contrastive approaches like CLIP require large batch sizes, careful negative sampling, and extensive hyperparameter tuning. We introduce NOVA, a NOn-contrastive Vision-language Alignment framework based on joint embedding prediction with distributional regularization. NOVA aligns visual representations to a frozen, domain-specific text encoder by predicting text embeddings from augmented image views, while enforcing an isotropic Gaussian structure via Sketched Isotropic Gaussian Regularization (SIGReg). This eliminates the need for negative sampling, momentum encoders, or stop-gradients, reducing the training objective to a single hyperparameter. We evaluate NOVA on zeroshot chest X-ray classification using ClinicalBERT as the text encoder and Vision Transformers trained from scratch on MIMIC-CXR. On zero-shot classification across three benchmark datasets, NOVA outperforms multiple standard baselines while exhibiting substantially more consistent training runs. Our results demonstrate that non-contrastive vision-language pretraining offers a simpler, more stable, and more effective alternative to contrastive methods.

[396] Schrödinger-Inspired Time-Evolution for 4D Deformation Forecasting

Ahsan Raza Siyal, Markus Haltmeier, Ruth Steiger, Elke Ruth Gizewski, Astrid Ellen Grams

Main category: cs.CV

TL;DR: A physics-guided neural architecture for 4D spatiotemporal forecasting that embeds a Schrödinger-type evolution operator within a deep convolutional framework to predict future volumetric sequences with temporal stability and interpretability.

Details

Motivation: Current neural forecasting models for complex 3D+time phenomena lack temporal stability, interpretability, and physical consistency, especially for applications like medical imaging where anatomical fidelity is crucial.

Method: Proposes a Schrödinger-inspired architecture that learns voxelwise amplitude, phase, and potential fields defining a complex-valued wavefunction ψ = Ae^{iφ}, which is evolved forward using a differentiable, unrolled Schrödinger time stepper within a deep convolutional framework.

Result: Demonstrates accurate and stable prediction of future 4D states (volumetric intensities and deformation fields) on synthetic benchmarks with realistic shape deformations and topological changes, showing improved temporal stability and interpretability.

Conclusion: The approach successfully integrates physical priors into deep learning, combining neural network expressivity with physics-based modeling robustness, offering interpretable, stable, and anatomically consistent spatiotemporal prediction for 4D forecasting.

Abstract: Spatiotemporal forecasting of complex three-dimensional phenomena (4D: 3D + time) is fundamental to applications in medical imaging, fluid and material dynamics, and geophysics. In contrast to unconstrained neural forecasting models, we propose a Schrödinger-inspired, physics-guided neural architecture that embeds an explicit time-evolution operator within a deep convolutional framework for 4D prediction. From observed volumetric sequences, the model learns voxelwise amplitude, phase, and potential fields that define a complex-valued wavefunction $ψ= A e^{iφ}$, which is evolved forward in time using a differentiable, unrolled Schrödinger time stepper. This physics-guided formulation yields several key advantages: (i) temporal stability arising from the structured evolution operator, which mitigates drift and error accumulation in long-horizon forecasting; (ii) an interpretable latent representation, where phase encodes transport dynamics, amplitude captures structural intensity, and the learned potential governs spatiotemporal interactions; and (iii) natural compatibility with deformation-based synthesis, which is critical for preserving anatomical fidelity in medical imaging applications. By integrating physical priors directly into the learning process, the proposed approach combines the expressivity of deep networks with the robustness and interpretability of physics-based modeling. We demonstrate accurate and stable prediction of future 4D states, including volumetric intensities and deformation fields, on synthetic benchmarks that emulate realistic shape deformations and topological changes. To our knowledge, this is the first end-to-end 4D neural forecasting framework to incorporate a Schrödinger-type evolution operator, offering a principled pathway toward interpretable, stable, and anatomically consistent spatiotemporal prediction.

[397] Improving Neuropathological Reconstruction Fidelity via AI Slice Imputation

Marina Crespo Aguirre, Jonathan Williams-Ramirez, Dina Zemlyanker, Xiaoling Hu, Lucas J. Deden-Binder, Rogeny Herisse, Mark Montine, Theresa R. Connors, Christopher Mount, Christine L. MacDonald, C. Dirk Keene, Caitlin S. Latimer, Derek H. Oakley, Bradley T. Hyman, Ana Lawry Aguila, Juan Eugenio Iglesias

Main category: cs.CV

TL;DR: A super-resolution method for generating isotropic 3D brain volumes from anisotropic reconstructions of 2D dissection photographs, improving anatomical fidelity and segmentation accuracy.

Details

Motivation: Existing 3D reconstructions from 2D brain dissection photographs often produce coarse, overly smooth results, especially with thick tissue slabs (high anisotropy), limiting anatomical precision for neuropathological analysis.

Method: Introduces a computationally efficient super-resolution step that imputes additional slices to create anatomically consistent isotropic volumes from anisotropic 3D reconstructions. Uses domain-randomized synthetic data for training to ensure generalization across dissection protocols and robustness to large slab thicknesses.

Result: The imputed volumes yield improved automated segmentations with higher Dice scores, particularly in cortical and white matter regions. Validation shows more accurate cortical surfaces and MRI registration compared to previous methods.

Conclusion: The approach enhances resolution and anatomical fidelity of photograph-based brain reconstructions, strengthening the bridge between neuropathology and neuroimaging. The method is publicly available.

Abstract: Neuropathological analyses benefit from spatially precise volumetric reconstructions that enhance anatomical delineation and improve morphometric accuracy. Our prior work has shown the feasibility of reconstructing 3D brain volumes from 2D dissection photographs. However these outputs sometimes exhibit coarse, overly smooth reconstructions of structures, especially under high anisotropy (i.e., reconstructions from thick slabs). Here, we introduce a computationally efficient super-resolution step that imputes slices to generate anatomically consistent isotropic volumes from anisotropic 3D reconstructions of dissection photographs. By training on domain-randomized synthetic data, we ensure that our method generalizes across dissection protocols and remains robust to large slab thicknesses. The imputed volumes yield improved automated segmentations, achieving higher Dice scores, particularly in cortical and white matter regions. Validation on surface reconstruction and atlas registration tasks demonstrates more accurate cortical surfaces and MRI registration. By enhancing the resolution and anatomical fidelity of photograph-based reconstructions, our approach strengthens the bridge between neuropathology and neuroimaging. Our method is publicly available at https://surfer.nmr.mgh.harvard.edu/fswiki/mri_3d_photo_recon

[398] HPC: Hierarchical Point-based Latent Representation for Streaming Dynamic Gaussian Splatting Compression

Yangzhi Ma, Bojun Liu, Wenting Liao, Dong Liu, Zhu Li, Li Li

Main category: cs.CV

TL;DR: HPC: Hierarchical point-based compression framework for streaming dynamic Gaussian Splatting that reduces storage by 67% while maintaining quality

Details

Motivation: Current streaming dynamic Gaussian Splatting compression methods have limitations: grid-based approaches waste parameters on unoccupied space, while point-based methods lack compactness due to poor local correlation exploitation. Need efficient compression for free-viewpoint video streaming.

Method: Proposes HPC with hierarchical point-based latent representation operating per-Gaussian to avoid redundancy. Uses tailored aggregation for compactness. First to compress neural networks for streaming dynamic Gaussian Splatting by exploiting inter-frame parameter correlations. End-to-end compression framework combining latent and network compression.

Result: Achieves 67% storage reduction compared to baseline while maintaining high reconstruction fidelity. Substantially outperforms state-of-the-art methods in comprehensive experimental evaluations.

Conclusion: HPC effectively addresses limitations of existing compression methods for dynamic Gaussian Splatting, providing efficient streaming transmission with small memory footprint while preserving rendering quality.

Abstract: While dynamic Gaussian Splatting has driven significant advances in free-viewpoint video, maintaining its rendering quality with a small memory footprint for efficient streaming transmission still presents an ongoing challenge. Existing streaming dynamic Gaussian Splatting compression methods typically leverage a latent representation to drive the neural network for predicting Gaussian residuals between frames. Their core latent representations can be categorized into structured grid-based and unstructured point-based paradigms. However, the former incurs significant parameter redundancy by inevitably modeling unoccupied space, while the latter suffers from limited compactness as it fails to exploit local correlations. To relieve these limitations, we propose HPC, a novel streaming dynamic Gaussian Splatting compression framework. It employs a hierarchical point-based latent representation that operates on a per-Gaussian basis to avoid parameter redundancy in unoccupied space. Guided by a tailored aggregation scheme, these latent points achieve high compactness with low spatial redundancy. To improve compression efficiency, we further undertake the first investigation to compress neural networks for streaming dynamic Gaussian Splatting through mining and exploiting the inter-frame correlation of parameters. Combined with latent compression, this forms a fully end-to-end compression framework. Comprehensive experimental evaluations demonstrate that HPC substantially outperforms state-of-the-art methods. It achieves a storage reduction of 67% against its baseline while maintaining high reconstruction fidelity.

[399] Video Understanding: Through A Temporal Lens

Thong Thanh Nguyen

Main category: cs.CV

TL;DR: Thesis proposes five contributions for video understanding: automatic annotation framework, recurrent adapters for temporal dynamics, State Space Layers for long-form modeling, contrastive learning for motion-moment relations, and empirical study on LVLMs identifying visual-language interface as bottleneck.

Details

Motivation: Existing video understanding methods have limitations in leveraging temporal relations among video elements. The work aims to advance video understanding by explicitly modeling temporal dynamics and addressing bottlenecks in current approaches.

Method: Five-fold approach: 1) Automatic annotation using large vision-language models with noise-robust contrastive learning; 2) Parameter-efficient fine-tuning with recurrent adapters; 3) Integration of State Space Layers for long-form video modeling; 4) Novel contrastive learning framework for motion-moment relations; 5) Comprehensive empirical study on LVLMs.

Result: Demonstrates that explicit temporal modeling significantly enhances model’s ability to represent and reason about video content. Introduces new long-term benchmarks for egocentric and feature-length content, and identifies visual-language interface as bottleneck for temporal reasoning.

Conclusion: Explicit temporal modeling is crucial for advanced video understanding. The proposed methods and frameworks effectively capture temporal dynamics and address key bottlenecks in current video understanding systems.

Abstract: This thesis explores the central question of how to leverage temporal relations among video elements to advance video understanding. Addressing the limitations of existing methods, the work presents a five-fold contribution: (1) an automatic annotation framework that utilizes large vision-language models and a noise-robust contrastive learning objective with a subtractive angular margin; (2) a parameter-efficient fine-tuning strategy using “recurrent adapters” to capture temporal dynamics in low-data regimes; (3) the integration of State Space Layers (SSL) for efficient long-form video modeling, supported by the introduction of two new long-term benchmarks for egocentric and feature-length content; (4) a novel contrastive learning framework designed to explicitly model fine-grained relations between motions and video moments; and (5) a comprehensive empirical study on Large Vision-Language Models (LVLMs) that identifies the visual-language interface as a bottleneck for temporal reasoning, leading to a new “temporal-oriented recipe” for upscaled video understanding. Collectively, these contributions demonstrate that explicit temporal modeling significantly enhances a model’s ability to represent and reason about the fluid nature of video content.

[400] V2X-DSC: Multi-Agent Collaborative Perception with Distributed Source Coding Guided Communication

Yuankun Zeng, Shaohui Li, Zhi Li, Shulan Ruan, Yu Liu, You He

Main category: cs.CV

TL;DR: V2X-DSC: A distributed source coding framework for bandwidth-constrained collaborative perception in V2X networks that compresses BEV features by transmitting only complementary information beyond local context.

Details

Motivation: Collaborative perception requires sharing intermediate features between agents, but dense BEV features saturate V2X communication links. Since collaborators view the same physical world, their features are strongly correlated, suggesting receivers only need innovation beyond their local context.

Method: Proposes V2X-DSC with a Conditional Codec (DCC) framework. Senders compress BEV features into compact codes, while receivers perform conditional reconstruction using local features as side information. This allocates bits to complementary cues rather than redundant content, regularizing learning and encouraging incremental representation.

Result: Experiments on DAIR-V2X, OPV2V, and V2X-Real datasets demonstrate state-of-the-art accuracy-bandwidth trade-offs under KB-level communication. The framework generalizes as a plug-and-play communication layer across multiple fusion backbones.

Conclusion: V2X-DSC effectively addresses bandwidth constraints in collaborative perception by leveraging distributed source coding principles, enabling efficient feature fusion while maintaining high accuracy through conditional reconstruction of complementary information.

Abstract: Collaborative perception improves 3D understanding by fusing multi-agent observations, yet intermediate-feature sharing faces strict bandwidth constraints as dense BEV features saturate V2X links. We observe that collaborators view the same physical world, making their features strongly correlated; thus receivers only need innovation beyond their local context. Revisiting this from a distributed source coding perspective, we propose V2X-DSC, a framework with a Conditional Codec (DCC) for bandwidth-constrained fusion. The sender compresses BEV features into compact codes, while the receiver performs conditional reconstruction using its local features as side information, allocating bits to complementary cues rather than redundant content. This conditional structure regularizes learning, encouraging incremental representation and yielding lower-noise features. Experiments on DAIR-V2X, OPV2V, and V2X-Real demonstrate state-of-the-art accuracy-bandwidth trade-offs under KB-level communication, and generalizes as a plug-and-play communication layer across multiple fusion backbones.

[401] JoyAvatar: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning

Ruikui Wang, Jinheng Feng, Lang Tian, Huaishao Luo, Chaochao Li, Liangbo Zhou, Huan Zhang, Youzheng Wu, Xiaodong He

Main category: cs.CV

TL;DR: JoyAvatar: A framework for generating long-duration avatar videos with enhanced text controllability for complex motions, camera movements, and background transitions while maintaining audio-visual synchronization.

Details

Motivation: Existing video avatar models have limited alignment with text instructions, especially for complex elements like full-body movement, dynamic camera trajectories, background transitions, and human-object interactions.

Method: 1) Twin-teacher enhanced training algorithm to transfer text-controllability from foundation models while learning audio-visual synchronization. 2) Dynamic modulation of multi-modal condition strengths (audio/text) based on denoising timesteps to mitigate conflicts between heterogeneous conditioning signals.

Result: Outperforms state-of-the-art models like Omnihuman-1.5 and KlingAvatar 2.0 in GSB evaluation. Enables complex applications including multi-person dialogues and non-human subjects role-playing.

Conclusion: JoyAvatar substantially expands avatar models’ capacity to generate natural, temporally coherent full-body motions and dynamic camera movements while preserving basic avatar capabilities like accurate lip-sync and identity consistency.

Abstract: Existing video avatar models have demonstrated impressive capabilities in scenarios such as talking, public speaking, and singing. However, the majority of these methods exhibit limited alignment with respect to text instructions, particularly when the prompts involve complex elements including large full-body movement, dynamic camera trajectory, background transitions, or human-object interactions. To break out this limitation, we present JoyAvatar, a framework capable of generating long duration avatar videos, featuring two key technical innovations. Firstly, we introduce a twin-teacher enhanced training algorithm that enables the model to transfer inherent text-controllability from the foundation model while simultaneously learning audio-visual synchronization. Secondly, during training, we dynamically modulate the strength of multi-modal conditions (e.g., audio and text) based on the distinct denoising timestep, aiming to mitigate conflicts between the heterogeneous conditioning signals. These two key designs serve to substantially expand the avatar model’s capacity to generate natural, temporally coherent full-body motions and dynamic camera movements as well as preserve the basic avatar capabilities, such as accurate lip-sync and identity consistency. GSB evaluation results demonstrate that our JoyAvatar model outperforms the state-of-the-art models such as Omnihuman-1.5 and KlingAvatar 2.0. Moreover, our approach enables complex applications including multi-person dialogues and non-human subjects role-playing. Some video samples are provided on https://joyavatar.github.io/.

[402] StomataSeg: Semi-Supervised Instance Segmentation for Sorghum Stomatal Components

Zhongtian Huang, Zhi Chen, Zi Huang, Xin Yu, Daniel Smith, Chaitanya Purushothama, Erik Van Oosterom, Alex Wu, William Salter, Yan Li, Scott Chapman

Main category: cs.CV

TL;DR: A semi-supervised instance segmentation framework for sorghum stomata analysis using patch-based preprocessing and pseudo-labeling to improve segmentation of tiny stomatal structures.

Details

Motivation: Sorghum is important for climate-resilient agriculture, and automated stomatal analysis is needed for high-throughput phenotyping, but current methods struggle with tiny structures (less than 40μm) and annotation bottlenecks.

Method: Semi-supervised instance segmentation framework with patch-based preprocessing (splitting high-res images into overlapping small patches) and pseudo-labeling strategy applied to unannotated images, creating additional pseudo-labeled patches.

Result: Substantial performance gains: semantic models improved from 65.93% to 70.35% mIoU; instance models improved from 28.30% to 46.10% AP. Created dataset with 11,060 human-annotated patches and 56,428 pseudo-labeled patches.

Conclusion: Combining patch-based preprocessing with semi-supervised learning significantly improves segmentation of fine stomatal structures, supporting scalable extraction of stomatal traits for AI-driven phenotyping in crop science.

Abstract: Sorghum is a globally important cereal grown widely in water-limited and stress-prone regions. Its strong drought tolerance makes it a priority crop for climate-resilient agriculture. Improving water-use efficiency in sorghum requires precise characterisation of stomatal traits, as stomata control of gas exchange, transpiration and photosynthesis have a major influence on crop performance. Automated analysis of sorghum stomata is difficult because the stomata are small (often less than 40 $μ$m in length in grasses such as sorghum) and vary in shape across genotypes and leaf surfaces. Automated segmentation contributes to high-throughput stomatal phenotyping, yet current methods still face challenges related to nested small structures and annotation bottlenecks. In this paper, we propose a semi-supervised instance segmentation framework tailored for analysis of sorghum stomatal components. We collect and annotate a sorghum leaf imagery dataset containing 11,060 human-annotated patches, covering the three stomatal components (pore, guard cell and complex area) across multiple genotypes and leaf surfaces. To improve the detection of tiny structures, we split high-resolution microscopy images into overlapping small patches. We then apply a pseudo-labelling strategy to unannotated images, producing an additional 56,428 pseudo-labelled patches. Benchmarking across semantic and instance segmentation models shows substantial performance gains: for semantic models the top mIoU increases from 65.93% to 70.35%, whereas for instance models the top AP rises from 28.30% to 46.10%. These results demonstrate that combining patch-based preprocessing with semi-supervised learning significantly improves the segmentation of fine stomatal structures. The proposed framework supports scalable extraction of stomatal traits and facilitates broader adoption of AI-driven phenotyping in crop science.

[403] Supervised makeup transfer with a curated dataset: Decoupling identity and makeup features for enhanced transformation

Qihe Pan, Yiming Wu, Xing Zhao, Liang Xie, Guodao Sun, Ronghua Liang

Main category: cs.CV

TL;DR: A diffusion-based makeup transfer framework with curated dataset, disentangled identity/makeup features, and text-guided control for region-specific makeup application.

Details

Motivation: Existing makeup transfer methods suffer from limited datasets, poor disentanglement between identity and makeup features, and weak controllability, which diffusion models can address with better stability than GAN-based approaches.

Method: Three contributions: 1) curated high-quality dataset using train-generate-filter-retrain strategy; 2) diffusion-based framework that disentangles identity and makeup features; 3) text-guided mechanism for fine-grained, region-specific control using natural language prompts.

Result: Experiments on benchmarks and real-world scenarios demonstrate improvements in fidelity, identity preservation, and flexibility compared to existing methods.

Conclusion: The proposed diffusion-based framework with curated dataset and text-guided control effectively addresses limitations in makeup transfer, offering better stability, disentanglement, and controllability than previous approaches.

Abstract: Diffusion models have recently shown strong progress in generative tasks, offering a more stable alternative to GAN-based approaches for makeup transfer. Existing methods often suffer from limited datasets, poor disentanglement between identity and makeup features, and weak controllability. To address these issues, we make three contributions. First, we construct a curated high-quality dataset using a train-generate-filter-retrain strategy that combines synthetic, realistic, and filtered samples to improve diversity and fidelity. Second, we design a diffusion-based framework that disentangles identity and makeup features, ensuring facial structure and skin tone are preserved while applying accurate and diverse cosmetic styles. Third, we propose a text-guided mechanism that allows fine-grained and region-specific control, enabling users to modify eyes, lips, or face makeup with natural language prompts. Experiments on benchmarks and real-world scenarios demonstrate improvements in fidelity, identity preservation, and flexibility. Examples of our dataset can be found at: https://makeup-adapter.github.io.

[404] Diffusion-Driven Inter-Outer Surface Separation for Point Clouds with Open Boundaries

Zhengyan Qin, Liyuan Qiu

Main category: cs.CV

TL;DR: A diffusion-based method for separating inter and outer layer surfaces from double-layered point clouds caused by TSDF truncation artifacts in 3D reconstruction.

Details

Motivation: Addresses the "double surface artifact" in TSDF fusion where asymmetric truncation thresholds create erroneous inter and outer shells in reconstructed point clouds, particularly problematic for indoor scene modeling and medical imaging applications.

Method: Uses a diffusion-based algorithm to separate inter and outer layer surfaces from double-layered point clouds, focusing on open-boundary models (surfaces with topological holes) rather than missing surface regions. The method works as a lightweight post-hoc module after TSDF fusion.

Result: Achieves extraction of the inter layer from 20,000 inter and 20,000 outer points in approximately 10 seconds, handling both watertight and open-boundary surface geometries effectively.

Conclusion: Provides a practical solution for accurate surface representation in applications like indoor scene modeling and medical imaging where double-layered point clouds are common, without replacing full reconstruction pipelines.

Abstract: We propose a diffusion-based algorithm for separating the inter and outer layer surfaces from double-layered point clouds, particularly those exhibiting the “double surface artifact” caused by truncation in Truncated Signed Distance Function (TSDF) fusion during indoor or medical 3D reconstruction. This artifact arises from asymmetric truncation thresholds, leading to erroneous inter and outer shells in the fused volume, which our method addresses by extracting the true inter layer to mitigate challenges like overlapping surfaces and disordered normals. We focus on point clouds with \emph{open boundaries} (i.e., sampled surfaces with topological openings/holes through which particles may escape), rather than point clouds with \emph{missing surface regions} where no samples exist. Our approach enables robust processing of both watertight and open-boundary models, achieving extraction of the inter layer from 20,000 inter and 20,000 outer points in approximately 10 seconds. This solution is particularly effective for applications requiring accurate surface representations, such as indoor scene modeling and medical imaging, where double-layered point clouds are prevalent, and it accommodates both closed (watertight) and open-boundary surface geometries. Our goal is \emph{post-hoc} inter/outer shell separation as a lightweight module after TSDF fusion; we do not aim to replace full variational or learning-based reconstruction pipelines.

[405] HSI-VAR: Rethinking Hyperspectral Restoration through Spatial-Spectral Visual Autoregression

Xiangming Wang, Benteng Sun, Yungeng Liu, Haijin Zeng, Yongyong Chen, Jingyong Su, Jie Liu

Main category: cs.CV

TL;DR: HSI-VAR: An autoregressive approach for hyperspectral image restoration that addresses composite degradations (noise, blur, missing bands) with improved efficiency and quality compared to diffusion models.

Details

Motivation: Real-world hyperspectral images suffer from composite degradations, but existing methods are either computationally expensive (diffusion models) or produce oversmoothed results (regression models). There's a need for efficient, high-quality HSI restoration.

Method: Frames HSI restoration as autoregressive generation with three innovations: latent-condition alignment for semantic consistency, degradation-aware guidance for encoding mixed degradations, and spatial-spectral adaptation module for cross-domain refinement.

Result: Achieves state-of-the-art performance with 3.77 dB PSNR improvement on ICVL benchmark, 95.5× inference speed-up compared to diffusion methods, and 50% computational cost reduction at inference.

Conclusion: HSI-VAR provides a practical solution for real-world HSI restoration by balancing computational efficiency with high-quality structural preservation through autoregressive modeling.

Abstract: Hyperspectral images (HSIs) capture richer spatial-spectral information beyond RGB, yet real-world HSIs often suffer from a composite mix of degradations, such as noise, blur, and missing bands. Existing generative approaches for HSI restoration like diffusion models require hundreds of iterative steps, making them computationally impractical for high-dimensional HSIs. While regression models tend to produce oversmoothed results, failing to preserve critical structural details. We break this impasse by introducing HSI-VAR, rethinking HSI restoration as an autoregressive generation problem, where spectral and spatial dependencies can be progressively modeled rather than globally reconstructed. HSI-VAR incorporates three key innovations: (1) Latent-condition alignment, which couples semantic consistency between latent priors and conditional embeddings for precise reconstruction; (2) Degradation-aware guidance, which uniquely encodes mixed degradations as linear combinations in the embedding space for automatic control, remarkably achieving a nearly $50%$ reduction in computational cost at inference; (3) A spatial-spectral adaptation module that refines details across both domains in the decoding phase. Extensive experiments on nine all-in-one HSI restoration benchmarks confirm HSI-VAR’s state-of-the-art performance, achieving a 3.77 dB PSNR improvement on \textbf{\textit{ICVL}} and offering superior structure preservation with an inference speed-up of up to $95.5 \times$ compared with diffusion-based methods, making it a highly practical solution for real-world HSI restoration.

[406] Evaluating Deep Learning-Based Nerve Segmentation in Brachial Plexus Ultrasound Under Realistic Data Constraints

Dylan Yves, Khush Agarwal, Jonathan Hoyin Chan, Patcharapit Promoppatum, Aroonkamon Pattanasiricharoen

Main category: cs.CV

TL;DR: Deep learning-based nerve segmentation in ultrasound images using U-Net architecture, evaluating dataset composition, annotation strategies, and multi-class supervision effects on performance.

Details

Motivation: Accurate nerve localization is critical for ultrasound-guided regional anesthesia but challenging due to low image contrast, speckle noise, and anatomical variability. Need robust automated segmentation systems.

Method: U-Net architecture for nerve segmentation in ultrasound images of brachial plexus. Evaluates effects of: 1) training on combined data from multiple ultrasound machines (SIEMENS ACUSON NX3 Elite and Philips EPIQ5), 2) extending binary nerve segmentation to multi-class supervision (artery, vein, nerve, muscle), and 3) correlation between nerve size and segmentation accuracy.

Result: 1) Multi-machine training provides regularization benefits for lower-performing sources but doesn’t surpass single-source training when matched to target domain. 2) Multi-class supervision decreases nerve-specific Dice scores by 9-61% due to class imbalance and boundary ambiguity. 3) Moderate positive correlation between nerve size and segmentation accuracy (Pearson r=0.587, p<0.001), indicating smaller nerves remain challenging.

Conclusion: Provides methodological guidance for developing robust ultrasound nerve segmentation systems under clinical data constraints. Highlights trade-offs in dataset composition, annotation strategies, and challenges with small anatomical structures.

Abstract: Accurate nerve localization is critical for the success of ultrasound-guided regional anesthesia, yet manual identification remains challenging due to low image contrast, speckle noise, and inter-patient anatomical variability. This study evaluates deep learning-based nerve segmentation in ultrasound images of the brachial plexus using a U-Net architecture, with a focus on how dataset composition and annotation strategy influence segmentation performance. We find that training on combined data from multiple ultrasound machines (SIEMENS ACUSON NX3 Elite and Philips EPIQ5) provides regularization benefits for lower-performing acquisition sources, though it does not surpass single-source training when matched to the target domain. Extending the task from binary nerve segmentation to multi-class supervision (artery, vein, nerve, muscle) results in decreased nerve-specific Dice scores, with performance drops ranging from 9% to 61% depending on dataset, likely due to class imbalance and boundary ambiguity. Additionally, we observe a moderate positive correlation between nerve size and segmentation accuracy (Pearson r=0.587, p<0.001), indicating that smaller nerves remain a primary challenge. These findings provide methodological guidance for developing robust ultrasound nerve segmentation systems under realistic clinical data constraints.

[407] DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning

Wenhao Li, Xianjing Meng, Qiangchang Wang, Zhongyi Han, Zhibin Wu, Yilong Yin

Main category: cs.CV

TL;DR: DVLA-RL is a few-shot learning method that uses dual-level vision-language alignment with reinforcement learning gating to improve cross-modal semantic alignment from low-level attributes to high-level descriptions.

Details

Motivation: Current few-shot learning approaches that incorporate LLMs overlook progressive and adaptive alignment between vision and language from low-level to high-level semantics, resulting in limited semantic gains.

Method: Proposes Dual-level Vision-Language Alignment with Reinforcement Learning gating (DVLA-RL) consisting of: 1) Dual-level Semantic Construction (DSC) that generates discriminative attributes from class names and support samples, and 2) RL-gated Attention (RLA) that formulates cross-modal fusion as a sequential decision process using episodic REINFORCE to adaptively adjust self-attention and cross-attention contributions.

Result: Achieves new state-of-the-art performance across nine benchmarks in three diverse few-shot learning scenarios.

Conclusion: DVLA-RL enables more precise cross-modal alignment by refining local attributes in shallow layers and emphasizing global semantics in deep layers, achieving class-specific discrimination and generalized representations with few support samples.

Abstract: Few-shot learning (FSL) aims to generalize to novel categories with only a few samples. Recent approaches incorporate large language models (LLMs) to enrich visual representations with semantic embeddings derived from class names. However, they overlook progressive and adaptive alignment between vision and language from low-level to high-level semantics, resulting in limited semantic gains. To address these challenges, we propose Dual-level Vision-Language Alignment with Reinforcement Learning gating (DVLA-RL), which consists of Dual-level Semantic Construction (DSC) and RL-gated Attention (RLA). Specifically, DSC conditions LLMs on both class names and support samples to generate discriminative attributes, progressively selects the most relevant ones, and then synthesizes them into coherent class descriptions. This process provides complementary low-level attributes and high-level descriptions, enabling both fine-grained grounding and holistic class understanding. To dynamically integrate dual-level semantics along with the visual network layers, RLA formulates cross-modal fusion as a sequential decision process. A lightweight policy trained with episodic REINFORCE adaptively adjusts the contributions of self-attention and cross-attention to integrate textual and visual tokens. As a result, shallow layers refine local attributes and deep layers emphasize global semantics, enabling more precise cross-modal alignment. This achieves class-specific discrimination and generalized representations with merely a few support samples. DVLA-RL achieves new state-of-the-art performance across nine benchmarks in three diverse FSL scenarios.

[408] Any3D-VLA: Enhancing VLA Robustness via Diverse Point Clouds

Xianzhe Fan, Shengliang Deng, Xiaoyang Wu, Yuxiang Lu, Zhuoling Li, Mi Yan, Yujia Zhang, Zhizheng Zhang, He Wang, Hengshuang Zhao

Main category: cs.CV

TL;DR: Any3D-VLA enhances Vision-Language-Action models by incorporating 3D point cloud representations alongside 2D images to improve spatial understanding in complex scenes, addressing domain gaps through unified training on diverse 3D data sources.

Details

Motivation: Existing VLA models rely solely on 2D images, limiting spatial understanding in complex 3D environments. The paper aims to incorporate 3D information to enhance VLA capabilities and address challenges of scarce 3D data and domain gaps across different environments.

Method: Proposes Any3D-VLA which: 1) Explicitly lifts visual input into point clouds, 2) Unifies simulator, sensor, and model-estimated point clouds in training pipeline, 3) Constructs diverse inputs, 4) Learns domain-agnostic 3D representations fused with corresponding 2D representations.

Result: Simulation and real-world experiments demonstrate Any3D-VLA’s advantages in improving performance and mitigating domain gaps. Point cloud representations better complement 2D representations for enhanced spatial understanding.

Conclusion: Incorporating 3D point cloud representations significantly enhances VLA model capabilities for complex spatial understanding tasks, with the unified training approach effectively addressing data scarcity and domain gap challenges.

Abstract: Existing Vision-Language-Action (VLA) models typically take 2D images as visual input, which limits their spatial understanding in complex scenes. How can we incorporate 3D information to enhance VLA capabilities? We conduct a pilot study across different observation spaces and visual representations. The results show that explicitly lifting visual input into point clouds yields representations that better complement their corresponding 2D representations. To address the challenges of (1) scarce 3D data and (2) the domain gap induced by cross-environment differences and depth-scale biases, we propose Any3D-VLA. It unifies the simulator, sensor, and model-estimated point clouds within a training pipeline, constructs diverse inputs, and learns domain-agnostic 3D representations that are fused with the corresponding 2D representations. Simulation and real-world experiments demonstrate Any3D-VLA’s advantages in improving performance and mitigating the domain gap. Our project homepage is available at https://xianzhefan.github.io/Any3D-VLA.github.io.

[409] VVLoc: Prior-free 3-DoF Vehicle Visual Localization

Ze Huang, Zhongyang Xiao, Mingliang Song, Longan Yang, Hongyuan Yuan, Li Sun

Main category: cs.CV

TL;DR: VVLoc is a unified neural network pipeline for vehicle localization that simultaneously handles topological (similarity-based) and metric (precise coordinate) localization using multi-camera systems with confidence estimation.

Details

Motivation: Current localization methods for autonomous driving have limitations: they handle topological and metric localization separately, rely on single-camera setups, require additional 3D semantic or pose priors, and lack confidence quantification mechanisms, making them impractical for real industrial applications.

Method: VVLoc uses a single neural network with multi-camera input to first evaluate geo-proximity between visual observations (topological localization), then estimates relative metric poses using a matching strategy, while providing confidence measures. Training only requires visual data pairs with ground-truth poses, eliminating complex supplementary data needs.

Result: VVLoc achieves state-of-the-art localization accuracy across various tasks, demonstrated on both public datasets and a more challenging self-collected dataset, showing robust performance in real-world scenarios.

Conclusion: VVLoc provides a practical, unified solution for vehicle localization that combines topological and metric approaches with confidence estimation, using efficient training with minimal data requirements, making it suitable for industrial autonomous driving applications.

Abstract: Localization is a critical technology in autonomous driving, encompassing both topological localization, which identifies the most similar map keyframe to the current observation, and metric localization, which provides precise spatial coordinates. Conventional methods typically address these tasks independently, rely on single-camera setups, and often require additional 3D semantic or pose priors, while lacking mechanisms to quantify the confidence of localization results, making them less feasible for real industrial applications. In this paper, we propose VVLoc, a unified pipeline that employs a single neural network to concurrently achieve topological and metric vehicle localization using multi-camera system. VVLoc first evaluates the geo-proximity between visual observations, then estimates their relative metric poses using a matching strategy, while also providing a confidence measure. Additionally, the training process for VVLoc is highly efficient, requiring only pairs of visual data and corresponding ground-truth poses, eliminating the need for complex supplementary data. We evaluate VVLoc not only on the publicly available datasets, but also on a more challenging self-collected dataset, demonstrating its ability to deliver state-of-the-art localization accuracy across a wide range of localization tasks.

[410] Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval

Tong Wang, Yunhan Zhao, Shu Kong

Main category: cs.CV

TL;DR: Paracosm is a training-free zero-shot composed image retrieval method that generates synthetic “mental images” from multimodal queries using LMMs, then matches these to synthetic counterparts of database images to overcome domain gaps.

Details

Motivation: Current zero-shot CIR methods use LMMs to generate textual descriptions from multimodal queries, then match text to images via VLMs. This indirect approach has limitations since the "mental image" is only implicitly defined. The authors propose directly generating the mental image for more accurate matching.

Method: Paracosm prompts an LMM to generate a synthetic “mental image” from the multimodal query (reference image + modification text). To address synthetic-to-real domain gaps, it also generates synthetic counterparts for each real image in the database. Matching occurs in this synthetic “paracosm” space between generated mental images and synthetic database images.

Result: Paracosm significantly outperforms existing zero-shot methods on four challenging CIR benchmarks, achieving state-of-the-art performance for zero-shot composed image retrieval.

Conclusion: Directly generating mental images for composed image retrieval is more effective than generating textual descriptions, and operating in a synthetic domain space helps overcome domain gaps between generated and real images.

Abstract: Composed Image Retrieval (CIR) is the task of retrieving a target image from a database using a multimodal query, which consists of a reference image and a modification text. The text specifies how to alter the reference image to form a mental image'', based on which CIR should find the target image in the database. The fundamental challenge of CIR is that this mental image’’ is not physically available and is only implicitly defined by the query. The contemporary literature pursues zero-shot methods and uses a Large Multimodal Model (LMM) to generate a textual description for a given multimodal query, and then employs a Vision-Language Model (VLM) for textual-visual matching to search the target image. In contrast, we address CIR from first principles by directly generating the mental image'' for more accurate matching. Particularly, we prompt an LMM to generate a mental image’’ for a given multimodal query and propose to use this mental image'' to search for the target image. As the mental image’’ has a synthetic-to-real domain gap with real images, we also generate a synthetic counterpart for each real image in the database to facilitate matching. In this sense, our method uses LMM to construct a ``paracosm’’, where it matches the multimodal query and database images. Hence, we call this method Paracosm. Notably, Paracosm is a training-free zero-shot CIR method. It significantly outperforms existing zero-shot methods on four challenging benchmarks, achieving state-of-the-art performance for zero-shot CIR.

[411] Edge-Native Generative De-identification: Inversion-Free Flow for Privacy-Preserving Federated Skin Image Analysis

Konstantinos Moutselos, Ilias Maglogiannis

Main category: cs.CV

TL;DR: A framework for privacy-preserving federated learning in clinical dermatology that uses inversion-free Rectified Flow Transformers to generate identity-agnostic synthetic skin images while preserving pathological features, enabling secure edge deployment.

Details

Motivation: Federated Learning for clinical dermatology faces privacy-preservation vs. diagnostic feature retention trade-offs. Traditional de-identification degrades pathology, while standard generative methods are too computationally intensive for edge devices.

Method: Proposes identity-agnostic pathology preservation using inversion-free Rectified Flow Transformers (FlowEdit) for near real-time identity transformation. Introduces “Segment-by-Synthesis” to generate counterfactual healthy/pathological twin pairs locally, extracting differential erythema masks decoupled from biometric markers.

Result: Achieves high-fidelity identity transformation in <20s, enabling local deployment. Demonstrates IoU stability >0.67 across synthetic identities on high-resolution clinical samples. Provides privacy-compliant synthetic surrogates that mitigate gradient leakage risk.

Conclusion: The framework enables secure, privacy-preserving federated learning for skin image analysis by generating synthetic surrogates at the edge, balancing privacy protection with diagnostic feature preservation.

Abstract: The deployment of Federated Learning (FL) for clinical dermatology is hindered by the competing requirements of protecting patient privacy and preserving diagnostic features. Traditional de-identification methods often degrade pathological fidelity, while standard generative editing techniques rely on computationally intensive inversion processes unsuitable for resource-constrained edge devices. We propose a framework for identity-agnostic pathology preservation that serves as a client-side privacy-preserving utility. By leveraging inversion-free Rectified Flow Transformers (FlowEdit), the system performs high-fidelity identity transformation in near real-time (less than 20s), facilitating local deployment on clinical nodes. We introduce a “Segment-by-Synthesis” mechanism that generates counterfactual healthy and pathological twin pairs locally. This enables the extraction of differential erythema masks that are decoupled from biometric markers and semantic artifacts (e.g. jewelry). Pilot validation on high-resolution clinical samples demonstrates an Intersection over Union (IoU) stability greater than 0.67 across synthetic identities. By generating privacy-compliant synthetic surrogates at the edge, this framework mitigates the risk of gradient leakage at the source, providing a secure pathway for high-precision skin image analysis in federated environments.

[412] TransNormal: Dense Visual Semantics for Diffusion-based Transparent Object Normal Estimation

Mingwei Li, Hehe Fan, Yi Yang

Main category: cs.CV

TL;DR: TransNormal: A diffusion-based framework for monocular normal estimation of transparent objects using DINOv3 semantics and wavelet regularization, with a new synthetic dataset for transparent labware.

Details

Motivation: Transparent object normal estimation is crucial for laboratory automation but challenging due to complex light refraction/reflection that causes failures in conventional sensors, hindering embodied AI deployment in scientific environments.

Method: Adapts pre-trained diffusion priors for single-step normal regression, integrates dense visual semantics from DINOv3 via cross-attention for geometric cues, uses multi-task learning objective and wavelet-based regularization to preserve fine structural details.

Result: Significantly outperforms SOTA: on ClearGrasp benchmark reduces mean error by 24.4% and improves 11.25° accuracy by 22.8%; on ClearPose achieves 15.2% reduction in mean error.

Conclusion: TransNormal effectively addresses transparent object normal estimation challenges through diffusion priors enhanced with semantic features and regularization, enabling better robotic manipulation in laboratory settings.

Abstract: Monocular normal estimation for transparent objects is critical for laboratory automation, yet it remains challenging due to complex light refraction and reflection. These optical properties often lead to catastrophic failures in conventional depth and normal sensors, hindering the deployment of embodied AI in scientific environments. We propose TransNormal, a novel framework that adapts pre-trained diffusion priors for single-step normal regression. To handle the lack of texture in transparent surfaces, TransNormal integrates dense visual semantics from DINOv3 via a cross-attention mechanism, providing strong geometric cues. Furthermore, we employ a multi-task learning objective and wavelet-based regularization to ensure the preservation of fine-grained structural details. To support this task, we introduce TransNormal-Synthetic, a physics-based dataset with high-fidelity normal maps for transparent labware. Extensive experiments demonstrate that TransNormal significantly outperforms state-of-the-art methods: on the ClearGrasp benchmark, it reduces mean error by 24.4% and improves 11.25° accuracy by 22.8%; on ClearPose, it achieves a 15.2% reduction in mean error. The code and dataset will be made publicly available at https://longxiang-ai.github.io/TransNormal.

[413] Invariance on Manifolds: Understanding Robust Visual Representations for Place Recognition

Jintao Cheng, Weibin Li, Zhijian He, Jin Wu, Chi Man Vong, Wei Zhang

Main category: cs.CV

TL;DR: Training-free VPR using second-order geometric statistics on SPD manifold for robust place recognition without supervision.

Details

Motivation: Current VPR methods either need extensive supervision or use simplistic statistics, missing structural correlations and geometric stability.

Method: Represent scenes as covariance descriptors on SPD manifold, treat perturbations as congruence transformations, use Riemannian mappings to linearize into Euclidean space, all training-free on fixed backbones.

Result: Achieves highly competitive performance against SOTA baselines, excels in challenging zero-shot scenarios without parameter updates.

Conclusion: Second-order geometric statistics framework provides robust, training-free VPR with strong generalization via geometric stability modeling on SPD manifold.

Abstract: Visual Place Recognition (VPR) demands representations robust to drastic environmental and viewpoint shifts. Current aggregation paradigms, however, either rely on data-hungry supervision or simplistic first-order statistics, often neglecting intrinsic structural correlations. In this work, we propose a Second-Order Geometric Statistics framework that inherently captures geometric stability without training. We conceptualize scenes as covariance descriptors on the Symmetric Positive Definite (SPD) manifold, where perturbations manifest as tractable congruence transformations. By leveraging geometry-aware Riemannian mappings, we project these descriptors into a linearized Euclidean embedding, effectively decoupling signal structure from noise. Our approach introduces a training-free framework built upon fixed, pre-trained backbones, achieving strong zero-shot generalization without parameter updates. Extensive experiments confirm that our method achieves highly competitive performance against state-of-the-art baselines, particularly excelling in challenging zero-shot scenarios.

[414] Distill3R: A Pipeline for Democratizing 3D Foundation Models on Commodity Hardware

Brandon Leblanc, Charalambos Poullis

Main category: cs.CV

TL;DR: Distill3R: A framework for distilling 3D foundation models into compact students trainable on single workstations, enabling democratized 3D vision research without massive compute clusters.

Details

Motivation: Large-scale 3D foundation models require massive computational clusters for training, creating barriers for academic labs without access to such resources. There's a need to democratize 3D vision research by making it accessible to labs with limited compute.

Method: Two key innovations: (1) Offline caching pipeline that decouples heavy teacher inference from training through compressed supervision signals, and (2) Confidence-aware distillation loss that leverages teacher uncertainty for training on commodity hardware. Proposes a 72M-parameter student model.

Result: Student achieves 9x parameter reduction and 5x inference speedup compared to 650M-parameter teacher. Trainable in under 3 days on single workstation vs. teacher requiring massive GPU clusters for up to a week. Preserves structural consistency and geometric understanding for functional 3D awareness.

Conclusion: Distill3R provides accessible research baseline for labs without large-scale compute, enabling domain-specific training at minimal cost. Not intended to compete with SOTA foundation models, but to democratize 3D vision research and enable efficient edge deployment.

Abstract: While multi-view 3D reconstruction has shifted toward large-scale foundation models capable of inferring globally consistent geometry, their reliance on massive computational clusters for training has created a significant barrier to entry for most academic laboratories. To bridge this compute divide, we introduce Distill3R, a framework designed to distill the geometric reasoning of 3D foundation models into compact students fully trainable on a single workstation. Our methodology centers on two primary innovations: (1) an offline caching pipeline that decouples heavy teacher inference from the training loop through compressed supervision signals, and (2) a confidence-aware distillation loss that leverages teacher uncertainty to enable training on commodity hardware. We propose a 72M-parameter student model which achieves a 9x reduction in parameters and a 5x inference speedup compared to its 650M-parameter teacher. The student is fully trainable in under 3 days on a single workstation, whereas its teacher requires massive GPU clusters for up to a week. We demonstrate that the student preserves the structural consistency and qualitative geometric understanding required for functional 3D awareness. By providing a reproducible, single-workstation training recipe, Distill3R serves as an exploratory entry point for democratized 3D vision research and efficient edge deployment. This work is not intended to compete with state-of-the-art foundation models, but to provide an accessible research baseline for laboratories without access to large-scale compute to train and specialize models on their own domain-specific data at minimal cost.

[415] DIAMOND: Directed Inference for Artifact Mitigation in Flow Matching Models

Alicja Polowczyk, Agnieszka Polowczyk, Piotr Borycki, Joanna Waczyńska, Jacek Tabor, Przemysław Spurek

Main category: cs.CV

TL;DR: DIAMOND is a training-free method that applies trajectory correction during inference to mitigate visual and anatomical artifacts in text-to-image generation, working without model weight modifications or additional training.

Details

Motivation: Current text-to-image models like FLUX still produce visual and anatomical artifacts that hinder practical use. Existing artifact reduction methods are post-hoc, require invasive model weight modifications, or are computationally expensive for regional refinement.

Method: DIAMOND uses trajectory correction during inference by reconstructing an estimate of the clean sample at every step of the generative trajectory, actively steering generation away from latent states that lead to artifacts. It’s training-free and extends to standard Diffusion Models.

Result: The method provides a robust, zero-shot path to high-fidelity, artifact-free image synthesis without additional training or weight modifications in modern generative architectures.

Conclusion: DIAMOND offers an effective training-free solution for artifact reduction in text-to-image generation by intervening during the core image formation process rather than working post-hoc.

Abstract: Despite impressive results from recent text-to-image models like FLUX, visual and anatomical artifacts remain a significant hurdle for practical and professional use. Existing methods for artifact reduction, typically work in a post-hoc manner, consequently failing to intervene effectively during the core image formation process. Notably, current techniques require problematic and invasive modifications to the model weights, or depend on a computationally expensive and time-consuming process of regional refinement. To address these limitations, we propose DIAMOND, a training-free method that applies trajectory correction to mitigate artifacts during inference. By reconstructing an estimate of the clean sample at every step of the generative trajectory, DIAMOND actively steers the generation process away from latent states that lead to artifacts. Furthermore, we extend the proposed method to standard Diffusion Models, demonstrating that DIAMOND provides a robust, zero-shot path to high-fidelity, artifact-free image synthesis without the need for additional training or weight modifications in modern generative architectures. Code is available at https://gmum.github.io/DIAMOND/

[416] OCTOPUS: Enhancing the Spatial-Awareness of Vision SSMs with Multi-Dimensional Scans and Traversal Selection

Kunal Mahatha, Ali Bahri, Pierre Marza, Sahar Dastani, Maria Vakalopoulou, Stergios Christodoulidis, Jose Dolz, Christian Desrosiers

Main category: cs.CV

TL;DR: OCTOPUS introduces a novel state space model architecture for vision tasks that performs discrete recurrence along eight principal orientations to capture both global context and local spatial structure while maintaining linear complexity.

Details

Motivation: Standard state space models (SSMs) have limited success in vision tasks due to their causal formulation, which breaks spatial relationships among pixels/patches and fails to capture local spatial coherence while linking non-adjacent patches.

Method: OCTOPUS performs discrete recurrence along eight principal orientations (forward/backward in horizontal, vertical, and diagonal directions) to enable effective information exchange across all spatially connected regions while maintaining independence among unrelated patches.

Result: OCTOPUS demonstrates notable improvements in boundary preservation and region consistency in segmentation tasks, while maintaining relatively better classification accuracy compared to existing V-SSM based models.

Conclusion: OCTOPUS appears as a foundation method for multi-directional recurrence as a scalable and effective mechanism for building spatially aware and computationally efficient vision architectures.

Abstract: State space models (SSMs) have recently emerged as an alternative to transformers due to their unique ability of modeling global relationships in text with linear complexity. However, their success in vision tasks has been limited due to their causal formulation, which is suitable for sequential text but detrimental in the spatial domain where causality breaks the inherent spatial relationships among pixels or patches. As a result, standard SSMs fail to capture local spatial coherence, often linking non-adjacent patches while ignoring neighboring ones that are visually correlated. To address these limitations, we introduce OCTOPUS , a novel architecture that preserves both global context and local spatial structure within images, while maintaining the linear complexity of SSMs. OCTOPUS performs discrete reoccurrence along eight principal orientations, going forward or backward in the horizontal, vertical, and diagonal directions, allowing effective information exchange across all spatially connected regions while maintaining independence among unrelated patches. This design enables multi-directional recurrence, capturing both global context and local spatial structure with SSM-level efficiency. In our classification and segmentation benchmarks, OCTOPUS demonstrates notable improvements in boundary preservation and region consistency, as evident from the segmentation results, while maintaining relatively better classification accuracy compared to existing V-SSM based models. These results suggest that OCTOPUS appears as a foundation method for multi-directional recurrence as a scalable and effective mechanism for building spatially aware and computationally efficient vision architectures.

Dhruv Parikh, Haoyang Fan, Rajgopal Kannan, Viktor Prasanna

Main category: cs.CV

TL;DR: ConsensusDrop: Training-free framework that fuses vision encoder saliency with LLM cross-attention to efficiently reduce visual tokens in VLMs while preserving accuracy

Details

Motivation: Vision-Language Models are expensive due to LLMs processing hundreds of redundant visual tokens. Existing token reduction methods use either vision-encoder saliency (broad but query-agnostic) or LLM cross-attention (query-aware but sparse/costly), but neither alone is sufficient for optimal performance.

Method: Proposes ConsensusDrop, a training-free framework that derives consensus ranking by reconciling vision encoder saliency with query-aware cross-attention. Retains most informative tokens while compressing remainder via encoder-guided token merging.

Result: Outperforms prior pruning methods under identical token budgets across LLaVA-1.5/NeXT, Video-LLaVA, and other open-source VLMs. Delivers stronger accuracy-efficiency Pareto frontier, preserving near-baseline accuracy even at aggressive token reductions while reducing TTFT and KV cache footprint.

Conclusion: ConsensusDrop provides an effective training-free approach for visual token reduction in VLMs by fusing complementary vision and language signals, achieving better efficiency-accuracy trade-offs than unimodal methods.

Abstract: Vision-Language Models (VLMs) are expensive because the LLM processes hundreds of largely redundant visual tokens. Existing token reduction methods typically exploit \textit{either} vision-encoder saliency (broad but query-agnostic) \textit{or} LLM cross-attention (query-aware but sparse and costly). We show that neither signal alone is sufficient: fusing them consistently improves performance compared to unimodal visual token selection (ranking). However, making such fusion practical is non-trivial: cross-modal saliency is usually only available \emph{inside} the LLM (too late for efficient pre-LLM pruning), and the two signals are inherently asymmetric, so naive fusion underutilizes their complementary strengths. We propose \textbf{ConsensusDrop}, a training-free framework that derives a \emph{consensus} ranking by reconciling vision encoder saliency with query-aware cross-attention, retaining the most informative tokens while compressing the remainder via encoder-guided token merging. Across LLaVA-1.5/NeXT, Video-LLaVA, and other open-source VLMs, ConsensusDrop consistently outperforms prior pruning methods under identical token budgets and delivers a stronger accuracy-efficiency Pareto frontier – preserving near-baseline accuracy even at aggressive token reductions while reducing TTFT and KV cache footprint. Our code will be open-sourced.

[418] Data Augmentation for High-Fidelity Generation of CAR-T/NK Immunological Synapse Images

Xiang Zhang, Boxuan Zhang, Alireza Naghizadeh, Mohab Mohamed, Dongfang Liu, Ruixiang Tang, Dimitris Metaxas, Dongfang Liu

Main category: cs.CV

TL;DR: The paper presents two complementary data augmentation frameworks (IAAA and SAAA) to generate synthetic CAR-T/NK immunological synapse images and segmentation masks, addressing limited annotated microscopy datasets for improving AI-based IS quantification in cancer immunotherapy.

Details

Motivation: Limited size of annotated microscopy datasets restricts the ability of artificial neural networks to generalize for accurate detection and segmentation of CAR-T/NK immunological synapse structures, which are important functional biomarkers for predicting therapeutic efficacy in cancer immunotherapy.

Method: Two complementary data-augmentation frameworks: 1) Instance Aware Automatic Augmentation (IAAA) - automated, instance-preserving augmentation method that generates synthetic images and masks by applying optimized augmentation policies; 2) Semantic-Aware AI Augmentation (SAAA) - combines diffusion-based mask generator with Pix2Pix conditional image synthesizer to create diverse, anatomically realistic segmentation masks and corresponding high-fidelity images.

Result: The augmentation strategies generate synthetic images whose visual and structural properties closely match real IS data, significantly improving CAR-T/NK IS detection and segmentation performance, enhancing robustness and accuracy of IS quantification.

Conclusion: This work supports the development of more reliable imaging-based biomarkers for predicting patient response to CAR-T/NK immunotherapy by addressing data limitations through advanced synthetic data generation techniques.

Abstract: Chimeric antigen receptor (CAR)-T and NK cell immunotherapies have transformed cancer treatment, and recent studies suggest that the quality of the CAR-T/NK cell immunological synapse (IS) may serve as a functional biomarker for predicting therapeutic efficacy. Accurate detection and segmentation of CAR-T/NK IS structures using artificial neural networks (ANNs) can greatly increase the speed and reliability of IS quantification. However, a persistent challenge is the limited size of annotated microscopy datasets, which restricts the ability of ANNs to generalize. To address this challenge, we integrate two complementary data-augmentation frameworks. First, we employ Instance Aware Automatic Augmentation (IAAA), an automated, instance-preserving augmentation method that generates synthetic CAR-T/NK IS images and corresponding segmentation masks by applying optimized augmentation policies to original IS data. IAAA supports multiple imaging modalities (e.g., fluorescence and brightfield) and can be applied directly to CAR-T/NK IS images derived from patient samples. In parallel, we introduce a Semantic-Aware AI Augmentation (SAAA) pipeline that combines a diffusion-based mask generator with a Pix2Pix conditional image synthesizer. This second method enables the creation of diverse, anatomically realistic segmentation masks and produces high-fidelity CAR-T/NK IS images aligned with those masks, further expanding the training corpus beyond what IAAA alone can provide. Together, these augmentation strategies generate synthetic images whose visual and structural properties closely match real IS data, significantly improving CAR-T/NK IS detection and segmentation performance. By enhancing the robustness and accuracy of IS quantification, this work supports the development of more reliable imaging-based biomarkers for predicting patient response to CAR-T/NK immunotherapy.

[419] Hybrid Topological and Deep Feature Fusion for Accurate MRI-Based Alzheimer’s Disease Severity Classification

Faisal Ahmed

Main category: cs.CV

TL;DR: A hybrid deep learning framework combining Topological Data Analysis (TDA) with DenseNet121 for four-class Alzheimer’s disease classification from structural MRI data, achieving state-of-the-art performance.

Details

Motivation: Early and accurate diagnosis of Alzheimer's disease remains challenging in neuroimaging-based clinical systems. The authors aim to capture complementary topological brain characteristics that conventional neural networks often overlook, enhancing class separability across different AD stages.

Method: Proposes a novel hybrid framework integrating TDA with DenseNet121 backbone. TDA captures topological characteristics of brain structures, while DenseNet121 learns hierarchical spatial features from MRI slices. The extracted deep and topological features are fused for enhanced classification.

Result: Extensive experiments on OASIS-1 Kaggle MRI dataset show the TDA+DenseNet121 model significantly outperforms existing state-of-the-art approaches, achieving 99.93% accuracy and 100% AUC, surpassing CNN-based, transfer learning, ensemble, and multi-scale architectures.

Conclusion: The results confirm the effectiveness of incorporating topological insights into deep learning pipelines and highlight the potential of the proposed framework as a robust and highly accurate tool for automated Alzheimer’s disease diagnosis.

Abstract: Early and accurate diagnosis of Alzheimer’s disease (AD) remains a critical challenge in neuroimaging-based clinical decision support systems. In this work, we propose a novel hybrid deep learning framework that integrates Topological Data Analysis (TDA) with a DenseNet121 backbone for four-class Alzheimer’s disease classification using structural MRI data from the OASIS dataset. TDA is employed to capture complementary topological characteristics of brain structures that are often overlooked by conventional neural networks, while DenseNet121 efficiently learns hierarchical spatial features from MRI slices. The extracted deep and topological features are fused to enhance class separability across the four AD stages. Extensive experiments conducted on the OASIS-1 Kaggle MRI dataset demonstrate that the proposed TDA+DenseNet121 model significantly outperforms existing state-of-the-art approaches. The model achieves an accuracy of 99.93% and an AUC of 100%, surpassing recently published CNN-based, transfer learning, ensemble, and multi-scale architectures. These results confirm the effectiveness of incorporating topological insights into deep learning pipelines and highlight the potential of the proposed framework as a robust and highly accurate tool for automated Alzheimer’s disease diagnosis.

[420] Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning

Meng Luo, Bobo Li, Shanqing Xu, Shize Zhang, Qiuchan Chen, Menglu Han, Wenhao Chen, Yanxiang Huang, Hao Fei, Mong-Li Lee, Wynne Hsu

Main category: cs.CV

TL;DR: HitEmotion introduces a Theory of Mind-grounded benchmark and reasoning framework to enhance emotional understanding in multimodal LLMs, addressing current limitations in affective intelligence.

Details

Motivation: Current multimodal large language models lack deep emotional understanding capabilities, which requires explicit modeling of Theory of Mind (ToM) - the cognitive substrate from which emotions arise.

Method: 1) HitEmotion benchmark with hierarchical cognitive depth levels; 2) ToM-guided reasoning chain tracking mental states and calibrating cross-modal evidence; 3) TMPO reinforcement learning using intermediate mental states as process-level supervision.

Result: HitEmotion exposes deep emotional reasoning deficits in state-of-the-art models, especially on cognitively demanding tasks. The proposed methods improve end-task accuracy and yield more faithful, coherent rationales.

Conclusion: The work provides a practical toolkit for evaluating and enhancing cognition-based emotional understanding capabilities in multimodal LLMs, bridging the gap between cognitive science and AI.

Abstract: Despite rapid progress in multimodal large language models (MLLMs), their capability for deep emotional understanding remains limited. We argue that genuine affective intelligence requires explicit modeling of Theory of Mind (ToM), the cognitive substrate from which emotions arise. To this end, we introduce HitEmotion, a ToM-grounded hierarchical benchmark that diagnoses capability breakpoints across increasing levels of cognitive depth. Second, we propose a ToM-guided reasoning chain that tracks mental states and calibrates cross-modal evidence to achieve faithful emotional reasoning. We further introduce TMPO, a reinforcement learning method that uses intermediate mental states as process-level supervision to guide and strengthen model reasoning. Extensive experiments show that HitEmotion exposes deep emotional reasoning deficits in state-of-the-art models, especially on cognitively demanding tasks. In evaluation, the ToM-guided reasoning chain and TMPO improve end-task accuracy and yield more faithful, more coherent rationales. In conclusion, our work provides the research community with a practical toolkit for evaluating and enhancing the cognition-based emotional understanding capabilities of MLLMs. Our dataset and code are available at: https://HitEmotion.github.io/.

[421] Navigating Simply, Aligning Deeply: Winning Solutions for Mouse vs. AI 2025

Phu-Hoa Pham, Chi-Nguyen Tran, Dao Sy Duy Minh, Nguyen Lam Phu Quy, Huynh Trung Kiet

Main category: cs.CV

TL;DR: Winning approaches for NeurIPS 2025 Mouse vs. AI competition: lightweight CNN with GLUs for visual robustness (95.4% score) and deep ResNet-like model for neural alignment (17.8M params), revealing optimal training at 200K steps.

Details

Motivation: Address critical challenges in visual robustness and neural alignment to develop artificial agents that can match biological vision systems, particularly for visuomotor learning tasks.

Method: For Track 1 (Visual Robustness): lightweight two-layer CNN enhanced by Gated Linear Units (GLUs) and observation normalization. For Track 2 (Neural Alignment): deep ResNet-like architecture with 16 convolutional layers and GLU-based gating. Systematic analysis of ten model checkpoints trained between 60K to 1.14M steps.

Result: Achieved 95.4% final score for visual robustness with lightweight CNN, and top-1 neural prediction performance for neural alignment with 17.8 million parameters. Found optimal training duration around 200K steps with non-monotonic relationship between training steps and performance.

Conclusion: Simpler architectures excel at visual robustness while deeper models with increased capacity achieve better neural alignment, challenging conventional assumptions about model complexity in visuomotor learning and providing practical guidance for biologically-inspired visual agents.

Abstract: Visual robustness and neural alignment remain critical challenges in developing artificial agents that can match biological vision systems. We present the winning approaches from Team HCMUS_TheFangs for both tracks of the NeurIPS 2025 Mouse vs. AI: Robust Visual Foraging Competition. For Track 1 (Visual Robustness), we demonstrate that architectural simplicity combined with targeted components yields superior generalization, achieving 95.4% final score with a lightweight two-layer CNN enhanced by Gated Linear Units and observation normalization. For Track 2 (Neural Alignment), we develop a deep ResNet-like architecture with 16 convolutional layers and GLU-based gating that achieves top-1 neural prediction performance with 17.8 million parameters. Our systematic analysis of ten model checkpoints trained between 60K to 1.14M steps reveals that training duration exhibits a non-monotonic relationship with performance, with optimal results achieved around 200K steps. Through comprehensive ablation studies and failure case analysis, we provide insights into why simpler architectures excel at visual robustness while deeper models with increased capacity achieve better neural alignment. Our results challenge conventional assumptions about model complexity in visuomotor learning and offer practical guidance for developing robust, biologically-inspired visual agents.

[422] VAMOS-OCTA: Vessel-Aware Multi-Axis Orthogonal Supervision for Inpainting Motion-Corrupted OCT Angiography Volumes

Nick DiSanto, Ehsan Khodapanah Aghdam, Han Liu, Jacob Watson, Yuankai K. Tao, Hao Li, Ipek Oguz

Main category: cs.CV

TL;DR: VAMOS-OCTA: A deep learning framework using vessel-aware multi-axis supervision to inpaint motion-corrupted B-scans in handheld OCTA imaging, improving both cross-sectional B-scan quality and volumetric projections.

Details

Motivation: Handheld OCTA enables noninvasive retinal imaging in uncooperative subjects but suffers from severe motion artifacts during 3D acquisition, leading to unsampled retinal regions and blank bands in en face projections that degrade image quality.

Method: Proposes a 2.5D U-Net architecture that takes neighboring B-scans as input to reconstruct corrupted center B-scans, guided by a novel Vessel-Aware Multi-Axis Orthogonal Supervision (VAMOS) loss combining vessel-weighted intensity reconstruction with axial and lateral projection consistency.

Result: VAMOS-OCTA consistently outperforms prior methods, producing reconstructions with sharp capillaries, restored vessel continuity, and clean en face projections, demonstrating superior performance on both synthetic and real-world corrupted volumes.

Conclusion: Multi-axis supervision offers a powerful constraint for restoring motion-degraded 3D OCTA data, enabling joint enhancement of both cross-sectional B-scan sharpness and volumetric projection accuracy even under severe motion corruptions.

Abstract: Handheld Optical Coherence Tomography Angiography (OCTA) enables noninvasive retinal imaging in uncooperative or pediatric subjects, but is highly susceptible to motion artifacts that severely degrade volumetric image quality. Sudden motion during 3D acquisition can lead to unsampled retinal regions across entire B-scans (cross-sectional slices), resulting in blank bands in en face projections. We propose VAMOS-OCTA, a deep learning framework for inpainting motion-corrupted B-scans using vessel-aware multi-axis supervision. We employ a 2.5D U-Net architecture that takes a stack of neighboring B-scans as input to reconstruct a corrupted center B-scan, guided by a novel Vessel-Aware Multi-Axis Orthogonal Supervision (VAMOS) loss. This loss combines vessel-weighted intensity reconstruction with axial and lateral projection consistency, encouraging vascular continuity in native B-scans and across orthogonal planes. Unlike prior work that focuses primarily on restoring the en face MIP, VAMOS-OCTA jointly enhances both cross-sectional B-scan sharpness and volumetric projection accuracy, even under severe motion corruptions. We trained our model on both synthetic and real-world corrupted volumes and evaluated its performance using both perceptual quality and pixel-wise accuracy metrics. VAMOS-OCTA consistently outperforms prior methods, producing reconstructions with sharp capillaries, restored vessel continuity, and clean en face projections. These results demonstrate that multi-axis supervision offers a powerful constraint for restoring motion-degraded 3D OCTA data. Our source code is available at https://github.com/MedICL-VU/VAMOS-OCTA.

[423] CortiNet: A Physics-Perception Hybrid Cortical-Inspired Dual-Stream Network for Gallbladder Disease Diagnosis from Ultrasound

Vagish Kumar, Souvik Chakraborty

Main category: cs.CV

TL;DR: CortiNet: A lightweight, cortical-inspired dual-stream neural network for gallbladder disease diagnosis from ultrasound images, combining physics-based signal decomposition with perception-driven feature learning.

Details

Motivation: Ultrasound imaging is widely used for gallbladder disease diagnosis but suffers from low resolution and speckle noise. Existing deep learning approaches use large CNNs that are difficult to deploy clinically, creating a need for lightweight yet accurate models.

Method: Proposes CortiNet, a dual-stream architecture inspired by human visual cortex parallel processing. Separates low-frequency structural information from high-frequency perceptual details through specialized encoding streams, using frequency-selective representations rather than raw pixels. Includes late-stage fusion and a structure-aware explainability framework using gradient-weighted class activation mapping only on the structural branch.

Result: Achieves 98.74% diagnostic accuracy on 10,692 expert-annotated images spanning nine gallbladder disease categories, with significantly reduced parameters compared to conventional deep convolutional models.

Conclusion: CortiNet demonstrates that lightweight, physics-inspired architectures can achieve high diagnostic accuracy for medical imaging tasks while being computationally efficient and interpretable, addressing deployment challenges in clinical settings.

Abstract: Ultrasound imaging is the primary diagnostic modality for detecting Gallbladder diseases due to its non-invasive nature, affordability, and wide accessibility. However, the low resolution and speckle noise inherent to ultrasound images hinder diagnostic reliability, prompting the use of large convolutional neural networks that are difficult to deploy in routine clinical settings. In this work, we propose CortiNet, a lightweight, cortical-inspired dual-stream neural architecture for gallbladder disease diagnosis that integrates physically interpretable multi-scale signal decomposition with perception-driven feature learning. Inspired by parallel processing pathways in the human visual cortex, CortiNet explicitly separates low-frequency structural information from high-frequency perceptual details and processes them through specialized encoding streams. By operating directly on structured, frequency-selective representations rather than raw pixel intensities, the architecture embeds strong physics-based inductive bias, enabling efficient feature learning with a significantly reduced parameter footprint. A late-stage cortical-style fusion mechanism integrates complementary structural and textural cues while preserving computational efficiency. Additionally, we propose a structure-aware explainability framework wherein gradient-weighted class activation mapping is only applied to the structural branch of the proposed CortiNet architecture. This choice allows the model to only focus on the structural features, making it robust against speckle noise. We evaluate CortiNet on 10,692 expert-annotated images spanning nine clinically relevant gallbladder disease categories. Experimental results demonstrate that CortiNet achieves high diagnostic accuracy (98.74%) with only a fraction of the parameters required by conventional deep convolutional models.

[424] SRVAU-R1: Enhancing Video Anomaly Understanding via Reflection-Aware Learning

Zihao Zhao, Shengting Cao, Muchao Ye

Main category: cs.CV

TL;DR: SRVAU-R1 introduces a reflection-aware learning framework for video anomaly understanding that enhances MLLM reasoning through self-reflection and self-correction mechanisms.

Details

Motivation: Existing MLLM-based approaches for video anomaly understanding focus on surface-level descriptions without deep reasoning capabilities like self-reflection and self-correction, limiting their effectiveness in understanding abnormal behaviors.

Method: Proposes SRVAU-R1 with: 1) First reflection-oriented Chain-of-Thought dataset for VAU with structured supervision (initial reasoning, self-reflection, revised reasoning), 2) Reflection-aware learning paradigm with supervised fine-tuning and reinforcement fine-tuning to enhance multi-modal reasoning.

Result: Extensive experiments on multiple video anomaly benchmarks show SRVAU-R1 consistently outperforms existing methods with significant improvements in both temporal anomaly localization accuracy and reasoning quality.

Conclusion: The reflection-aware learning framework effectively enhances MLLM reasoning capabilities for video anomaly understanding through structured self-reflection mechanisms.

Abstract: Multi-modal large language models (MLLMs) have demonstrated significant progress in reasoning capabilities and shown promising effectiveness in video anomaly understanding (VAU) tasks. However, existing MLLM-based approaches remain largely focused on surface-level descriptions of anomalies, lacking deep reasoning over abnormal behaviors like explicit self-reflection and self-correction. To address that, we propose Self-Reflection-Enhanced Reasoning for Video Anomaly Understanding (SRVAU-R1), a reflection-aware learning framework that incorporates reflection in MLLM reasoning. Specifically, SRVAU-R1 introduces the first reflection-oriented Chain-of-Thought dataset tailored for VAU, providing structured supervision with initial reasoning, self-reflection, and revised reasoning. Based on that, it includes a novel reflection-aware learning paradigm with supervised fine-tuning and reinforcement fine-tuning to enhance multi-modal reasoning for VAU. Extensive experiments on multiple video anomaly benchmarks demonstrate that SRVAU-R1 consistently outperforms existing methods, achieving significant improvements in both temporal anomaly localization accuracy and reasoning quality.

[425] LocalScore: Local Density-Aware Similarity Scoring for Biometrics

Yiyang Su, Minchul Kim, Jie Zhu, Christopher Perry, Feng Liu, Anil Jain, Xiaoming Liu

Main category: cs.CV

TL;DR: LocalScore improves open-set biometric recognition by using k-nearest neighbors to incorporate local gallery density, achieving better performance on unseen subjects across multiple modalities.

Details

Motivation: Traditional biometric systems struggle with open-set scenarios where probe subjects may not be enrolled in the gallery. Existing methods collapse intra-subject variability into single representations, leading to poor open-set robustness, especially with multi-sample galleries common in real-world deployments.

Method: Proposes LocalScore, a simple scoring algorithm that explicitly incorporates local density of gallery feature distribution using k-th nearest neighbors. The method is architecture-agnostic, loss-independent, and adds negligible computational overhead, making it plug-and-play for existing systems.

Result: Extensive experiments across multiple modalities show LocalScore achieves substantial gains: open-set retrieval FNIR@FPIR reduced from 53% to 40%, and verification TAR@FAR improved from 51% to 74%. Theoretical analysis explains when the method achieves most significant gains based on dataset characteristics.

Conclusion: LocalScore effectively addresses open-set biometric challenges by incorporating local gallery density, providing consistent performance improvements across modalities with minimal computational cost, making it a practical solution for real-world deployments.

Abstract: Open-set biometrics faces challenges with probe subjects who may not be enrolled in the gallery, as traditional biometric systems struggle to detect these non-mated probes. Despite the growing prevalence of multi-sample galleries in real-world deployments, most existing methods collapse intra-subject variability into a single global representation, leading to suboptimal decision boundaries and poor open-set robustness. To address this issue, we propose LocalScore, a simple yet effective scoring algorithm that explicitly incorporates the local density of the gallery feature distribution using the k-th nearest neighbors. LocalScore is architecture-agnostic, loss-independent, and incurs negligible computational overhead, making it a plug-and-play solution for existing biometric systems. Extensive experiments across multiple modalities demonstrate that LocalScore consistently achieves substantial gains in open-set retrieval (FNIR@FPIR reduced from 53% to 40%) and verification (TAR@FAR improved from 51% to 74%). We further provide theoretical analysis and empirical validation explaining when and why the method achieves the most significant gains based on dataset characteristics.

[426] Effectiveness of Automatically Curated Dataset in Thyroid Nodules Classification Algorithms Using Deep Learning

Jichen Yang, Jikai Zhang, Benjamin Wildman-Tobriner, Maciej A. Mazurowski

Main category: cs.CV

TL;DR: Automatically-curated thyroid nodule ultrasound datasets improve deep learning model performance over manually annotated datasets for cancer classification.

Details

Motivation: Limited availability of curated medical imaging datasets for training deep learning models, especially for thyroid nodule cancer diagnosis using ultrasound images. Previous automatic curation methods showed promise but their utility for training models was unknown.

Method: Trained deep learning models on three datasets: 1) manually annotated dataset, 2) automatically-curated dataset, and 3) accurate subset of automatically-curated data. Compared performance using AUC metrics with statistical significance testing.

Result: Automatically-curated dataset trained model achieved AUC of 0.694 (CI: 0.67-0.73), significantly better than manually annotated dataset’s AUC of 0.643 (CI: 0.62-0.66). Accurate subset performed similarly (AUC 0.689) to full automatically-curated dataset.

Conclusion: Automatically-curated datasets substantially improve deep learning algorithm performance for thyroid nodule classification, and using all automatically-curated data is better than using only accurate subsets.

Abstract: The diagnosis of thyroid nodule cancers commonly utilizes ultrasound images. Several studies showed that deep learning algorithms designed to classify benign and malignant thyroid nodules could match radiologists’ performance. However, data availability for training deep learning models is often limited due to the significant effort required to curate such datasets. The previous study proposed a method to curate thyroid nodule datasets automatically. It was tested to have a 63% yield rate and 83% accuracy. However, the usefulness of the generated data for training deep learning models remains unknown. In this study, we conducted experiments to determine whether using a automatically-curated dataset improves deep learning algorithms’ performance. We trained deep learning models on the manually annotated and automatically-curated datasets. We also trained with a smaller subset of the automatically-curated dataset that has higher accuracy to explore the optimum usage of such dataset. As a result, the deep learning model trained on the manually selected dataset has an AUC of 0.643 (95% confidence interval [CI]: 0.62, 0.66). It is significantly lower than the AUC of the 6automatically-curated dataset trained deep learning model, 0.694 (95% confidence interval [CI]: 0.67, 0.73, P < .001). The AUC of the accurate subset trained deep learning model is 0.689 (95% confidence interval [CI]: 0.66, 0.72, P > .43), which is insignificantly worse than the AUC of the full automatically-curated dataset. In conclusion, we showed that using a automatically-curated dataset can substantially increase the performance of deep learning algorithms, and it is suggested to use all the data rather than only using the accurate subset.

[427] GMAC: Global Multi-View Constraint for Automatic Multi-Camera Extrinsic Calibration

Chentian Sun

Main category: cs.CV

TL;DR: GMAC: A multi-camera extrinsic calibration framework using implicit geometric representations from multi-view reconstruction networks, enabling accurate extrinsic estimation without explicit 3D reconstruction or manual calibration.

Details

Motivation: Existing multi-camera calibration methods rely on calibration targets, explicit geometric modeling, or task-specific neural networks, which lack robustness and applicability in complex dynamic environments or online scenarios, limiting practical deployment.

Method: GMAC models extrinsics as global variables constrained by latent multi-view geometric structure, prunes and reconfigures existing networks to use their latent features for extrinsic prediction via lightweight regression head, and jointly optimizes cross-view reprojection consistency and multi-view cycle consistency.

Result: Experiments on synthetic and real-world multi-camera datasets show GMAC achieves accurate and stable extrinsic estimation without explicit 3D reconstruction or manual calibration.

Conclusion: GMAC provides a new solution for efficient deployment and online calibration of multi-camera systems by leveraging implicit geometric representations from existing networks.

Abstract: Automatic calibration of multi-camera systems, namely the accurate estimation of spatial extrinsic parameters, is fundamental for 3D reconstruction, panoramic perception, and multi-view data fusion. Existing methods typically rely on calibration targets, explicit geometric modeling, or task-specific neural networks. Such approaches often exhibit limited robustness and applicability in complex dynamic environments or online scenarios, making them difficult to deploy in practical applications. To address this, this paper proposes GMAC, a multi-camera extrinsic estimation framework based on the implicit geometric representations learned by multi-view reconstruction networks. GMAC models extrinsics as global variables constrained by the latent multi-view geometric structure and prunes and structurally reconfigures existing networks so that their latent features can directly support extrinsic prediction through a lightweight regression head, without requiring a completely new network design. Furthermore, GMAC jointly optimizes cross-view reprojection consistency and multi-view cycle consistency, ensuring geometric coherence across cameras while improving prediction accuracy and optimization stability. Experiments on both synthetic and real-world multi-camera datasets demonstrate that GMAC achieves accurate and stable extrinsic estimation without explicit 3D reconstruction or manual calibration, providing a new solution for efficient deployment and online calibration of multi-camera systems.

[428] FUSE-Flow: Scalable Real-Time Multi-View Point Cloud Reconstruction Using Confidence

Chentian Sun

Main category: cs.CV

TL;DR: FUSE-Flow: A real-time, linearly scalable framework for multi-view point cloud reconstruction using adaptive spatial hashing and weighted fusion to achieve high-quality streaming reconstruction with GPU parallelization.

Details

Motivation: Real-time multi-view point cloud reconstruction is crucial for VR/AR, robotics, and digital twins, but existing methods suffer from high computational complexity, excessive memory usage, and limited scalability, failing to simultaneously achieve real-time performance, reconstruction quality, and multi-camera extensibility.

Method: Frame-wise, stateless framework where each frame independently generates point cloud fragments, fused via measurement confidence and 3D distance consistency weights. Uses adaptive spatial hashing-based weighted aggregation: 3D space is partitioned by local point density, representative points are selected per cell, and weighted fusion handles sparse/dense regions with GPU parallelization.

Result: Achieves high-throughput, low-latency point cloud generation with linear complexity. Improves reconstruction stability and geometric fidelity in overlapping, depth-discontinuous, and dynamic scenes while maintaining real-time frame rates on modern GPUs.

Conclusion: FUSE-Flow effectively addresses the challenges of real-time multi-view point cloud reconstruction, demonstrating robustness, scalability, and practical applicability for immersive perception applications.

Abstract: Real-time multi-view point cloud reconstruction is a core problem in 3D vision and immersive perception, with wide applications in VR, AR, robotic navigation, digital twins, and computer interaction. Despite advances in multi-camera systems and high-resolution depth sensors, fusing large-scale multi-view depth observations into high-quality point clouds under strict real-time constraints remains challenging. Existing methods relying on voxel-based fusion, temporal accumulation, or global optimization suffer from high computational complexity, excessive memory usage, and limited scalability, failing to simultaneously achieve real-time performance, reconstruction quality, and multi-camera extensibility. We propose FUSE-Flow, a frame-wise, stateless, and linearly scalable point cloud streaming reconstruction framework. Each frame independently generates point cloud fragments, fused via two weights, measurement confidence and 3D distance consistency to suppress noise while preserving geometric details. For large-scale multi-camera efficiency, we introduce an adaptive spatial hashing-based weighted aggregation method: 3D space is adaptively partitioned by local point cloud density, representative points are selected per cell, and weighted fusion is performed to handle both sparse and dense regions. With GPU parallelization, FUSE-Flow achieves high-throughput, low-latency point cloud generation and fusion with linear complexity. Experiments demonstrate that the framework improves reconstruction stability and geometric fidelity in overlapping, depth-discontinuous, and dynamic scenes, while maintaining real-time frame rates on modern GPUs, verifying its effectiveness, robustness, and scalability.

[429] VEQ: Modality-Adaptive Quantization for MoE Vision-Language Models

Guangshuo Qin, Zhiteng Li, Zheng Chen, Weihang Zhang, Linghe Kong, Yulun Zhang

Main category: cs.CV

TL;DR: VEQ is a dual-aware quantization framework for Mixture-of-Experts Vision-Language Models that addresses cross-modal differences and expert heterogeneity through modality-expert-aware and modality-affinity-aware quantization techniques.

Details

Motivation: Mixture-of-Experts VLMs offer strong performance but have prohibitive memory/computational costs. Existing quantization methods fail to address two critical heterogeneities: differences between vision/language tokens and non-uniform contributions of different experts.

Method: Proposes Visual Expert Quantization (VEQ) with two components: 1) Modality-expert-aware Quantization uses expert activation frequency to prioritize error minimization for pivotal experts; 2) Modality-affinity-aware Quantization constructs enhanced Hessian matrix integrating token-expert affinity with modality information to guide calibration.

Result: Extensive experiments show VEQ consistently outperforms SOTA baselines. Under W3A16 configuration, achieves average accuracy gains of 2.04% on Kimi-VL and 3.09% on Qwen3-VL compared to previous SOTA quantization methods, demonstrating superior robustness across multimodal tasks.

Conclusion: VEQ effectively addresses the dual heterogeneity in MoE VLMs through modality-aware quantization techniques, achieving significant performance improvements while reducing computational costs, making large-scale multimodal models more practical.

Abstract: Mixture-of-Experts(MoE) Vision-Language Models (VLMs) offer remarkable performance but incur prohibitive memory and computational costs, making compression essential. Post-Training Quantization (PTQ) is an effective training-free technique to address the massive memory and computation overhead. Existing quantization paradigms fall short as they are oblivious to two critical forms of heterogeneity: the inherent discrepancy between vision and language tokens, and the non-uniform contribution of different experts. To bridge this gap, we propose Visual Expert Quantization (VEQ), a dual-aware quantization framework designed to simultaneously accommodate cross-modal differences and heterogeneity between experts. Specifically, VEQ incorporates 1)Modality-expert-aware Quantization, which utilizes expert activation frequency to prioritize error minimization for pivotal experts, and 2)Modality-affinity-aware Quantization, which constructs an enhanced Hessian matrix by integrating token-expert affinity with modality information to guide the calibration process. Extensive experiments across diverse benchmarks verify that VEQ consistently outperforms state-of-the-art baselines. Specifically, under the W3A16 configuration, our method achieves significant average accuracy gains of 2.04% on Kimi-VL and 3.09% on Qwen3-VL compared to the previous SOTA quantization methods, demonstrating superior robustness across various multimodal tasks. Our code will be available at https://github.com/guangshuoqin/VEQ.

[430] From Videos to Conversations: Egocentric Instructions for Task Assistance

Lavisha Aggarwal, Vikas Bahirwani, Andrea Colaco

Main category: cs.CV

TL;DR: A framework to automatically convert single-person instructional videos into two-person multimodal task-guidance conversations, creating the HowToDIV dataset for AI-assisted AR task guidance.

Details

Motivation: There's a need for expert knowledge in everyday tasks, but AI agents for AR assistance are limited by the scarcity of large-scale multimodal conversational datasets grounded in real-world task execution, due to the high cost and complexity of human-assisted data collection.

Method: A fully automatic pipeline using large language models to transform single-person instructional videos into two-person multimodal task-guidance conversations, creating expert-novice interactions from existing video content.

Result: Created HowToDIV dataset with 507 conversations, 6,636 question-answer pairs, and 24 hours of video across multiple domains, with baseline results using Gemma 3 and Qwen 2.5 models.

Conclusion: The framework provides a scalable, cost-efficient alternative to traditional data collection for multimodal conversational datasets, enabling progress in AI agents for AR task assistance.

Abstract: Many everyday tasks, ranging from appliance repair and cooking to car maintenance, require expert knowledge, particularly for complex, multi-step procedures. Despite growing interest in AI agents for augmented reality (AR) assistance, progress remains limited by the scarcity of large-scale multimodal conversational datasets grounded in real-world task execution, in part due to the cost and logistical complexity of human-assisted data collection. In this paper, we present a framework to automatically transform single person instructional videos into two-person multimodal task-guidance conversations. Our fully automatic pipeline, based on large language models, provides a scalable and cost efficient alternative to traditional data collection approaches. Using this framework, we introduce HowToDIV, a multimodal dataset comprising 507 conversations, 6,636 question answer pairs, and 24 hours of video spanning multiple domains. Each session consists of a multi-turn expert-novice interaction. Finally, we report baseline results using Gemma 3 and Qwen 2.5 on HowToDIV, providing an initial benchmark for multimodal procedural task assistance.

[431] ReLayout: Versatile and Structure-Preserving Design Layout Editing via Relation-Aware Design Reconstruction

Jiawei Lin, Shizhao Sun, Danqing Huang, Ting Liu, Ji Li, Jiang Bian

Main category: cs.CV

TL;DR: ReLayout is a framework for automated design layout editing that preserves layout structure while modifying designs based on user intents, using a multimodal LLM backbone without requiring triplet training data.

Details

Motivation: The paper addresses the challenge of automated design layout editing where user needs expressed in natural language are ambiguous, and there's a scarcity of training data (original design, editing operation, edited design triplets). The goal is to enable versatile design modifications while preserving the layout structure of unedited elements.

Method: ReLayout introduces a relation graph capturing position and size relationships among unedited elements as constraints for structure preservation. It uses relation-aware design reconstruction (RADR) which learns to reconstruct designs from elements, relation graphs, and synthesized editing operations in a self-supervised manner. A multimodal large language model serves as the backbone, unifying multiple editing actions within a single model.

Result: Qualitative, quantitative results and user studies show that ReLayout significantly outperforms baseline models in editing quality, accuracy, and layout structure preservation.

Conclusion: ReLayout provides an effective framework for versatile, structure-preserving design layout editing without requiring triplet training data, advancing automated redesign workflows.

Abstract: Automated redesign without manual adjustments marks a key step forward in the design workflow. In this work, we focus on a foundational redesign task termed design layout editing, which seeks to autonomously modify the geometric composition of a design based on user intents. To overcome the ambiguity of user needs expressed in natural language, we introduce four basic and important editing actions and standardize the format of editing operations. The underexplored task presents a unique challenge: satisfying specified editing operations while simultaneously preserving the layout structure of unedited elements. Besides, the scarcity of triplet (original design, editing operation, edited design) samples poses another formidable challenge. To this end, we present ReLayout, a novel framework for versatile and structure-preserving design layout editing that operates without triplet data. Specifically, ReLayout first introduces the relation graph, which contains the position and size relationships among unedited elements, as the constraint for layout structure preservation. Then, relation-aware design reconstruction (RADR) is proposed to bypass the data challenge. By learning to reconstruct a design from its elements, a relation graph, and a synthesized editing operation, RADR effectively emulates the editing process in a self-supervised manner. A multi-modal large language model serves as the backbone for RADR, unifying multiple editing actions within a single model and thus achieving versatile editing after fine-tuning. Qualitative, quantitative results and user studies show that ReLayout significantly outperforms the baseline models in terms of editing quality, accuracy, and layout structure preservation.

[432] Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance

Xinrong Chen, Xu Chu, Yingmin Qiu, Hengyuan Zhang, Jing Xiong, Shiyu Tang, Shuai Liu, Shaokang Yang, Cheng Yang, Hayden Kwok-Hay So, Ngai Wong

Main category: cs.CV

TL;DR: ResDec is a training-free method that uses historical decoding information to suppress hallucinations in Large Vision-Language Models by correcting biases through token logits evolution.

Details

Motivation: Large Vision-Language Models suffer from language priors and hallucinations - generating content that appears coherent but doesn't match visual input, limiting their reliability in multimodal tasks.

Method: Residual Decoding (ResDec) leverages historical information during decoding, using the model’s internal implicit reasoning mechanism and token logits evolution to correct biases without additional training.

Result: ResDec effectively suppresses hallucinations from language priors, improves visual grounding, reduces object hallucinations, and performs well on comprehensive LVLM benchmarks.

Conclusion: ResDec is a broadly applicable, training-free approach that addresses hallucination issues in LVLMs while maintaining strong performance across multimodal benchmarks.

Abstract: Large Vision-Language Models (LVLMs) can reason effectively from image-text inputs and perform well in various multimodal tasks. Despite this success, they are affected by language priors and often produce hallucinations. Hallucinations denote generated content that is grammatically and syntactically coherent, yet bears no match or direct relevance to actual visual input. To address this problem, we propose Residual Decoding (ResDec). It is a novel training-free method that uses historical information to aid decoding. The method relies on the internal implicit reasoning mechanism and token logits evolution mechanism of LVLMs to correct biases. Extensive experiments demonstrate that ResDec effectively suppresses hallucinations induced by language priors, significantly improves visual grounding, and reduces object hallucinations. In addition to mitigating hallucinations, ResDec also performs exceptionally well on comprehensive LVLM benchmarks, highlighting its broad applicability.

[433] Baseline Method of the Foundation Model Challenge for Ultrasound Image Analysis

Bo Deng, Yitong Tang, Jiake Li, Yuxin Huang, Li Wang, Yu Zhang, Yufei Zhan, Hua Lu, Xiaoshen Zhang, Jieyun Bai

Main category: cs.CV

TL;DR: A unified multi-task learning framework for ultrasound image analysis across 27 diverse tasks, establishing a baseline for ultrasound foundation models.

Details

Motivation: Ultrasound imaging has substantial heterogeneity across anatomical structures and acquisition protocols, making it challenging to develop generalizable analysis models. Most existing methods are task-specific, limiting their suitability as clinically deployable foundation models.

Method: Uses a Multi-Head Multi-Task Learning (MH-MTL) framework with an ImageNet-pretrained EfficientNet-B4 backbone and Feature Pyramid Network (FPN) for multi-scale feature extraction. Implements task-specific routing where global tasks use high-level semantic features and dense prediction tasks use spatially detailed FPN representations. Training includes composite loss with task-adaptive learning rate scaling and cosine annealing schedule.

Result: Validation results demonstrate the feasibility and robustness of the unified design, establishing a strong and extensible baseline for ultrasound foundation model research.

Conclusion: The proposed framework provides a viable approach for developing generalizable ultrasound analysis models that can handle multiple tasks within a single shared network, addressing the heterogeneity challenge in ultrasound imaging.

Abstract: Ultrasound (US) imaging exhibits substantial heterogeneity across anatomical structures and acquisition protocols, posing significant challenges to the development of generalizable analysis models. Most existing methods are task-specific, limiting their suitability as clinically deployable foundation models. To address this limitation, the Foundation Model Challenge for Ultrasound Image Analysis (FM_UIA~~2026) introduces a large-scale multi-task benchmark comprising 27 subtasks across segmentation, classification, detection, and regression. In this paper, we present the official baseline for FM_UIA~~2026 based on a unified Multi-Head Multi-Task Learning (MH-MTL) framework that supports all tasks within a single shared network. The model employs an ImageNet-pretrained EfficientNet–B4 backbone for robust feature extraction, combined with a Feature Pyramid Network (FPN) to capture multi-scale contextual information. A task-specific routing strategy enables global tasks to leverage high-level semantic features, while dense prediction tasks exploit spatially detailed FPN representations. Training incorporates a composite loss with task-adaptive learning rate scaling and a cosine annealing schedule. Validation results demonstrate the feasibility and robustness of this unified design, establishing a strong and extensible baseline for ultrasound foundation model research. The code and dataset are publicly available at \href{https://github.com/lijiake2408/Foundation-Model-Challenge-for-Ultrasound-Image-Analysis}{GitHub}.

[434] Radioactive 3D Gaussian Ray Tracing for Tomographic Reconstruction

Ling Chen, Bao Yang

Main category: cs.CV

TL;DR: A tomographic reconstruction framework using 3D Gaussian ray tracing instead of splatting, enabling more physically accurate line integrals and better handling of nonlinear geometric corrections for CT and PET imaging.

Details

Motivation: While 3D Gaussian Splatting (3DGS) and its extension R2-Gaussian show promise for tomographic reconstruction, the local affine approximation used in splatting degrades quantitative accuracy and complicates nonlinear geometric corrections needed for realistic tomography systems.

Method: Proposes a tomographic reconstruction framework based on 3D Gaussian ray tracing instead of splatting. Computes line integrals through 3D Gaussian primitives analytically, avoiding local affine collapse. Provides explicit control over ray origins and directions to facilitate precise application of nonlinear geometric corrections like arc-correction in PET.

Result: The ray-tracing approach yields a more physically consistent forward projection model and extends applicability of Gaussian-based reconstruction to a wider range of realistic tomography systems while improving projection accuracy.

Conclusion: Ray tracing with 3D Gaussians overcomes limitations of splatting-based approaches, enabling more accurate tomographic reconstruction with better handling of complex geometric corrections for medical imaging applications.

Abstract: 3D Gaussian Splatting (3DGS) has recently emerged in computer vision as a promising rendering technique. By adapting the principles of Elliptical Weighted Average (EWA) splatting to a modern differentiable pipeline, 3DGS enables real-time, high-quality novel view synthesis. Building upon this, R2-Gaussian extended the 3DGS paradigm to tomographic reconstruction by rectifying integration bias, achieving state-of-the-art performance in computed tomography (CT). To enable differentiability, R2-Gaussian adopts a local affine approximation: each 3D Gaussian is locally mapped to a 2D Gaussian on the detector and composed via alpha blending to form projections. However, the affine approximation can degrade reconstruction quantitative accuracy and complicate the incorporation of nonlinear geometric corrections. To address these limitations, we propose a tomographic reconstruction framework based on 3D Gaussian ray tracing. Our approach provides two key advantages over splatting-based models: (i) it computes the line integral through 3D Gaussian primitives analytically, avoiding the local affine collapse and thus yielding a more physically consistent forward projection model; and (ii) the ray-tracing formulation gives explicit control over ray origins and directions, which facilitates the precise application of nonlinear geometric corrections, e.g., arc-correction used in positron emission tomography (PET). These properties extend the applicability of Gaussian-based reconstruction to a wider range of realistic tomography systems while improving projection accuracy.

[435] PDE-Constrained Optimization for Neural Image Segmentation with Physics Priors

Seema K. Poudel, Sunny K. Khadka

Main category: cs.CV

TL;DR: PDE-constrained optimization framework for microscopy image segmentation that integrates physical priors through variational regularization to improve stability and generalization.

Details

Motivation: Microscopy image segmentation faces challenges due to noise, weak boundaries, and limited labeled data. Unconstrained deep learning often leads to unstable solutions and poor generalization, motivating the need for physically-informed regularization.

Method: Formulates segmentation as PDE-constrained optimization with composite objective: data fidelity term + penalty terms from reaction-diffusion equations and phase-field interface energies. Uses differentiable residual losses and evaluates on LIVECell dataset with UNet baseline.

Result: Consistent improvements in segmentation accuracy and boundary fidelity compared to unconstrained baselines. Enhanced stability and better generalization in low-sample regimes, particularly when evaluated on unseen cell types.

Conclusion: PDE-constrained optimization provides principled bridge between variational methods and deep learning, demonstrating value of structured priors for scientific machine learning applications.

Abstract: Segmentation of microscopy images constitutes an ill-posed inverse problem due to measurement noise, weak object boundaries, and limited labeled data. Although deep neural networks provide flexible nonparametric estimators, unconstrained empirical risk minimization often leads to unstable solutions and poor generalization. In this work, image segmentation is formulated as a PDE-constrained optimization problem that integrates physically motivated priors into deep learning models through variational regularization. The proposed framework minimizes a composite objective function consisting of a data fidelity term and penalty terms derived from reaction-diffusion equations and phase-field interface energies, all implemented as differentiable residual losses. Experiments are conducted on the LIVECell dataset, a high-quality, manually annotated collection of phase-contrast microscopy images. Training is performed on two cell types, while evaluation is carried out on a distinct, unseen cell type to assess generalization. A UNet architecture is used as the unconstrained baseline model. Experimental results demonstrate consistent improvements in segmentation accuracy and boundary fidelity compared to unconstrained deep learning baselines. Moreover, the PDE-regularized models exhibit enhanced stability and improved generalization in low-sample regimes, highlighting the advantages of incorporating structured priors. The proposed approach illustrates how PDE-constrained optimization can strengthen data-driven learning frameworks, providing a principled bridge between variational methods, statistical learning, and scientific machine learning.

[436] PISA: Piecewise Sparse Attention Is Wiser for Efficient Diffusion Transformers

Haopeng Li, Shitong Shao, Wenliang Zhong, Zikai Zhou, Lichen Bai, Hui Xiong, Zeke Xie

Main category: cs.CV

TL;DR: PISA introduces a training-free piecewise sparse attention method for diffusion transformers that maintains exact computation for critical blocks while efficiently approximating non-critical blocks via block-wise Taylor expansion, achieving significant speedups without quality degradation.

Details

Motivation: Current block sparse attention methods for diffusion transformers suffer from quality degradation at high sparsity levels because they discard non-critical context information entirely, creating a trade-off between computational efficiency and generation quality.

Method: PISA uses a novel exact-or-approximate strategy: critical blocks are computed exactly while non-critical blocks are approximated efficiently using block-wise Taylor expansion, leveraging the discovery that attention scores of non-critical blocks exhibit distributional stability.

Result: PISA achieves 1.91× speedup on Wan2.1-14B, 2.57× on Hunyuan-Video, and 1.2× on FLUX for image generation while maintaining the highest quality among sparse attention methods, effectively bridging the speed-quality gap.

Conclusion: PISA demonstrates that approximating rather than discarding non-critical blocks enables efficient attention with sub-quadratic complexity while preserving full attention span, offering a superior alternative to conventional keep-or-drop sparse attention for diffusion transformers.

Abstract: Diffusion Transformers are fundamental for video and image generation, but their efficiency is bottlenecked by the quadratic complexity of attention. While block sparse attention accelerates computation by attending only critical key-value blocks, it suffers from degradation at high sparsity by discarding context. In this work, we discover that attention scores of non-critical blocks exhibit distributional stability, allowing them to be approximated accurately and efficiently rather than discarded, which is essentially important for sparse attention design. Motivated by this key insight, we propose PISA, a training-free Piecewise Sparse Attention that covers the full attention span with sub-quadratic complexity. Unlike the conventional keep-or-drop paradigm that directly drop the non-critical block information, PISA introduces a novel exact-or-approximate strategy: it maintains exact computation for critical blocks while efficiently approximating the remainder through block-wise Taylor expansion. This design allows PISA to serve as a faithful proxy to full attention, effectively bridging the gap between speed and quality. Experimental results demonstrate that PISA achieves 1.91 times and 2.57 times speedups on Wan2.1-14B and Hunyuan-Video, respectively, while consistently maintaining the highest quality among sparse attention methods. Notably, even for image generation on FLUX, PISA achieves a 1.2 times acceleration without compromising visual quality. Code is available at: https://github.com/xie-lab-ml/piecewise-sparse-attention.

[437] MedAD-R1: Eliciting Consistent Reasoning in Interpretible Medical Anomaly Detection via Consistency-Reinforced Policy Optimization

Haitao Zhang, Yingying Wang, Jiaxiang Wang, Haote Xu, Hongyang Zhang, Yirong Chen, Yue Huang, Xinghao Ding

Main category: cs.CV

TL;DR: MedAD-R1 model achieves SOTA on MedAD-38K benchmark using two-stage training with cognitive injection and consistency-optimized policy optimization for transparent medical reasoning.

Details

Motivation: Current medical anomaly detection models rely on simplistic SFT datasets, lacking plausible reasoning and robust multimodal generalization needed for trustworthy clinical decision support.

Method: Two-stage framework: 1) Cognitive Injection via SFT for medical knowledge and structured think-then-answer paradigm; 2) Consistency Group Relative Policy Optimization (Con-GRPO) with consistency reward to ensure reasoning aligns with final diagnosis.

Result: MedAD-R1 achieves SOTA on MedAD-38K benchmark, outperforming baselines by >10%, generating transparent and logically consistent reasoning pathways.

Conclusion: The approach enhances trustworthiness and interpretability of AI for clinical decision support through structured reasoning and consistency optimization.

Abstract: Medical Anomaly Detection (MedAD) presents a significant opportunity to enhance diagnostic accuracy using Large Multimodal Models (LMMs) to interpret and answer questions based on medical images. However, the reliance on Supervised Fine-Tuning (SFT) on simplistic and fragmented datasets has hindered the development of models capable of plausible reasoning and robust multimodal generalization. To overcome this, we introduce MedAD-38K, the first large-scale, multi-modal, and multi-center benchmark for MedAD featuring diagnostic Chain-of-Thought (CoT) annotations alongside structured Visual Question-Answering (VQA) pairs. On this foundation, we propose a two-stage training framework. The first stage, Cognitive Injection, uses SFT to instill foundational medical knowledge and align the model with a structured think-then-answer paradigm. Given that standard policy optimization can produce reasoning that is disconnected from the final answer, the second stage incorporates Consistency Group Relative Policy Optimization (Con-GRPO). This novel algorithm incorporates a crucial consistency reward to ensure the generated reasoning process is relevant and logically coherent with the final diagnosis. Our proposed model, MedAD-R1, achieves state-of-the-art (SOTA) performance on the MedAD-38K benchmark, outperforming strong baselines by more than 10%. This superior performance stems from its ability to generate transparent and logically consistent reasoning pathways, offering a promising approach to enhancing the trustworthiness and interpretability of AI for clinical decision support.

[438] Differential Vector Erasure: Unified Training-Free Concept Erasure for Flow Matching Models

Zhiqi Zhang, Xinhao Zhong, Yi Sun, Shuoyang Sun, Bin Chen, Shu-Tao Xia, Xuan Wang

Main category: cs.CV

TL;DR: Differential Vector Erasure (DVE) is a training-free method for concept erasure in flow matching models that removes undesirable concepts by projecting velocity fields onto differential directions between target and anchor concepts.

Details

Motivation: Text-to-image diffusion models can generate undesirable content (NSFW, copyrighted styles, objects), but existing concept erasure methods focus on DDPM-based models and require fine-tuning. Flow matching models represent a different generative paradigm needing specialized approaches.

Method: DVE constructs a differential vector field capturing directional discrepancy between target and anchor concepts in the velocity field. During inference, it projects the velocity field onto this differential direction to remove concept-specific components without affecting other semantics.

Result: Extensive experiments on FLUX show DVE outperforms existing baselines across NSFW suppression, artistic style removal, and object erasure tasks while preserving image quality and diversity.

Conclusion: DVE provides an effective training-free solution for concept erasure in flow matching models, addressing safety and control concerns in text-to-image generation without compromising performance.

Abstract: Text-to-image diffusion models have demonstrated remarkable capabilities in generating high-quality images, yet their tendency to reproduce undesirable concepts, such as NSFW content, copyrighted styles, or specific objects, poses growing concerns for safe and controllable deployment. While existing concept erasure approaches primarily focus on DDPM-based diffusion models and rely on costly fine-tuning, the recent emergence of flow matching models introduces a fundamentally different generative paradigm for which prior methods are not directly applicable. In this paper, we propose Differential Vector Erasure (DVE), a training-free concept erasure method specifically designed for flow matching models. Our key insight is that semantic concepts are implicitly encoded in the directional structure of the velocity field governing the generative flow. Leveraging this observation, we construct a differential vector field that characterizes the directional discrepancy between a target concept and a carefully chosen anchor concept. During inference, DVE selectively removes concept-specific components by projecting the velocity field onto the differential direction, enabling precise concept suppression without affecting irrelevant semantics. Extensive experiments on FLUX demonstrate that DVE consistently outperforms existing baselines on a wide range of concept erasure tasks, including NSFW suppression, artistic style removal, and object erasure, while preserving image quality and diversity.

[439] PandaPose: 3D Human Pose Lifting from a Single Image via Propagating 2D Pose Prior to 3D Anchor Space

Jinghong Zheng, Changlong Jiang, Yang Xiao, Jiaqi Li, Haohong Kuang, Hang Xu, Ran Wang, Zhiguo Cao, Min Du, Joey Tianyi Zhou

Main category: cs.CV

TL;DR: PandaPose: A 3D human pose lifting method that propagates 2D pose prior to 3D anchor space as intermediate representation to address error propagation and self-occlusion issues in single RGB image 3D pose estimation.

Details

Motivation: Existing methods for 3D human pose lifting from single RGB images suffer from two fundamental limitations: inevitable error propagation from input predicted 2D pose to 3D predictions, and inherent difficulties in handling self-occlusion cases where joints are not visible in the 2D image.

Method: Proposes PandaPose with a 3D anchor space comprising: (1) Joint-wise 3D anchors in canonical coordinate system for robust priors, (2) Depth-aware joint-wise feature lifting that hierarchically integrates depth information to resolve self-occlusion, (3) Anchor-feature interaction decoder that incorporates 3D anchors with lifted features to generate unified anchor queries encapsulating joint-wise 3D anchor set, visual cues and geometric depth information, which are then used for anchor-to-joint ensemble prediction.

Result: Experiments on Human3.6M, MPI-INF-3DHP and 3DPW benchmarks demonstrate superiority, with substantial reduction in error by 14.7% compared to SOTA methods on challenging Human3.6M conditions. Qualitative comparisons further showcase effectiveness and robustness.

Conclusion: PandaPose effectively addresses error propagation and self-occlusion issues in 3D human pose lifting through its novel 3D anchor space approach, achieving state-of-the-art performance on established benchmarks.

Abstract: 3D human pose lifting from a single RGB image is a challenging task in 3D vision. Existing methods typically establish a direct joint-to-joint mapping from 2D to 3D poses based on 2D features. This formulation suffers from two fundamental limitations: inevitable error propagation from input predicted 2D pose to 3D predictions and inherent difficulties in handling self-occlusion cases. In this paper, we propose PandaPose, a 3D human pose lifting approach via propagating 2D pose prior to 3D anchor space as the unified intermediate representation. Specifically, our 3D anchor space comprises: (1) Joint-wise 3D anchors in the canonical coordinate system, providing accurate and robust priors to mitigate 2D pose estimation inaccuracies. (2) Depth-aware joint-wise feature lifting that hierarchically integrates depth information to resolve self-occlusion ambiguities. (3) The anchor-feature interaction decoder that incorporates 3D anchors with lifted features to generate unified anchor queries encapsulating joint-wise 3D anchor set, visual cues and geometric depth information. The anchor queries are further employed to facilitate anchor-to-joint ensemble prediction. Experiments on three well-established benchmarks (i.e., Human3.6M, MPI-INF-3DHP and 3DPW) demonstrate the superiority of our proposition. The substantial reduction in error by $14.7%$ compared to SOTA methods on the challenging conditions of Human3.6M and qualitative comparisons further showcase the effectiveness and robustness of our approach.

[440] Robust Harmful Meme Detection under Missing Modalities via Shared Representation Learning

Felix Breiteneder, Mohammad Belal, Muhammad Saad Saeed, Shahed Masoudian, Usman Naseem, Kulshrestha Juhi, Markus Schedl, Shah Nawaz

Main category: cs.CV

TL;DR: Proposes a multimodal harmful meme detection method robust to missing text modality, learning shared representations that work when OCR fails or text is absent.

Details

Motivation: Real-world meme detection faces modal-incomplete data (e.g., poor OCR quality missing text), but existing methods rely on complete multimodal data and degrade when modalities are missing.

Method: Learns shared representations for multiple modalities by projecting them independently, enabling use when data is modal-incomplete, reducing text dependence and improving visual feature integration.

Result: Outperforms existing approaches on two benchmark datasets when text is missing, showing better visual feature integration and reduced text dependence.

Conclusion: First comprehensive investigation of harmful meme detection with modal-incomplete data; method enables real-world application where modalities may be absent.

Abstract: Internet memes are powerful tools for communication, capable of spreading political, psychological, and sociocultural ideas. However, they can be harmful and can be used to disseminate hate toward targeted individuals or groups. Although previous studies have focused on designing new detection methods, these often rely on modal-complete data, such as text and images. In real-world settings, however, modalities like text may be missing due to issues like poor OCR quality, making existing methods sensitive to missing information and leading to performance deterioration. To address this gap, in this paper, we present the first-of-its-kind work to comprehensively investigate the behavior of harmful meme detection methods in the presence of modal-incomplete data. Specifically, we propose a new baseline method that learns a shared representation for multiple modalities by projecting them independently. These shared representations can then be leveraged when data is modal-incomplete. Experimental results on two benchmark datasets demonstrate that our method outperforms existing approaches when text is missing. Moreover, these results suggest that our method allows for better integration of visual features, reducing dependence on text and improving robustness in scenarios where textual information is missing. Our work represents a significant step forward in enabling the real-world application of harmful meme detection, particularly in situations where a modality is absent.

[441] LightCity: An Urban Dataset for Outdoor Inverse Rendering and Reconstruction under Multi-illumination Conditions

Jingjing Wang, Qirui Hu, Chong Bao, Yuke Zhu, Hujun Bao, Zhaopeng Cui, Guofeng Zhang

Main category: cs.CV

TL;DR: LightCity: A synthetic urban dataset for inverse rendering with diverse illumination conditions, enabling benchmarking of intrinsic decomposition and 3D reconstruction in complex urban scenes.

Details

Motivation: Inverse rendering in urban scenes faces challenges due to complex illumination conditions (multi-illumination, indirect light, shadows), but lacks appropriate datasets to study these effects on intrinsic decomposition and 3D reconstruction.

Method: Created LightCity, a high-quality synthetic urban dataset with over 300 sky maps, controllable illumination, street-level and aerial perspectives (50K+ images), and rich properties including depth, normal, material components, light and indirect light.

Result: Provides a comprehensive dataset enabling benchmarking of three fundamental tasks in urban environments, with analysis of these benchmarks to advance related research.

Conclusion: LightCity addresses the dataset gap for urban inverse rendering research, providing a robust foundation for studying illumination effects on intrinsic decomposition and 3D reconstruction.

Abstract: Inverse rendering in urban scenes is pivotal for applications like autonomous driving and digital twins. Yet, it faces significant challenges due to complex illumination conditions, including multi-illumination and indirect light and shadow effects. However, the effects of these challenges on intrinsic decomposition and 3D reconstruction have not been explored due to the lack of appropriate datasets. In this paper, we present LightCity, a novel high-quality synthetic urban dataset featuring diverse illumination conditions with realistic indirect light and shadow effects. LightCity encompasses over 300 sky maps with highly controllable illumination, varying scales with street-level and aerial perspectives over 50K images, and rich properties such as depth, normal, material components, light and indirect light, etc. Besides, we leverage LightCity to benchmark three fundamental tasks in the urban environments and conduct a comprehensive analysis of these benchmarks, laying a robust foundation for advancing related research.

[442] Koo-Fu CLIP: Closed-Form Adaptation of Vision-Language Models via Fukunaga-Koontz Linear Discriminant Analysis

Matej Suchanek, Klara Janouskova, Ondrej Vasatko, Jiri Matas

Main category: cs.CV

TL;DR: Koo-Fu CLIP adapts CLIP embeddings using Fukunaga-Koontz Linear Discriminant Analysis to improve class separability and enable dimensionality reduction for supervised classification tasks.

Details

Motivation: Raw CLIP embeddings are not optimized for supervised classification, exhibiting limited class separation and excessive dimensionality, requiring adaptation for better discriminative performance.

Method: Uses Fukunaga-Koontz Linear Discriminant Analysis in whitened embedding space to suppress within-class variation and enhance between-class discrimination, creating a closed-form linear projection.

Result: Improves ImageNet-1K top-1 accuracy from 75.1% to 79.1%, with consistent gains on larger label spaces (14K and 21K classes), and supports 10-12x compression with minimal accuracy loss.

Conclusion: Koo-Fu CLIP provides an efficient, lightweight adaptation method that reshapes CLIP embedding geometry for improved supervised classification and enables efficient large-scale classification and retrieval.

Abstract: Visual-language models such as CLIP provide powerful general-purpose representations, but their raw embeddings are not optimized for supervised classification, often exhibiting limited class separation and excessive dimensionality. We propose Koo-Fu CLIP, a supervised CLIP adaptation method based on Fukunaga-Koontz Linear Discriminant Analysis, which operates in a whitened embedding space to suppress within-class variation and enhance between-class discrimination. The resulting closed-form linear projection reshapes the geometry of CLIP embeddings, improving class separability while performing effective dimensionality reduction, and provides a lightweight and efficient adaptation of CLIP representations. Across large-scale ImageNet benchmarks, nearest visual prototype classification in the Koo-Fu CLIP space improves top-1 accuracy from 75.1% to 79.1% on ImageNet-1K, with consistent gains persisting as the label space expands to 14K and 21K classes. The method supports substantial compression by up to 10-12x with little or no loss in accuracy, enabling efficient large-scale classification and retrieval.

[443] Improving Robustness of Vision-Language-Action Models by Restoring Corrupted Visual Inputs

Daniel Yezid Guarnizo Orjuela, Leonardo Scappatura, Veronica Di Gennaro, Riccardo Andrea Izzo, Gianluca Bardaro, Matteo Matteucci

Main category: cs.CV

TL;DR: Vision-Language-Action models are vulnerable to image corruptions like sensor noise, causing catastrophic performance drops. The Corruption Restoration Transformer (CRT) is introduced as a plug-and-play solution to restore clean observations from corrupted inputs without fine-tuning the underlying VLA model.

Details

Motivation: VLA models show fragility to visual disturbances in real-world deployment. While physical occlusions are well-studied, image corruptions (sensor-level artifacts like electronic noise, dead pixels, lens contaminants) remain largely unexplored but critically compromise visual signal integrity before interpretation.

Method: Introduces Corruption Restoration Transformer (CRT), a plug-and-play, model-agnostic vision transformer that uses adversarial training to restore clean observations from corrupted inputs. CRT works without computationally expensive fine-tuning of the underlying VLA model.

Result: State-of-the-art VLAs (π₀.₅ and SmolVLA) drop from 90% to as low as 2% success rates under common signal artifacts. CRT effectively recovers lost performance, enabling VLAs to maintain near-baseline success rates even under severe visual corruption, as demonstrated across LIBERO and Meta-World benchmarks.

Conclusion: Image corruptions pose a critical vulnerability for VLA models in real-world deployment. CRT provides an effective, practical solution to immunize VLAs against sensor disturbances without requiring model retraining, addressing a significant gap in robustness for multimodal robotic systems.

Abstract: Vision-Language-Action (VLA) models have emerged as a dominant paradigm for generalist robotic manipulation, unifying perception and control within a single end-to-end architecture. However, despite their success in controlled environments, reliable real-world deployment is severely hindered by their fragility to visual disturbances. While existing literature extensively addresses physical occlusions caused by scene geometry, a critical mode remains largely unexplored: image corruptions. These sensor-level artifacts, ranging from electronic noise and dead pixels to lens contaminants, directly compromise the integrity of the visual signal prior to interpretation. In this work, we quantify this vulnerability, demonstrating that state-of-the-art VLAs such as $π_{0.5}$ and SmolVLA, suffer catastrophic performance degradation, dropping from 90% success rates to as low as 2%, under common signal artifacts. To mitigate this, we introduce the Corruption Restoration Transformer (CRT), a plug-and-play and model-agnostic vision transformer designed to immunize VLA models against sensor disturbances. Leveraging an adversarial training objective, CRT restores clean observations from corrupted inputs without requiring computationally expensive fine-tuning of the underlying model. Extensive experiments across the LIBERO and Meta-World benchmarks demonstrate that CRT effectively recovers lost performance, enabling VLAs to maintain near-baseline success rates, even under severe visual corruption.

[444] Semantically Aware UAV Landing Site Assessment from Remote Sensing Imagery via Multimodal Large Language Models

Chunliang Hua, Zeyuan Yang, Lei Zhang, Jiayang Sun, Fengwen Chen, Chunlan Zeng, Xiao Hu

Main category: cs.CV

TL;DR: A framework using Remote Sensing imagery and Multimodal LLMs for UAV emergency landing site assessment, combining semantic segmentation with vision-language reasoning to identify complex risks beyond geometric features.

Details

Motivation: Traditional geometric sensors fail to identify semantic risks (crowds, temporary structures) for UAV emergency landing. Need for global context-aware assessment that understands complex hazards invisible to pure geometric analysis.

Method: Coarse-to-fine pipeline: 1) Lightweight semantic segmentation pre-screens candidate areas, 2) Vision-language reasoning agent fuses visual features with Point-of-Interest data to detect subtle hazards using MLLMs.

Result: Framework significantly outperforms geometric baselines in risk identification accuracy. Generates human-like, interpretable justifications. Emergency Landing Site Selection (ELSS) benchmark dataset released publicly.

Conclusion: MLLM-based approach enables comprehensive risk assessment for UAV emergency landing by understanding semantic context, enhancing trust through interpretable reasoning.

Abstract: Safe UAV emergency landing requires more than just identifying flat terrain; it demands understanding complex semantic risks (e.g., crowds, temporary structures) invisible to traditional geometric sensors. In this paper, we propose a novel framework leveraging Remote Sensing (RS) imagery and Multimodal Large Language Models (MLLMs) for global context-aware landing site assessment. Unlike local geometric methods, our approach employs a coarse-to-fine pipeline: first, a lightweight semantic segmentation module efficiently pre-screens candidate areas; second, a vision-language reasoning agent fuses visual features with Point-of-Interest (POI) data to detect subtle hazards. To validate this approach, we construct and release the Emergency Landing Site Selection (ELSS) benchmark. Experiments demonstrate that our framework significantly outperforms geometric baselines in risk identification accuracy. Furthermore, qualitative results confirm its ability to generate human-like, interpretable justifications, enhancing trust in automated decision-making. The benchmark dataset is publicly accessible at https://anonymous.4open.science/r/ELSS-dataset-43D7.

[445] EEmo-Logic: A Unified Dataset and Multi-Stage Framework for Comprehensive Image-Evoked Emotion Assessment

Lancheng Gao, Ziheng Jia, Zixuan Xing, Wei Sun, Huiyu Duan, Guangtao Zhai, Xiongkuo Min

Main category: cs.CV

TL;DR: EEmoDB: Largest image-evoked emotion understanding dataset with 5 analysis dimensions and 5 task categories, plus EEmo-Logic MLLM for comprehensive emotion perception and reasoning.

Details

Motivation: Existing models have limited coarse-grained emotion perception and deficient reasoning capabilities for image-evoked emotions, which is crucial for machine empathy and human-computer interaction applications.

Method: Created EEmoDB dataset with 1.2M QA pairs from 125k images (EEmoDB-QA) and 36k dataset from 25k images for fine-grained assessment (EEmoDB-Assess). Developed EEmo-Logic MLLM via instruction fine-tuning and task-customized group relative preference optimization (GRPO) with novel reward design.

Result: EEmo-Logic achieves robust performance in both in-domain and cross-domain datasets, excelling in emotion QA and fine-grained assessment tasks.

Conclusion: The proposed dataset and model advance comprehensive image-evoked emotion understanding, enabling better machine empathy and human-computer interaction applications.

Abstract: Understanding the multi-dimensional attributes and intensity nuances of image-evoked emotions is pivotal for advancing machine empathy and empowering diverse human-computer interaction applications. However, existing models are still limited to coarse-grained emotion perception or deficient reasoning capabilities. To bridge this gap, we introduce EEmoDB, the largest image-evoked emotion understanding dataset to date. It features $5$ analysis dimensions spanning $5$ distinct task categories, facilitating comprehensive interpretation. Specifically, we compile $1.2M$ question-answering (QA) pairs (EEmoDB-QA) from $125k$ images via automated generation, alongside a $36k$ dataset (EEmoDB-Assess) curated from $25k$ images for fine-grained assessment. Furthermore, we propose EEmo-Logic, an all-in-one multimodal large language model (MLLM) developed via instruction fine-tuning and task-customized group relative preference optimization (GRPO) with novel reward design. Extensive experiments demonstrate that EEmo-Logic achieves robust performance in in-domain and cross-domain datasets, excelling in emotion QA and fine-grained assessment. The code is available at https://anonymous.4open.science/r/EEmoLogic.

[446] Refining Context-Entangled Content Segmentation via Curriculum Selection and Anti-Curriculum Promotion

Chunming He, Rihan Zhang, Fengyang Xiao, Dingming Zhang, Zhiwen Cao, Sina Farsiu

Main category: cs.CV

TL;DR: CurriSeg: A dual-phase learning framework combining curriculum and anti-curriculum principles for robust segmentation in context-entangled scenarios like camouflaged object detection.

Details

Motivation: Biological learning progresses from easy to difficult tasks, gradually building perception and robustness. This principle is applied to Context-Entangled Content Segmentation (CECS), where objects share visual patterns with surroundings (e.g., camouflaged objects). Conventional methods focus on architectural improvements but ignore learning dynamics for robustness under entangled data distributions.

Method: CurriSeg uses a dual-phase framework: 1) Curriculum Selection phase dynamically selects training data based on temporal loss statistics to distinguish hard-but-informative samples from noisy ones, enabling stable capability enhancement. 2) Anti-Curriculum Promotion phase uses Spectral-Blindness Fine-Tuning to suppress high-frequency components, enforcing dependence on low-frequency structural and contextual cues to strengthen generalization.

Result: Extensive experiments show CurriSeg achieves consistent improvements across diverse CECS benchmarks without adding parameters or increasing total training time.

Conclusion: CurriSeg offers a principled view of how progression and challenge interplay to foster robust and context-aware segmentation, demonstrating that learning dynamics (not just architecture) are crucial for handling entangled visual patterns.

Abstract: Biological learning proceeds from easy to difficult tasks, gradually reinforcing perception and robustness. Inspired by this principle, we address Context-Entangled Content Segmentation (CECS), a challenging setting where objects share intrinsic visual patterns with their surroundings, as in camouflaged object detection. Conventional segmentation networks predominantly rely on architectural enhancements but often ignore the learning dynamics that govern robustness under entangled data distributions. We introduce CurriSeg, a dual-phase learning framework that unifies curriculum and anti-curriculum principles to improve representation reliability. In the Curriculum Selection phase, CurriSeg dynamically selects training data based on the temporal statistics of sample losses, distinguishing hard-but-informative samples from noisy or ambiguous ones, thus enabling stable capability enhancement. In the Anti-Curriculum Promotion phase, we design Spectral-Blindness Fine-Tuning, which suppresses high-frequency components to enforce dependence on low-frequency structural and contextual cues and thus strengthens generalization. Extensive experiments demonstrate that CurriSeg achieves consistent improvements across diverse CECS benchmarks without adding parameters or increasing total training time, offering a principled view of how progression and challenge interplay to foster robust and context-aware segmentation. Code will be released.

[447] EMFormer: Efficient Multi-Scale Transformer for Accumulative Context Weather Forecasting

Hao Chen, Tao Han, Jie Zhang, Song Guo, Fenghua Ling, Lei Bai

Main category: cs.CV

TL;DR: EMFormer: Efficient Multi-scale Transformer for long-term weather forecasting with novel pretraining-finetuning pipeline addressing catastrophic forgetting and error accumulation

Details

Motivation: Address limitations in long-term weather forecasting including catastrophic forgetting, error accumulation, and high training overhead in existing approaches that use finetuning to extend prediction horizons

Method: Three-part approach: 1) Efficient Multi-scale Transformer (EMFormer) with single convolution for multi-scale feature extraction, 2) Accumulative context finetuning for temporal consistency, 3) Composite loss with sinusoidal weighting for dynamic balance during optimization

Result: Achieves strong performance in weather forecasting and extreme event prediction with substantial long-term forecast accuracy improvements; demonstrates strong generalization on vision benchmarks (ImageNet-1K, ADE20K) with 5.69x speedup over conventional multi-scale modules

Conclusion: Proposed pipeline effectively enhances long-context modeling while reducing computational overhead, addressing key limitations in long-term weather forecasting systems

Abstract: Long-term weather forecasting is critical for socioeconomic planning and disaster preparedness. While recent approaches employ finetuning to extend prediction horizons, they remain constrained by the issues of catastrophic forgetting, error accumulation, and high training overhead. To address these limitations, we present a novel pipeline across pretraining, finetuning and forecasting to enhance long-context modeling while reducing computational overhead. First, we introduce an Efficient Multi-scale Transformer (EMFormer) to extract multi-scale features through a single convolution in both training and inference. Based on the new architecture, we further employ an accumulative context finetuning to improve temporal consistency without degrading short-term accuracy. Additionally, we propose a composite loss that dynamically balances different terms via a sinusoidal weighting, thereby adaptively guiding the optimization trajectory throughout pretraining and finetuning. Experiments show that our approach achieves strong performance in weather forecasting and extreme event prediction, substantially improving long-term forecast accuracy. Moreover, EMFormer demonstrates strong generalization on vision benchmarks (ImageNet-1K and ADE20K) while delivering a 5.69x speedup over conventional multi-scale modules.

[448] Med3D-R1: Incentivizing Clinical Reasoning in 3D Medical Vision-Language Models for Abnormality Diagnosis

Haoran Lai, Zihang Jiang, Kun Zhang, Qingsong Yao, Rongsheng Wang, Zhiyang He, Xiaodong Tao, Wei Wei, Shaohua Kevin Zhou

Main category: cs.CV

TL;DR: Med3D-R1: A reinforcement learning framework for 3D medical vision-language models with residual alignment and abnormality re-weighting to improve clinical reasoning and diagnostic accuracy on volumetric medical imaging.

Details

Motivation: Addressing challenges in 3D medical vision-language models including complexity of volumetric imaging, overfitting to superficial report patterns, and lack of interpretability-aware reward designs for robust clinical reasoning.

Method: Two-stage RL framework: 1) SFT with residual alignment to bridge 3D features and text embeddings, and abnormality re-weighting to emphasize clinical tokens; 2) RL with redesigned consistency reward to promote step-by-step diagnostic reasoning.

Result: Achieved state-of-the-art accuracies of 41.92% on CT-RATE and 44.99% on RAD-ChestCT benchmarks for medical multiple-choice visual question answering, outperforming prior methods.

Conclusion: The approach enhances real-world diagnostic workflows by enabling more reliable and transparent 3D medical vision-language systems with improved abnormality diagnosis and clinical reasoning.

Abstract: Developing 3D vision-language models with robust clinical reasoning remains a challenge due to the inherent complexity of volumetric medical imaging, the tendency of models to overfit superficial report patterns, and the lack of interpretability-aware reward designs. In this paper, we propose Med3D-R1, a reinforcement learning framework with a two-stage training process: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). During SFT stage, we introduce a residual alignment mechanism to bridge the gap between high-dimensional 3D features and textual embeddings, and an abnormality re-weighting strategy to emphasize clinically informative tokens and reduce structural bias in reports. In RL stage, we redesign the consistency reward to explicitly promote coherent, step-by-step diagnostic reasoning. We evaluate our method on medical multiple-choice visual question answering using two 3D diagnostic benchmarks, CT-RATE and RAD-ChestCT, where our model attains state-of-the-art accuracies of 41.92% on CT-RATE and 44.99% on RAD-ChestCT. These results indicate improved abnormality diagnosis and clinical reasoning and outperform prior methods on both benchmarks. Overall, our approach holds promise for enhancing real-world diagnostic workflows by enabling more reliable and transparent 3D medical vision-language systems.

[449] Boosting Point-supervised Temporal Action Localization via Text Refinement and Alignment

Yunchuan Ma, Laiyun Qing, Guorong Li, Yuqing Liu, Yuankai Qi, Qingming Huang

Main category: cs.CV

TL;DR: TRA framework enhances point-supervised temporal action localization by incorporating textual features from video descriptions using text refinement and multimodal alignment modules.

Details

Motivation: Current point-supervised temporal action localization methods only use visual features, missing valuable semantic information from text descriptions that could complement visual understanding.

Method: Proposes Text Refinement and Alignment (TRA) framework with two modules: Point-based Text Refinement (PTR) refines initial video descriptions using point annotations and pre-trained models, and Point-based Multimodal Alignment (PMA) projects features into unified semantic space with point-level multimodal contrastive learning.

Result: Extensive experiments on five benchmarks show favorable performance compared to state-of-the-art methods, with practical implementation on a single 24 GB RTX 3090 GPU.

Conclusion: Incorporating textual features through refinement and alignment significantly improves point-supervised temporal action localization, demonstrating the value of multimodal integration for video understanding tasks.

Abstract: Recently, point-supervised temporal action localization has gained significant attention for its effective balance between labeling costs and localization accuracy. However, current methods only consider features from visual inputs, neglecting helpful semantic information from the text side. To address this issue, we propose a Text Refinement and Alignment (TRA) framework that effectively utilizes textual features from visual descriptions to complement the visual features as they are semantically rich. This is achieved by designing two new modules for the original point-supervised framework: a Point-based Text Refinement module (PTR) and a Point-based Multimodal Alignment module (PMA). Specifically, we first generate descriptions for video frames using a pre-trained multimodal model. Next, PTR refines the initial descriptions by leveraging point annotations together with multiple pre-trained models. PMA then projects all features into a unified semantic space and leverages a point-level multimodal feature contrastive learning to reduce the gap between visual and linguistic modalities. Last, the enhanced multi-modal features are fed into the action detector for precise localization. Extensive experimental results on five widely used benchmarks demonstrate the favorable performance of our proposed framework compared to several state-of-the-art methods. Moreover, our computational overhead analysis shows that the framework can run on a single 24 GB RTX 3090 GPU, indicating its practicality and scalability.

[450] OASIS-DC: Generalizable Depth Completion via Output-level Alignment of Sparse-Integrated Monocular Pseudo Depth

Jaehyeon Cho, Jhonghyun An

Main category: cs.CV

TL;DR: Monocular depth estimation foundation models produce relative depth, not metric depth. The paper proposes calibrating relative depth with sparse range measurements to create pseudo-metric depth priors, then refining with a network for accurate metric predictions from few labeled samples.

Details

Motivation: Current monocular foundation models for depth estimation output relative depth rather than metric depth, limiting their practical use in robotics and autonomous driving applications that require absolute scale measurements.

Method: 1) Leverage relative depth from foundation models which preserves global layout and boundaries. 2) Calibrate relative depth with sparse range measurements to create pseudo-metric depth priors. 3) Design a refinement network that follows the prior where reliable and deviates where necessary.

Result: The system enables accurate metric depth predictions from very few labeled samples, sustains stable scale and sharp edges across few-shot regimes, and works effectively even when curated validation data are unavailable.

Conclusion: Coupling foundation priors with sparse anchors provides a practical route to robust, deployment-ready depth completion under real-world label scarcity, bridging the gap between relative depth from foundation models and metric depth requirements for real applications.

Abstract: Recent monocular foundation models excel at zero-shot depth estimation, yet their outputs are inherently relative rather than metric, limiting direct use in robotics and autonomous driving. We leverage the fact that relative depth preserves global layout and boundaries: by calibrating it with sparse range measurements, we transform it into a pseudo metric depth prior. Building on this prior, we design a refinement network that follows the prior where reliable and deviates where necessary, enabling accurate metric predictions from very few labeled samples. The resulting system is particularly effective when curated validation data are unavailable, sustaining stable scale and sharp edges across few-shot regimes. These findings suggest that coupling foundation priors with sparse anchors is a practical route to robust, deployment-ready depth completion under real-world label scarcity.

[451] Q-DiT4SR: Exploration of Detail-Preserving Diffusion Transformer Quantization for Real-World Image Super-Resolution

Xun Zhang, Kaicheng Yang, Hongliang Lu, Haotong Qin, Yong Guo, Yulun Zhang

Main category: cs.CV

TL;DR: Q-DiT4SR: A post-training quantization framework specifically designed for Diffusion Transformer-based real-world image super-resolution models, achieving state-of-the-art performance with significant model compression and computational reduction.

Details

Motivation: Diffusion Transformers (DiTs) show promise for real-world image super-resolution but suffer from heavy inference burden. Existing quantization methods are either designed for U-Net architectures or text-to-image tasks, and directly applying them to DiT-based super-resolution models causes severe degradation of local textures.

Method: Proposes Q-DiT4SR with two key components: 1) H-SVD (hierarchical SVD) that integrates global low-rank branch with local block-wise rank-1 branch under matched parameter budget, and 2) Variance-aware Spatio-Temporal Mixed Precision (VaSMP for cross-layer weight bit-width allocation based on rate-distortion theory, and VaTMP for intra-layer activation precision scheduling across diffusion timesteps via dynamic programming).

Result: Achieves state-of-the-art performance on multiple real-world datasets under both W4A6 and W4A4 settings. W4A4 quantization reduces model size by 5.8× and computational operations by over 60× while maintaining high-quality texture generation.

Conclusion: Q-DiT4SR is the first PTQ framework specifically tailored for DiT-based real-world image super-resolution, effectively addressing the inference burden while preserving local texture quality through novel hierarchical decomposition and mixed precision strategies.

Abstract: Recently, Diffusion Transformers (DiTs) have emerged in Real-World Image Super-Resolution (Real-ISR) to generate high-quality textures, yet their heavy inference burden hinders real-world deployment. While Post-Training Quantization (PTQ) is a promising solution for acceleration, existing methods in super-resolution mostly focus on U-Net architectures, whereas generic DiT quantization is typically designed for text-to-image tasks. Directly applying these methods to DiT-based super-resolution models leads to severe degradation of local textures. Therefore, we propose Q-DiT4SR, the first PTQ framework specifically tailored for DiT-based Real-ISR. We propose H-SVD, a hierarchical SVD that integrates a global low-rank branch with a local block-wise rank-1 branch under a matched parameter budget. We further propose Variance-aware Spatio-Temporal Mixed Precision: VaSMP allocates cross-layer weight bit-widths in a data-free manner based on rate-distortion theory, while VaTMP schedules intra-layer activation precision across diffusion timesteps via dynamic programming (DP) with minimal calibration. Experiments on multiple real-world datasets demonstrate that our Q-DiT4SR achieves SOTA performance under both W4A6 and W4A4 settings. Notably, the W4A4 quantization configuration reduces model size by 5.8$\times$ and computational operations by over 60$\times$. Our code and models will be available at https://github.com/xunzhang1128/Q-DiT4SR.

[452] TF-Lane: Traffic Flow Module for Robust Lane Perception

Yihan Xie, Han Xia, Zhen Yang

Main category: cs.CV

TL;DR: TFM integrates real-time traffic flow information with lane perception algorithms to improve autonomous driving lane detection in challenging scenarios like occlusions and missing lanes.

Details

Motivation: Vision-based lane detection struggles in occluded or lane-missing scenarios, while HD map solutions have high costs and limited real-time performance. Traffic flow offers a cost-free, real-time information source to supplement lane perception.

Method: Proposes TrafficFlow-aware Lane perception Module (TFM) that extracts real-time traffic flow features and integrates them with existing lane perception algorithms, validated on open-source algorithms and datasets.

Result: TFM consistently improves performance across four mainstream models and two public datasets (Nuscenes and OpenLaneV2), achieving up to +4.1% mAP gain on Nuscenes dataset.

Conclusion: Traffic flow is an effective, cost-free information source that can significantly enhance lane perception in autonomous driving systems when integrated with existing algorithms.

Abstract: Autonomous driving systems require robust lane perception capabilities, yet existing vision-based detection methods suffer significant performance degradation when visual sensors provide insufficient cues, such as in occluded or lane-missing scenarios. While some approaches incorporate high-definition maps as supplementary information, these solutions face challenges of high subscription costs and limited real-time performance. To address these limitations, we explore an innovative information source: traffic flow, which offers real-time capabilities without additional costs. This paper proposes a TrafficFlow-aware Lane perception Module (TFM) that effectively extracts real-time traffic flow features and seamlessly integrates them with existing lane perception algorithms. This solution originated from real-world autonomous driving conditions and was subsequently validated on open-source algorithms and datasets. Extensive experiments on four mainstream models and two public datasets (Nuscenes and OpenLaneV2) using standard evaluation metrics show that TFM consistently improves performance, achieving up to +4.1% mAP gain on the Nuscenes dataset.

[453] DSFC-Net: A Dual-Encoder Spatial and Frequency Co-Awareness Network for Rural Road Extraction

Zhengbo Zhang, Yihe Tian, Wanke Xia, Lin Chen, Yue Sun, Kun Ding, Ying Wang, Bing Xu, Shiming Xiang

Main category: cs.CV

TL;DR: DSFC-Net: A dual-encoder framework combining CNN and Spatial-Frequency Hybrid Transformer for rural road extraction from remote sensing imagery, addressing challenges like vegetation occlusions and narrow roads through frequency-aware attention mechanisms.

Details

Motivation: Rural road extraction from high-resolution remote sensing imagery is challenging due to high intra-class variability, vegetation occlusions, narrow road widths, and existing methods being optimized for urban environments. These unique rural characteristics require specialized approaches.

Method: Proposes DSFC-Net with dual encoders: CNN branch for local boundaries and short-range continuity, and Spatial-Frequency Hybrid Transformer (SFT) with Cross-Frequency Interaction Attention (CFIA) module using Laplacian Pyramid to decouple high/low-frequency information. Also includes Channel Feature Fusion Module (CFFM) to integrate features from both branches.

Result: Comprehensive experiments on WHU-RuR+, DeepGlobe, and Massachusetts datasets show DSFC-Net’s superiority over state-of-the-art approaches for rural road extraction.

Conclusion: DSFC-Net effectively addresses rural road extraction challenges by fusing spatial and frequency-domain information, with the SFT’s frequency-aware attention mechanism preserving connectivity of narrow roads against occlusions.

Abstract: Accurate extraction of rural roads from high-resolution remote sensing imagery is essential for infrastructure planning and sustainable development. However, this task presents unique challenges in rural settings due to several factors. These include high intra-class variability and low inter-class separability from diverse surface materials, frequent vegetation occlusions that disrupt spatial continuity, and narrow road widths that exacerbate detection difficulties. Existing methods, primarily optimized for structured urban environments, often underperform in these scenarios as they overlook such distinctive characteristics. To address these challenges, we propose DSFC-Net, a dual-encoder framework that synergistically fuses spatial and frequency-domain information. Specifically, a CNN branch is employed to capture fine-grained local road boundaries and short-range continuity, while a novel Spatial-Frequency Hybrid Transformer (SFT) is introduced to robustly model global topological dependencies against vegetation occlusions. Distinct from standard attention mechanisms that suffer from frequency bias, the SFT incorporates a Cross-Frequency Interaction Attention (CFIA) module that explicitly decouples high- and low-frequency information via a Laplacian Pyramid strategy. This design enables the dynamic interaction between spatial details and frequency-aware global contexts, effectively preserving the connectivity of narrow roads. Furthermore, a Channel Feature Fusion Module (CFFM) is proposed to bridge the two branches by adaptively recalibrating channel-wise feature responses, seamlessly integrating local textures with global semantics for accurate segmentation. Comprehensive experiments on the WHU-RuR+, DeepGlobe, and Massachusetts datasets validate the superiority of DSFC-Net over state-of-the-art approaches.

[454] Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons

Xianhui Zhang, Chengyu Xie, Linxia Zhu, Yonghui Yang, Weixiang Zhao, Zifeng Cheng, Cong Wang, Fei Shen, Tat-Seng Chua

Main category: cs.CV

TL;DR: LLMs contain cross-lingual shared safety neurons (SS-Neurons) that regulate safety behavior across languages; targeting these neurons improves multilingual safety alignment, especially for non-high-resource languages.

Details

Motivation: Multilingual safety is imbalanced, with non-high-resource languages being more vulnerable than high-resource ones, and the neural mechanisms behind safety alignment remain unclear despite observed cross-lingual representation transfer.

Method: Identify monolingual safety neurons (MS-Neurons) and validate their causal role in safety refusal; identify SS-Neurons as the subset shared between high-resource and non-high-resource languages; propose neuron-oriented training targeting SS-Neurons based on language resource distribution and model architecture.

Result: Suppressing SS-Neurons causes concurrent safety drops across non-high-resource languages, while reinforcing them improves cross-lingual defensive consistency; fine-tuning this tiny neuronal subset outperforms state-of-the-art methods, significantly enhancing non-high-resource language safety while maintaining general capabilities.

Conclusion: SS-Neurons are a critical neuronal subset that jointly regulates safety behavior across languages, and targeting them through neuron-oriented training effectively addresses multilingual safety imbalance while preserving model capabilities.

Abstract: Multilingual safety remains significantly imbalanced, leaving non-high-resource (NHR) languages vulnerable compared to robust high-resource (HR) ones. Moreover, the neural mechanisms driving safety alignment remain unclear despite observed cross-lingual representation transfer. In this paper, we find that LLMs contain a set of cross-lingual shared safety neurons (SS-Neurons), a remarkably small yet critical neuronal subset that jointly regulates safety behavior across languages. We first identify monolingual safety neurons (MS-Neurons) and validate their causal role in safety refusal behavior through targeted activation and suppression. Our cross-lingual analyses then identify SS-Neurons as the subset of MS-Neurons shared between HR and NHR languages, serving as a bridge to transfer safety capabilities from HR to NHR domains. We observe that suppressing these neurons causes concurrent safety drops across NHR languages, whereas reinforcing them improves cross-lingual defensive consistency. Building on these insights, we propose a simple neuron-oriented training strategy that targets SS-Neurons based on language resource distribution and model architecture. Experiments demonstrate that fine-tuning this tiny neuronal subset outperforms state-of-the-art methods, significantly enhancing NHR safety while maintaining the model’s general capabilities. The code and dataset will be available athttps://github.com/1518630367/SS-Neuron-Expansion.

[455] Interacted Planes Reveal 3D Line Mapping

Zeran Ke, Bin Tan, Gui-Song Xia, Yujun Shen, Nan Xue

Main category: cs.CV

TL;DR: LiP-Map is a joint optimization framework for 3D line mapping that explicitly models line and planar primitives together, using planar topology to improve line reconstruction accuracy and efficiency.

Details

Motivation: Current 3D line mapping methods often treat lines in isolation without considering their natural physical context as edges of planar surfaces. The authors aim to develop a more principled approach that leverages the topological relationship between lines and planes for better structured reconstruction in man-made environments.

Method: LiP-Map introduces a line-plane joint optimization framework that explicitly models learnable line and planar primitives together. Instead of imposing pairwise coplanarity constraints, it constructs explicit interactions between plane and line primitives, integrating planar topology into 3D line mapping. The method operates efficiently, typically completing reconstructions in 3-5 minutes per scene.

Result: On over 100 scenes from ScanNetV2, ScanNet++, Hypersim, 7Scenes, and Tanks&Temple datasets, LiP-Map improves both accuracy and completeness over state-of-the-art methods. It also significantly advances line-assisted visual localization, establishing strong performance on 7Scenes.

Conclusion: LiP-Map pioneers the integration of planar topology into 3D line mapping through explicit line-plane joint optimization, offering a principled route toward structured reconstruction in man-made environments while maintaining strong efficiency.

Abstract: 3D line mapping from multi-view RGB images provides a compact and structured visual representation of scenes. We study the problem from a physical and topological perspective: a 3D line most naturally emerges as the edge of a finite 3D planar patch. We present LiP-Map, a line-plane joint optimization framework that explicitly models learnable line and planar primitives. This coupling enables accurate and detailed 3D line mapping while maintaining strong efficiency (typically completing a reconstruction in 3 to 5 minutes per scene). LiP-Map pioneers the integration of planar topology into 3D line mapping, not by imposing pairwise coplanarity constraints but by explicitly constructing interactions between plane and line primitives, thus offering a principled route toward structured reconstruction in man-made environments. On more than 100 scenes from ScanNetV2, ScanNet++, Hypersim, 7Scenes, and Tanks&Temple, LiP-Map improves both accuracy and completeness over state-of-the-art methods. Beyond line mapping quality, LiP-Map significantly advances line-assisted visual localization, establishing strong performance on 7Scenes. Our code is released at https://github.com/calmke/LiPMAP for reproducible research.

[456] Interaction-Consistent Object Removal via MLLM-Based Reasoning

Ching-Kai Huang, Wen-Chieh Lin, Yan-Cen Lee

Main category: cs.CV

TL;DR: REORM is a reasoning-enhanced object removal framework that uses MLLMs to identify and remove not just target objects but also associated interaction elements for semantically consistent image editing.

Details

Motivation: Current object removal methods often only erase the named target object, leaving behind interaction evidence that makes results semantically inconsistent. The paper formalizes the problem of Interaction-Consistent Object Removal (ICOR), which requires removing both target objects and associated interaction elements.

Method: Proposes REORM, a reasoning-enhanced object removal framework leveraging multimodal large language models to infer which elements must be jointly removed. Features modular design with MLLM-driven analysis, mask-guided removal, self-correction mechanism, and local-deployment variant for resource-limited scenarios.

Result: Introduces ICOREval benchmark for evaluation. REORM outperforms state-of-the-art image editing systems on ICOREval, demonstrating effectiveness in producing interaction-consistent results.

Conclusion: REORM successfully addresses the ICOR problem by using MLLM reasoning to identify and remove interaction elements along with target objects, producing more semantically consistent image editing results.

Abstract: Image-based object removal often erases only the named target, leaving behind interaction evidence that renders the result semantically inconsistent. We formalize this problem as Interaction-Consistent Object Removal (ICOR), which requires removing not only the target object but also associated interaction elements, such as lighting-dependent effects, physically connected objects, targetproduced elements, and contextually linked objects. To address this task, we propose Reasoning-Enhanced Object Removal with MLLM (REORM), a reasoningenhanced object removal framework that leverages multimodal large language models to infer which elements must be jointly removed. REORM features a modular design that integrates MLLM-driven analysis, mask-guided removal, and a self-correction mechanism, along with a local-deployment variant that supports accurate editing under limited resources. To support evaluation, we introduce ICOREval, a benchmark consisting of instruction-driven removals with rich interaction dependencies. On ICOREval, REORM outperforms state-of-the-art image editing systems, demonstrating its effectiveness in producing interactionconsistent results.

[457] ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation

Ayushman Sarkar, Zhenyu Yu, Chu Chen, Wei Tang, Kangning Cui, Mohd Yamani Idna Idris

Main category: cs.CV

TL;DR: ReDiStory improves multi-frame visual story generation by reorganizing prompt embeddings to separate identity and frame-specific components, reducing cross-frame interference without training.

Details

Motivation: Existing training-free methods for visual story generation concatenate identity and frame prompts, which causes inter-frame semantic interference that weakens identity preservation in complex stories.

Method: ReDiStory is a training-free framework that decomposes text embeddings into identity-related and frame-specific components, then decorrelates frame embeddings by suppressing shared directions across frames to reduce cross-frame interference.

Result: Experiments on ConsiStory+ benchmark show consistent gains over 1Prompt1Story on multiple identity consistency metrics while maintaining prompt fidelity, using identical diffusion backbones and inference settings.

Conclusion: ReDiStory improves identity consistency in multi-frame story generation through inference-time prompt embedding reorganization without modifying diffusion parameters or requiring additional supervision.

Abstract: Generating coherent visual stories requires maintaining subject identity across multiple images while preserving frame-specific semantics. Recent training-free methods concatenate identity and frame prompts into a unified representation, but this often introduces inter-frame semantic interference that weakens identity preservation in complex stories. We propose ReDiStory, a training-free framework that improves multi-frame story generation via inference-time prompt embedding reorganization. ReDiStory explicitly decomposes text embeddings into identity-related and frame-specific components, then decorrelates frame embeddings by suppressing shared directions across frames. This reduces cross-frame interference without modifying diffusion parameters or requiring additional supervision. Under identical diffusion backbones and inference settings, ReDiStory improves identity consistency while maintaining prompt fidelity. Experiments on the ConsiStory+ benchmark show consistent gains over 1Prompt1Story on multiple identity consistency metrics. Code is available at: https://github.com/YuZhenyuLindy/ReDiStory

[458] StoryState: Agent-Based State Control for Consistent and Editable Storybooks

Ayushman Sarkar, Zhenyu Yu, Wei Tang, Chu Chen, Kangning Cui, Mohd Yamani Idna Idris

Main category: cs.CV

TL;DR: StoryState introduces an agent-based orchestration layer with explicit, editable story state for multimodal storybook generation, improving consistency and enabling localized edits.

Details

Motivation: Current multimodal storybook generation lacks explicit story state representation, making edits coarse-grained and breaking visual consistency across pages.

Method: Uses LLM agents to maintain structured story state (character sheet, global settings, per-page constraints) and generate prompts for training-free text-to-image generation.

Result: Enables localized page edits, improves cross-page consistency, reduces unintended changes and editing time compared to baseline, approaching one-shot consistency of Gemini Storybook.

Conclusion: StoryState provides model-agnostic, prompt-based approach for consistent multimodal story generation with editable state representation.

Abstract: Large multimodal models have enabled one-click storybook generation, where users provide a short description and receive a multi-page illustrated story. However, the underlying story state, such as characters, world settings, and page-level objects, remains implicit, making edits coarse-grained and often breaking visual consistency. We present StoryState, an agent-based orchestration layer that introduces an explicit and editable story state on top of training-free text-to-image generation. StoryState represents each story as a structured object composed of a character sheet, global settings, and per-page scene constraints, and employs a small set of LLM agents to maintain this state and derive 1Prompt1Story-style prompts for generation and editing. Operating purely through prompts, StoryState is model-agnostic and compatible with diverse generation backends. System-level experiments on multi-page editing tasks show that StoryState enables localized page edits, improves cross-page consistency, and reduces unintended changes, interaction turns, and editing time compared to 1Prompt1Story, while approaching the one-shot consistency of Gemini Storybook. Code is available at https://github.com/YuZhenyuLindy/StoryState

[459] DeCorStory: Gram-Schmidt Prompt Embedding Decorrelation for Consistent Storytelling

Ayushman Sarkar, Zhenyu Yu, Mohd Yamani Idna Idris

Main category: cs.CV

TL;DR: DeCorStory is a training-free inference-time framework that reduces inter-frame semantic interference in text-to-image storytelling by orthogonalizing prompt embeddings and strengthening identity preservation.

Details

Motivation: Existing training-free methods for text-to-image storytelling suffer from embedding correlation issues that cause color leakage, background blending, and identity drift across frames, requiring a solution that maintains visual and semantic consistency without model modifications.

Method: DeCorStory uses Gram-Schmidt prompt embedding decorrelation to orthogonalize frame-level semantics, singular value reweighting to strengthen prompt-specific information, and identity-preserving cross-attention to stabilize character identity during diffusion inference.

Result: The method achieves consistent improvements in prompt-image alignment, identity consistency, and visual diversity, reaching state-of-the-art performance among training-free baselines without requiring model modifications or fine-tuning.

Conclusion: DeCorStory provides an effective training-free solution for maintaining visual and semantic consistency in text-to-image storytelling by reducing inter-frame semantic interference through embedding decorrelation and identity preservation techniques.

Abstract: Maintaining visual and semantic consistency across frames is a key challenge in text-to-image storytelling. Existing training-free methods, such as One-Prompt-One-Story, concatenate all prompts into a single sequence, which often induces strong embedding correlation and leads to color leakage, background blending, and identity drift. We propose DeCorStory, a training-free inference-time framework that explicitly reduces inter-frame semantic interference. DeCorStory applies Gram-Schmidt prompt embedding decorrelation to orthogonalize frame-level semantics, followed by singular value reweighting to strengthen prompt-specific information and identity-preserving cross-attention to stabilize character identity during diffusion. The method requires no model modification or fine-tuning and can be seamlessly integrated into existing diffusion pipelines. Experiments demonstrate consistent improvements in prompt-image alignment, identity consistency, and visual diversity, achieving state-of-the-art performance among training-free baselines. Code is available at: https://github.com/YuZhenyuLindy/DeCorStory

[460] FlowCast: Trajectory Forecasting for Scalable Zero-Cost Speculative Flow Matching

Divya Jyoti Bajpai, Shubham Agarwal, Apoorv Saxena, Kuldeep Kulkarni, Subrata Mitra, Manjesh Kumar Hanawal

Main category: cs.CV

TL;DR: FlowCast is a training-free speculative generation framework that accelerates Flow Matching models by exploiting their constant-velocity property to skip redundant denoising steps without quality degradation.

Details

Motivation: Flow Matching models produce high-quality visual generation but suffer from slow inference due to many denoising steps, limiting real-time applications. Existing acceleration methods degrade quality, require costly retraining, or lack generalization.

Method: FlowCast speculates future velocity by extrapolating current velocity without additional time cost, accepting it if within a mean-squared error threshold. This allows skipping redundant steps in stable regions while maintaining precision in complex ones. It’s plug-and-play, requires no auxiliary networks, and works with any FM model.

Result: Empirical evaluations show FlowCast achieves >2.5× speedup in image generation, video generation, and editing tasks, outperforming existing baselines with no quality loss compared to standard full generation.

Conclusion: FlowCast provides an effective training-free acceleration framework for Flow Matching models that maintains quality while significantly improving inference speed, making them more practical for real-time applications.

Abstract: Flow Matching (FM) has recently emerged as a powerful approach for high-quality visual generation. However, their prohibitively slow inference due to a large number of denoising steps limits their potential use in real-time or interactive applications. Existing acceleration methods, like distillation, truncation, or consistency training, either degrade quality, incur costly retraining, or lack generalization. We propose FlowCast, a training-free speculative generation framework that accelerates inference by exploiting the fact that FM models are trained to preserve constant velocity. FlowCast speculates future velocity by extrapolating current velocity without incurring additional time cost, and accepts it if it is within a mean-squared error threshold. This constant-velocity forecasting allows redundant steps in stable regions to be aggressively skipped while retaining precision in complex ones. FlowCast is a plug-and-play framework that integrates seamlessly with any FM model and requires no auxiliary networks. We also present a theoretical analysis and bound the worst-case deviation between speculative and full FM trajectories. Empirical evaluations demonstrate that FlowCast achieves $>2.5\times$ speedup in image generation, video generation, and editing tasks, outperforming existing baselines with no quality loss as compared to standard full generation.

[461] What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom

Yan Ma, Weiyu Zhang, Tianle Li, Linge Du, Xuyang Shen, Pengfei Liu

Main category: cs.CV

TL;DR: MED framework analyzes vision tool-use RL, finding improvements mainly from intrinsic learning rather than tool mastery; tools reduce harm but don’t significantly correct intrinsic failures.

Details

Motivation: To understand whether performance gains in vision tool-use RL come from improved tool use or evolving intrinsic capabilities, and to disentangle these effects.

Method: MED (Measure-Explain-Diagnose) framework: coarse-to-fine analysis that separates intrinsic capability changes from tool-induced effects, decomposes tool-induced performance into gain/harm terms, and probes driving mechanisms.

Result: Across two VLMs with different tool priors and six benchmarks, improvements are dominated by intrinsic learning; tool-use RL mainly reduces tool-induced harm (fewer call-induced errors, weaker tool schema interference) with limited progress in tool-based correction of intrinsic failures.

Conclusion: Current vision tool-use RL learns to coexist safely with tools rather than master them, suggesting tools serve more as safety mechanisms than capability enhancers.

Abstract: Vision tool-use reinforcement learning (RL) can equip vision-language models with visual operators such as crop-and-zoom and achieves strong performance gains, yet it remains unclear whether these gains are driven by improvements in tool use or evolving intrinsic capabilities.We introduce MED (Measure-Explain-Diagnose), a coarse-to-fine framework that disentangles intrinsic capability changes from tool-induced effects, decomposes the tool-induced performance difference into gain and harm terms, and probes the mechanisms driving their evolution. Across checkpoint-level analyses on two VLMs with different tool priors and six benchmarks, we find that improvements are dominated by intrinsic learning, while tool-use RL mainly reduces tool-induced harm (e.g., fewer call-induced errors and weaker tool schema interference) and yields limited progress in tool-based correction of intrinsic failures. Overall, current vision tool-use RL learns to coexist safely with tools rather than master them.

[462] Beyond Pixels: Visual Metaphor Transfer via Schema-Driven Agentic Reasoning

Yu Xu, Yuxin Zhang, Juan Cao, Lin Gao, Chunyu Wang, Oliver Deussen, Tong-Yee Lee, Fan Tang

Main category: cs.CV

TL;DR: A multi-agent framework for Visual Metaphor Transfer that captures abstract creative logic from reference images and transfers it to target subjects using Conceptual Blending Theory and Schema Grammar.

Details

Motivation: Existing generative AI models focus on pixel-level alignment and surface appearance but fail to capture the abstract logic needed for genuine metaphorical generation, limiting their creative capabilities.

Method: Cognitive-inspired multi-agent framework operationalizing Conceptual Blending Theory through Schema Grammar, with specialized agents for perception, transfer, generation, and hierarchical diagnosis with closed-loop backtracking.

Result: Significantly outperforms SOTA baselines in metaphor consistency, analogy appropriateness, and visual creativity, as demonstrated through extensive experiments and human evaluations.

Conclusion: The approach enables automated high-impact creative applications in advertising and media by bridging the gap between surface-level generation and abstract creative logic transfer.

Abstract: A visual metaphor constitutes a high-order form of human creativity, employing cross-domain semantic fusion to transform abstract concepts into impactful visual rhetoric. Despite the remarkable progress of generative AI, existing models remain largely confined to pixel-level instruction alignment and surface-level appearance preservation, failing to capture the underlying abstract logic necessary for genuine metaphorical generation. To bridge this gap, we introduce the task of Visual Metaphor Transfer (VMT), which challenges models to autonomously decouple the “creative essence” from a reference image and re-materialize that abstract logic onto a user-specified target subject. We propose a cognitive-inspired, multi-agent framework that operationalizes Conceptual Blending Theory (CBT) through a novel Schema Grammar (“G”). This structured representation decouples relational invariants from specific visual entities, providing a rigorous foundation for cross-domain logic re-instantiation. Our pipeline executes VMT through a collaborative system of specialized agents: a perception agent that distills the reference into a schema, a transfer agent that maintains generic space invariance to discover apt carriers, a generation agent for high-fidelity synthesis and a hierarchical diagnostic agent that mimics a professional critic, performing closed-loop backtracking to identify and rectify errors across abstract logic, component selection, and prompt encoding. Extensive experiments and human evaluations demonstrate that our method significantly outperforms SOTA baselines in metaphor consistency, analogy appropriateness, and visual creativity, paving the way for automated high-impact creative applications in advertising and media. Source code will be made publicly available.

[463] MTC-VAE: Multi-Level Temporal Compression with Content Awareness

Yubo Dong, Linchao Zhu

Main category: cs.CV

TL;DR: A technique to convert fixed compression rate VAEs into models supporting multi-level temporal compression for video diffusion models, with minimal fine-tuning to maintain performance at higher compression rates.

Details

Motivation: Latent Video Diffusion Models (LVDMs) use VAEs for video compression, but higher compression rates cause efficiency decline when adding extra sampling layers without expanding hidden channels. There's a need for flexible temporal compression without performance degradation.

Method: Propose a technique to convert fixed compression rate VAEs into models supporting multi-level temporal compression. Use minimal fine-tuning approach to counteract performance decline at elevated compression rates. Investigate integration with diffusion-based generative models (DiT).

Result: Examine how varying compression levels impact model performance across diverse video segments. Provide empirical evidence on effectiveness. Demonstrate successful concurrent training and compatibility with diffusion frameworks.

Conclusion: The approach enables multi-level temporal compression for video VAEs with minimal fine-tuning, maintaining performance at higher compression rates and showing compatibility with diffusion-based generative models.

Abstract: Latent Video Diffusion Models (LVDMs) rely on Variational Autoencoders (VAEs) to compress videos into compact latent representations. For continuous Variational Autoencoders (VAEs), achieving higher compression rates is desirable; yet, the efficiency notably declines when extra sampling layers are added without expanding the dimensions of hidden channels. In this paper, we present a technique to convert fixed compression rate VAEs into models that support multi-level temporal compression, providing a straightforward and minimal fine-tuning approach to counteract performance decline at elevated compression rates.Moreover, we examine how varying compression levels impact model performance over video segments with diverse characteristics, offering empirical evidence on the effectiveness of our proposed approach. We also investigate the integration of our multi-level temporal compression VAE with diffusion-based generative models, DiT, highlighting successful concurrent training and compatibility within these frameworks. This investigation illustrates the potential uses of multi-level temporal compression.

[464] Adaptive Visual Autoregressive Acceleration via Dual-Linkage Entropy Analysis

Yu Zhang, Jingyi Liu, Feng Liu, Duoqian Miao, Qi Zhang, Kexue Fu, Changwei Wang, Longbing Cao

Main category: cs.CV

TL;DR: NOVA is a training-free token reduction framework for Visual AutoRegressive models that uses entropy analysis to adaptively prune low-entropy tokens and accelerate inference while maintaining quality.

Details

Motivation: VAR models suffer from high computational costs due to massive token counts. Existing token reduction methods have limitations: heuristic stage partitioning, non-adaptive schedules, and limited acceleration scope, leaving significant acceleration potential untapped.

Method: Uses entropy analysis to capture modeling dynamics evolution. Adaptively determines acceleration activation scale by identifying inflection points of scale entropy growth. Dynamically computes distinct token reduction ratios for each scale and layer, pruning low-entropy tokens while reusing cache from prior scale residuals.

Result: Extensive experiments and analyses validate NOVA as a simple yet effective training-free acceleration framework for VAR models.

Conclusion: NOVA provides an effective training-free acceleration solution for VAR models by leveraging entropy analysis to adaptively reduce tokens while maintaining generation quality.

Abstract: Visual AutoRegressive modeling (VAR) suffers from substantial computational cost due to the massive token count involved. Failing to account for the continuous evolution of modeling dynamics, existing VAR token reduction methods face three key limitations: heuristic stage partition, non-adaptive schedules, and limited acceleration scope, thereby leaving significant acceleration potential untapped. Since entropy variation intrinsically reflects the transition of predictive uncertainty, it offers a principled measure to capture modeling dynamics evolution. Therefore, we propose NOVA, a training-free token reduction acceleration framework for VAR models via entropy analysis. NOVA adaptively determines the acceleration activation scale during inference by online identifying the inflection point of scale entropy growth. Through scale-linkage and layer-linkage ratio adjustment, NOVA dynamically computes distinct token reduction ratios for each scale and layer, pruning low-entropy tokens while reusing the cache derived from the residuals at the prior scale to accelerate inference and maintain generation quality. Extensive experiments and analyses validate NOVA as a simple yet effective training-free acceleration framework.

[465] T2M Mamba: Motion Periodicity-Saliency Coupling Approach for Stable Text-Driven Motion Generation

Xingzu Zhan, Chen Xie, Honghang Chen, Yixun Lin, Xiaochun Mai

Main category: cs.CV

TL;DR: T2M Mamba: A text-to-motion generation model that addresses periodicity-saliency coupling and semantic robustness using Mamba architecture with novel periodicity estimation and cross-modal alignment techniques.

Details

Motivation: Existing text-to-motion models suffer from two limitations: (1) treating motion periodicity and keyframe saliency as independent factors, causing generation drift in long sequences, and (2) fragility to semantically equivalent paraphrases where minor synonym substitutions distort textual embeddings and produce unstable motions.

Method: Proposes T2M Mamba with two key components: (1) Periodicity-Saliency Aware Mamba using novel algorithms for keyframe weight estimation via enhanced Density Peaks Clustering and motion periodicity estimation via FFT-accelerated autocorrelation, and (2) Periodic Differential Cross-modal Alignment Module (PDCAM) to enhance robust alignment of textual and motion embeddings.

Result: Extensive experiments on HumanML3D and KIT-ML datasets show effectiveness, achieving an FID of 0.068 and consistent gains on all other metrics compared to existing methods.

Conclusion: T2M Mamba successfully addresses the coupling of periodicity and saliency in motion generation while improving robustness to semantic paraphrases, advancing the state-of-the-art in text-to-motion generation.

Abstract: Text-to-motion generation, which converts motion language descriptions into coherent 3D human motion sequences, has attracted increasing attention in fields, such as avatar animation and humanoid robotic interaction. Though existing models have achieved significant fidelity, they still suffer from two core limitations: (i) They treat motion periodicity and keyframe saliency as independent factors, overlooking their coupling and causing generation drift in long sequences. (ii) They are fragile to semantically equivalent paraphrases, where minor synonym substitutions distort textual embeddings, propagating through the decoder and producing unstable or erroneous motions. In this work, we propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba, which utilizes novel algorithms for keyframe weight estimation via enhanced Density Peaks Clustering and motion periodicity estimation via FFT-accelerated autocorrelation to capture coupled dynamics with minimal computational overhead, and (ii) constructing a Periodic Differential Cross-modal Alignment Module (PDCAM) to enhance robust alignment of textual and motion embeddings. Extensive experiments on HumanML3D and KIT-ML datasets have been conducted, confirming the effectiveness of our approach, achieving an FID of 0.068 and consistent gains on all other metrics.

[466] Exposing and Defending the Achilles’ Heel of Video Mixture-of-Experts

Songping Wang, Qinglong Liu, Yueming Lyu, Ning Li, Ziwen He, Caifeng Shan

Main category: cs.CV

TL;DR: TLGA framework investigates adversarial vulnerabilities in video MoE models, proposing attacks on routers and experts, and defense via joint adversarial training.

Details

Motivation: Mixture-of-Experts (MoE) shows strong video understanding performance but its adversarial robustness is underexplored. Existing attacks treat MoE as unified, overlooking independent vulnerabilities of routers and experts.

Method: Propose Temporal Lipschitz-Guided Attacks (TLGA) to investigate component-level vulnerabilities. First design router attacks, then Joint TLGA (J-TLGA) that collaboratively perturbs routers and experts. Also propose Joint Temporal Lipschitz Adversarial Training (J-TLAT) for defense.

Result: Joint attacks significantly amplify adversarial effects, exposing collaborative weaknesses. J-TLAT enhances component-wise robustness, reduces inference cost by >60% compared to dense models, and improves robustness across diverse datasets/architectures.

Conclusion: The framework effectively investigates and mitigates both independent and collaborative weaknesses in video MoE models through component-level attacks and joint adversarial training.

Abstract: Mixture-of-Experts (MoE) has demonstrated strong performance in video understanding tasks, yet its adversarial robustness remains underexplored. Existing attack methods often treat MoE as a unified architecture, overlooking the independent and collaborative weaknesses of key components such as routers and expert modules. To fill this gap, we propose Temporal Lipschitz-Guided Attacks (TLGA) to thoroughly investigate component-level vulnerabilities in video MoE models. We first design attacks on the router, revealing its independent weaknesses. Building on this, we introduce Joint Temporal Lipschitz-Guided Attacks (J-TLGA), which collaboratively perturb both routers and experts. This joint attack significantly amplifies adversarial effects and exposes the Achilles’ Heel (collaborative weaknesses) of the MoE architecture. Based on these insights, we further propose Joint Temporal Lipschitz Adversarial Training (J-TLAT). J-TLAT performs joint training to further defend against collaborative weaknesses, enhancing component-wise robustness. Our framework is plug-and-play and reduces inference cost by more than 60% compared with dense models. It consistently enhances adversarial robustness across diverse datasets and architectures, effectively mitigating both the independent and collaborative weaknesses of MoE.

[467] PolyGen: Fully Synthetic Vision-Language Training via Multi-Generator Ensembles

Leonardo Brusini, Cristian Sbrolli, Eugenio Lomurno, Toshihiko Yamasaki, Matteo Matteucci

Main category: cs.CV

TL;DR: PolyGen framework improves synthetic vision-language data by using multiple distinct generators instead of scaling up a single one, achieving better feature diversity and performance.

Details

Motivation: Current synthetic data methods for vision-language pre-training rely on scaling up single generative backbones, which introduces generator-specific biases and limits feature diversity. There's a need for better synthetic data construction that prioritizes manifold coverage over simple dataset size.

Method: PolyGen employs a Polylithic approach that trains on the intersection of architecturally distinct generators to marginalize out model-specific artifacts. It also introduces a Programmatic Hard Negative curriculum to enforce fine-grained syntactic understanding. The framework reallocates data budget from unique captions to multi-source variations.

Result: PolyGen outperforms the leading single-source baseline (SynthCLIP) by +19.0% on aggregate multi-task benchmarks and achieves +9.1% improvement on the SugarCrepe++ compositionality benchmark.

Conclusion: Structural diversity through multiple distinct generators is a more data-efficient scaling law than simply increasing the volume of single-source samples for synthetic vision-language data.

Abstract: Synthetic data offers a scalable solution for vision-language pre-training, yet current state-of-the-art methods typically rely on scaling up a single generative backbone, which introduces generator-specific spectral biases and limits feature diversity. In this work, we introduce PolyGen, a framework that redefines synthetic data construction by prioritizing manifold coverage and compositional rigor over simple dataset size. PolyGen employs a Polylithic approach to train on the intersection of architecturally distinct generators, effectively marginalizing out model-specific artifacts. Additionally, we introduce a Programmatic Hard Negative curriculum that enforces fine-grained syntactic understanding. By structurally reallocating the same data budget from unique captions to multi-source variations, PolyGen achieves a more robust feature space, outperforming the leading single-source baseline (SynthCLIP) by +19.0% on aggregate multi-task benchmarks and on the SugarCrepe++ compositionality benchmark (+9.1%). These results demonstrate that structural diversity is a more data-efficient scaling law than simply increasing the volume of single-source samples.

[468] PromptRL: Prompt Matters in RL for Flow-Based Image Generation

Fu-Yun Wang, Han Zhang, Michael Gharbi, Hongsheng Li, Taesung Park

Main category: cs.CV

TL;DR: PromptRL improves RL for flow matching models by using language models as prompt refinement agents to address sample inefficiency and prompt overfitting in text-to-image generation.

Details

Motivation: Current RL pipelines for flow matching models suffer from two key limitations: sample inefficiency due to insufficient generation diversity, and pronounced prompt overfitting where models memorize specific training formulations and collapse when evaluated on semantically equivalent but stylistically varied prompts.

Method: PromptRL incorporates language models as trainable prompt refinement agents directly within the flow-based RL optimization loop. This creates a synergistic training regime that reshapes optimization dynamics while developing sophisticated prompt rewriting capabilities.

Result: Achieves state-of-the-art performance across multiple benchmarks: 0.97 on GenEval, 0.98 on OCR accuracy, and 24.05 on PickScore. Also improves EditReward of FLUX.1-Kontext from 1.19 to 1.43 with only 0.06 million rollouts, surpassing Gemini 2.5 Flash Image (1.37) and achieving comparable performance with ReasonNet (1.44).

Conclusion: PromptRL consistently achieves higher performance ceilings while requiring over 2× fewer rollouts compared to naive flow-only RL, demonstrating the effectiveness of integrating language models as prompt refinement agents in RL optimization for flow-based image generation.

Abstract: Flow matching models (FMs) have revolutionized text-to-image (T2I) generation, with reinforcement learning (RL) serving as a critical post-training strategy for alignment with reward objectives. In this research, we show that current RL pipelines for FMs suffer from two underappreciated yet important limitations: sample inefficiency due to insufficient generation diversity, and pronounced prompt overfitting, where models memorize specific training formulations and exhibit dramatic performance collapse when evaluated on semantically equivalent but stylistically varied prompts. We present PromptRL (Prompt Matters in RL for Flow-Based Image Generation), a framework that incorporates language models (LMs) as trainable prompt refinement agents directly within the flow-based RL optimization loop. This design yields two complementary benefits: rapid development of sophisticated prompt rewriting capabilities and, critically, a synergistic training regime that reshapes the optimization dynamics. PromptRL achieves state-of-the-art performance across multiple benchmarks, obtaining scores of 0.97 on GenEval, 0.98 on OCR accuracy, and 24.05 on PickScore. Furthermore, we validate the effectiveness of our RL approach on large-scale image editing models, improving the EditReward of FLUX.1-Kontext from 1.19 to 1.43 with only 0.06 million rollouts, surpassing Gemini 2.5 Flash Image (also known as Nano Banana), which scores 1.37, and achieving comparable performance with ReasonNet (1.44), which relied on fine-grained data annotations along with a complex multi-stage training. Our extensive experiments empirically demonstrate that PromptRL consistently achieves higher performance ceilings while requiring over 2$\times$ fewer rollouts compared to naive flow-only RL. Our code is available at https://github.com/G-U-N/UniRL.

[469] Stronger Semantic Encoders Can Harm Relighting Performance: Probing Visual Priors via Augmented Latent Intrinsics

Xiaoyan Xing, Xiao Zhang, Sezer Karaoglu, Theo Gevers, Anand Bhattad

Main category: cs.CV

TL;DR: ALI introduces a method that fuses semantic and photometric features for better image-to-image relighting, especially on challenging materials like metal and glass.

Details

Motivation: Current image-to-image relighting methods using latent intrinsic representations struggle with challenging materials like metal and glass. The paper identifies a fundamental trade-off between semantic abstraction and photometric fidelity in existing approaches.

Method: ALI (Augmented Latent Intrinsics) balances semantic context and dense photometric structure by fusing features from a pixel-aligned visual encoder into a latent-intrinsic framework, along with a self-supervised refinement strategy to address data scarcity.

Result: ALI achieves strong improvements in relighting quality, with the largest gains on complex, specular materials, outperforming previous methods that rely solely on semantic encoders.

Conclusion: The paper demonstrates that balancing semantic and photometric features is crucial for high-quality image relighting, especially for challenging materials, and introduces an effective framework to achieve this balance.

Abstract: Image-to-image relighting requires representations that disentangle scene properties from illumination. Recent methods rely on latent intrinsic representations but remain under-constrained and often fail on challenging materials such as metal and glass. A natural hypothesis is that stronger pretrained visual priors should resolve these failures. We find the opposite: features from top-performing semantic encoders often degrade relighting quality, revealing a fundamental trade-off between semantic abstraction and photometric fidelity. We study this trade-off and introduce Augmented Latent Intrinsics (ALI), which balances semantic context and dense photometric structure by fusing features from a pixel-aligned visual encoder into a latent-intrinsic framework, together with a self-supervised refinement strategy to mitigate the scarcity of paired real-world data. Trained only on unlabeled real-world image pairs and paired with a dense, pixel-aligned visual prior, ALI achieves strong improvements in relighting, with the largest gains on complex, specular materials. Project page: https:\augmented-latent-intrinsics.github.io

[470] Where to Attend: A Principled Vision-Centric Position Encoding with Parabolas

Christoffer Koo Øhrstrøm, Rafael I. Cabral Muchacho, Yifei Dong, Filippos Moumtzidellis, Ronja Güldenring, Florian T. Pokorny, Lazaros Nalpantidis

Main category: cs.CV

TL;DR: PaPE is a parabola-based position encoding method for vision modalities in attention architectures that addresses vision-specific characteristics like translation/rotation invariance, distance decay, directionality, and context awareness.

Details

Motivation: Existing position encodings for vision modalities largely extend 1D-sequence encodings from language to nD structures without fully accounting for vision-specific characteristics like translation invariance, rotation invariance, distance decay, directionality, and context awareness.

Method: Designs Parabolic Position Encoding (PaPE) based on principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. Evaluates on 8 datasets spanning 4 vision modalities (images, point clouds, videos, event camera streams).

Result: PaPE or PaPE-RI achieves top performance on 7 out of 8 datasets. Extrapolation experiments on ImageNet-1K show PaPE extrapolates remarkably well, improving by up to 10.5% over next-best position encoding in absolute terms.

Conclusion: PaPE effectively addresses vision-specific characteristics in position encoding and demonstrates superior performance across multiple vision modalities with strong extrapolation capabilities.

Abstract: We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as images, point clouds, videos, or event camera streams-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. We evaluate PaPE on 8 datasets that span 4 modalities. We find that either PaPE or PaPE-RI achieves the top performance on 7 out of 8 datasets. Extrapolation experiments on ImageNet-1K show that PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5% over the next-best position encoding. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.

[471] BioTamperNet: Affinity-Guided State-Space Model Detecting Tampered Biomedical Images

Soumyaroop Nandi, Prem Natarajan

Main category: cs.CV

TL;DR: BioTamperNet is a framework for detecting duplicated regions in tampered biomedical images using affinity-guided attention inspired by State Space Models.

Details

Motivation: Existing forensic models trained on natural images underperform on biomedical data where subtle manipulations can compromise experimental validity, creating a need for specialized biomedical image tampering detection.

Method: Introduces affinity-guided self-attention to capture intra-image similarities and affinity-guided cross-attention to model cross-image correspondences, integrating lightweight SSM-inspired linear attention mechanisms for efficient fine-grained localization.

Result: Extensive experiments on benchmark bio-forensic datasets demonstrate significant improvements over competitive baselines in accurately detecting duplicated regions.

Conclusion: BioTamperNet effectively addresses the unique challenges of biomedical image tampering detection through specialized attention mechanisms and achieves state-of-the-art performance.

Abstract: We propose BioTamperNet, a novel framework for detecting duplicated regions in tampered biomedical images, leveraging affinity-guided attention inspired by State Space Model (SSM) approximations. Existing forensic models, primarily trained on natural images, often underperform on biomedical data where subtle manipulations can compromise experimental validity. To address this, BioTamperNet introduces an affinity-guided self-attention module to capture intra-image similarities and an affinity-guided cross-attention module to model cross-image correspondences. Our design integrates lightweight SSM-inspired linear attention mechanisms to enable efficient, fine-grained localization. Trained end-to-end, BioTamperNet simultaneously identifies tampered regions and their source counterparts. Extensive experiments on the benchmark bio-forensic datasets demonstrate significant improvements over competitive baselines in accurately detecting duplicated regions. Code - https://github.com/SoumyaroopNandi/BioTamperNet

[472] Cross-Paradigm Evaluation of Gaze-Based Semantic Object Identification for Intelligent Vehicles

Penghao Deng, Jidong J. Yang, Jiachen Bian

Main category: cs.CV

TL;DR: Paper investigates gaze behavior analysis in driving using vision-based approaches including object detection, segmentation-assisted classification, and vision-language models to identify where drivers look at road scenes.

Details

Motivation: Understanding driver visual attention is critical for developing advanced driver-assistance systems and improving road safety. The paper aims to analyze gaze behavior by identifying what objects drivers look at in road scenes captured by vehicle cameras.

Method: Three vision-based approaches: 1) Direct object detection using YOLOv13, 2) Segmentation-assisted classification using SAM2 with EfficientNetV2 or YOLOv13, and 3) Query-based Vision-Language Models (Qwen2.5-VL-7b and Qwen2.5-VL-32b). The collocation of gaze points with object semantics is investigated.

Result: YOLOv13 and Qwen2.5-VL-32b significantly outperform other approaches with Macro F1-Scores over 0.84. Qwen2.5-VL-32b showed superior robustness for identifying small safety-critical objects like traffic lights, especially in nighttime conditions. Segmentation-assisted approach suffered from “part-versus-whole” semantic gap leading to poor recall.

Conclusion: There’s a fundamental trade-off between real-time efficiency of traditional detectors and richer contextual understanding/robustness of large VLMs. Findings provide practical guidance for designing human-aware intelligent driver monitoring systems.

Abstract: Understanding where drivers direct their visual attention during driving, as characterized by gaze behavior, is critical for developing next-generation advanced driver-assistance systems and improving road safety. This paper tackles this challenge as a semantic identification task from the road scenes captured by a vehicle’s front-view camera. Specifically, the collocation of gaze points with object semantics is investigated using three distinct vision-based approaches: direct object detection (YOLOv13), segmentation-assisted classification (SAM2 paired with EfficientNetV2 versus YOLOv13), and query-based Vision-Language Models, VLMs (Qwen2.5-VL-7b versus Qwen2.5-VL-32b). The results demonstrate that the direct object detection (YOLOv13) and Qwen2.5-VL-32b significantly outperform other approaches, achieving Macro F1-Scores over 0.84. The large VLM (Qwen2.5-VL-32b), in particular, exhibited superior robustness and performance for identifying small, safety-critical objects such as traffic lights, especially in adverse nighttime conditions. Conversely, the segmentation-assisted paradigm suffers from a “part-versus-whole” semantic gap that led to large failure in recall. The results reveal a fundamental trade-off between the real-time efficiency of traditional detectors and the richer contextual understanding and robustness offered by large VLMs. These findings provide critical insights and practical guidance for the design of future human-aware intelligent driver monitoring systems.

[473] Understanding vision transformer robustness through the lens of out-of-distribution detection

Joey Kuang, Alexander Wong

Main category: cs.CV

TL;DR: Vision transformers (DeiT, DeiT3, ViT) show varying quantization robustness, with large-scale pretraining (ImageNet-22k) harming 4-bit quantization performance in out-of-distribution detection compared to ImageNet-1k pretraining.

Details

Motivation: While vision transformers excel in vision tasks, making them accessible and real-time via quantization is challenging due to performance loss. Current research focuses on in-distribution behavior, but attention mechanisms in OOD scenarios may reveal quantization insights.

Method: Investigates behavior of quantized small-variant vision transformers (DeiT, DeiT3, ViT) on common OOD datasets. Compares ID vs OOD performance, analyzes quantization effects from 4-bit to full precision, and examines impact of pretraining scale (ImageNet-1k vs ImageNet-22k).

Result: 4-bit models show initial instabilities, especially DeiT3 trained on ImageNet-22k dropping 17% from quantization error. ViT shows reasonable ID quantization robustness. OOD detection reveals larger quantization deltas for ImageNet-22k pretrained models (15.0-19.2% AUPR-out delta) vs ImageNet-1k (9.5-12.0% delta).

Conclusion: Pretraining on large-scale datasets may hinder low-bit quantization robustness in OOD detection. Data augmentation may be more beneficial than large-scale pretraining for quantization robustness.

Abstract: Vision transformers have shown remarkable performance in vision tasks, but enabling them for accessible and real-time use is still challenging. Quantization reduces memory and inference costs at the risk of performance loss. Strides have been made to mitigate low precision issues mainly by understanding in-distribution (ID) task behaviour, but the attention mechanism may provide insight on quantization attributes by exploring out-of-distribution (OOD) situations. We investigate the behaviour of quantized small-variant popular vision transformers (DeiT, DeiT3, and ViT) on common OOD datasets. ID analyses show the initial instabilities of 4-bit models, particularly of those trained on the larger ImageNet-22k, as the strongest FP32 model, DeiT3, sharply drop 17% from quantization error to be one of the weakest 4-bit models. While ViT shows reasonable quantization robustness for ID calibration, OOD detection reveals more: ViT and DeiT3 pretrained on ImageNet-22k respectively experienced a 15.0% and 19.2% average quantization delta in AUPR-out between full precision to 4-bit while their ImageNet-1k-only counterparts experienced a 9.5% and 12.0% delta. Overall, our results suggest pretraining on large scale datasets may hinder low-bit quantization robustness in OOD detection and that data augmentation may be a more beneficial option.

[474] Preserving Localized Patch Semantics in VLMs

Parsa Esmaeilkhani, Longin Jan Latecki

Main category: cs.CV

TL;DR: Logit Lens Loss (LLL) improves vision-language models by preserving visual token locality, enabling meaningful object confidence maps and better vision-centric task performance without architectural changes.

Details

Motivation: Logit Lens visualization for VLMs suffers because visual content diffuses to language tokens, destroying locality of visual information and making it unusable for explainability. Need to preserve visual representation in image tokens.

Method: Proposes Logit Lens Loss (LLL) - a complementary loss to next-token prediction that makes visual token embeddings semantically aligned with textual concepts describing their image regions. Constrains mixing of image/text tokens in self-attention to prevent loss of localized visual information.

Result: LLL makes Logit Lens practically relevant by producing meaningful object confidence maps in images. Also improves performance on vision-centric tasks like segmentation without attaching special heads.

Conclusion: LLL effectively preserves visual locality in VLMs, enabling better explainability through Logit Lens visualization and improving vision task performance without architectural modifications.

Abstract: Logit Lens has been proposed for visualizing tokens that contribute most to LLM answers. Recently, Logit Lens was also shown to be applicable in autoregressive Vision-Language Models (VLMs), where it illustrates the conceptual content of image tokens in the form of heatmaps, e.g., which image tokens are likely to depict the concept of cat in a given image. However, the visual content of image tokens often gets diffused to language tokens, and consequently, the locality of visual information gets mostly destroyed, which renders Logit Lens visualization unusable for explainability. To address this issue, we introduce a complementary loss to next-token prediction (NTP) to prevent the visual tokens from losing the visual representation inherited from corresponding image patches. The proposed Logit Lens Loss (LLL) is designed to make visual token embeddings more semantically aligned with the textual concepts that describe their image regions (e.g., patches containing a cat with the word “cat”), without requiring any architectural modification or large-scale training. This way, LLL constrains the mixing of image and text tokens in the self-attention layers in order to prevent image tokens from losing their localized visual information. As our experiments show, LLL not only makes Logit Lens practically relevant by producing meaningful object confidence maps in images, but also improves performance on vision-centric tasks like segmentation without attaching any special heads.

[475] Rotation-free Online Handwritten Character Recognition Using Linear Recurrent Units

Zhe Ling, Sicheng Yu, Danyu Yang

Main category: cs.CV

TL;DR: Proposes SW-PS+LRU framework for rotation-invariant online handwritten character recognition using sliding window path signatures and linear recurrent units.

Details

Motivation: Online handwritten character recognition suffers from rotational deformations that disrupt stroke spatial layout, reducing accuracy. Need for rotation-invariant features remains challenging.

Method: Uses Sliding Window Path Signature (SW-PS) to capture local structural features, combined with lightweight Linear Recurrent Units (LRU) classifier that combines RNN’s incremental processing with SSM’s parallel training.

Result: Achieved 99.62% (digits), 96.67% (English upper letters), and 94.33% (Chinese radicals) accuracy on CASIA-OLHWDB1.1 dataset with random rotation up to ±180°. Outperforms competing models in convergence speed and test accuracy.

Conclusion: SW-PS+LRU framework effectively addresses rotation invariance problem in online handwritten character recognition, demonstrating superior performance across different character types.

Abstract: Online handwritten character recognition leverages stroke order and dynamic features, which generally provide higher accuracy and robustness compared with offline recognition. However, in practical applications, rotational deformations can disrupt the spatial layout of strokes, substantially reducing recognition accuracy. Extracting rotation-invariant features therefore remains a challenging open problem. In this work, we employ the Sliding Window Path Signature (SW-PS) to capture local structural features of characters, and introduce the lightweight Linear Recurrent Units (LRU) as the classifier. The LRU combine the fast incremental processing capability of recurrent neural networks (RNN) with the efficient parallel training of state space models (SSM), while reliably modelling dynamic stroke characteristics. We conducted recognition experiments with random rotation angle up to $\pm 180^{\circ}$ on three subsets of the CASIA-OLHWDB1.1 dataset: digits, English upper letters, and Chinese radicals. The accuracies achieved after ensemble learning were $99.62%$, $96.67%$, and $94.33%$, respectively. Experimental results demonstrate that the proposed SW-PS+LRU framework consistently surpasses competing models in both convergence speed and test accuracy.

[476] Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

Youliang Zhang, Zhengguang Zhou, Zhentao Yu, Ziyao Huang, Teng Hu, Sen Liang, Guozhen Zhang, Ziqiao Peng, Shunkai Li, Yi Chen, Zixiang Zhou, Yuan Zhou, Qinglin Lu, Xiu Li

Main category: cs.CV

TL;DR: InteractAvatar: A dual-stream framework for generating talking avatars performing grounded human-object interactions with environmental perception and audio-visual synchronization

Details

Motivation: Existing talking avatar generation methods focus on simple human motion but lack the ability to perform grounded human-object interactions (GHOI) that require environmental perception and text-aligned interactions with surrounding objects.

Method: Proposes InteractAvatar with dual-stream framework: 1) Perception and Interaction Module (PIM) for text-aligned interaction motions using detection-enhanced perception, and 2) Audio-Interaction Aware Generation Module (AIM) for synthesizing talking avatars with object interactions. Uses motion-to-video aligner for parallel co-generation.

Result: Establishes GroundedInter benchmark for GHOI video generation evaluation. Extensive experiments demonstrate effectiveness in generating grounded human-object interactions for talking avatars.

Conclusion: InteractAvatar successfully addresses the control-quality dilemma in GHOI generation by decoupling perception/planning from video synthesis, enabling realistic talking avatars with object interactions.

Abstract: Generating talking avatars is a fundamental task in video generation. Although existing methods can generate full-body talking avatars with simple human motion, extending this task to grounded human-object interaction (GHOI) remains an open challenge, requiring the avatar to perform text-aligned interactions with surrounding objects. This challenge stems from the need for environmental perception and the control-quality dilemma in GHOI generation. To address this, we propose a novel dual-stream framework, InteractAvatar, which decouples perception and planning from video synthesis for grounded human-object interaction. Leveraging detection to enhance environmental perception, we introduce a Perception and Interaction Module (PIM) to generate text-aligned interaction motions. Additionally, an Audio-Interaction Aware Generation Module (AIM) is proposed to synthesize vivid talking avatars performing object interactions. With a specially designed motion-to-video aligner, PIM and AIM share a similar network structure and enable parallel co-generation of motions and plausible videos, effectively mitigating the control-quality dilemma. Finally, we establish a benchmark, GroundedInter, for evaluating GHOI video generation. Extensive experiments and comparisons demonstrate the effectiveness of our method in generating grounded human-object interactions for talking avatars. Project page: https://interactavatar.github.io

[477] FSCA-Net: Feature-Separated Cross-Attention Network for Robust Multi-Dataset Training

Yuehai Chen

Main category: cs.CV

TL;DR: FSCA-Net is a crowd counting framework that disentangles domain-invariant and domain-specific features using cross-attention fusion and mutual information optimization to improve cross-dataset generalization.

Details

Motivation: Crowd counting models suffer from performance degradation across diverse environments due to domain discrepancies. Joint training on multiple datasets causes negative transfer as shared and domain-specific representations become entangled, limiting real-world applicability.

Method: Proposes FSCA-Net with explicit feature disentanglement into domain-invariant and domain-specific components. Uses cross-attention fusion module to model interactions between components, and mutual information optimization to maximize consistency among domain-invariant features while minimizing redundancy among domain-specific ones.

Result: Extensive experiments on multiple crowd counting benchmarks show FSCA-Net effectively mitigates negative transfer and achieves state-of-the-art cross-dataset generalization performance.

Conclusion: FSCA-Net provides a robust and scalable solution for real-world crowd analysis by addressing domain discrepancy challenges through explicit feature disentanglement and adaptive fusion mechanisms.

Abstract: Crowd counting plays a vital role in public safety, traffic regulation, and smart city management. However, despite the impressive progress achieved by CNN- and Transformer-based models, their performance often deteriorates when applied across diverse environments due to severe domain discrepancies. Direct joint training on multiple datasets, which intuitively should enhance generalization, instead results in negative transfer, as shared and domain-specific representations become entangled. To address this challenge, we propose the Feature Separation and Cross-Attention Network FSCA-Net, a unified framework that explicitly disentangles feature representations into domain-invariant and domain-specific components. A novel cross-attention fusion module adaptively models interactions between these components, ensuring effective knowledge transfer while preserving dataset-specific discriminability. Furthermore, a mutual information optimization objective is introduced to maximize consistency among domain-invariant features and minimize redundancy among domain-specific ones, promoting complementary shared-private representations. Extensive experiments on multiple crowd counting benchmarks demonstrate that FSCA-Net effectively mitigates negative transfer and achieves state-of-the-art cross-dataset generalization, providing a robust and scalable solution for real-world crowd analysis.

[478] Toward Cognitive Supersensing in Multimodal Large Language Model

Boyi Li, Yifan Shen, Yuanzhe Liu, Yifan Xu, Jiateng Liu, Xinzhuo Li, Zhengyuan Li, Jingyuan Zhu, Yunhan Zhong, Fangzhou Lan, Jianguo Cao, James M. Rehg, Heng Ji, Ismini Lourentzou, Xu Cao

Main category: cs.CV

TL;DR: Cognitive Supersensing trains MLLMs with visual imagery capabilities using latent visual embeddings and reinforcement learning to improve complex cognitive reasoning in visual tasks.

Details

Motivation: Current MLLMs excel at perceptual tasks but struggle with complex cognitive problems requiring visual memory and reasoning. Existing approaches focus on text-based Chain-of-Thought reasoning, neglecting visual reasoning mechanisms analogous to human visuospatial sketchpad and visual imagery.

Method: Introduces Cognitive Supersensing training paradigm with Latent Visual Imagery Prediction (LVIP) head that learns sequences of visual cognitive latent embeddings aligned with answers, forming vision-based internal reasoning chains. Adds reinforcement learning stage to optimize text reasoning paths based on grounded visual latent.

Result: MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench (comprehensive VQA benchmark assessing five cognitive dimensions) and show superior generalization on out-of-domain mathematics and science VQA benchmarks.

Conclusion: Internal visual imagery is key to bridging the gap between perceptual recognition and cognitive understanding in MLLMs. The approach demonstrates that visual reasoning mechanisms are essential for solving complex cognitive problems requiring visual memory.

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable success in open-vocabulary perceptual tasks, yet their ability to solve complex cognitive problems remains limited, especially when visual details are abstract and require visual memory. Current approaches primarily scale Chain-of-Thought (CoT) reasoning in the text space, even when language alone is insufficient for clear and structured reasoning, and largely neglect visual reasoning mechanisms analogous to the human visuospatial sketchpad and visual imagery. To mitigate this deficiency, we introduce Cognitive Supersensing, a novel training paradigm that endows MLLMs with human-like visual imagery capabilities by integrating a Latent Visual Imagery Prediction (LVIP) head that jointly learns sequences of visual cognitive latent embeddings and aligns them with the answer, thereby forming vision-based internal reasoning chains. We further introduce a reinforcement learning stage that optimizes text reasoning paths based on this grounded visual latent. To evaluate the cognitive capabilities of MLLMs, we present CogSense-Bench, a comprehensive visual question answering (VQA) benchmark assessing five cognitive dimensions. Extensive experiments demonstrate that MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench and exhibit superior generalization on out-of-domain mathematics and science VQA benchmarks, suggesting that internal visual imagery is potentially key to bridging the gap between perceptual recognition and cognitive understanding. We will open-source the CogSense-Bench and our model weights.

[479] Multimodal UNcommonsense: From Odd to Ordinary and Ordinary to Odd

Yejin Son, Saejin Kim, Dongjun Min, Younjae Yu

Main category: cs.CV

TL;DR: Multimodal UNcommonsense (MUN) benchmark evaluates models on surprising visual-language scenarios, with a retrieval-based ICL framework (R-ICL) using Multimodal Ensemble Retriever (MER) to improve reasoning in atypical settings.

Details

Motivation: Commonsense reasoning in multimodal contexts remains challenging, especially for scenarios that deviate from typical visual or contextual expectations. Current models struggle with surprising or unlikely outcomes in visual-language pairs.

Method: Proposes MUN benchmark with surprising visual-language pairs, and R-ICL framework with Multimodal Ensemble Retriever (MER) to identify relevant exemplars even when image-text pairs are discordant, transferring reasoning from larger to smaller models without training.

Result: R-ICL shows average 8.3% improvement over baseline ICL methods on MUN benchmark, demonstrating effectiveness in low-frequency, atypical settings.

Conclusion: MUN enables evaluation of visual-language models’ robustness in non-prototypical scenarios, opening directions for improving adaptability in culturally diverse, real-world contexts.

Abstract: Commonsense reasoning in multimodal contexts remains a foundational challenge in artificial intelligence. We introduce Multimodal UNcommonsense(MUN), a benchmark designed to evaluate models’ ability to handle scenarios that deviate from typical visual or contextual expectations. MUN pairs visual scenes with surprising or unlikely outcomes described in natural language, prompting models to either rationalize seemingly odd images using everyday logic or uncover unexpected interpretations in ordinary scenes. To support this task, we propose a retrieval-based in-context learning (R-ICL) framework that transfers reasoning capabilities from larger models to smaller ones without additional training. Leveraging a novel Multimodal Ensemble Retriever (MER), our method identifies semantically relevant exemplars even when image and text pairs are deliberately discordant. Experiments show an average improvement of 8.3% over baseline ICL methods, highlighting the effectiveness of R-ICL in low-frequency, atypical settings. MUN opens new directions for evaluating and improving visual-language models’ robustness and adaptability in real-world, culturally diverse, and non-prototypical scenarios.

[480] One-Step Diffusion for Perceptual Image Compression

Yiwen Jia, Hao Wei, Yanhui Zhou, Chenyang Ge

Main category: cs.CV

TL;DR: Single-step diffusion image compression method that achieves 46× faster inference while maintaining comparable performance to multi-step diffusion approaches.

Details

Motivation: Diffusion-based image compression methods deliver high perceptual quality but suffer from significant inference latency and computational overhead due to requiring many denoising steps during decoding.

Method: Proposes a diffusion-based image compression method requiring only a single-step diffusion process. Introduces a discriminator operating on compact feature representations instead of raw pixels to enhance perceptual quality, leveraging features’ ability to better capture high-level texture and structural details.

Result: Achieves comparable compression performance while offering 46× faster inference speed compared to recent diffusion-based approaches.

Conclusion: The proposed single-step diffusion compression method significantly reduces inference latency while maintaining perceptual quality, making diffusion-based compression more practical for deployment.

Abstract: Diffusion-based image compression methods have achieved notable progress, delivering high perceptual quality at low bitrates. However, their practical deployment is hindered by significant inference latency and heavy computational overhead, primarily due to the large number of denoising steps required during decoding. To address this problem, we propose a diffusion-based image compression method that requires only a single-step diffusion process, significantly improving inference speed. To enhance the perceptual quality of reconstructed images, we introduce a discriminator that operates on compact feature representations instead of raw pixels, leveraging the fact that features better capture high-level texture and structural details. Experimental results show that our method delivers comparable compression performance while offering a 46$\times$ faster inference speed compared to recent diffusion-based approaches. The source code and models are available at https://github.com/cheesejiang/OSDiff.

[481] SGHA-Attack: Semantic-Guided Hierarchical Alignment for Transferable Targeted Attacks on Vision-Language Models

Haobo Wang, Weiqi Luo, Xiaojun Jia, Xiaochun Cao

Main category: cs.CV

TL;DR: SGHA-Attack improves transfer-based adversarial attacks on vision-language models by using multiple semantic references and hierarchical feature alignment across intermediate layers.

Details

Motivation: Current transfer-based adversarial attacks on VLMs overfit to surrogate models by relying on single references and focusing only on final-layer alignment, which limits transferability across heterogeneous VLMs and underutilizes intermediate semantic information.

Method: Proposes SGHA-Attack with semantic-guided hierarchical alignment: 1) Creates reference pool using text-to-image model conditioned on target prompt, selects Top-K most semantically relevant anchors; 2) Aligns intermediate visual representations at multiple depths with global and spatial granularities; 3) Synchronizes intermediate visual and textual features in shared latent subspace for early cross-modal supervision.

Result: Extensive experiments show SGHA-Attack achieves stronger targeted transferability than prior methods on open-source and commercial black-box VLMs, and remains robust against preprocessing and purification defenses.

Conclusion: SGHA-Attack demonstrates that leveraging multiple semantic references and hierarchical alignment across intermediate layers significantly improves transferability of adversarial attacks across diverse vision-language models.

Abstract: Large vision-language models (VLMs) are vulnerable to transfer-based adversarial perturbations, enabling attackers to optimize on surrogate models and manipulate black-box VLM outputs. Prior targeted transfer attacks often overfit surrogate-specific embedding space by relying on a single reference and emphasizing final-layer alignment, which underutilizes intermediate semantics and degrades transfer across heterogeneous VLMs. To address this, we propose SGHA-Attack, a Semantic-Guided Hierarchical Alignment framework that adopts multiple target references and enforces intermediate-layer consistency. Concretely, we generate a visually grounded reference pool by sampling a frozen text-to-image model conditioned on the target prompt, and then carefully select the Top-K most semantically relevant anchors under the surrogate to form a weighted mixture for stable optimization guidance. Building on these anchors, SGHA-Attack injects target semantics throughout the feature hierarchy by aligning intermediate visual representations at both global and spatial granularities across multiple depths, and by synchronizing intermediate visual and textual features in a shared latent subspace to provide early cross-modal supervision before the final projection. Extensive experiments on open-source and commercial black-box VLMs show that SGHA-Attack achieves stronger targeted transferability than prior methods and remains robust under preprocessing and purification defenses.

Wencan Cheng, Gim Hee Lee

Main category: cs.CV

TL;DR: HandMCM: A novel 3D hand pose estimation method using state space models (Mamba) with local information injection/filtering and correspondence modeling to handle occlusions, enhanced by multi-modal image features.

Details

Motivation: 3D hand pose estimation is crucial for human-computer interaction applications like AR, but faces challenges from self-occlusion and object interactions causing severe occlusions.

Method: Proposes HandMCM based on state space models (Mamba) with modules for local information injection/filtering and correspondence modeling to learn dynamic kinematic topology across occlusion scenarios, integrated with multi-modal image features.

Result: Significantly outperforms current state-of-the-art methods on three benchmark datasets, particularly in challenging scenarios with severe occlusions.

Conclusion: Demonstrates potential to advance accuracy and reliability of 3D hand pose estimation in practical applications by effectively handling occlusions through novel Mamba-based architecture with multi-modal features.

Abstract: 3D hand pose estimation that involves accurate estimation of 3D human hand keypoint locations is crucial for many human-computer interaction applications such as augmented reality. However, this task poses significant challenges due to self-occlusion of the hands and occlusions caused by interactions with objects. In this paper, we propose HandMCM to address these challenges. Our HandMCM is a novel method based on the powerful state space model (Mamba). By incorporating modules for local information injection/filtering and correspondence modeling, the proposed correspondence Mamba effectively learns the highly dynamic kinematic topology of keypoints across various occlusion scenarios. Moreover, by integrating multi-modal image features, we enhance the robustness and representational capacity of the input, leading to more accurate hand pose estimation. Empirical evaluations on three benchmark datasets demonstrate that our model significantly outperforms current state-of-the-art methods, particularly in challenging scenarios involving severe occlusions. These results highlight the potential of our approach to advance the accuracy and reliability of 3D hand pose estimation in practical applications.

[483] Know Your Step: Faster and Better Alignment for Flow Matching Models via Step-aware Advantages

Zhixiong Yue, Zixuan Ni, Feiyang Ye, Jinshan Zhang, Sheng Shen, Zhenpeng Mi

Main category: cs.CV

TL;DR: TAFS-GRPO: A novel RL framework for training flow matching text-to-image models to become efficient few-step generators aligned with human preferences through temperature annealing and group relative policy optimization.

Details

Motivation: Existing RL approaches for flow matching models require many denoising steps and suffer from sparse, imprecise reward signals leading to suboptimal human preference alignment in text-to-image generation.

Method: Proposes Temperature Annealed Few-step Sampling with Group Relative Policy Optimization (TAFS-GRPO) that iteratively injects adaptive temporal noise onto one-step samples, annealing outputs to introduce stochasticity while preserving semantic integrity. Uses step-aware advantage integration with GRPO to provide dense, step-specific rewards without requiring differentiable reward functions.

Result: Extensive experiments show TAFS-GRPO achieves strong performance in few-step text-to-image generation and significantly improves alignment of generated images with human preferences.

Conclusion: The proposed framework effectively addresses limitations of existing RL approaches for flow matching models, enabling efficient few-step generation with better human preference alignment through temperature annealing and group relative policy optimization.

Abstract: Recent advances in flow matching models, particularly with reinforcement learning (RL), have significantly enhanced human preference alignment in few step text to image generators. However, existing RL based approaches for flow matching models typically rely on numerous denoising steps, while suffering from sparse and imprecise reward signals that often lead to suboptimal alignment. To address these limitations, we propose Temperature Annealed Few step Sampling with Group Relative Policy Optimization (TAFS GRPO), a novel framework for training flow matching text to image models into efficient few step generators well aligned with human preferences. Our method iteratively injects adaptive temporal noise onto the results of one step samples. By repeatedly annealing the model’s sampled outputs, it introduces stochasticity into the sampling process while preserving the semantic integrity of each generated image. Moreover, its step aware advantage integration mechanism combines the GRPO to avoid the need for the differentiable of reward function and provide dense and step specific rewards for stable policy optimization. Extensive experiments demonstrate that TAFS GRPO achieves strong performance in few step text to image generation and significantly improves the alignment of generated images with human preferences. The code and models of this work will be available to facilitate further research.

[484] Samba+: General and Accurate Salient Object Detection via A More Unified Mamba-based Framework

Wenzhuo Zhao, Keren Fu, Jiahao He, Xiaohong Liu, Qijun Zhao, Guangtao Zhai

Main category: cs.CV

TL;DR: Samba is a pure Mamba-based architecture for salient object detection that achieves state-of-the-art performance across multiple SOD tasks with improved computational efficiency.

Details

Motivation: Existing SOD models are limited by CNN's restricted receptive fields and Transformer's quadratic computational complexity. The emerging Mamba architecture offers a promising balance between global receptive fields and computational efficiency for SOD tasks.

Method: Proposes Saliency Mamba (Samba) with saliency-guided Mamba blocks using spatial neighborhood scanning to preserve spatial continuity, and context-aware upsampling for hierarchical feature alignment. Samba+ extends this with multi-task joint training, hub-and-spoke graph attention for cross-modal fusion, and modality-anchored continual learning to handle arbitrary modalities.

Result: Samba outperforms existing methods across six SOD tasks on 22 datasets with lower computational cost. Samba+ achieves even better results using a single versatile model trained in a multi-task manner.

Conclusion: The Mamba-based Samba framework provides an effective solution for various SOD tasks, balancing computational efficiency and performance, with Samba+ offering a unified approach to handle multiple modalities and tasks.

Abstract: Existing salient object detection (SOD) models are generally constrained by the limited receptive fields of convolutional neural networks (CNNs) and quadratic computational complexity of Transformers. Recently, the emerging state-space model, namely Mamba, has shown great potential in balancing global receptive fields and computational efficiency. As a solution, we propose Saliency Mamba (Samba), a pure Mamba-based architecture that flexibly handles various distinct SOD tasks, including RGB/RGB-D/RGB-T SOD, video SOD (VSOD), RGB-D VSOD, and visible-depth-thermal SOD. Specifically, we rethink the scanning strategy of Mamba for SOD, and introduce a saliency-guided Mamba block (SGMB) that features a spatial neighborhood scanning (SNS) algorithm to preserve the spatial continuity of salient regions. A context-aware upsampling (CAU) method is also proposed to promote hierarchical feature alignment and aggregation by modeling contextual dependencies. As one step further, to avoid the “task-specific” problem as in previous SOD solutions, we develop Samba+, which is empowered by training Samba in a multi-task joint manner, leading to a more unified and versatile model. Two crucial components that collaboratively tackle challenges encountered in input of arbitrary modalities and continual adaptation are investigated. Specifically, a hub-and-spoke graph attention (HGA) module facilitates adaptive cross-modal interactive fusion, and a modality-anchored continual learning (MACL) strategy alleviates inter-modal conflicts together with catastrophic forgetting. Extensive experiments demonstrate that Samba individually outperforms existing methods across six SOD tasks on 22 datasets with lower computational cost, whereas Samba+ achieves even superior results on these tasks and datasets by using a single trained versatile model. Additional results further demonstrate the potential of our Samba framework.

[485] UV-M3TL: A Unified and Versatile Multimodal Multi-Task Learning Framework for Assistive Driving Perception

Wenzhuo Liu, Qiannan Guo, Zhen Wang, Wenshuo Wang, Lei Yang, Yicheng Qiao, Lening Wang, Zhiwei Li, Chen Lv, Shanghang Zhang, Junqiang Xi, Huaping Liu

Main category: cs.CV

TL;DR: UV-M3TL is a multimodal multi-task learning framework for driver assistance systems that simultaneously handles driver behavior, emotion, vehicle behavior, and traffic context recognition while mitigating inter-task negative transfer through dual-branch feature modeling and adaptive loss mechanisms.

Details

Motivation: ADAS systems need to understand both human driver behavior and navigation context, but jointly learning these heterogeneous tasks causes inter-task negative transfer that impairs system performance. There's a need for a unified framework that can handle multiple perception tasks simultaneously without performance degradation.

Method: Proposes UV-M3TL with two core components: 1) Dual-branch spatial channel multimodal embedding (DB-SCME) that uses a dual-branch structure to explicitly model task-shared and task-specific features, enhancing cross-task knowledge transfer while mitigating conflicts. 2) Adaptive feature-decoupled multi-task loss (AFD-Loss) that introduces adaptive weighting based on learning dynamics and feature decoupling constraints to stabilize joint optimization and learn diverse multi-task representations.

Result: Achieves state-of-the-art performance across all four tasks (driver behavior, driver emotion, vehicle behavior, traffic context) on the AIDE dataset. Demonstrates versatility by achieving strong performance on additional public benchmarks (BDD100K, CityScapes, NYUD-v2, PASCAL-Context) across diverse task combinations, attaining SOTA results on most tasks.

Conclusion: UV-M3TL provides an effective unified framework for multimodal multi-task learning in ADAS applications, successfully mitigating inter-task negative transfer while maintaining strong performance across diverse perception tasks. The framework demonstrates both specialization for driver assistance tasks and generalizability to other multi-task perception benchmarks.

Abstract: Advanced Driver Assistance Systems (ADAS) need to understand human driver behavior while perceiving their navigation context, but jointly learning these heterogeneous tasks would cause inter-task negative transfer and impair system performance. Here, we propose a Unified and Versatile Multimodal Multi-Task Learning (UV-M3TL) framework to simultaneously recognize driver behavior, driver emotion, vehicle behavior, and traffic context, while mitigating inter-task negative transfer. Our framework incorporates two core components: dual-branch spatial channel multimodal embedding (DB-SCME) and adaptive feature-decoupled multi-task loss (AFD-Loss). DB-SCME enhances cross-task knowledge transfer while mitigating task conflicts by employing a dual-branch structure to explicitly model salient task-shared and task-specific features. AFD-Loss improves the stability of joint optimization while guiding the model to learn diverse multi-task representations by introducing an adaptive weighting mechanism based on learning dynamics and feature decoupling constraints. We evaluate our method on the AIDE dataset, and the experimental results demonstrate that UV-M3TL achieves state-of-the-art performance across all four tasks. To further prove the versatility, we evaluate UV-M3TL on additional public multi-task perception benchmarks (BDD100K, CityScapes, NYUD-v2, and PASCAL-Context), where it consistently delivers strong performance across diverse task combinations, attaining state-of-the-art results on most tasks.

[486] Token Pruning for In-Context Generation in Diffusion Transformers

Junqing Lin, Xingyu Zheng, Pei Cheng, Bin Fu, Jingwei Sun, Guangzhong Sun

Main category: cs.CV

TL;DR: ToPi is a training-free token pruning framework for Diffusion Transformers that reduces computational costs in in-context image generation by selectively pruning reference tokens based on their contribution to the generation process.

Details

Motivation: In-context generation in Diffusion Transformers enables controllable image-to-image generation but creates computational bottlenecks due to increased sequence length from concatenating reference examples. Existing token reduction techniques are inadequate because they apply uniform strategies and fail to account for the asymmetric roles of reference contexts versus target latents across spatial, temporal, and functional dimensions.

Method: ToPi uses offline calibration-driven sensitivity analysis to identify pivotal attention layers as a proxy for redundancy estimation. It then derives an influence metric to quantify each context token’s contribution for selective pruning, coupled with a temporal update strategy that adapts to the evolving diffusion trajectory. The framework is training-free and specifically tailored for in-context generation in DiTs.

Result: Empirical evaluations show ToPi achieves over 30% speedup in inference while maintaining structural fidelity and visual consistency across complex image generation tasks.

Conclusion: ToPi effectively addresses the computational bottleneck in in-context generation for Diffusion Transformers through a principled token pruning approach that preserves generation quality while significantly improving efficiency.

Abstract: In-context generation significantly enhances Diffusion Transformers (DiTs) by enabling controllable image-to-image generation through reference examples. However, the resulting input concatenation drastically increases sequence length, creating a substantial computational bottleneck. Existing token reduction techniques, primarily tailored for text-to-image synthesis, fall short in this paradigm as they apply uniform reduction strategies, overlooking the inherent role asymmetry between reference contexts and target latents across spatial, temporal, and functional dimensions. To bridge this gap, we introduce ToPi, a training-free token pruning framework tailored for in-context generation in DiTs. Specifically, ToPi utilizes offline calibration-driven sensitivity analysis to identify pivotal attention layers, serving as a robust proxy for redundancy estimation. Leveraging these layers, we derive a novel influence metric to quantify the contribution of each context token for selective pruning, coupled with a temporal update strategy that adapts to the evolving diffusion trajectory. Empirical evaluations demonstrate that ToPi can achieve over 30% speedup in inference while maintaining structural fidelity and visual consistency across complex image generation tasks.

[487] Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation?

Susan Liang, Chao Huang, Filippos Bellos, Yolo Yunlong Tang, Qianxiang Shen, Jing Bi, Luchuan Song, Zeliang Zhang, Jason Corso, Chenliang Xu

Main category: cs.CV

TL;DR: Omni-Judge: Using omni-modal LLMs as human-aligned judges for evaluating text-to-audio-video generation, achieving competitive correlation with human judgments on semantic tasks but limited on high-temporal-resolution metrics.

Details

Motivation: Current text-to-video models can generate high-fidelity videos with synchronized audio, but evaluating these tri-modal outputs is challenging. Human evaluation is costly and doesn't scale, while traditional automatic metrics focus on isolated modality pairs and lack interpretability.

Method: Introduces Omni-Judge, which leverages omni-modal large language models (omni-LLMs) that naturally process audio, video, and text. The approach uses these models as judges across nine perceptual and alignment metrics for text-conditioned audio-video generation.

Result: Omni-Judge achieves correlation comparable to traditional metrics and excels on semantically demanding tasks like audio-text alignment, video-text alignment, and audio-video-text coherence. However, it underperforms on high-FPS perceptual metrics (video quality and audio-video synchronization) due to limited temporal resolution.

Conclusion: Omni-LLMs show promise as unified evaluators for multi-modal generation, offering interpretable explanations that expose inconsistencies, but have current limitations in temporal resolution. The approach enables practical downstream uses like feedback-based refinement.

Abstract: State-of-the-art text-to-video generation models such as Sora 2 and Veo 3 can now produce high-fidelity videos with synchronized audio directly from a textual prompt, marking a new milestone in multi-modal generation. However, evaluating such tri-modal outputs remains an unsolved challenge. Human evaluation is reliable but costly and difficult to scale, while traditional automatic metrics, such as FVD, CLAP, and ViCLIP, focus on isolated modality pairs, struggle with complex prompts, and provide limited interpretability. Omni-modal large language models (omni-LLMs) present a promising alternative: they naturally process audio, video, and text, support rich reasoning, and offer interpretable chain-of-thought feedback. Driven by this, we introduce Omni-Judge, a study assessing whether omni-LLMs can serve as human-aligned judges for text-conditioned audio-video generation. Across nine perceptual and alignment metrics, Omni-Judge achieves correlation comparable to traditional metrics and excels on semantically demanding tasks such as audio-text alignment, video-text alignment, and audio-video-text coherence. It underperforms on high-FPS perceptual metrics, including video quality and audio-video synchronization, due to limited temporal resolution. Omni-Judge provides interpretable explanations that expose semantic or physical inconsistencies, enabling practical downstream uses such as feedback-based refinement. Our findings highlight both the potential and current limitations of omni-LLMs as unified evaluators for multi-modal generation.

[488] PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

Minh-Quan Le, Gaurav Mittal, Cheng Zhao, David Gu, Dimitris Samaras, Mei Chen

Main category: cs.CV

TL;DR: PISCES introduces an annotation-free post-training method for text-to-video generation using dual optimal transport-aligned rewards to improve video quality and semantic alignment without human preference data.

Details

Motivation: Current reward-based post-training methods for text-to-video generation either require large-scale human preference annotations or use misaligned embeddings from pre-trained vision-language models, limiting scalability and providing suboptimal supervision.

Method: PISCES uses a Dual Optimal Transport-aligned Rewards module with two components: 1) Distributional OT-aligned Quality Reward for visual quality and temporal coherence, and 2) Discrete Token-level OT-aligned Semantic Reward for spatio-temporal correspondence between text and video tokens.

Result: Experiments show PISCES outperforms both annotation-based and annotation-free methods on VBench across Quality and Semantic scores, with human preference studies validating effectiveness. The method works with multiple optimization paradigms including direct backpropagation and reinforcement learning fine-tuning.

Conclusion: PISCES successfully improves annotation-free reward supervision for text-to-video generation through optimal transport alignment, providing scalable and effective post-training without human annotations.

Abstract: Text-to-video (T2V) generation aims to synthesize videos with high visual quality and temporal consistency that are semantically aligned with input text. Reward-based post-training has emerged as a promising direction to improve the quality and semantic alignment of generated videos. However, recent methods either rely on large-scale human preference annotations or operate on misaligned embeddings from pre-trained vision-language models, leading to limited scalability or suboptimal supervision. We present $\texttt{PISCES}$, an annotation-free post-training algorithm that addresses these limitations via a novel Dual Optimal Transport (OT)-aligned Rewards module. To align reward signals with human judgment, $\texttt{PISCES}$ uses OT to bridge text and video embeddings at both distributional and discrete token levels, enabling reward supervision to fulfill two objectives: (i) a Distributional OT-aligned Quality Reward that captures overall visual quality and temporal coherence; and (ii) a Discrete Token-level OT-aligned Semantic Reward that enforces semantic, spatio-temporal correspondence between text and video tokens. To our knowledge, $\texttt{PISCES}$ is the first to improve annotation-free reward supervision in generative post-training through the lens of OT. Experiments on both short- and long-video generation show that $\texttt{PISCES}$ outperforms both annotation-based and annotation-free methods on VBench across Quality and Semantic scores, with human preference studies further validating its effectiveness. We show that the Dual OT-aligned Rewards module is compatible with multiple optimization paradigms, including direct backpropagation and reinforcement learning fine-tuning.

[489] Research on World Models Is Not Merely Injecting World Knowledge into Specific Tasks

Bohan Zeng, Kaixin Zhu, Daili Hua, Bozhou Li, Chengzhuo Tong, Yuran Wang, Xinyi Huang, Yifan Dai, Zixiang Zhang, Yifan Yang, Zhou Liu, Hao Liang, Xiaochen Ma, Ruichuan An, Tianyi Bai, Hongcheng Gao, Junbo Niu, Yang Shi, Xinlong Chen, Yue Ding, Minglei Shi, Kai Zeng, Yiwen Tang, Yuanxing Zhang, Pengfei Wan, Xintao Wang, Wentao Zhang

Main category: cs.CV

TL;DR: Proposes a unified design specification for world models to address fragmentation in current AI research, advocating for integration of interaction, perception, reasoning, and spatial representation rather than task-specific approaches.

Details

Motivation: Current world model research is fragmented with approaches focused on injecting world knowledge into isolated tasks (visual prediction, 3D estimation, symbol grounding) rather than establishing unified frameworks. Task-specific integrations yield performance gains but lack systematic coherence for holistic world understanding.

Method: Analyzes limitations of fragmented approaches and proposes a unified design specification for world models. Suggests that robust world models should be normative frameworks that integrally incorporate interaction, perception, symbolic reasoning, and spatial representation.

Result: Provides a structured perspective to guide future research toward more general, robust, and principled models of the world, moving beyond task-specific approaches to holistic world understanding.

Conclusion: World models need unified frameworks that integrate multiple capabilities rather than fragmented task-specific approaches. The proposed design specification aims to advance research toward more comprehensive and principled world understanding.

Abstract: World models have emerged as a critical frontier in AI research, aiming to enhance large models by infusing them with physical dynamics and world knowledge. The core objective is to enable agents to understand, predict, and interact with complex environments. However, current research landscape remains fragmented, with approaches predominantly focused on injecting world knowledge into isolated tasks, such as visual prediction, 3D estimation, or symbol grounding, rather than establishing a unified definition or framework. While these task-specific integrations yield performance gains, they often lack the systematic coherence required for holistic world understanding. In this paper, we analyze the limitations of such fragmented approaches and propose a unified design specification for world models. We suggest that a robust world model should not be a loose collection of capabilities but a normative framework that integrally incorporates interaction, perception, symbolic reasoning, and spatial representation. This work aims to provide a structured perspective to guide future research toward more general, robust, and principled models of the world.

[490] Federated Vision Transformer with Adaptive Focal Loss for Medical Image Classification

Xinyuan Zhao, Yihang Wu, Ahmad Chaddad, Tareef Daqqaq, Reem Kateb

Main category: cs.CV

TL;DR: Federated learning framework for medical image classification using dynamic adaptive focal loss and client-aware aggregation to address data heterogeneity and class imbalance.

Details

Motivation: Address challenges in federated learning for medical imaging where data privacy restricts access to original datasets, and local clients face data heterogeneity and class imbalance issues that hinder model generalization.

Method: Proposes a FL framework with dynamic adaptive focal loss (DAFL) that adjusts based on each client’s sample distribution, and a client-aware weighted aggregation strategy that adapts to data size and characteristics.

Result: Outperforms DenseNet121, ResNet50, ViT-S/16, ViT-L/32, FedCLIP, Swin Transformer, CoAtNet, and MixNet on three medical datasets (ISIC, Ocular Disease, RSNA-ICH) with accuracy improvements from 0.98% to 41.69%.

Conclusion: The proposed FL framework effectively addresses class imbalance and client heterogeneity in federated medical image classification, demonstrating superior performance over various baseline models.

Abstract: While deep learning models like Vision Transformer (ViT) have achieved significant advances, they typically require large datasets. With data privacy regulations, access to many original datasets is restricted, especially medical images. Federated learning (FL) addresses this challenge by enabling global model aggregation without data exchange. However, the heterogeneity of the data and the class imbalance that exist in local clients pose challenges for the generalization of the model. This study proposes a FL framework leveraging a dynamic adaptive focal loss (DAFL) and a client-aware aggregation strategy for local training. Specifically, we design a dynamic class imbalance coefficient that adjusts based on each client’s sample distribution and class data distribution, ensuring minority classes receive sufficient attention and preventing sparse data from being ignored. To address client heterogeneity, a weighted aggregation strategy is adopted, which adapts to data size and characteristics to better capture inter-client variations. The classification results on three public datasets (ISIC, Ocular Disease and RSNA-ICH) show that the proposed framework outperforms DenseNet121, ResNet50, ViT-S/16, ViT-L/32, FedCLIP, Swin Transformer, CoAtNet, and MixNet in most cases, with accuracy improvements ranging from 0.98% to 41.69%. Ablation studies on the imbalanced ISIC dataset validate the effectiveness of the proposed loss function and aggregation strategy compared to traditional loss functions and other FL approaches. The codes can be found at: https://github.com/AIPMLab/ViT-FLDAF.

[491] ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval

Tianyu Yang, ChenWei He, Xiangzhao Hao, Tianyue Wang, Jiarui Guo, Haiyun Guo, Leigang Qu, Jinqiao Wang, Tat-Seng Chua

Main category: cs.CV

TL;DR: ReCALL framework addresses capability degradation in MLLM-based composed image retrieval by diagnosing blind spots, generating corrective data, and refining the retriever through continual training.

Details

Motivation: Adapting generative Multimodal Large Language Models (MLLMs) for discriminative retrieval tasks causes paradigm conflict and capability degradation, where the retriever loses fine-grained reasoning abilities after adaptation.

Method: Three-step pipeline: 1) Diagnose cognitive blind spots via self-guided informative instance mining, 2) Generate corrective instructions and triplets using CoT prompting of foundation MLLM with VQA-based consistency filtering, 3) Refine retriever through continual training on triplets with grouped contrastive scheme.

Result: Extensive experiments on CIRR and FashionIQ datasets show ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance in composed image retrieval.

Conclusion: ReCALL effectively addresses capability degradation in MLLM-based retrieval by aligning discriminative embedding space with intrinsic compositional reasoning, improving fine-grained visual-semantic understanding.

Abstract: Composed Image Retrieval (CIR) aims to retrieve target images based on a hybrid query comprising a reference image and a modification text. Early dual-tower Vision-Language Models (VLMs) struggle with cross-modality compositional reasoning required for this task. Recently, adapting generative Multimodal Large Language Models (MLLMs) for retrieval offers a promising direction. However, we identify that this adaptation strategy overlooks a fundamental issue: adapting a generative MLLM into a single-embedding discriminative retriever triggers a paradigm conflict, which leads to Capability Degradation - the deterioration of native fine-grained reasoning after retrieval adaptation. To address this challenge, we propose ReCALL (Recalibrating Capability Degradation), a model-agnostic framework that follows a diagnose-generate-refine pipeline: Firstly, we diagnose cognitive blind spots of the retriever via self-guided informative instance mining. Next, we generate corrective instructions and triplets by CoT prompting the foundation MLLM and conduct quality control with VQA-based consistency filtering. Finally, we refine the retriever through continual training on these triplets with a grouped contrastive scheme, thereby internalizing fine-grained visual-semantic distinctions and realigning the discriminative embedding space of retriever with intrinsic compositional reasoning within the MLLM. Extensive experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance. Code will be released soon.

[492] Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning

Yinchao Ma, Qiang Zhou, Zhibin Wang, Xianing Chen, Hanqing Yang, Jun Song, Bo Zheng

Main category: cs.CV

TL;DR: CaCoVID: A reinforcement learning-based token compression algorithm for video LLMs that optimizes token selection based on their contribution to correct predictions rather than attention scores.

Details

Motivation: Video LLMs suffer from computational overhead due to redundant video tokens during inference. Existing compression methods prioritize tokens with high attention scores, but the correlation between attention scores and actual contribution to correct answers is unclear.

Method: 1) RL-based framework with policy network to select optimal token combinations based on contribution to correct predictions; 2) Combinatorial policy optimization with online combination space sampling to reduce exploration space and accelerate convergence.

Result: Extensive experiments on diverse video understanding benchmarks demonstrate effectiveness of CaCoVID in compressing video tokens while maintaining performance.

Conclusion: CaCoVID provides a novel contribution-aware approach to video token compression that shifts from passive preservation to active discovery of optimal token combinations, improving computational efficiency for video LLMs.

Abstract: Video large language models have demonstrated remarkable capabilities in video understanding tasks. However, the redundancy of video tokens introduces significant computational overhead during inference, limiting their practical deployment. Many compression algorithms are proposed to prioritize retaining features with the highest attention scores to minimize perturbations in attention computations. However, the correlation between attention scores and their actual contribution to correct answers remains ambiguous. To address the above limitation, we propose a novel \textbf{C}ontribution-\textbf{a}ware token \textbf{Co}mpression algorithm for \textbf{VID}eo understanding (\textbf{CaCoVID}) that explicitly optimizes the token selection policy based on the contribution of tokens to correct predictions. First, we introduce a reinforcement learning-based framework that optimizes a policy network to select video token combinations with the greatest contribution to correct predictions. This paradigm shifts the focus from passive token preservation to active discovery of optimal compressed token combinations. Secondly, we propose a combinatorial policy optimization algorithm with online combination space sampling, which dramatically reduces the exploration space for video token combinations and accelerates the convergence speed of policy optimization. Extensive experiments on diverse video understanding benchmarks demonstrate the effectiveness of CaCoVID. Codes will be released.

[493] From Frames to Sequences: Temporally Consistent Human-Centric Dense Prediction

Xingyu Miao, Junting Dong, Qin Zhao, Yuhang Yang, Junhao Chen, Yang Long

Main category: cs.CV

TL;DR: A unified ViT-based model for temporally consistent human-centric dense prediction across videos, trained with synthetic data and geometric priors.

Details

Motivation: Existing human-centric dense prediction models suffer from temporal flickering under motion, occlusion, and lighting changes, and lack paired video supervision for multiple dense tasks.

Method: 1) Scalable synthetic data pipeline generating photorealistic human frames with motion-aligned sequences and pixel-accurate labels; 2) Unified ViT-based dense predictor with explicit human geometric prior via CSE embeddings; 3) Lightweight channel reweighting module for geometry-feature reliability; 4) Two-stage training combining static pretraining with dynamic sequence supervision.

Result: Achieves state-of-the-art performance on THuman2.1 and Hi4D benchmarks and generalizes effectively to in-the-wild videos.

Conclusion: The proposed approach successfully addresses temporal consistency in human-centric dense prediction through synthetic data generation and geometric prior integration.

Abstract: In this work, we focus on the challenge of temporally consistent human-centric dense prediction across video sequences. Existing models achieve strong per-frame accuracy but often flicker under motion, occlusion, and lighting changes, and they rarely have paired human video supervision for multiple dense tasks. We address this gap with a scalable synthetic data pipeline that generates photorealistic human frames and motion-aligned sequences with pixel-accurate depth, normals, and masks. Unlike prior static data synthetic pipelines, our pipeline provides both frame-level labels for spatial learning and sequence-level supervision for temporal learning. Building on this, we train a unified ViT-based dense predictor that (i) injects an explicit human geometric prior via CSE embeddings and (ii) improves geometry-feature reliability with a lightweight channel reweighting module after feature fusion. Our two-stage training strategy, combining static pretraining with dynamic sequence supervision, enables the model first to acquire robust spatial representations and then refine temporal consistency across motion-aligned sequences. Extensive experiments show that we achieve state-of-the-art performance on THuman2.1 and Hi4D and generalize effectively to in-the-wild videos.

[494] Moonworks Lunara Aesthetic II: An Image Variation Dataset

Yan Wang, Partho Hassan, Samiha Sadeka, Nada Soliman, M M Sayeef Abdullah, Sabit Hassan

Main category: cs.CV

TL;DR: Lunara Aesthetic II is a publicly released image dataset for evaluating contextual consistency in image generation/editing systems, featuring 2,854 anchor-linked variation pairs with identity-preserving contextual transformations.

Details

Motivation: To address the need for controlled evaluation of contextual consistency in modern image generation and editing systems, providing a dataset that enables benchmarking of identity preservation and contextual generalization.

Method: Created a dataset of 2,854 anchor-linked variation pairs from original art and photographs, applying contextual transformations (illumination, weather, viewpoint, composition, color tone, mood) while preserving underlying identity.

Result: Dataset shows high identity stability, strong target attribute realization, and robust aesthetic profile exceeding large-scale web datasets, released under Apache 2.0 license.

Conclusion: Lunara Aesthetic II provides valuable resources for benchmarking, fine-tuning, and analyzing contextual generalization and identity preservation in image generation systems with interpretable supervision.

Abstract: We introduce Lunara Aesthetic II, a publicly released, ethically sourced image dataset designed to support controlled evaluation and learning of contextual consistency in modern image generation and editing systems. The dataset comprises 2,854 anchor-linked variation pairs derived from original art and photographs created by Moonworks. Each variation pair applies contextual transformations, such as illumination, weather, viewpoint, scene composition, color tone, or mood; while preserving a stable underlying identity. Lunara Aesthetic II operationalizes identity-preserving contextual variation as a supervision signal while also retaining Lunara’s signature high aesthetic scores. Results show high identity stability, strong target attribute realization, and a robust aesthetic profile that exceeds large-scale web datasets. Released under the Apache 2.0 license, Lunara Aesthetic II is intended for benchmarking, fine-tuning, and analysis of contextual generalization, identity preservation, and edit robustness in image generation and image-to-image systems with interpretable, relational supervision. The dataset is publicly available at: https://huggingface.co/datasets/moonworks/lunara-aesthetic-image-variations.

[495] Real-Time Loop Closure Detection in Visual SLAM via NetVLAD and Faiss

Enguang Fan

Main category: cs.CV

TL;DR: NetVLAD-based loop closure detection outperforms traditional DBoW on KITTI dataset with real-time performance using Faiss acceleration, offering better accuracy and robustness for SLAM systems.

Details

Motivation: Classic bag-of-words approaches like DBoW degrade under appearance change and perceptual aliasing, while deep learning-based VPR descriptors (NetVLAD, Transformers) offer stronger robustness but are considered computationally expensive for real-time SLAM applications.

Method: Empirically evaluate NetVLAD as LCD module vs DBoW on KITTI dataset, introduce Fine-Grained Top-K precision-recall curve for LCD settings, use Faiss-accelerated nearest-neighbor search for real-time query speed.

Result: NetVLAD achieves real-time query speed with Faiss acceleration while improving accuracy and robustness over DBoW, making it a practical drop-in alternative for LCD in SLAM systems.

Conclusion: Deep learning-based visual place recognition descriptors like NetVLAD can be practically deployed for real-time loop closure detection in SLAM systems, overcoming previous computational barriers while providing superior performance.

Abstract: Loop closure detection (LCD) is a core component of simultaneous localization and mapping (SLAM): it identifies revisited places and enables pose-graph constraints that correct accumulated drift. Classic bag-of-words approaches such as DBoW are efficient but often degrade under appearance change and perceptual aliasing. In parallel, deep learning-based visual place recognition (VPR) descriptors (e.g., NetVLAD and Transformer-based models) offer stronger robustness, but their computational cost is often viewed as a barrier to real-time SLAM. In this paper, we empirically evaluate NetVLAD as an LCD module and compare it against DBoW on the KITTI dataset. We introduce a Fine-Grained Top-K precision-recall curve that better reflects LCD settings where a query may have zero or multiple valid matches. With Faiss-accelerated nearestneighbor search, NetVLAD achieves real-time query speed while improving accuracy and robustness over DBoW, making it a practical drop-in alternative for LCD in SLAM.

[496] VRGaussianAvatar: Integrating 3D Gaussian Avatars into VR

Hail Song, Boram Yoon, Seokhwan Yang, Seoyoung Kang, Hyunjeong Kim, Henning Metzmacher, Woontack Woo

Main category: cs.CV

TL;DR: VRGaussianAvatar enables real-time full-body 3D Gaussian Splatting avatars in VR using only HMD tracking, with a parallel pipeline for pose estimation and stereoscopic rendering optimized via Binocular Batching.

Details

Motivation: To create realistic, real-time full-body avatars in virtual reality using minimal input (just head-mounted display tracking) while maintaining high visual quality and performance for interactive VR applications.

Method: Uses a parallel pipeline with VR Frontend (inverse kinematics for full-body pose estimation) and GA Backend (3D Gaussian Splatting avatar rendering). Introduces Binocular Batching to jointly process left/right eye views for efficient stereo rendering on high-res VR displays.

Result: System sustains interactive VR performance and outperforms image- and video-based mesh avatar baselines in user studies, achieving higher perceived appearance similarity, embodiment, and plausibility.

Conclusion: VRGaussianAvatar demonstrates effective real-time 3D Gaussian Splatting avatars for VR with minimal input requirements, offering improved visual quality and user experience over traditional mesh-based approaches.

Abstract: We present VRGaussianAvatar, an integrated system that enables real-time full-body 3D Gaussian Splatting (3DGS) avatars in virtual reality using only head-mounted display (HMD) tracking signals. The system adopts a parallel pipeline with a VR Frontend and a GA Backend. The VR Frontend uses inverse kinematics to estimate full-body pose and streams the resulting pose along with stereo camera parameters to the backend. The GA Backend stereoscopically renders a 3DGS avatar reconstructed from a single image. To improve stereo rendering efficiency, we introduce Binocular Batching, which jointly processes left and right eye views in a single batched pass to reduce redundant computation and support high-resolution VR displays. We evaluate VRGaussianAvatar with quantitative performance tests and a within-subject user study against image- and video-based mesh avatar baselines. Results show that VRGaussianAvatar sustains interactive VR performance and yields higher perceived appearance similarity, embodiment, and plausibility. Project page and source code are available at https://vrgaussianavatar.github.io.

[497] SMTrack: State-Aware Mamba for Efficient Temporal Modeling in Visual Tracking

Yinchao Ma, Dengqing Yang, Zhangyu He, Wenfei Yang, Tianzhu Zhang

Main category: cs.CV

TL;DR: SMTrack is a novel visual tracking method using state space models (Mamba) for efficient long-range temporal modeling, achieving good performance with low computational cost.

Details

Motivation: Existing CNN and Transformer architectures struggle with long-range temporal dependencies in visual tracking, requiring complex modules or high computational costs. The authors aim to leverage state space models (like Mamba) for more efficient temporal modeling.

Method: Proposes State-aware Mamba Tracker (SMTrack) with a selective state-aware space model using state-wise parameters to capture diverse temporal cues. It enables long-range temporal interactions with linear computational complexity during training and uses hidden state propagation for efficient inference.

Result: Extensive experiments show SMTrack achieves promising tracking performance with low computational costs compared to existing methods.

Conclusion: SMTrack provides an effective and efficient temporal modeling paradigm for visual tracking using state space models, addressing limitations of conventional architectures while maintaining computational efficiency.

Abstract: Visual tracking aims to automatically estimate the state of a target object in a video sequence, which is challenging especially in dynamic scenarios. Thus, numerous methods are proposed to introduce temporal cues to enhance tracking robustness. However, conventional CNN and Transformer architectures exhibit inherent limitations in modeling long-range temporal dependencies in visual tracking, often necessitating either complex customized modules or substantial computational costs to integrate temporal cues. Inspired by the success of the state space model, we propose a novel temporal modeling paradigm for visual tracking, termed State-aware Mamba Tracker (SMTrack), providing a neat pipeline for training and tracking without needing customized modules or substantial computational costs to build long-range temporal dependencies. It enjoys several merits. First, we propose a novel selective state-aware space model with state-wise parameters to capture more diverse temporal cues for robust tracking. Second, SMTrack facilitates long-range temporal interactions with linear computational complexity during training. Third, SMTrack enables each frame to interact with previously tracked frames via hidden state propagation and updating, which releases computational costs of handling temporal cues during tracking. Extensive experimental results demonstrate that SMTrack achieves promising performance with low computational costs.

[498] FreshMem: Brain-Inspired Frequency-Space Hybrid Memory for Streaming Video Understanding

Kangcong Li, Peng Ye, Lin Zhang, Chao Wang, Huafeng Qin, Tao Chen

Main category: cs.CV

TL;DR: FreshMem is a training-free memory network for streaming video understanding in MLLMs that combines frequency-space hybrid memory to balance short-term fidelity with long-term coherence.

Details

Motivation: Existing MLLM methods for streaming video understanding lack flexible adaptivity, causing irreversible detail loss and context fragmentation when transitioning from offline to online continuous perception.

Method: Proposes FreshMem with two synergistic modules: 1) Multi-scale Frequency Memory (MFM) projects overflowing frames into frequency coefficients with residual details to reconstruct global historical “gist”; 2) Space Thumbnail Memory (STM) discretizes continuous streams into episodic clusters using adaptive compression to create high-density space thumbnails.

Result: FreshMem significantly boosts Qwen2-VL baseline with gains of 5.20% on StreamingBench, 4.52% on OV-Bench, and 2.34% on OVO-Bench. As a training-free solution, it outperforms several fully fine-tuned methods.

Conclusion: FreshMem offers a highly efficient paradigm for long-horizon streaming video understanding in MLLMs by reconciling short-term fidelity with long-term coherence through frequency-space hybrid memory inspired by brain perception.

Abstract: Transitioning Multimodal Large Language Models (MLLMs) from offline to online streaming video understanding is essential for continuous perception. However, existing methods lack flexible adaptivity, leading to irreversible detail loss and context fragmentation. To resolve this, we propose FreshMem, a Frequency-Space Hybrid Memory network inspired by the brain’s logarithmic perception and memory consolidation. FreshMem reconciles short-term fidelity with long-term coherence through two synergistic modules: Multi-scale Frequency Memory (MFM), which projects overflowing frames into representative frequency coefficients, complemented by residual details to reconstruct a global historical “gist”; and Space Thumbnail Memory (STM), which discretizes the continuous stream into episodic clusters by employing an adaptive compression strategy to distill them into high-density space thumbnails. Extensive experiments show that FreshMem significantly boosts the Qwen2-VL baseline, yielding gains of 5.20%, 4.52%, and 2.34% on StreamingBench, OV-Bench, and OVO-Bench, respectively. As a training-free solution, FreshMem outperforms several fully fine-tuned methods, offering a highly efficient paradigm for long-horizon streaming video understanding.

Jiaming Cui, Shuai Zhou, Wenqiang Li, Ruifeng Qin, Feng Shen

Main category: cs.CV

TL;DR: CMAFNet is a cross-modal fusion network for transmission line defect detection that integrates RGB and depth data through feature purification and contextual integration, achieving state-of-the-art performance on small-scale defects.

Details

Motivation: Transmission line defect detection using UAVs faces challenges from small-scale defects, complex backgrounds, and illumination variations. RGB-based detectors struggle with geometrically subtle defects that lack chromatic contrast, requiring better integration of geometric information from depth data.

Method: Proposes CMAFNet with two main components: 1) Semantic Recomposition Module using dictionary-based feature purification via learned codebook to suppress modality-specific noise, and 2) Contextual Semantic Integration Framework using partial-channel attention to capture global spatial dependencies. Uses position-wise normalization for cross-modal alignment.

Result: Achieves 32.2% mAP@50 and 12.5% APs on TLRGBD benchmark (94.5% small objects), outperforming strongest baseline by 9.8 and 4.0 percentage points. Lightweight variant reaches 24.8% mAP50 at 228 FPS with only 4.9M parameters.

Conclusion: CMAFNet effectively integrates RGB and depth modalities through principled purification and fusion, demonstrating superior performance for small-scale defect detection in complex aerial inspection scenarios with computational efficiency.

Abstract: Transmission line defect detection remains challenging for automated UAV inspection due to the dominance of small-scale defects, complex backgrounds, and illumination variations. Existing RGB-based detectors, despite recent progress, struggle to distinguish geometrically subtle defects from visually similar background structures under limited chromatic contrast. This paper proposes CMAFNet, a Cross-Modal Alignment and Fusion Network that integrates RGB appearance and depth geometry through a principled purify-then-fuse paradigm. CMAFNet consists of a Semantic Recomposition Module that performs dictionary-based feature purification via a learned codebook to suppress modality-specific noise while preserving defect-discriminative information, and a Contextual Semantic Integration Framework that captures global spatial dependencies using partial-channel attention to enhance structural semantic reasoning. Position-wise normalization within the purification stage enforces explicit reconstruction-driven cross-modal alignment, ensuring statistical compatibility between heterogeneous features prior to fusion. Extensive experiments on the TLRGBD benchmark, where 94.5% of instances are small objects, demonstrate that CMAFNet achieves 32.2% mAP@50 and 12.5% APs, outperforming the strongest baseline by 9.8 and 4.0 percentage points, respectively. A lightweight variant reaches 24.8% mAP50 at 228 FPS with only 4.9M parameters, surpassing all YOLO-based detectors while matching transformer-based methods at substantially lower computational cost.

[500] Physics Informed Generative AI Enabling Labour Free Segmentation For Microscopy Analysis

Salma Zahran, Zhou Ao, Zhengyang Zhang, Chen Chi, Chenchen Yuan, Yanming Wang

Main category: cs.CV

TL;DR: A framework using phase-field simulations and CycleGAN to generate realistic SEM images for automated materials segmentation without manual annotation.

Details

Motivation: Automating semantic segmentation of microscopy images is limited by expensive expert annotation. Physics-based simulations provide scalable labeling but suffer from domain gaps when applied to real experimental data with complex textures and noise.

Method: 1) Generate microstructural morphologies with perfect ground-truth masks using phase-field simulations. 2) Use CycleGAN for unpaired image-to-image translation to transform clean simulations into realistic SEM images. 3) Train a U-Net model exclusively on this synthetic data.

Result: U-Net trained on synthetic data achieved mean Boundary F1-Score of 0.90 and IoU of 0.88 on unseen experimental images. t-SNE and Shannon entropy analysis confirmed synthetic images are statistically indistinguishable from real data.

Conclusion: The framework successfully bridges simulation-to-reality gap, transforming data-scarce segmentation into data-abundant problem without manual annotation, enabling automated materials analysis.

Abstract: Semantic segmentation of microscopy images is a critical task for high-throughput materials characterisation, yet its automation is severely constrained by the prohibitive cost, subjectivity, and scarcity of expert-annotated data. While physics-based simulations offer a scalable alternative to manual labelling, models trained on such data historically fail to generalise due to a significant domain gap, lacking the complex textures, noise patterns, and imaging artefacts inherent to experimental data. This paper introduces a novel framework for labour-free segmentation that successfully bridges this simulation-to-reality gap. Our pipeline leverages phase-field simulations to generate an abundant source of microstructural morphologies with perfect, intrinsically-derived ground-truth masks. We then employ a Cycle-Consistent Generative Adversarial Network (CycleGAN) for unpaired image-to-image translation, transforming the clean simulations into a large-scale dataset of high-fidelity, realistic SEM images. A U-Net model, trained exclusively on this synthetic data, demonstrated remarkable generalisation when deployed on unseen experimental images, achieving a mean Boundary F1-Score of 0.90 and an Intersection over Union (IOU) of 0.88. Comprehensive validation using t-SNE feature-space projection and Shannon entropy analysis confirms that our synthetic images are statistically and featurally indistinguishable from the real data manifold. By completely decoupling model training from manual annotation, our generative framework transforms a data-scarce problem into one of data abundance, providing a robust and fully automated solution to accelerate materials discovery and analysis.

[501] Rethinking Genomic Modeling Through Optical Character Recognition

Hongxin Xiang, Pengsen Ma, Yunkang Cao, Di Yu, Haowen Chen, Xinyu Yang, Xiangxiang Zeng

Main category: cs.CV

TL;DR: OpticalDNA is a vision-based framework that treats DNA sequences as visual documents for OCR-style understanding, achieving better performance with far fewer tokens than traditional language model approaches to genomics.

Details

Motivation: Current genomic foundation models use language model architectures that treat DNA as 1D token sequences, which is structurally misaligned with sparse and discontinuous genomic semantics. This leads to wasted computation on low-information background and prevents understanding-driven compression for long contexts.

Method: OpticalDNA renders DNA into structured visual layouts and trains an OCR-capable vision-language model with a visual DNA encoder and document decoder. The encoder produces compact, reconstructible visual tokens for high-fidelity compression. It defines prompt-conditioned objectives over core genomic primitives: reading, region grounding, subsequence retrieval, and masked span completion.

Result: Across diverse genomic benchmarks, OpticalDNA consistently outperforms recent baselines. On sequences up to 450k bases, it achieves the best overall performance with nearly 20× fewer effective tokens, and surpasses models with up to 985× more activated parameters while tuning only 256k trainable parameters.

Conclusion: Vision-based approaches to genomic modeling through OCR-style document understanding offer significant advantages over traditional language model architectures, enabling more efficient and effective representation of DNA sequences with far fewer computational resources.

Abstract: Recent genomic foundation models largely adopt large language model architectures that treat DNA as a one-dimensional token sequence. However, exhaustive sequential reading is structurally misaligned with sparse and discontinuous genomic semantics, leading to wasted computation on low-information background and preventing understanding-driven compression for long contexts. Here, we present OpticalDNA, a vision-based framework that reframes genomic modeling as Optical Character Recognition (OCR)-style document understanding. OpticalDNA renders DNA into structured visual layouts and trains an OCR-capable vision–language model with a \emph{visual DNA encoder} and a \emph{document decoder}, where the encoder produces compact, reconstructible visual tokens for high-fidelity compression. Building on this representation, OpticalDNA defines prompt-conditioned objectives over core genomic primitives-reading, region grounding, subsequence retrieval, and masked span completion-thereby learning layout-aware DNA representations that retain fine-grained genomic information under a reduced effective token budget. Across diverse genomic benchmarks, OpticalDNA consistently outperforms recent baselines; on sequences up to 450k bases, it achieves the best overall performance with nearly $20\times$ fewer effective tokens, and surpasses models with up to $985\times$ more activated parameters while tuning only 256k \emph{trainable} parameters.

[502] FastPhysGS: Accelerating Physics-based Dynamic 3DGS Simulation via Interior Completion and Adaptive Optimization

Yikun Ma, Yiqing Li, Jingwen Ye, Zhongkai Wu, Weidong Zhang, Lin Gao, Zhi Jin

Main category: cs.CV

TL;DR: FastPhysGS enables fast physics-based dynamic 3D Gaussian Splatting simulation using instance-aware particle filling and bidirectional graph decoupling optimization, achieving high-fidelity results in 1 minute with 7GB memory.

Details

Motivation: Extending 3D Gaussian Splatting to 4D physical simulation is challenging due to manual parameter tuning requirements, distillation from video diffusion models with limited generalization, and perceptual gaps when using LLMs/VLMs that ignore surface structure and yield unstable physics.

Method: Proposes FastPhysGS with two key components: (1) Instance-aware Particle Filling with Monte Carlo Importance Sampling to efficiently populate interior particles while preserving geometric fidelity, and (2) Bidirectional Graph Decoupling Optimization, an adaptive strategy that rapidly optimizes material parameters predicted from a Vision-Language Model.

Result: FastPhysGS achieves high-fidelity physical simulation in 1 minute using only 7 GB runtime memory, outperforming prior works with broad potential applications.

Conclusion: FastPhysGS provides a fast and robust framework for physics-based dynamic 3DGS simulation that addresses limitations of existing methods through efficient particle filling and adaptive optimization strategies.

Abstract: Extending 3D Gaussian Splatting (3DGS) to 4D physical simulation remains challenging. Based on the Material Point Method (MPM), existing methods either rely on manual parameter tuning or distill dynamics from video diffusion models, limiting the generalization and optimization efficiency. Recent attempts using LLMs/VLMs suffer from a text/image-to-3D perceptual gap, yielding unstable physics behavior. In addition, they often ignore the surface structure of 3DGS, leading to implausible motion. We propose FastPhysGS, a fast and robust framework for physics-based dynamic 3DGS simulation:(1) Instance-aware Particle Filling (IPF) with Monte Carlo Importance Sampling (MCIS) to efficiently populate interior particles while preserving geometric fidelity; (2) Bidirectional Graph Decoupling Optimization (BGDO), an adaptive strategy that rapidly optimizes material parameters predicted from a VLM. Experiments show FastPhysGS achieves high-fidelity physical simulation in 1 minute using only 7 GB runtime memory, outperforming prior works with broad potential applications.

[503] DenVisCoM: Dense Vision Correspondence Mamba for Efficient and Real-time Optical Flow and Stereo Estimation

Tushar Anand, Maheswar Bora, Antitza Dantcheva, Abhijit Das

Main category: cs.CV

TL;DR: DenVisCoM: A hybrid Mamba-Transformer architecture for joint real-time optical flow and disparity estimation

Details

Motivation: Optical flow and disparity estimation are fundamentally related multi-view geometry and motion tasks that benefit from joint processing. Current approaches often struggle with balancing accuracy, real-time inference, and memory efficiency.

Method: Proposes DenVisCoM, a novel Mamba block combined with Transformer-based attention in a hybrid architecture. This unified framework jointly estimates optical flow and disparity by efficiently addressing real-time inference, memory footprint, and accuracy simultaneously.

Result: Extensive experiments show the model can accurately estimate optical flow and disparity in real time. The architecture achieves a favorable trade-off between accuracy and processing speed across multiple datasets.

Conclusion: The proposed DenVisCoM hybrid architecture successfully enables joint, accurate, real-time estimation of optical flow and disparity, demonstrating the effectiveness of combining Mamba blocks with Transformer attention for multi-view geometry tasks.

Abstract: In this work, we propose a novel Mamba block DenVisCoM, as well as a novel hybrid architecture specifically tailored for accurate and real-time estimation of optical flow and disparity estimation. Given that such multi-view geometry and motion tasks are fundamentally related, we propose a unified architecture to tackle them jointly. Specifically, the proposed hybrid architecture is based on DenVisCoM and a Transformer-based attention block that efficiently addresses real-time inference, memory footprint, and accuracy at the same time for joint estimation of motion and 3D dense perception tasks. We extensively analyze the benchmark trade-off of accuracy and real-time processing on a large number of datasets. Our experimental results and related analysis suggest that our proposed model can accurately estimate optical flow and disparity estimation in real time. All models and associated code are available at https://github.com/vimstereo/DenVisCoM.

[504] Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models

Yue Zhou, Xinan He, Kaiqing Lin, Bing Fan, Feng Ding, Bin Li

Main category: cs.CV

TL;DR: A simple linear classifier on frozen Vision Foundation Model features outperforms specialized AIGI detectors in real-world scenarios, establishing new SOTA through emergent capabilities from large-scale pretraining with synthetic data.

Details

Motivation: Specialized AI-generated image detectors perform well on curated benchmarks but fail dramatically in real-world scenarios, highlighting the need for more robust detection methods that work in the wild.

Method: Train a simple linear classifier on frozen features from modern Vision Foundation Models (Perception Encoder, MetaCLIP 2, DINOv3) without architectural modifications, evaluating across traditional benchmarks, unseen generators, and challenging in-the-wild distributions.

Result: The simple baseline matches specialized detectors on standard benchmarks and decisively outperforms them on in-the-wild datasets by over 30% accuracy, establishing new state-of-the-art. Vision-Language Models learn explicit semantic concepts of forgery while Self-Supervised Learning models acquire implicit forensic features.

Conclusion: Foundation models’ emergent capabilities from massive pretraining data containing synthetic content enable superior real-world detection. A paradigm shift is needed from overfitting on static benchmarks to leveraging foundation models’ evolving world knowledge for reliable AI forensics.

Abstract: While specialized detectors for AI-Generated Images (AIGI) achieve near-perfect accuracy on curated benchmarks, they suffer from a dramatic performance collapse in realistic, in-the-wild scenarios. In this work, we demonstrate that simplicity prevails over complex architectural designs. A simple linear classifier trained on the frozen features of modern Vision Foundation Models , including Perception Encoder, MetaCLIP 2, and DINOv3, establishes a new state-of-the-art. Through a comprehensive evaluation spanning traditional benchmarks, unseen generators, and challenging in-the-wild distributions, we show that this baseline not only matches specialized detectors on standard benchmarks but also decisively outperforms them on in-the-wild datasets, boosting accuracy by striking margins of over 30%. We posit that this superior capability is an emergent property driven by the massive scale of pre-training data containing synthetic content. We trace the source of this capability to two distinct manifestations of data exposure: Vision-Language Models internalize an explicit semantic concept of forgery, while Self-Supervised Learning models implicitly acquire discriminative forensic features from the pretraining data. However, we also reveal persistent limitations: these models suffer from performance degradation under recapture and transmission, remain blind to VAE reconstruction and localized editing. We conclude by advocating for a paradigm shift in AI forensics, moving from overfitting on static benchmarks to harnessing the evolving world knowledge of foundation models for real-world reliability.

[505] Tail-Aware Post-Training Quantization for 3D Geometry Models

Sicheng Pan, Chen Tang, Shuzhao Xie, Ke Yang, Weixiang Zhang, Jiawei Li, Bin Chen, Shu-Tao Xia, Zhi Wang

Main category: cs.CV

TL;DR: TAPTQ is a tail-aware post-training quantization pipeline specifically designed for 3D geometric learning models, addressing challenges in quantization for 3D vision transformers through progressive calibration, efficient search algorithms, and error compensation techniques.

Details

Motivation: 3D geometry models face deployment challenges on resource-constrained platforms. Existing PTQ methods optimized for 2D vision transformers fail for 3D models due to complex feature distributions and high calibration overhead.

Method: Three key contributions: (1) Progressive coarse-to-fine calibration construction for data-scarce 3D datasets, (2) Ternary-search-based solver for quantization interval search reducing complexity from O(N) to O(log N), (3) TRE-Guided Module-wise Compensation using Tail Relative Error metric to address quantization error accumulation from activation outliers.

Result: Extensive experiments on VGGT and Pi3 benchmarks show TAPTQ consistently outperforms state-of-the-art PTQ methods in accuracy while significantly reducing calibration time.

Conclusion: TAPTQ provides an effective quantization solution for 3D geometric learning models, enabling efficient deployment on resource-constrained platforms with improved accuracy and reduced calibration overhead.

Abstract: The burgeoning complexity and scale of 3D geometry models pose significant challenges for deployment on resource-constrained platforms. While Post-Training Quantization (PTQ) enables efficient inference without retraining, conventional methods, primarily optimized for 2D Vision Transformers, fail to transfer effectively to 3D models due to intricate feature distributions and prohibitive calibration overhead. To address these challenges, we propose TAPTQ, a Tail-Aware Post-Training Quantization pipeline specifically engineered for 3D geometric learning. Our contribution is threefold: (1) To overcome the data-scale bottleneck in 3D datasets, we develop a progressive coarse-to-fine calibration construction strategy that constructs a highly compact subset to achieve both statistical purity and geometric representativeness. (2) We reformulate the quantization interval search as an optimization problem and introduce a ternary-search-based solver, reducing the computational complexity from $\mathcal{O}(N)$ to $\mathcal{O}(\log N)$ for accelerated deployment. (3) To mitigate quantization error accumulation, we propose TRE-Guided Module-wise Compensation, which utilizes a Tail Relative Error (TRE) metric to adaptively identify and rectify distortions in modules sensitive to long-tailed activation outliers. Extensive experiments on the VGGT and Pi3 benchmarks demonstrate that TAPTQ consistently outperforms state-of-the-art PTQ methods in accuracy while significantly reducing calibration time. The code will be released soon.

[506] ObjEmbed: Towards Universal Multimodal Object Embeddings

Shenghao Fu, Yukun Su, Fengyun Rao, Jing Lyu, Xiaohua Xie, Wei-Shi Zheng

Main category: cs.CV

TL;DR: ObjEmbed is a multimodal embedding model that decomposes images into regional object embeddings with semantic and spatial components, enabling fine-grained alignment between image regions and text phrases for visual grounding and retrieval tasks.

Details

Motivation: Current multimodal embedding models excel at global image-text alignment but struggle with fine-grained alignment between specific image regions and textual phrases, which is crucial for detailed visual understanding tasks.

Method: ObjEmbed decomposes input images into multiple regional embeddings (one per object) plus global embeddings. Each region gets two complementary embeddings: object embedding for semantic matching and IoU embedding for localization quality prediction. The final matching score combines semantic similarity with predicted IoU. All objects and the full image are encoded in a single forward pass.

Result: Superior performance on 18 diverse benchmarks demonstrates strong semantic discrimination capabilities for both region-level and image-level tasks.

Conclusion: ObjEmbed provides an effective solution for fine-grained vision-language alignment with object-oriented representations that capture both semantic and spatial aspects, enabling versatile and efficient visual understanding.

Abstract: Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings. It supports a wide range of visual understanding tasks like visual grounding, local image retrieval, and global image retrieval. ObjEmbed enjoys three key properties: (1) Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval. (2) Versatility: It seamlessly handles both region-level and image-level tasks. (3) Efficient Encoding: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency. Superior performance on 18 diverse benchmarks demonstrates its strong semantic discrimination.

[507] Spot-Wise Smart Parking: An Edge-Enabled Architecture with YOLOv11 and Digital Twin Integration

Gustavo P. C. P. da Luz, Alvaro M. Aspilcueta Narvaez, Tiago Godoi Bannwart, Gabriel Massuyoshi Sato, Luis Fernando Gomez Gonzalez, Juliana Freitag Borin

Main category: cs.CV

TL;DR: Extends smart parking system from region-level to spot-level monitoring using distance-aware matching with spatial tolerance and adaptive bounding box partitioning, achieving 98.80% accuracy on edge devices.

Details

Motivation: Previous parking monitoring systems only estimated free spaces at region level, lacking spot-level insights needed for advanced applications. Need to overcome this limitation to provide detailed occupancy information.

Method: Spot-wise monitoring strategy using distance-aware matching method with spatial tolerance, enhanced by Adaptive Bounding Box Partitioning for challenging spaces. Uses YOLOv11m model (40.5MB) on edge devices. Introduces Digital Shadow for visual representation and application support server on repurposed TV box.

Result: Achieves 98.80% balanced accuracy with 8-second inference time on resource-constrained edge device. System enables detailed spot occupancy statistics and scalable communication between cloud services, parking totem, and bot.

Conclusion: Successfully extends parking monitoring to spot-level with high accuracy on edge devices, while promoting hardware reuse and sustainability through repurposed TV box server. Digital Shadow serves as foundation for future Digital Twin development.

Abstract: Smart parking systems help reduce congestion and minimize users’ search time, thereby contributing to smart city adoption and enhancing urban mobility. In previous works, we presented a system developed on a university campus to monitor parking availability by estimating the number of free spaces from vehicle counts within a region of interest. Although this approach achieved good accuracy, it restricted the system’s ability to provide spot-level insights and support more advanced applications. To overcome this limitation, we extend the system with a spot-wise monitoring strategy based on a distance-aware matching method with spatial tolerance, enhanced through an Adaptive Bounding Box Partitioning method for challenging spaces. The proposed approach achieves a balanced accuracy of 98.80% while maintaining an inference time of 8 seconds on a resource-constrained edge device, enhancing the capabilities of YOLOv11m, a model that has a size of 40.5 MB. In addition, two new components were introduced: (i) a Digital Shadow that visually represents parking lot entities as a base to evolve to a full Digital Twin, and (ii) an application support server based on a repurposed TV box. The latter not only enables scalable communication among cloud services, the parking totem, and a bot that provides detailed spot occupancy statistics, but also promotes hardware reuse as a step towards greater sustainability.

[508] Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation

Jun He, Junyan Ye, Zilong Huang, Dongzhi Jiang, Chenjue Zhang, Leqi Zhu, Renrui Zhang, Xiang Zhang, Weijia Li

Main category: cs.CV

TL;DR: Mind-Brush is a unified agentic framework that transforms text-to-image generation into a dynamic, knowledge-driven workflow using a ’think-research-create’ paradigm with active multimodal evidence retrieval and reasoning tools.

Details

Motivation: Current text-to-image models fail to grasp implicit user intentions and struggle with complex knowledge reasoning. They're constrained by static internal priors and can't adapt to evolving real-world dynamics.

Method: Introduces Mind-Brush, a unified agentic framework that simulates human-like ’think-research-create’ paradigm. It actively retrieves multimodal evidence to ground out-of-distribution concepts and employs reasoning tools to resolve implicit visual constraints.

Result: Mind-Brush significantly enhances unified models’ capabilities, achieving zero-to-one capability leap for Qwen-Image baseline on Mind-Bench benchmark, while achieving superior results on established benchmarks like WISE and RISE.

Conclusion: The framework bridges gaps in current generation models by making generation dynamic and knowledge-driven, enabling better comprehension of implicit intentions and adaptation to real-world dynamics.

Abstract: While text-to-image generation has achieved unprecedented fidelity, the vast majority of existing models function fundamentally as static text-to-pixel decoders. Consequently, they often fail to grasp implicit user intentions. Although emerging unified understanding-generation models have improved intent comprehension, they still struggle to accomplish tasks involving complex knowledge reasoning within a single model. Moreover, constrained by static internal priors, these models remain unable to adapt to the evolving dynamics of the real world. To bridge these gaps, we introduce Mind-Brush, a unified agentic framework that transforms generation into a dynamic, knowledge-driven workflow. Simulating a human-like ’think-research-create’ paradigm, Mind-Brush actively retrieves multimodal evidence to ground out-of-distribution concepts and employs reasoning tools to resolve implicit visual constraints. To rigorously evaluate these capabilities, we propose Mind-Bench, a comprehensive benchmark comprising 500 distinct samples spanning real-time news, emerging concepts, and domains such as mathematical and Geo-Reasoning. Extensive experiments demonstrate that Mind-Brush significantly enhances the capabilities of unified models, realizing a zero-to-one capability leap for the Qwen-Image baseline on Mind-Bench, while achieving superior results on established benchmarks like WISE and RISE.

[509] Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Yu Zeng, Wenxuan Huang, Zhen Fang, Shuang Chen, Yufan Shen, Yishuo Cai, Xiaoman Wang, Zhenfei Yin, Lin Chen, Zehui Chen, Shiting Huang, Yiming Zhao, Yao Hu, Philip Torr, Wanli Ouyang, Shaosheng Cao

Main category: cs.CV

TL;DR: VDR-Bench: A new benchmark for evaluating multimodal deep-research systems with 2,000 VQA instances designed to address limitations in current visual-textual search evaluation, plus a multi-round cropped-search workflow to improve visual retrieval.

Details

Motivation: Current benchmarks for multimodal deep-research systems have two major limitations: 1) they're not visual search-centric (answers can be leaked through textual cues or prior knowledge), and 2) they use overly idealized evaluation scenarios with near-exact image matching and insufficiently challenging text search.

Method: Constructed VDR-Bench with 2,000 VQA instances using a multi-stage curation pipeline and expert review. Also proposed a simple multi-round cropped-search workflow to enhance visual retrieval capabilities of MLLMs in realistic scenarios.

Result: The benchmark provides realistic evaluation conditions for vision-deepresearch systems, and the proposed multi-round cropped-search workflow effectively improves model performance in realistic visual retrieval scenarios.

Conclusion: The work addresses critical gaps in evaluating multimodal deep-research systems and provides practical guidance for future system design, with code to be released publicly.

Abstract: Multimodal Large Language Models (MLLMs) have advanced VQA and now support Vision-DeepResearch systems that use search engines for complex visual-textual fact-finding. However, evaluating these visual and textual search abilities is still difficult, and existing benchmarks have two major limitations. First, existing benchmarks are not visual search-centric: answers that should require visual search are often leaked through cross-textual cues in the text questions or can be inferred from the prior world knowledge in current MLLMs. Second, overly idealized evaluation scenario: On the image-search side, the required information can often be obtained via near-exact matching against the full image, while the text-search side is overly direct and insufficiently challenging. To address these issues, we construct the Vision-DeepResearch benchmark (VDR-Bench) comprising 2,000 VQA instances. All questions are created via a careful, multi-stage curation pipeline and rigorous expert review, designed to assess the behavior of Vision-DeepResearch systems under realistic real-world conditions. Moreover, to address the insufficient visual retrieval capabilities of current MLLMs, we propose a simple multi-round cropped-search workflow. This strategy is shown to effectively improve model performance in realistic visual retrieval scenarios. Overall, our results provide practical guidance for the design of future multimodal deep-research systems. The code will be released in https://github.com/Osilly/Vision-DeepResearch.

[510] MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement

Hao Zhang, Yanping Zha, Zizhuo Li, Meiqi Gong, Jiayi Ma

Main category: cs.CV

TL;DR: MagicFuse: A single-image fusion framework that generates comprehensive cross-spectral scene representations from single low-quality visible images using diffusion models, achieving performance comparable to multi-modal fusion methods.

Details

Motivation: Address the practical challenge of benefiting from multi-modal image fusion advantages when only visible imaging sensors are available under harsh conditions, extending data-level fusion to knowledge-level fusion.

Method: Proposes MagicFuse with three diffusion-based branches: 1) intra-spectral knowledge reinforcement branch to mine obscured scene information, 2) cross-spectral knowledge generation branch to learn thermal radiation patterns, and 3) multi-domain knowledge fusion branch that integrates probabilistic noise from both branches to obtain cross-spectral representations through successive sampling, with visual and semantic constraints.

Result: Extensive experiments show MagicFuse achieves visual and semantic representation performance comparable to or better than state-of-the-art fusion methods with multi-modal inputs, despite using only single degraded visible images.

Conclusion: MagicFuse successfully extends conventional data-level fusion to knowledge-level fusion, enabling comprehensive cross-spectral scene understanding from single visible images, which is valuable for practical applications where multi-modal sensors are unavailable.

Abstract: This paper focuses on a highly practical scenario: how to continue benefiting from the advantages of multi-modal image fusion under harsh conditions when only visible imaging sensors are available. To achieve this goal, we propose a novel concept of single-image fusion, which extends conventional data-level fusion to the knowledge level. Specifically, we develop MagicFuse, a novel single image fusion framework capable of deriving a comprehensive cross-spectral scene representation from a single low-quality visible image. MagicFuse first introduces an intra-spectral knowledge reinforcement branch and a cross-spectral knowledge generation branch based on the diffusion models. They mine scene information obscured in the visible spectrum and learn thermal radiation distribution patterns transferred to the infrared spectrum, respectively. Building on them, we design a multi-domain knowledge fusion branch that integrates the probabilistic noise from the diffusion streams of these two branches, from which a cross-spectral scene representation can be obtained through successive sampling. Then, we impose both visual and semantic constraints to ensure that this scene representation can satisfy human observation while supporting downstream semantic decision-making. Extensive experiments show that our MagicFuse achieves visual and semantic representation performance comparable to or even better than state-of-the-art fusion methods with multi-modal inputs, despite relying solely on a single degraded visible image.

Dennis Basile, Dennis Sprute, Helene Dörksen, Holger Flatt

Main category: cs.CV

TL;DR: Privacy-compliant person detection using MEMS-LiDAR with hybrid real/synthetic data augmentation improves detection accuracy while reducing annotation effort.

Details

Motivation: Need for reliable unauthorized person detection in industrial spaces while addressing privacy concerns (GDPR compliance) and overcoming limitations of vision-based methods (lighting sensitivity, data annotation burden).

Method: Uses MEMS-LiDAR for anonymized 3D point clouds, augments real LiDAR data with synthetic scenes from CARLA simulation framework to create hybrid training dataset.

Result: Hybrid data improves average precision by 44 percentage points compared to real-data-only model while reducing manual annotation effort by 50%.

Conclusion: Proposed approach provides scalable, cost-efficient alternative that combines high performance in person detection with GDPR compliance in industrial environments.

Abstract: The reliable detection of unauthorized individuals in safety-critical industrial indoor spaces is crucial to avoid plant shutdowns, property damage, and personal hazards. Conventional vision-based methods that use deep-learning approaches for person recognition provide image information but are sensitive to lighting and visibility conditions and often violate privacy regulations, such as the General Data Protection Regulation (GDPR) in the European Union. Typically, detection systems based on deep learning require annotated data for training. Collecting and annotating such data, however, is highly time-consuming and due to manual treatments not necessarily error free. Therefore, this paper presents a privacy-compliant approach based on Micro-Electro-Mechanical Systems LiDAR (MEMS-LiDAR), which exclusively captures anonymized 3D point clouds and avoids personal identification features. To compensate for the large amount of time required to record real LiDAR data and for post-processing and annotation, real recordings are augmented with synthetically generated scenes from the CARLA simulation framework. The results demonstrate that the hybrid data improves the average precision by 44 percentage points compared to a model trained exclusively with real data while reducing the manual annotation effort by 50 %. Thus, the proposed approach provides a scalable, cost-efficient alternative to purely real-data-based methods and systematically shows how synthetic LiDAR data can combine high performance in person detection with GDPR compliance in an industrial environment.

[512] DDP-WM: Disentangled Dynamics Prediction for Efficient World Models

Shicheng Yin, Kaixuan Yin, Weixing Chen, Yang Liu, Guanbin Li, Liang Lin

Main category: cs.CV

TL;DR: DDP-WM is an efficient world model for robotics that disentangles primary dynamics from background updates, achieving 9x speedup and improved planning performance.

Details

Motivation: Existing dense Transformer-based world models have high computational overhead that hinders real-time deployment in robotics. There's a need for more efficient models that maintain high performance for autonomous planning.

Method: Proposes Disentangled Dynamics Prediction (DDP) that decomposes latent state evolution into sparse primary dynamics (physical interactions) and secondary context-driven background updates. Uses efficient historical processing with dynamic localization to isolate primary dynamics and cross-attention for background updates.

Result: Achieves approximately 9x inference speedup on Push-T task and improves MPC success rate from 90% to 98% compared to state-of-the-art dense models. Shows strong performance across navigation, tabletop manipulation, and complex deformable/multi-body interactions.

Conclusion: DDP-WM establishes a promising path for developing efficient, high-fidelity world models for robotics by optimizing resource allocation and providing a smooth optimization landscape for planners.

Abstract: World models are essential for autonomous robotic planning. However, the substantial computational overhead of existing dense Transformerbased models significantly hinders real-time deployment. To address this efficiency-performance bottleneck, we introduce DDP-WM, a novel world model centered on the principle of Disentangled Dynamics Prediction (DDP). We hypothesize that latent state evolution in observed scenes is heterogeneous and can be decomposed into sparse primary dynamics driven by physical interactions and secondary context-driven background updates. DDP-WM realizes this decomposition through an architecture that integrates efficient historical processing with dynamic localization to isolate primary dynamics. By employing a crossattention mechanism for background updates, the framework optimizes resource allocation and provides a smooth optimization landscape for planners. Extensive experiments demonstrate that DDP-WM achieves significant efficiency and performance across diverse tasks, including navigation, precise tabletop manipulation, and complex deformable or multi-body interactions. Specifically, on the challenging Push-T task, DDP-WM achieves an approximately 9 times inference speedup and improves the MPC success rate from 90% to98% compared to state-of-the-art dense models. The results establish a promising path for developing efficient, high-fidelity world models. Codes will be available at https://github.com/HCPLabSYSU/DDP-WM.

[513] Automated Discontinuity Set Characterisation in Enclosed Rock Face Point Clouds Using Single-Shot Filtering and Cyclic Orientation Transformation

Dibyayan Patra, Pasindu Ranasinghe, Bikram Banerjee, Simit Raval

Main category: cs.CV

TL;DR: A novel approach for automatic discontinuity set characterization in rock faces using single-shot filtering, cyclic orientation transformation, and hierarchical clustering.

Details

Motivation: Characterizing structural discontinuity sets in rock faces is crucial for assessing rock-mass stability and excavation safety in underground mines. While UAV and mobile laser scanning provide point cloud data, robust automatic characterization methods for real-world scenarios like enclosed cavities remain an open problem.

Method: Proposes a three-step approach: 1) Single-shot filtering using signal processing to isolate planar regions while suppressing noise and high-curvature artifacts, 2) Cyclic orientation transformation scheme to accurately represent dip angle and dip direction in Cartesian space, 3) Hierarchical clustering technique to characterize transformed orientations into sets without requiring user-defined cluster numbers.

Result: The method outperforms existing automated structure mapping techniques with lowest mean absolute errors: 1.95° in nominal dip angle and 2.20° in dip direction, with dispersion errors below 3°. Validated on real-world mine stope data against ground truth from manually handpicked discontinuity planes.

Conclusion: The proposed approach provides an accurate and robust solution for automatic discontinuity set characterization in underground mine cavities, addressing limitations of existing methods and demonstrating superior performance in real-world scenarios.

Abstract: Characterisation of structural discontinuity sets in exposed rock faces of underground mine cavities is essential for assessing rock-mass stability, excavation safety, and operational efficiency. UAV and other mobile laser-scanning techniques provide efficient means of collecting point clouds from rock faces. However, the development of a robust and efficient approach for automatic characterisation of discontinuity sets in real-world scenarios, like fully enclosed rock faces in cavities, remains an open research problem. In this study, a new approach is proposed for automatic discontinuity set characterisation that uses a single-shot filtering strategy, an innovative cyclic orientation transformation scheme and a hierarchical clustering technique. The single-shot filtering step isolates planar regions while robustly suppressing noise and high-curvature artefacts in one pass using a signal-processing technique. To address the limitations of Cartesian clustering on polar orientation data, a cyclic orientation transformation scheme is developed, enabling accurate representation of dip angle and dip direction in Cartesian space. The transformed orientations are then characterised into sets using a hierarchical clustering technique, which handles varying density distributions and identifies clusters without requiring user-defined set numbers. The accuracy of the method is validated on real-world mine stope and against ground truth obtained using manually handpicked discontinuity planes identified with the Virtual Compass tool, as well as widely used automated structure mapping techniques. The proposed approach outperforms the other techniques by exhibiting the lowest mean absolute error in estimating discontinuity set orientations in real-world stope data with errors of 1.95° and 2.20° in nominal dip angle and dip direction, respectively, and dispersion errors lying below 3°.

[514] Spatio-Temporal Transformers for Long-Term NDVI Forecasting

Ido Faran, Nathan S. Netanyahu, Maxim Shoshany

Main category: cs.CV

TL;DR: STT-LTF is a spatio-temporal transformer framework for long-term satellite image forecasting that integrates spatial context modeling with temporal sequence prediction for heterogeneous Mediterranean landscapes.

Details

Motivation: Address challenges in long-term satellite image time series analysis in heterogeneous landscapes, particularly Mediterranean regions with complex spatial patterns, seasonal variations, and multi-decade environmental changes across different scales.

Method: Extended transformer framework processes multi-scale spatial patches alongside temporal sequences (up to 20 years) through unified architecture, using self-supervised learning with spatial masking, temporal masking, and horizon sampling strategies from 40 years of unlabeled Landsat imagery.

Result: Achieves MAE of 0.0328 and R^2 of 0.8412 for next-year predictions on Landsat data (1984-2024), outperforming traditional statistical methods, CNN-based approaches, LSTM networks, and standard transformers.

Conclusion: STT-LTF effectively handles irregular temporal sampling and variable prediction horizons, making it suitable for heterogeneous landscapes experiencing rapid ecological transitions, with direct prediction of arbitrary future time points without error accumulation.

Abstract: Long-term satellite image time series (SITS) analysis in heterogeneous landscapes faces significant challenges, particularly in Mediterranean regions where complex spatial patterns, seasonal variations, and multi-decade environmental changes interact across different scales. This paper presents the Spatio-Temporal Transformer for Long Term Forecasting (STT-LTF ), an extended framework that advances beyond purely temporal analysis to integrate spatial context modeling with temporal sequence prediction. STT-LTF processes multi-scale spatial patches alongside temporal sequences (up to 20 years) through a unified transformer architecture, capturing both local neighborhood relationships and regional climate influences. The framework employs comprehensive self-supervised learning with spatial masking, temporal masking, and horizon sampling strategies, enabling robust model training from 40 years of unlabeled Landsat imagery. Unlike autoregressive approaches, STT-LTF directly predicts arbitrary future time points without error accumulation, incorporating spatial patch embeddings, cyclical temporal encoding, and geographic coordinates to learn complex dependencies across heterogeneous Mediterranean ecosystems. Experimental evaluation on Landsat data (1984-2024) demonstrates that STT-LTF achieves a Mean Absolute Error (MAE) of 0.0328 and R^2 of 0.8412 for next-year predictions, outperforming traditional statistical methods, CNN-based approaches, LSTM networks, and standard transformers. The framework’s ability to handle irregular temporal sampling and variable prediction horizons makes it particularly suitable for analysis of heterogeneous landscapes experiencing rapid ecological transitions.

[515] Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Dvir Samuel, Issar Tzachor, Matan Levy, Micahel Green, Gal Chechik, Rami Ben-Ari

Main category: cs.CV

TL;DR: Training-free attention optimization framework for autoregressive video diffusion models that reduces KV cache growth and computational redundancy through temporal compression and ANN-based token selection.

Details

Motivation: Autoregressive video diffusion models suffer from growing KV cache during inference, causing increasing latency, GPU memory usage, and limiting temporal context, which harms long-range consistency in video generation.

Method: Three training-free modules: TempCache compresses KV cache via temporal correspondence; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using ANN matching; AnnSA sparsifies self-attention by restricting queries to semantically matched keys using lightweight ANN.

Result: Achieves 5-10x end-to-end speedups while preserving visual quality, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, unlike prior methods that progressively slow down.

Conclusion: The proposed attention optimization framework effectively addresses KV cache growth in autoregressive video diffusion, enabling efficient long-form video synthesis with consistent quality and stable resource usage.

Abstract: Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5–x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.

[516] FlowBypass: Rectified Flow Trajectory Bypass for Training-Free Image Editing

Menglin Han, Zhangkai Ni

Main category: cs.CV

TL;DR: FlowBypass: A training-free image editing framework using Rectified Flow to create bypass trajectories between inversion and reconstruction, improving prompt alignment while preserving fidelity.

Details

Motivation: Existing training-free image editing methods rely on inversion-reconstruction trajectories that face a trade-off: longer trajectories accumulate errors and compromise fidelity, while shorter ones fail to ensure sufficient alignment with edit prompts. Previous solutions use backbone-specific feature manipulations that lack general applicability.

Method: FlowBypass uses Rectified Flow to construct a bypass directly connecting inversion and reconstruction trajectories, mitigating error accumulation without feature manipulations. The authors provide formal derivation of two trajectories, obtain approximate bypass formulation and numerical solution for seamless trajectory transitions.

Result: Extensive experiments show FlowBypass consistently outperforms state-of-the-art image editing methods, achieving stronger prompt alignment while preserving high-fidelity details in irrelevant regions.

Conclusion: FlowBypass offers a novel analytical framework for training-free image editing that addresses the fundamental trade-off in inversion-reconstruction approaches, providing better prompt alignment and fidelity preservation through bypass trajectory construction.

Abstract: Training-free image editing has attracted increasing attention for its efficiency and independence from training data. However, existing approaches predominantly rely on inversion-reconstruction trajectories, which impose an inherent trade-off: longer trajectories accumulate errors and compromise fidelity, while shorter ones fail to ensure sufficient alignment with the edit prompt. Previous attempts to address this issue typically employ backbone-specific feature manipulations, limiting general applicability. To address these challenges, we propose FlowBypass, a novel and analytical framework grounded in Rectified Flow that constructs a bypass directly connecting inversion and reconstruction trajectories, thereby mitigating error accumulation without relying on feature manipulations. We provide a formal derivation of two trajectories, from which we obtain an approximate bypass formulation and its numerical solution, enabling seamless trajectory transitions. Extensive experiments demonstrate that FlowBypass consistently outperforms state-of-the-art image editing methods, achieving stronger prompt alignment while preserving high-fidelity details in irrelevant regions.

[517] LDRNet: Large Deformation Registration Model for Chest CT Registration

Cheng Wang, Qiyu Gao, Fandong Zhang, Shu Zhang, Yizhou Yu

Main category: cs.CV

TL;DR: LDRNet: Fast unsupervised deep learning method for large deformation chest CT registration using coarse-to-fine refinement with innovative refine and rigid blocks.

Details

Motivation: Most deep learning medical image registration focuses on brain images, but chest CT registration presents greater challenges including larger deformations, complex backgrounds, and region overlap, requiring specialized methods.

Method: Proposes LDRNet with coarse-to-fine refinement: first predicts coarse resolution registration field, then refines it using two novel components: 1) refine block for multi-resolution field refinement, 2) rigid block to learn transformation matrix from high-level features.

Result: Achieves state-of-the-art performance for large deformation image registration on private and public SegTHOR datasets, outperforming traditional methods and deep learning models (VoxelMorph, RCN, LapIRN) while being much faster.

Conclusion: LDRNet effectively addresses the unique challenges of chest CT registration through its coarse-to-fine architecture with specialized refinement components, offering superior performance and speed for large deformation medical image registration.

Abstract: Most of the deep learning based medical image registration algorithms focus on brain image registration tasks.Compared with brain registration, the chest CT registration has larger deformation, more complex background and region over-lap. In this paper, we propose a fast unsupervised deep learning method, LDRNet, for large deformation image registration of chest CT images. We first predict a coarse resolution registration field, then refine it from coarse to fine. We propose two innovative technical components: 1) a refine block that is used to refine the registration field in different resolutions, 2) a rigid block that is used to learn transformation matrix from high-level features. We train and evaluate our model on the private dataset and public dataset SegTHOR. We compare our performance with state-of-the-art traditional registration methods as well as deep learning registration models VoxelMorph, RCN, and LapIRN. The results demonstrate that our model achieves state-of-the-art performance for large deformation images registration and is much faster.

[518] GPD: Guided Progressive Distillation for Fast and High-Quality Video Generation

Xiao Liang, Yunzhu Zhang, Linchao Zhu

Main category: cs.CV

TL;DR: GPD is a distillation framework that accelerates video diffusion models by reducing sampling steps from 48 to 6 while maintaining quality, using progressive teacher guidance and frequency-domain constraints.

Details

Motivation: Diffusion models have high computational costs for video generation due to many denoising steps. Existing acceleration methods often degrade video quality significantly, creating a need for efficient yet high-quality video generation.

Method: Guided Progressive Distillation (GPD) uses a teacher model to progressively guide a student model to operate with larger step sizes. It features: 1) online-generated training targets to reduce optimization difficulty, and 2) frequency-domain constraints in latent space to preserve fine details and temporal dynamics.

Result: Applied to Wan2.1 model, GPD reduces sampling steps from 48 to 6 while maintaining competitive visual quality on VBench. Outperforms existing distillation methods in both pipeline simplicity and quality preservation.

Conclusion: GPD provides an effective framework for accelerating video diffusion models with minimal quality degradation, addressing the computational bottleneck in video generation.

Abstract: Diffusion models have achieved remarkable success in video generation; however, the high computational cost of the denoising process remains a major bottleneck. Existing approaches have shown promise in reducing the number of diffusion steps, but they often suffer from significant quality degradation when applied to video generation. We propose Guided Progressive Distillation (GPD), a framework that accelerates the diffusion process for fast and high-quality video generation. GPD introduces a novel training strategy in which a teacher model progressively guides a student model to operate with larger step sizes. The framework consists of two key components: (1) an online-generated training target that reduces optimization difficulty while improving computational efficiency, and (2) frequency-domain constraints in the latent space that promote the preservation of fine-grained details and temporal dynamics. Applied to the Wan2.1 model, GPD reduces the number of sampling steps from 48 to 6 while maintaining competitive visual quality on VBench. Compared with existing distillation methods, GPD demonstrates clear advantages in both pipeline simplicity and quality preservation.

[519] Seeing Is Believing? A Benchmark for Multimodal Large Language Models on Visual Illusions and Anomalies

Wenjin Hou, Wei Liu, Han Hu, Xiaoxiao Sun, Serena Yeung-Levy, Hehe Fan

Main category: cs.CV

TL;DR: VIA-Bench is a challenging benchmark for testing MLLM robustness on visual illusions and anomalies, revealing significant vulnerabilities in current models despite their strong performance on standard benchmarks.

Details

Motivation: Current MLLM evaluations focus on standard in-distribution data, leaving robustness to visual illusions and anomalies largely unexamined. The authors aim to address this gap by creating a benchmark that probes model performance on scenarios that defy common-sense priors.

Method: The authors introduce VIA-Bench with six core categories: color illusions, motion illusions, gestalt illusions, geometric/spatial illusions, general visual illusions, and visual anomalies. They construct over 1K high-quality question-answer pairs through careful human-in-the-loop review and evaluate over 20 state-of-the-art MLLMs.

Result: Extensive evaluation reveals significant vulnerabilities in MLLMs. Chain-of-Thought reasoning offers negligible robustness, often yielding “brittle mirages” where model logic collapses under illusory stimuli. The findings show a fundamental divergence between machine and human perception.

Conclusion: Resolving perceptual bottlenecks related to visual illusions and anomalies is critical for advancing artificial general intelligence. The benchmark exposes important limitations in current MLLMs’ visual reasoning capabilities.

Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable proficiency on general-purpose vision-language benchmarks, reaching or even exceeding human-level performance. However, these evaluations typically rely on standard in-distribution data, leaving the robustness of MLLMs largely unexamined when faced with scenarios that defy common-sense priors. To address this gap, we introduce VIA-Bench, a challenging benchmark designed to probe model performance on visual illusions and anomalies. It includes six core categories: color illusions, motion illusions, gestalt illusions, geometric and spatial illusions, general visual illusions, and visual anomalies. Through careful human-in-the-loop review, we construct over 1K high-quality question-answer pairs that require nuanced visual reasoning. Extensive evaluation of over 20 state-of-the-art MLLMs, including proprietary, open-source, and reasoning-enhanced models, uncovers significant vulnerabilities. Notably, we find that Chain-of-Thought (CoT) reasoning offers negligible robustness, often yielding ``brittle mirages’’ where the model’s logic collapses under illusory stimuli. Our findings reveal a fundamental divergence between machine and human perception, suggesting that resolving such perceptual bottlenecks is critical for the advancement of artificial general intelligence. The benchmark data and code will be released.

[520] Efficient Cross-Country Data Acquisition Strategy for ADAS via Street-View Imagery

Yin Wu, Daniel Slieter, Carl Esselborn, Ahmed Abouelazm, Tsung Yuan Tseng, J. Marius Zöllner

Main category: cs.CV

TL;DR: Street-view-guided data acquisition strategy using public imagery to identify places of interest for cross-country ADAS/ADS adaptation, reducing data collection needs by 50% while maintaining performance.

Details

Motivation: Cross-country deployment of ADAS/ADS faces challenges due to domain shifts from differences in legislation, traffic infrastructure, and visual conventions. Traditional data collection through extensive on-road driving is costly and inefficient for identifying representative locations across countries.

Method: Proposes a street-view-guided data acquisition strategy using publicly available imagery to identify places of interest (POI). Introduces two POI scoring methods: 1) KNN-based feature distance approach using a vision foundation model, and 2) visual-attribution approach using a vision-language model. Uses collect-detect protocol and constructs co-located dataset pairing Zenseact Open Dataset with Mapillary street-view images.

Result: Experiments on traffic sign detection (sensitive to cross-country variations) show approach achieves performance comparable to random sampling while using only half of the target-domain data. Cost estimations demonstrate large-scale street-view processing remains economically feasible for full-country analysis.

Conclusion: Street-view-guided data acquisition offers efficient and cost-effective cross-country model adaptation for ADAS/ADS systems, reducing data collection requirements while maintaining performance through intelligent POI identification using public imagery.

Abstract: Deploying ADAS and ADS across countries remains challenging due to differences in legislation, traffic infrastructure, and visual conventions, which introduce domain shifts that degrade perception performance. Traditional cross-country data collection relies on extensive on-road driving, making it costly and inefficient to identify representative locations. To address this, we propose a street-view-guided data acquisition strategy that leverages publicly available imagery to identify places of interest (POI). Two POI scoring methods are introduced: a KNN-based feature distance approach using a vision foundation model, and a visual-attribution approach using a vision-language model. To enable repeatable evaluation, we adopt a collect-detect protocol and construct a co-located dataset by pairing the Zenseact Open Dataset with Mapillary street-view images. Experiments on traffic sign detection, a task particularly sensitive to cross-country variations in sign appearance, show that our approach achieves performance comparable to random sampling while using only half of the target-domain data. We further provide cost estimations for full-country analysis, demonstrating that large-scale street-view processing remains economically feasible. These results highlight the potential of street-view-guided data acquisition for efficient and cost-effective cross-country model adaptation.

[521] SPIRIT: Adapting Vision Foundation Models for Unified Single- and Multi-Frame Infrared Small Target Detection

Qian Xu, Xi Li, Fei Gao, Jie Guo, Haojuan Yuan, Shuaipeng Fan, Mingjin Zhang

Main category: cs.CV

TL;DR: SPIRIT is a unified framework that adapts vision foundation models to infrared small target detection using physics-informed plug-ins for spatial feature refinement and temporal memory attention with spatial priors.

Details

Motivation: Infrared small target detection faces challenges due to weak radiometric signals, limited semantic cues, and modality gaps between infrared and visible-spectrum imagery. Direct use of vision foundation models fails because hierarchical feature aggregation submerges target peaks and appearance-only memory attention leads to spurious clutter associations.

Method: SPIRIT adapts VFMs via lightweight physics-informed plug-ins: 1) PIFR spatially refines features using rank-sparsity decomposition to suppress background and enhance target signals, 2) PGMA temporally injects history-derived soft spatial priors into memory cross-attention to constrain cross-frame association.

Result: Experiments on multiple IRSTD benchmarks show consistent gains over VFM-based baselines and state-of-the-art performance in both single-frame and video-mode detection.

Conclusion: SPIRIT successfully bridges the modality gap between infrared and visible-spectrum data by incorporating physics-informed priors into vision foundation models, enabling robust infrared small target detection while maintaining compatibility with existing VFM architectures.

Abstract: Infrared small target detection (IRSTD) is crucial for surveillance and early-warning, with deployments spanning both single-frame analysis and video-mode tracking. A practical solution should leverage vision foundation models (VFMs) to mitigate infrared data scarcity, while adopting a memory-attention-based temporal propagation framework that unifies single- and multi-frame inference. However, infrared small targets exhibit weak radiometric signals and limited semantic cues, which differ markedly from visible-spectrum imagery. This modality gap makes direct use of semantics-oriented VFMs and appearance-driven cross-frame association unreliable for IRSTD: hierarchical feature aggregation can submerge localized target peaks, and appearance-only memory attention becomes ambiguous, leading to spurious clutter associations. To address these challenges, we propose SPIRIT, a unified and VFM-compatible framework that adapts VFMs to IRSTD via lightweight physics-informed plug-ins. Spatially, PIFR refines features by approximating rank-sparsity decomposition to suppress structured background components and enhance sparse target-like signals. Temporally, PGMA injects history-derived soft spatial priors into memory cross-attention to constrain cross-frame association, enabling robust video detection while naturally reverting to single-frame inference when temporal context is absent. Experiments on multiple IRSTD benchmarks show consistent gains over VFM-based baselines and SOTA performance.

[522] CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions

Yuliang Zhan, Jian Li, Wenbing Huang, Wenbing Huang, Yang Liu, Hao Sun

Main category: cs.CV

TL;DR: CloDS: An unsupervised framework for learning cloth dynamics from multi-view visual observations without requiring known physical properties as supervision.

Details

Motivation: Existing deep learning methods for simulating dynamic systems require known physical properties as supervision, limiting applicability under unknown conditions. The paper introduces Cloth Dynamics Grounding (CDG) as a novel scenario for unsupervised learning of cloth dynamics from visual observations.

Method: Cloth Dynamics Splatting (CloDS) uses a three-stage pipeline: 1) video-to-geometry grounding, 2) training dynamics model on grounded meshes. For handling large deformations and self-occlusions, introduces dual-position opacity modulation for bidirectional 2D-3D mapping via mesh-based Gaussian splatting.

Result: Comprehensive experiments show CloDS effectively learns cloth dynamics from visual data while maintaining strong generalization capabilities for unseen configurations.

Conclusion: CloDS enables unsupervised learning of cloth dynamics from visual observations, addressing limitations of existing methods that require physical supervision.

Abstract: Deep learning has demonstrated remarkable capabilities in simulating complex dynamic systems. However, existing methods require known physical properties as supervision or inputs, limiting their applicability under unknown conditions. To explore this challenge, we introduce Cloth Dynamics Grounding (CDG), a novel scenario for unsupervised learning of cloth dynamics from multi-view visual observations. We further propose Cloth Dynamics Splatting (CloDS), an unsupervised dynamic learning framework designed for CDG. CloDS adopts a three-stage pipeline that first performs video-to-geometry grounding and then trains a dynamics model on the grounded meshes. To cope with large non-linear deformations and severe self-occlusions during grounding, we introduce a dual-position opacity modulation that supports bidirectional mapping between 2D observations and 3D geometry via mesh-based Gaussian splatting in video-to-geometry grounding stage. It jointly considers the absolute and relative position of Gaussian components. Comprehensive experimental evaluations demonstrate that CloDS effectively learns cloth dynamics from visual data while maintaining strong generalization capabilities for unseen configurations. Our code is available at https://github.com/whynot-zyl/CloDS. Visualization results are available at https://github.com/whynot-zyl/CloDS_video}.%\footnote{As in this example.

[523] WS-IMUBench: Can Weakly Supervised Methods from Audio, Image, and Video Be Adapted for IMU-based Temporal Action Localization?

Pei Li, Jiaxi Yin, Lei Ouyang, Shihan Pan, Ge Wang, Han Ding, Fei Wang

Main category: cs.CV

TL;DR: WS-IMUBench is a benchmark study evaluating weakly supervised temporal action localization for IMU data using only sequence-level labels, comparing transfer of methods from audio/image/video domains to IMU-TAL.

Details

Motivation: Current IMU temporal action localization requires costly dense frame-level annotations, creating a scalability bottleneck. The paper addresses this by exploring weakly supervised approaches that only need sequence-level labels.

Method: Systematic benchmark of 7 weakly supervised methods from audio/image/video domains on 7 public IMU datasets, with over 3,540 training runs and 7,080 evaluations. Focuses on transferability, effectiveness, and insights through three research questions.

Result: Temporal-domain methods transfer better than image-derived approaches; weak supervision can be competitive on datasets with longer actions and higher-dimensional sensing; failure modes include short actions, temporal ambiguity, and poor proposal quality.

Conclusion: Establishes WS-IMUBench as a reproducible benchmarking framework and outlines directions for advancing WS-IMU-TAL, including IMU-specific proposal generation and boundary-aware objectives.

Abstract: IMU-based Human Activity Recognition (HAR) has enabled a wide range of ubiquitous computing applications, yet its dominant clip classification paradigm cannot capture the rich temporal structure of real-world behaviors. This motivates a shift toward IMU Temporal Action Localization (IMU-TAL), which predicts both action categories and their start/end times in continuous streams. However, current progress is strongly bottlenecked by the need for dense, frame-level boundary annotations, which are costly and difficult to scale. To address this bottleneck, we introduce WS-IMUBench, a systematic benchmark study of weakly supervised IMU-TAL (WS-IMU-TAL) under only sequence-level labels. Rather than proposing a new localization algorithm, we evaluate how well established weakly supervised localization paradigms from audio, image, and video transfer to IMU-TAL under only sequence-level labels. We benchmark seven representative weakly supervised methods on seven public IMU datasets, resulting in over 3,540 model training runs and 7,080 inference evaluations. Guided by three research questions on transferability, effectiveness, and insights, our findings show that (i) transfer is modality-dependent, with temporal-domain methods generally more stable than image-derived proposal-based approaches; (ii) weak supervision can be competitive on favorable datasets (e.g., with longer actions and higher-dimensional sensing); and (iii) dominant failure modes arise from short actions, temporal ambiguity, and proposal quality. Finally, we outline concrete directions for advancing WS-IMU-TAL (e.g., IMU-specific proposal generation, boundary-aware objectives, and stronger temporal reasoning). Beyond individual results, WS-IMUBench establishes a reproducible benchmarking template, datasets, protocols, and analyses, to accelerate community-wide progress toward scalable WS-IMU-TAL.

[524] How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

Huanyu Zhang, Xuehai Bai, Chengzu Li, Chen Liang, Haochen Tian, Haodong Li, Ruichuan An, Yifan Zhang, Anna Korhonen, Zhang Zhang, Liang Wang, Tieniu Tan

Main category: cs.CV

TL;DR: VIBE is a visual instruction benchmark for image editing that evaluates models on multimodal (sketch-based) instructions across three complexity levels, revealing proprietary models outperform open-source ones but all struggle with harder tasks.

Details

Motivation: Current image editing systems are primarily text-guided, but human communication is multimodal where visual instructions like sketches efficiently convey spatial and structural intent. There's a gap in benchmarks for visual instruction following in image editing.

Method: Introduces VIBE benchmark with three-level interaction hierarchy: deictic grounding, morphological manipulation, and causal reasoning. Curates diverse test cases with increasing complexity. Proposes LMM-as-a-judge evaluation framework with task-specific metrics for scalable assessment.

Result: Evaluated 17 open-source and proprietary image editing models. Found proprietary models show early-stage visual instruction-following capabilities and consistently outperform open-source models. Performance degrades significantly with increasing task difficulty even for strongest systems.

Conclusion: Visual instruction following remains challenging, especially for complex tasks. Proprietary models lead but have limitations. The benchmark highlights promising research directions for improving multimodal instruction understanding in image editing.

Abstract: Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing models, we find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models. However, performance degrades markedly with increasing task difficulty even for the strongest systems, highlighting promising directions for future research.

[525] Fact or Fake? Assessing the Role of Deepfake Detectors in Multimodal Misinformation Detection

A S M Sharifuzzaman Sagar, Mohammed Bennamoun, Farid Boussaid, Naeha Sharif, Lian Xu, Shaaban Sahmoud, Ali Kishk

Main category: cs.CV

TL;DR: Deepfake detectors focusing on pixel-level manipulations provide limited value for multimodal misinformation detection and can actually harm fact-checking performance when integrated into evidence-based systems.

Details

Motivation: Current deepfake detectors focus on pixel-level manipulations but ignore semantic meaning in image-text pairs, raising questions about their usefulness in multimodal misinformation detection and fact-checking pipelines.

Method: Systematic analysis using two benchmarks (MMFakeBench and DGM4) evaluating: 1) image-only deepfake detectors, 2) evidence-driven fact-checking system with tool-guided retrieval (MCTS) and deliberative inference (Multi-Agent Debate), and 3) hybrid systems combining detectors with evidence-based approaches.

Result: Deepfake detectors showed limited standalone performance (F1: 0.26-0.53 on MMFakeBench, 0.33-0.49 on DGM4). When integrated into fact-checking pipelines, they reduced performance by 0.04-0.08 F1. Evidence-centric system achieved best results (F1: ~0.81 on MMFakeBench, 0.55 on DGM4).

Conclusion: Multimodal claim verification is primarily driven by semantic understanding and external evidence, not pixel-level artifact signals. Deepfake detectors introduce misleading authenticity priors that undermine evidence-based reasoning for real-world image-text misinformation.

Abstract: In multimodal misinformation, deception usually arises not just from pixel-level manipulations in an image, but from the semantic and contextual claim jointly expressed by the image-text pair. Yet most deepfake detectors, engineered to detect pixel-level forgeries, do not account for claim-level meaning, despite their growing integration in automated fact-checking (AFC) pipelines. This raises a central scientific and practical question: Do pixel-level detectors contribute useful signal for verifying image-text claims, or do they instead introduce misleading authenticity priors that undermine evidence-based reasoning? We provide the first systematic analysis of deepfake detectors in the context of multimodal misinformation detection. Using two complementary benchmarks, MMFakeBench and DGM4, we evaluate: (1) state-of-the-art image-only deepfake detectors, (2) an evidence-driven fact-checking system that performs tool-guided retrieval via Monte Carlo Tree Search (MCTS) and engages in deliberative inference through Multi-Agent Debate (MAD), and (3) a hybrid fact-checking system that injects detector outputs as auxiliary evidence. Results across both benchmark datasets show that deepfake detectors offer limited standalone value, achieving F1 scores in the range of 0.26-0.53 on MMFakeBench and 0.33-0.49 on DGM4, and that incorporating their predictions into fact-checking pipelines consistently reduces performance by 0.04-0.08 F1 due to non-causal authenticity assumptions. In contrast, the evidence-centric fact-checking system achieves the highest performance, reaching F1 scores of approximately 0.81 on MMFakeBench and 0.55 on DGM4. Overall, our findings demonstrate that multimodal claim verification is driven primarily by semantic understanding and external evidence, and that pixel-level artifact signals do not reliably enhance reasoning over real-world image-text misinformation.

[526] Trust but Verify: Adaptive Conditioning for Reference-Based Diffusion Super-Resolution via Implicit Reference Correlation Modeling

Yuan Wang, Yuhao Wan, Siming Zheng, Bo Li, Qibin Hou, Peng-Tao Jiang

Main category: cs.CV

TL;DR: Ada-RefSR: A diffusion-based reference super-resolution framework with adaptive gating that selectively uses reference information based on reliability, preventing hallucinations from misleading references.

Details

Motivation: Real-world degradations make correspondences between low-quality inputs and reference images unreliable in reference-based super-resolution. Existing methods either ignore correlations or rely on brittle explicit matching, leading to over-reliance on misleading references or under-utilization of valuable cues.

Method: Proposes Ada-RefSR with Adaptive Implicit Correlation Gating (AICG) that uses learnable summary tokens to distill dominant reference patterns and capture implicit correlations with LQ features. Integrated into attention backbone for lightweight, adaptive regulation of reference guidance.

Result: Experiments on multiple datasets show Ada-RefSR achieves strong balance of fidelity, naturalness, and efficiency while remaining robust under varying reference alignment.

Conclusion: The “Trust but Verify” principle with adaptive gating effectively addresses reference reliability issues in diffusion-based image restoration, providing robust performance against misleading references.

Abstract: Recent works have explored reference-based super-resolution (RefSR) to mitigate hallucinations in diffusion-based image restoration. A key challenge is that real-world degradations make correspondences between low-quality (LQ) inputs and reference (Ref) images unreliable, requiring adaptive control of reference usage. Existing methods either ignore LQ-Ref correlations or rely on brittle explicit matching, leading to over-reliance on misleading references or under-utilization of valuable cues. To address this, we propose Ada-RefSR, a single-step diffusion framework guided by a “Trust but Verify” principle: reference information is leveraged when reliable and suppressed otherwise. Its core component, Adaptive Implicit Correlation Gating (AICG), employs learnable summary tokens to distill dominant reference patterns and capture implicit correlations with LQ features. Integrated into the attention backbone, AICG provides lightweight, adaptive regulation of reference guidance, serving as a built-in safeguard against erroneous fusion. Experiments on multiple datasets demonstrate that Ada-RefSR achieves a strong balance of fidelity, naturalness, and efficiency, while remaining robust under varying reference alignment.

[527] ProxyImg: Towards Highly-Controllable Image Representation via Hierarchical Disentangled Proxy Embedding

Ye Chen, Yupeng Zhu, Xiongzhen Zhang, Zhewen Wan, Yingzhe Li, Wenjun Zhang, Bingbing Ni

Main category: cs.CV

TL;DR: Hierarchical proxy-based parametric image representation that disentangles semantic, geometric, and textural attributes for efficient and controllable image/video editing with physics-driven animation support.

Details

Motivation: Existing image representations (raster images, Gaussian primitives, latent images) suffer from representation redundancy or lack direct mapping from latent variables to semantic instances/parts, hindering efficient and controllable image/video editing.

Method: Hierarchical proxy-based parametric representation that: 1) performs semantic-aware decomposition of input images, 2) constructs hierarchical proxy geometries via adaptive Bezier fitting and iterative internal region subdivision/meshing, 3) embeds multi-scale implicit texture parameters into geometry-aware distributed proxy nodes, and 4) uses locality-adaptive feature indexing for spatial texture coherence.

Result: Achieves state-of-the-art rendering fidelity with significantly fewer parameters on ImageNet, OIR-Bench, and HumanEdit benchmarks; enables intuitive, interactive, physically plausible manipulation; supports real-time physics-driven animation with Position-Based Dynamics and lightweight implicit rendering.

Conclusion: Proposed representation enables efficient, controllable image/video editing with semantic disentanglement, high reconstruction fidelity, and physics-driven animation capabilities superior to generative approaches.

Abstract: Prevailing image representation methods, including explicit representations such as raster images and Gaussian primitives, as well as implicit representations such as latent images, either suffer from representation redundancy that leads to heavy manual editing effort, or lack a direct mapping from latent variables to semantic instances or parts, making fine-grained manipulation difficult. These limitations hinder efficient and controllable image and video editing. To address these issues, we propose a hierarchical proxy-based parametric image representation that disentangles semantic, geometric, and textural attributes into independent and manipulable parameter spaces. Based on a semantic-aware decomposition of the input image, our representation constructs hierarchical proxy geometries through adaptive Bezier fitting and iterative internal region subdivision and meshing. Multi-scale implicit texture parameters are embedded into the resulting geometry-aware distributed proxy nodes, enabling continuous high-fidelity reconstruction in the pixel domain and instance- or part-independent semantic editing. In addition, we introduce a locality-adaptive feature indexing mechanism to ensure spatial texture coherence, which further supports high-quality background completion without relying on generative models. Extensive experiments on image reconstruction and editing benchmarks, including ImageNet, OIR-Bench, and HumanEdit, demonstrate that our method achieves state-of-the-art rendering fidelity with significantly fewer parameters, while enabling intuitive, interactive, and physically plausible manipulation. Moreover, by integrating proxy nodes with Position-Based Dynamics, our framework supports real-time physics-driven animation using lightweight implicit rendering, achieving superior temporal consistency and visual realism compared with generative approaches.

[528] Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model

Jiedong Zhuang, Lu Lu, Ming Dai, Rui Hu, Jian Chen, Qiang Liu, Haoji Hu

Main category: cs.CV

TL;DR: Lazy Attention reduces MLLM inference costs by enabling cross-layer sharing of similar attention patterns through a novel Q Cache mechanism, cutting KV cache usage by 35% and improving throughput 1.5x with minimal performance loss.

Details

Motivation: Multimodal LLMs suffer from high inference costs due to excessive visual tokens from vision encoders, creating computational load and KV cache bottlenecks. Existing token pruning methods often compromise KV cache integrity and fail in long-text generation tasks.

Method: Proposes Lazy Attention mechanism that enables cross-layer sharing of similar attention patterns. Develops a novel layer-shared Q Cache for MLLMs that facilitates query reuse across adjacent layers. The method is lightweight, compatible with existing inference frameworks (Flash Attention, KV cache), and orthogonal to token-wise techniques.

Result: Reduces KV cache usage by over 35%, achieves 1.5x throughput improvement, with only ~1% performance loss on various MLLMs. Outperforms state-of-the-art token-wise methods in accuracy preservation across multiple benchmarks.

Conclusion: Lazy Attention provides an efficient attention mechanism for MLLMs that addresses computational bottlenecks while maintaining performance, offering a flexible solution that can be combined with existing token pruning approaches.

Abstract: Multimodal large language models (MLLMs) are plagued by exorbitant inference costs attributable to the profusion of visual tokens within the vision encoder. The redundant visual tokens engenders a substantial computational load and key-value (KV) cache footprint bottleneck. Existing approaches focus on token-wise optimization, leveraging diverse intricate token pruning techniques to eliminate non-crucial visual tokens. Nevertheless, these methods often unavoidably undermine the integrity of the KV cache, resulting in failures in long-text generation tasks. To this end, we conduct an in-depth investigation towards the attention mechanism of the model from a new perspective, and discern that attention within more than half of all decode layers are semantic similar. Upon this finding, we contend that the attention in certain layers can be streamlined by inheriting the attention from their preceding layers. Consequently, we propose Lazy Attention, an efficient attention mechanism that enables cross-layer sharing of similar attention patterns. It ingeniously reduces layer-wise redundant computation in attention. In Lazy Attention, we develop a novel layer-shared cache, Q Cache, tailored for MLLMs, which facilitates the reuse of queries across adjacent layers. In particular, Q Cache is lightweight and fully compatible with existing inference frameworks, including Flash Attention and KV cache. Additionally, our method is highly flexible as it is orthogonal to existing token-wise techniques and can be deployed independently or combined with token pruning approaches. Empirical evaluations on multiple benchmarks demonstrate that our method can reduce KV cache usage by over 35% and achieve 1.5x throughput improvement, while sacrificing only approximately 1% of performance on various MLLMs. Compared with SOTA token-wise methods, our technique achieves superior accuracy preservation.

[529] Learning Sparse Visual Representations via Spatial-Semantic Factorization

Theodore Zhengde Zhao, Sid Kiblawi, Jianwei Yang, Naoto Usuyama, Reuben Tan, Noel C Codella, Tristan Naumann, Hoifung Poon, Mu Wei

Main category: cs.CV

TL;DR: STELLAR resolves the conflict between semantic understanding and image reconstruction in SSL by factorizing visual features into semantic concepts and spatial distributions, enabling both high-quality reconstruction and strong semantic performance with sparse tokens.

Details

Motivation: Self-supervised learning faces a fundamental conflict: semantic SSL methods like DINO discard spatial information needed for reconstruction, while generative SSL methods like MAE preserve spatial details but lack high-level abstractions. There's a need to bridge this gap between discriminative and generative vision.

Method: STELLAR factorizes visual features into a low-rank product of semantic concepts and their spatial distributions. This disentanglement allows DINO-style augmentation alignment on semantic tokens while maintaining precise spatial mapping in the localization matrix for pixel-level reconstruction.

Result: With only 16 sparse tokens, STELLAR achieves high-quality reconstruction (2.60 FID) while matching the semantic performance of dense backbones (79.10% ImageNet accuracy). This demonstrates a versatile sparse representation that bridges discriminative and generative vision.

Conclusion: STELLAR successfully resolves the tension between semantic understanding and image reconstruction in SSL by strategically separating semantic identity from spatial geometry, creating a unified framework for both discriminative and generative vision tasks.

Abstract: Self-supervised learning (SSL) faces a fundamental conflict between semantic understanding and image reconstruction. High-level semantic SSL (e.g., DINO) relies on global tokens that are forced to be location-invariant for augmentation alignment, a process that inherently discards the spatial coordinates required for reconstruction. Conversely, generative SSL (e.g., MAE) preserves dense feature grids for reconstruction but fails to produce high-level abstractions. We introduce STELLAR, a framework that resolves this tension by factorizing visual features into a low-rank product of semantic concepts and their spatial distributions. This disentanglement allows us to perform DINO-style augmentation alignment on the semantic tokens while maintaining the precise spatial mapping in the localization matrix necessary for pixel-level reconstruction. We demonstrate that as few as 16 sparse tokens under this factorized form are sufficient to simultaneously support high-quality reconstruction (2.60 FID) and match the semantic performance of dense backbones (79.10% ImageNet accuracy). Our results highlight STELLAR as a versatile sparse representation that bridges the gap between discriminative and generative vision by strategically separating semantic identity from spatial geometry. Code available at https://aka.ms/stellar.

[530] DSXFormer: Dual-Pooling Spectral Squeeze-Expansion and Dynamic Context Attention Transformer for Hyperspectral Image Classification

Farhan Ullah, Irfan Ullah, Khalil Khan, Giovanni Pau, JaKeoung Koo

Main category: cs.CV

TL;DR: DSXFormer: A dual-pooling spectral squeeze-expansion transformer with dynamic context attention for hyperspectral image classification, achieving state-of-the-art performance on benchmark datasets.

Details

Motivation: Hyperspectral image classification faces challenges due to high spectral dimensionality, complex spectral-spatial correlations, and limited labeled samples. Existing transformer-based models struggle with spectral discriminability while maintaining computational efficiency.

Method: Proposes DSXFormer with Dual-Pooling Spectral Squeeze-Expansion (DSX) block using global average and max pooling to recalibrate spectral features, and Dynamic Context Attention (DCA) mechanism within window-based transformer to capture local spectral-spatial relationships efficiently. Includes patch extraction, embedding, and merging for multi-scale learning.

Result: Achieves classification accuracies of 99.95% (Salinas), 98.91% (Indian Pines), 99.85% (Pavia University), and 98.52% (Kennedy Space Center), outperforming state-of-the-art methods on four benchmark datasets.

Conclusion: DSXFormer effectively balances spectral emphasis and spatial contextual representation through spectral dual-pooling squeeze-expansion and dynamic context attention, demonstrating superior performance for hyperspectral image classification.

Abstract: Hyperspectral image classification (HSIC) is a challenging task due to high spectral dimensionality, complex spectral-spatial correlations, and limited labeled training samples. Although transformer-based models have shown strong potential for HSIC, existing approaches often struggle to achieve sufficient spectral discriminability while maintaining computational efficiency. To address these limitations, we propose a novel DSXFormer, a novel dual-pooling spectral squeeze-expansion transformer with Dynamic Context Attention for HSIC. The proposed DSXFormer introduces a Dual-Pooling Spectral Squeeze-Expansion (DSX) block, which exploits complementary global average and max pooling to adaptively recalibrate spectral feature channels, thereby enhancing spectral discriminability and inter-band dependency modeling. In addition, DSXFormer incorporates a Dynamic Context Attention (DCA) mechanism within a window-based transformer architecture to dynamically capture local spectral-spatial relationships while significantly reducing computational overhead. The joint integration of spectral dual-pooling squeeze-expansion and DCA enables DSXFormer to achieve an effective balance between spectral emphasis and spatial contextual representation. Furthermore, patch extraction, embedding, and patch merging strategies are employed to facilitate efficient multi-scale feature learning. Extensive experiments conducted on four widely used hyperspectral benchmark datasets, including Salinas (SA), Indian Pines (IP), Pavia University (PU), and Kennedy Space Center (KSC), demonstrate that DSXFormer consistently outperforms state-of-the-art methods, achieving classification accuracies of 99.95%, 98.91%, 99.85%, and 98.52%, respectively.

[531] Enabling Progressive Whole-slide Image Analysis with Multi-scale Pyramidal Network

Shuyang Wu, Yifu Qiu, Ines P. Nearchou, Sandrine Prost, Jonathan A Fallowfield, Hakan Bilen, Timothy J Kendall

Main category: cs.CV

TL;DR: MSPN is a plug-and-play multi-scale pyramidal network for attention-based MIL in computational pathology that improves performance across tasks while being lightweight.

Details

Motivation: Existing multi-scale approaches in computational pathology use late feature fusion with arbitrary magnification levels, which loses cross-scale feature relationships and is computationally expensive.

Method: Proposes MSPN with grid-based remapping (using high magnification features to derive coarse features) and coarse guidance network (CGN) that learns coarse contexts, enabling progressive multi-scale analysis on whole slide images.

Result: MSPN consistently improves MIL performance across 4 attention-based frameworks, 4 clinical tasks, and 3 foundation model types, while being lightweight and easy-to-use.

Conclusion: MSPN provides an effective plug-and-play solution for multi-scale analysis in computational pathology that enhances attention-based MIL frameworks across diverse clinical applications.

Abstract: Multiple-instance Learning (MIL) is commonly used to undertake computational pathology (CPath) tasks, and the use of multi-scale patches allows diverse features across scales to be learned. Previous studies using multi-scale features in clinical applications rely on multiple inputs across magnifications with late feature fusion, which does not retain the link between features across scales while the inputs are dependent on arbitrary, manufacturer-defined magnifications, being inflexible and computationally expensive. In this paper, we propose the Multi-scale Pyramidal Network (MSPN), which is plug-and-play over attention-based MIL that introduces progressive multi-scale analysis on WSI. Our MSPN consists of (1) grid-based remapping that uses high magnification features to derive coarse features and (2) the coarse guidance network (CGN) that learns coarse contexts. We benchmark MSPN as an add-on module to 4 attention-based frameworks using 4 clinically relevant tasks across 3 types of foundation model, as well as the pre-trained MIL framework. We show that MSPN consistently improves MIL across the compared configurations and tasks, while being lightweight and easy-to-use.

[532] Beyond Open Vocabulary: Multimodal Prompting for Object Detection in Remote Sensing Images

Shuai Yang, Ziyue Huang, Jiaxin Chen, Qingjie Liu, Yunhong Wang

Main category: cs.CV

TL;DR: RS-MPOD: A multimodal open-vocabulary detection framework for remote sensing that uses visual prompts from exemplar instances alongside text prompts to improve category specification under semantic ambiguity.

Details

Motivation: Text-only prompting for open-vocabulary object detection in remote sensing often fails due to task-specific category semantics and distribution shifts, leading to unstable category specification under open-vocabulary settings.

Method: Proposes RS-MPOD with visual prompt encoder to extract appearance-based category cues from exemplar instances for text-free specification, and multimodal fusion module to integrate visual and textual information when both are available.

Result: Visual prompting yields more reliable category specification under semantic ambiguity and distribution shifts, while multimodal prompting remains competitive when textual semantics are well aligned.

Conclusion: Multimodal prompting (visual + textual) provides a flexible and robust alternative to text-only prompting for open-vocabulary object detection in remote sensing applications.

Abstract: Open-vocabulary object detection in remote sensing commonly relies on text-only prompting to specify target categories, implicitly assuming that inference-time category queries can be reliably grounded through pretraining-induced text-visual alignment. In practice, this assumption often breaks down in remote sensing scenarios due to task- and application-specific category semantics, resulting in unstable category specification under open-vocabulary settings. To address this limitation, we propose RS-MPOD, a multimodal open-vocabulary detection framework that reformulates category specification beyond text-only prompting by incorporating instance-grounded visual prompts, textual prompts, and their multimodal integration. RS-MPOD introduces a visual prompt encoder to extract appearance-based category cues from exemplar instances, enabling text-free category specification, and a multimodal fusion module to integrate visual and textual information when both modalities are available. Extensive experiments on standard, cross-dataset, and fine-grained remote sensing benchmarks show that visual prompting yields more reliable category specification under semantic ambiguity and distribution shifts, while multimodal prompting provides a flexible alternative that remains competitive when textual semantics are well aligned.

[533] Your AI-Generated Image Detector Can Secretly Achieve SOTA Accuracy, If Calibrated

Muli Yang, Gabriel James Goenawan, Henan Wang, Huaiyuan Qin, Chenghao Xu, Yanhua Yang, Fen Fang, Ying Sun, Joo-Hwee Lim, Hongyuan Zhu

Main category: cs.CV

TL;DR: A post-hoc calibration framework for AI-generated image detectors that addresses systematic bias and distributional shift by applying a learnable scalar correction to model logits.

Details

Motivation: Existing AI-generated image detectors trained on balanced datasets show systematic bias at test time, frequently misclassifying fake images as real due to distributional shift in fake samples and implicit priors learned during training.

Method: Proposes a theoretically grounded post-hoc calibration framework based on Bayesian decision theory. Introduces a learnable scalar correction to model logits, optimized on a small validation set from the target distribution while keeping the backbone frozen.

Result: Experiments on challenging benchmarks show the approach significantly improves robustness without retraining, offering a lightweight and principled solution for reliable and adaptive AI-generated image detection.

Conclusion: The proposed calibration framework effectively addresses distributional shift in AI-generated image detection, providing a practical solution for real-world deployment without requiring extensive retraining.

Abstract: Despite being trained on balanced datasets, existing AI-generated image detectors often exhibit systematic bias at test time, frequently misclassifying fake images as real. We hypothesize that this behavior stems from distributional shift in fake samples and implicit priors learned during training. Specifically, models tend to overfit to superficial artifacts that do not generalize well across different generation methods, leading to a misaligned decision threshold when faced with test-time distribution shift. To address this, we propose a theoretically grounded post-hoc calibration framework based on Bayesian decision theory. In particular, we introduce a learnable scalar correction to the model’s logits, optimized on a small validation set from the target distribution while keeping the backbone frozen. This parametric adjustment compensates for distributional shift in model output, realigning the decision boundary even without requiring ground-truth labels. Experiments on challenging benchmarks show that our approach significantly improves robustness without retraining, offering a lightweight and principled solution for reliable and adaptive AI-generated image detection in the open world. Code is available at https://github.com/muliyangm/AIGI-Det-Calib.

[534] Enhancing Multi-Image Understanding through Delimiter Token Scaling

Minyoung Lee, Yeji Park, Dongjun Hwang, Yejin Kim, Seong Joon Oh, Junsuk Choe

Main category: cs.CV

TL;DR: A method to improve multi-image reasoning in LVLMs by scaling delimiter token hidden states to reduce cross-image information leakage, enhancing image distinction without extra training or inference costs.

Details

Motivation: Large Vision-Language Models perform well on single-image tasks but struggle with multiple images due to cross-image information leakage, where models fail to distinguish information across different images despite using delimiter tokens.

Method: Proposes scaling the hidden states of delimiter tokens to reinforce intra-image interactions and limit undesired cross-image interactions, enhancing the model’s ability to preserve image-specific information and distinguish between images.

Result: Shows performance gains on multi-image benchmarks (Mantis, MuirBench, MIRB, QBench2) and improves performance on text-only multi-document and multi-table understanding benchmarks (TQABench, MultiNews, WCEP-10) without additional training or inference costs.

Conclusion: Simple scaling of delimiter token hidden states effectively reduces cross-image information leakage in LVLMs, improving multi-image reasoning capabilities while maintaining computational efficiency.

Abstract: Large Vision-Language Models (LVLMs) achieve strong performance on single-image tasks, but their performance declines when multiple images are provided as input. One major reason is the cross-image information leakage, where the model struggles to distinguish information across different images. Existing LVLMs already employ delimiter tokens to mark the start and end of each image, yet our analysis reveals that these tokens fail to effectively block cross-image information leakage. To enhance their effectiveness, we propose a method that scales the hidden states of delimiter tokens. This enhances the model’s ability to preserve image-specific information by reinforcing intra-image interaction and limiting undesired cross-image interactions. Consequently, the model is better able to distinguish between images and reason over them more accurately. Experiments show performance gains on multi-image benchmarks such as Mantis, MuirBench, MIRB, and QBench2. We further evaluate our method on text-only tasks that require clear distinction. The method improves performance on multi-document and multi-table understanding benchmarks, including TQABench, MultiNews, and WCEP-10. Notably, our method requires no additional training or inference cost.

[535] Leveraging Latent Vector Prediction for Localized Control in Image Generation via Diffusion Models

Pablo Domingo-Gregorio, Javier Ruiz-Hidalgo

Main category: cs.CV

TL;DR: A novel diffusion model training framework that enables precise local control over user-defined image regions while allowing the model to autonomously generate remaining areas according to text prompts.

Details

Motivation: While diffusion models excel at text-to-image generation, achieving detailed control solely through text prompts is laborious. Existing methods use image-level controls (edges, segmentation, depth maps) but apply conditions uniformly across entire images, limiting localized control.

Method: Proposes a training framework incorporating masking features and an additional loss term that leverages prediction of the initial latent vector at any diffusion step to enhance correspondence between current step and final sample in latent space.

Result: Extensive experiments demonstrate the method effectively synthesizes high-quality images with controlled local conditions.

Conclusion: The approach enables precise local control over user-defined image regions while maintaining autonomous generation of remaining areas according to text prompts.

Abstract: Diffusion models emerged as a leading approach in text-to-image generation, producing high-quality images from textual descriptions. However, attempting to achieve detailed control to get a desired image solely through text remains a laborious trial-and-error endeavor. Recent methods have introduced image-level controls alongside with text prompts, using prior images to extract conditional information such as edges, segmentation and depth maps. While effective, these methods apply conditions uniformly across the entire image, limiting localized control. In this paper, we propose a novel methodology to enable precise local control over user-defined regions of an image, while leaving to the diffusion model the task of autonomously generating the remaining areas according to the original prompt. Our approach introduces a new training framework that incorporates masking features and an additional loss term, which leverages the prediction of the initial latent vector at any diffusion step to enhance the correspondence between the current step and the final sample in the latent space. Extensive experiments demonstrate that our method effectively synthesizes high-quality images with controlled local conditions.

[536] SurfSplat: Conquering Feedforward 2D Gaussian Splatting with Surface Continuity Priors

Bing He, Jingnan Gao, Yunuo Chen, Ning Cao, Gang Chen, Zhengxue Cheng, Li Song, Wenjun Zhang

Main category: cs.CV

TL;DR: SurfSplat: A feedforward framework using 2D Gaussian Splatting primitives for high-fidelity 3D reconstruction from sparse images, addressing issues of discrete point clouds and color bias in previous 3DGS-based methods.

Details

Motivation: Current 3D reconstruction methods using 3D Gaussian Splatting often produce discrete, color-biased point clouds that lack continuous surfaces and show severe artifacts under close-up views, limiting their practical application for high-fidelity reconstruction.

Method: SurfSplat uses 2D Gaussian Splatting primitives instead of 3DGS, providing stronger anisotropy and higher geometric precision. It incorporates a surface continuity prior and forced alpha blending strategy to reconstruct coherent geometry with faithful textures.

Result: Extensive experiments on RealEstate10K, DL3DV, and ScanNet show SurfSplat consistently outperforms prior methods on both standard metrics and the proposed High-Resolution Rendering Consistency (HRRC) metric.

Conclusion: SurfSplat establishes a robust solution for high-fidelity 3D reconstruction from sparse inputs, producing continuous surfaces with accurate geometry and texture while addressing the limitations of previous 3DGS-based approaches.

Abstract: Reconstructing 3D scenes from sparse images remains a challenging task due to the difficulty of recovering accurate geometry and texture without optimization. Recent approaches leverage generalizable models to generate 3D scenes using 3D Gaussian Splatting (3DGS) primitive. However, they often fail to produce continuous surfaces and instead yield discrete, color-biased point clouds that appear plausible at normal resolution but reveal severe artifacts under close-up views. To address this issue, we present SurfSplat, a feedforward framework based on 2D Gaussian Splatting (2DGS) primitive, which provides stronger anisotropy and higher geometric precision. By incorporating a surface continuity prior and a forced alpha blending strategy, SurfSplat reconstructs coherent geometry together with faithful textures. Furthermore, we introduce High-Resolution Rendering Consistency (HRRC), a new evaluation metric designed to evaluate high-resolution reconstruction quality. Extensive experiments on RealEstate10K, DL3DV, and ScanNet demonstrate that SurfSplat consistently outperforms prior methods on both standard metrics and HRRC, establishing a robust solution for high-fidelity 3D reconstruction from sparse inputs. Project page: https://hebing-sjtu.github.io/SurfSplat-website/

[537] UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving

Guosheng Zhao, Yaozeng Wang, Xiaofeng Wang, Zheng Zhu, Tingdong Yu, Guan Huang, Yongchen Zai, Ji Jiao, Changliang Xue, Xiaole Wang, Zhen Yang, Futang Zhu, Xingang Wang

Main category: cs.CV

TL;DR: UniDriveDreamer: A single-stage unified multimodal world model for autonomous driving that jointly generates multi-camera video and LiDAR sequences using aligned latent representations and diffusion transformers.

Details

Motivation: Existing world models for autonomous driving focus on single-modality generation (either video or LiDAR), lacking unified multimodal synthesis capabilities needed for comprehensive scene understanding.

Method: Proposes a unified framework with: 1) LiDAR-specific VAE and video VAE for encoding inputs, 2) Unified Latent Anchoring (ULA) to align cross-modal latent distributions, 3) Diffusion transformer for joint modeling of geometric correspondence and temporal evolution, 4) Structured scene layout conditioning per modality.

Result: Outperforms previous state-of-the-art methods in both video and LiDAR generation, with measurable improvements in downstream tasks.

Conclusion: UniDriveDreamer demonstrates effective unified multimodal world modeling for autonomous driving, enabling joint synthesis of visual and geometric sensor data without cascaded modules.

Abstract: World models have demonstrated significant promise for data synthesis in autonomous driving. However, existing methods predominantly concentrate on single-modality generation, typically focusing on either multi-camera video or LiDAR sequence synthesis. In this paper, we propose UniDriveDreamer, a single-stage unified multimodal world model for autonomous driving, which directly generates multimodal future observations without relying on intermediate representations or cascaded modules. Our framework introduces a LiDAR-specific variational autoencoder (VAE) designed to encode input LiDAR sequences, alongside a video VAE for multi-camera images. To ensure cross-modal compatibility and training stability, we propose Unified Latent Anchoring (ULA), which explicitly aligns the latent distributions of the two modalities. The aligned features are fused and processed by a diffusion transformer that jointly models their geometric correspondence and temporal evolution. Additionally, structured scene layout information is projected per modality as a conditioning signal to guide the synthesis. Extensive experiments demonstrate that UniDriveDreamer outperforms previous state-of-the-art methods in both video and LiDAR generation, while also yielding measurable improvements in downstream

[538] ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal Reasoning

Gongli Xi, Kun Wang, Zeming Gao, Huahui Yi, Haolang Lu, Ye Tian, Wendong Wang

Main category: cs.CV

TL;DR: ClueTracer: A training-free method to suppress hallucinations in multimodal reasoning models by tracing clue propagation from question to visual tokens, improving performance on reasoning benchmarks.

Details

Motivation: Multimodal reasoning models suffer from hallucinations where they generate content not supported by input images or questions, due to "reasoning drift" where models over-focus on irrelevant entities during clue gathering.

Method: Introduces ClueTracer, a training-free, parameter-free plugin that traces how key clues propagate along the reasoning pathway (question → outputs → visual tokens) to localize task-relevant patches while suppressing attention to irrelevant regions.

Result: ClueTracer improves all reasoning architectures by 1.21× on reasoning benchmarks without additional training, and yields 1.14× gain when transferred to non-reasoning settings.

Conclusion: The method effectively addresses hallucination in multimodal reasoning by identifying and suppressing reasoning drift through visual clue tracing, working across various architectures without retraining.

Abstract: Large multimodal reasoning models solve challenging visual problems via explicit long-chain inference: they gather visual clues from images and decode clues into textual tokens. Yet this capability also increases hallucinations, where the model generates content that is not supported by the input image or the question. To understand this failure mode, we identify \emph{reasoning drift}: during clue gathering, the model over-focuses on question-irrelevant entities, diluting focus on task-relevant cues and gradually decoupling the reasoning trace from visual grounding. As a consequence, many inference-time localization or intervention methods developed for non-reasoning models fail to pinpoint the true clues in reasoning settings. Motivated by these insights, we introduce ClueRecall, a metric for assessing visual clue retrieval, and present ClueTracer, a training-free, parameter-free, and architecture-agnostic plugin for hallucination suppression. ClueTracer starts from the question and traces how key clues propagate along the model’s reasoning pathway (question $\rightarrow$ outputs $\rightarrow$ visual tokens), thereby localizing task-relevant patches while suppressing spurious attention to irrelevant regions. Remarkably, \textbf{without any additional training}, ClueTracer improves all \textbf{reasoning} architectures (including \texttt{R1-OneVision}, \texttt{Ocean-R1}, \texttt{MM-Eureka}, \emph{etc}.) by $\mathbf{1.21\times}$ on reasoning benchmarks. When transferred to \textbf{non-reasoning} settings, it yields a $\mathbf{1.14\times}$ gain.

[539] One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation

Shuo Lu, Haohan Wang, Wei Feng, Weizhen Wang, Shen Zhang, Yaoyu Li, Ao Ma, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law, Bing Zhan, Yuan Xu, Huizai Yao, Yongcan Yu, Chenyang Si, Jian Liang

Main category: cs.CV

TL;DR: OSMF is a unified framework for advertising image generation that aligns diverse group-wise click preferences using product-aware adaptive grouping and a Group-aware Multimodal Large Language Model (G-MLLM) fine-tuned with Group-DPO for preference alignment.

Details

Motivation: Existing advertising image generation approaches use a "one-size-fits-all" strategy that optimizes for overall CTR while neglecting preference diversity among user groups, leading to suboptimal performance for specific groups and limiting targeted marketing effectiveness.

Method: 1) Product-aware adaptive grouping dynamically organizes users based on attributes and product characteristics; 2) Preference-conditioned image generation uses a Group-aware Multimodal Large Language Model (G-MLLM) pre-trained to comprehend group features and generate images; 3) Fine-tuning with Group-DPO for group-wise preference alignment; 4) Introduction of GAIP dataset with 600K groups from 40M users.

Result: The framework achieves state-of-the-art performance in both offline and online settings, effectively enhancing each group’s CTR on generated images through group-wise preference alignment.

Conclusion: OSMF successfully bridges the gap in advertising image generation by addressing preference diversity among user groups through adaptive grouping and multimodal LLM-based generation with preference alignment, advancing the field with a new dataset and methodology.

Abstract: Advertising image generation has increasingly focused on online metrics like Click-Through Rate (CTR), yet existing approaches adopt a ``one-size-fits-all" strategy that optimizes for overall CTR while neglecting preference diversity among user groups. This leads to suboptimal performance for specific groups, limiting targeted marketing effectiveness. To bridge this gap, we present \textit{One Size, Many Fits} (OSMF), a unified framework that aligns diverse group-wise click preferences in large-scale advertising image generation. OSMF begins with product-aware adaptive grouping, which dynamically organizes users based on their attributes and product characteristics, representing each group with rich collective preference features. Building on these groups, preference-conditioned image generation employs a Group-aware Multimodal Large Language Model (G-MLLM) to generate tailored images for each group. The G-MLLM is pre-trained to simultaneously comprehend group features and generate advertising images. Subsequently, we fine-tune the G-MLLM using our proposed Group-DPO for group-wise preference alignment, which effectively enhances each group’s CTR on the generated images. To further advance this field, we introduce the Grouped Advertising Image Preference Dataset (GAIP), the first large-scale public dataset of group-wise image preferences, including around 600K groups built from 40M users. Extensive experiments demonstrate that our framework achieves the state-of-the-art performance in both offline and online settings. Our code and datasets will be released at https://github.com/JD-GenX/OSMF.

[540] Auto-Comp: An Automated Pipeline for Scalable Compositional Probing of Contrastive Vision-Language Models

Cristian Sbrolli, Matteo Matteucci, Toshihiko Yamasaki

Main category: cs.CV

TL;DR: Auto-Comp introduces an automated synthetic pipeline for generating scalable benchmarks to analyze compositional reasoning failures in Vision-Language Models, revealing universal failures in attribute binding and spatial relations.

Details

Motivation: VLMs exhibit critical flaws in compositional reasoning, confusing attributes like "a red cube and a blue sphere" with "a blue cube and a red sphere." There's a need for fine-grained, controllable analysis to disentangle visual and linguistic roots of these failures for robust evaluation.

Method: Auto-Comp is a fully automated synthetic pipeline that generates paired images from Minimal captions (simple descriptions) and LLM-generated Contextual captions (rich scene descriptions). This enables controlled A/B testing to isolate core binding ability from visio-linguistic complexity. The approach includes novel benchmarks for color binding, spatial relations, and a “Confusion Benchmark” with low-entropy distractors.

Result: Evaluation of 20 VLMs (CLIP and SigLIP families) reveals universal compositional failures. The Confusion Benchmark shows models are highly susceptible to low-entropy distractors (repeated objects/colors). A surprising trade-off emerges: visio-linguistic context aids spatial reasoning but hinders local attribute binding by introducing visual clutter.

Conclusion: Auto-Comp provides a powerful framework for analyzing compositional reasoning failures in VLMs, revealing deeper flaws beyond simple attribute swaps. The pipeline enables scalable benchmark generation for future research, with released resources on HuggingFace.

Abstract: Modern Vision-Language Models (VLMs) exhibit a critical flaw in compositional reasoning, often confusing “a red cube and a blue sphere” with “a blue cube and a red sphere”. Disentangling the visual and linguistic roots of these failures is a fundamental challenge for robust evaluation. To enable fine-grained, controllable analysis, we introduce Auto-Comp, a fully automated and synthetic pipeline for generating scalable benchmarks. Its controllable nature is key to dissecting and isolating different reasoning skills. Auto-Comp generates paired images from Minimal (e.g., “a monitor to the left of a bicycle on a white background”) and LLM-generated Contextual captions (e.g., “In a brightly lit photography studio, a monitor is positioned to the left of a bicycle”), allowing a controlled A/B test to disentangle core binding ability from visio-linguistic complexity. Our evaluation of 20 VLMs on novel benchmarks for color binding and spatial relations reveals universal compositional failures in both CLIP and SigLIP model families. Crucially, our novel “Confusion Benchmark” reveals a deeper flaw beyond simple attribute swaps: models are highly susceptible to low-entropy distractors (e.g., repeated objects or colors), demonstrating their compositional failures extend beyond known bag-of-words limitations. we uncover a surprising trade-off: visio-linguistic context, which provides global scene cues, aids spatial reasoning but simultaneously hinders local attribute binding by introducing visual clutter. We release the Auto-Comp pipeline to facilitate future benchmark creation, alongside all our generated benchmarks (https://huggingface.co/AutoComp).

[541] Multi-View Stenosis Classification Leveraging Transformer-Based Multiple-Instance Learning Using Real-World Clinical Data

Nikola Cenikj, Özgün Turgut, Alexander Müller, Alexander Steger, Jan Kehrer, Marcus Brugger, Daniel Rueckert, Eimo Martens, Philip Müller

Main category: cs.CV

TL;DR: SegmentMIL: A transformer-based multi-view multiple-instance learning framework for patient-level coronary stenosis classification without view-level annotations

Details

Motivation: Existing deep learning models for coronary stenosis detection require expensive view-level annotations and fail to capture temporal dynamics and dependencies among multiple angiography views, which are crucial for clinical diagnosis.

Method: Proposes SegmentMIL, a transformer-based multi-view multiple-instance learning framework that uses patient-level supervision without view-level annotations. It jointly predicts stenosis presence and localizes affected anatomical regions (right/left coronary arteries and segments).

Result: SegmentMIL achieves high performance on internal and external evaluations, outperforming both view-level models and classical MIL baselines.

Conclusion: SegmentMIL shows potential as a clinically viable and scalable solution for coronary stenosis diagnosis by leveraging patient-level supervision and capturing multi-view dependencies.

Abstract: Coronary artery stenosis is a leading cause of cardiovascular disease, diagnosed by analyzing the coronary arteries from multiple angiography views. Although numerous deep-learning models have been proposed for stenosis detection from a single angiography view, their performance heavily relies on expensive view-level annotations, which are often not readily available in hospital systems. Moreover, these models fail to capture the temporal dynamics and dependencies among multiple views, which are crucial for clinical diagnosis. To address this, we propose SegmentMIL, a transformer-based multi-view multiple-instance learning framework for patient-level stenosis classification. Trained on a real-world clinical dataset, using patient-level supervision and without any view-level annotations, SegmentMIL jointly predicts the presence of stenosis and localizes the affected anatomical region, distinguishing between the right and left coronary arteries and their respective segments. SegmentMIL obtains high performance on internal and external evaluations and outperforms both view-level models and classical MIL baselines, underscoring its potential as a clinically viable and scalable solution for coronary stenosis diagnosis. Our code is available at https://github.com/NikolaCenic/mil-stenosis.

[542] UrbanGS: A Scalable and Efficient Architecture for Geometrically Accurate Large-Scene Reconstruction

Changbai Li, Haodong Zhu, Hanlin Chen, Xiuping Liang, Tongfei Chen, Shuwei Shao, Linlin Yang, Huobin Tan, Baochang Zhang

Main category: cs.CV

TL;DR: UrbanGS: A scalable 3D Gaussian Splatting framework for large-scale urban environments that improves geometric consistency through depth-normal regularization and adaptive pruning, while enhancing memory efficiency and computational scalability.

Details

Motivation: Extending 3D Gaussian Splatting to large-scale urban environments faces challenges in geometric consistency, memory efficiency, and computational scalability. Existing methods struggle with updating position parameters and handling complex large-scale scenes.

Method: 1) Depth-Consistent D-Normal Regularization module integrating D-Normal constraints with external depth supervision for comprehensive geometric parameter updates; 2) Adaptive confidence weighting based on gradient consistency and inverse depth deviation; 3) Spatially Adaptive Gaussian Pruning (SAGP) strategy that dynamically adjusts Gaussian density; 4) Unified partitioning and view assignment scheme to eliminate boundary artifacts.

Result: Extensive experiments on multiple urban datasets demonstrate superior performance in rendering quality, geometric accuracy, and memory efficiency compared to existing approaches.

Conclusion: UrbanGS provides a systematic solution for high-fidelity large-scale scene reconstruction, effectively addressing the challenges of geometric consistency, memory efficiency, and computational scalability in city-scale 3D Gaussian Splatting applications.

Abstract: While 3D Gaussian Splatting (3DGS) enables high-quality, real-time rendering for bounded scenes, its extension to large-scale urban environments gives rise to critical challenges in terms of geometric consistency, memory efficiency, and computational scalability. To address these issues, we present UrbanGS, a scalable reconstruction framework that effectively tackles these challenges for city-scale applications. First, we propose a Depth-Consistent D-Normal Regularization module. Unlike existing approaches that rely solely on monocular normal estimators, which can effectively update rotation parameters yet struggle to update position parameters, our method integrates D-Normal constraints with external depth supervision. This allows for comprehensive updates of all geometric parameters. By further incorporating an adaptive confidence weighting mechanism based on gradient consistency and inverse depth deviation, our approach significantly enhances multi-view depth alignment and geometric coherence, which effectively resolves the issue of geometric accuracy in complex large-scale scenes. To improve scalability, we introduce a Spatially Adaptive Gaussian Pruning (SAGP) strategy, which dynamically adjusts Gaussian density based on local geometric complexity and visibility to reduce redundancy. Additionally, a unified partitioning and view assignment scheme is designed to eliminate boundary artifacts and optimize computational load. Extensive experiments on multiple urban datasets demonstrate that UrbanGS achieves superior performance in rendering quality, geometric accuracy, and memory efficiency, providing a systematic solution for high-fidelity large-scale scene reconstruction.

[543] FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

FSVideo Team, Qingyu Chen, Zhiyuan Fang, Haibin Huang, Xinwei Huang, Tong Jin, Minxuan Lin, Bo Liu, Celong Liu, Chongyang Ma, Xing Mei, Xiaohui Shen, Yaojie Shen, Fuwen Tan, Angtian Wang, Xiao Yang, Yiding Yang, Jiamin Yuan, Lingxi Zhang, Yuxin Zhang

Main category: cs.CV

TL;DR: FSVideo: A fast transformer-based image-to-video diffusion framework using highly-compressed latent space, memory-enhanced diffusion transformers, and multi-resolution generation for efficient high-quality video synthesis.

Details

Motivation: To create an efficient image-to-video generation system that achieves competitive quality while being significantly faster than existing open-source models, addressing the computational challenges of video generation.

Method: 1) New video autoencoder with 64×64×4 spatial-temporal compression, 2) Diffusion transformer with layer memory design for better inter-layer information flow, 3) Multi-resolution generation using a few-step DIT upsampler to enhance video fidelity.

Result: Achieves competitive performance against popular open-source models while being an order of magnitude faster, using a 14B DIT base model and 14B DIT upsampler.

Conclusion: FSVideo demonstrates that efficient transformer-based architectures with optimized latent spaces and memory designs can enable fast, high-quality image-to-video generation.

Abstract: We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space ($64\times64\times4$ spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.

[544] Teacher-Guided Student Self-Knowledge Distillation Using Diffusion Model

Yu Wang, Chuanguang Yang, Zhulin An, Weilun Feng, Jiarui Zhao, Chengqing Yu, Libo Huang, Boyu Diao, Yongjun Xu

Main category: cs.CV

TL;DR: DSKD introduces a novel knowledge distillation method using teacher-guided diffusion to denoise student features, then performs distillation between original and denoised student features using locality-sensitive hashing, eliminating teacher-student feature distribution discrepancies.

Details

Motivation: Existing KD methods suffer from incompatible information transfer due to differences in feature distributions between teacher and student models, leading to suboptimal knowledge transfer.

Method: Proposes DSKD: teacher-guided student Diffusion Self-KD. Uses teacher classifier to guide diffusion model’s sampling process to denoise student features, then performs LSH-guided feature distillation between original and denoised student features.

Result: Significantly outperforms existing KD methods across various models and datasets on visual recognition tasks.

Conclusion: DSKD effectively eliminates teacher-student feature distribution discrepancies while learning meaningful knowledge from teachers, demonstrating superior performance over conventional KD approaches.

Abstract: Existing Knowledge Distillation (KD) methods often align feature information between teacher and student by exploring meaningful feature processing and loss functions. However, due to the difference in feature distributions between the teacher and student, the student model may learn incompatible information from the teacher. To address this problem, we propose teacher-guided student Diffusion Self-KD, dubbed as DSKD. Instead of the direct teacher-student alignment, we leverage the teacher classifier to guide the sampling process of denoising student features through a light-weight diffusion model. We then propose a novel locality-sensitive hashing (LSH)-guided feature distillation method between the original and denoised student features. The denoised student features encapsulate teacher knowledge and could be regarded as a teacher role. In this way, our DSKD method could eliminate discrepancies in mapping manners and feature distributions between the teacher and student, while learning meaningful knowledge from the teacher. Experiments on visual recognition tasks demonstrate that DSKD significantly outperforms existing KD methods across various models and datasets. Our code is attached in supplementary material.

[545] Enhancing Diffusion-Based Quantitatively Controllable Image Generation via Matrix-Form EDM and Adaptive Vicinal Training

Xin Ding, Yun Chen, Sen Zhang, Kao Zhang, Nenglun Chen, Peibei Cao, Yongwei Wang, Fei Wu

Main category: cs.CV

TL;DR: iCCDM improves continuous conditional diffusion models by incorporating EDM framework with matrix formulation and adaptive vicinal training, achieving better quality and efficiency than existing methods including large-scale text-to-image models.

Details

Motivation: CCDM has limitations due to outdated diffusion framework and low sampling efficiency, and has been surpassed by GAN-based methods. Need to improve both generation quality and sampling efficiency for continuous conditional image generation.

Method: Proposes iCCDM framework incorporating Elucidated Diffusion Model (EDM) with novel matrix-form EDM formulation and adaptive vicinal training strategy to enhance both quality and efficiency.

Result: iCCDM consistently outperforms existing methods across four benchmark datasets (64×64 to 256×256 resolution), including state-of-the-art large-scale text-to-image diffusion models (Stable Diffusion 3, FLUX.1, Qwen-Image), achieving higher generation quality with significantly reduced sampling cost.

Conclusion: iCCDM successfully addresses CCDM’s limitations by integrating advanced EDM framework with novel formulations, establishing new state-of-the-art for continuous conditional image generation with improved efficiency.

Abstract: Continuous Conditional Diffusion Model (CCDM) is a diffusion-based framework designed to generate high-quality images conditioned on continuous regression labels. Although CCDM has demonstrated clear advantages over prior approaches across a range of datasets, it still exhibits notable limitations and has recently been surpassed by a GAN-based method, namely CcGAN-AVAR. These limitations mainly arise from its reliance on an outdated diffusion framework and its low sampling efficiency due to long sampling trajectories. To address these issues, we propose an improved CCDM framework, termed iCCDM, which incorporates the more advanced \textit{Elucidated Diffusion Model} (EDM) framework with substantial modifications to improve both generation quality and sampling efficiency. Specifically, iCCDM introduces a novel matrix-form EDM formulation together with an adaptive vicinal training strategy. Extensive experiments on four benchmark datasets, spanning image resolutions from $64\times64$ to $256\times256$, demonstrate that iCCDM consistently outperforms existing methods, including state-of-the-art large-scale text-to-image diffusion models (e.g., Stable Diffusion 3, FLUX.1, and Qwen-Image), achieving higher generation quality while significantly reducing sampling cost.

[546] MLV-Edit: Towards Consistent and Highly Efficient Editing for Minute-Level Videos

Yangyi Cao, Yuanhang Li, Lan Chen, Qi Mao

Main category: cs.CV

TL;DR: MLV-Edit is a training-free, flow-based framework for minute-level video editing that addresses computational overhead and temporal consistency challenges through segment-wise editing with Velocity Blend and Attention Sink modules.

Details

Motivation: Existing video editing techniques excel at short-form manipulation but struggle with minute-level videos due to prohibitive computational costs and difficulty maintaining global temporal consistency across thousands of frames.

Method: MLV-Edit uses a divide-and-conquer strategy for segment-wise editing with two core modules: Velocity Blend rectifies motion inconsistencies at segment boundaries by aligning flow fields, and Attention Sink anchors local segment features to global reference frames to suppress structural drift.

Result: Extensive experiments show MLV-Edit consistently outperforms state-of-the-art methods in temporal stability and semantic fidelity for minute-level video editing.

Conclusion: MLV-Edit provides an effective training-free solution for long-duration video editing that maintains temporal consistency and handles computational challenges through innovative flow-based techniques.

Abstract: We propose MLV-Edit, a training-free, flow-based framework that address the unique challenges of minute-level video editing. While existing techniques excel in short-form video manipulation, scaling them to long-duration videos remains challenging due to prohibitive computational overhead and the difficulty of maintaining global temporal consistency across thousands of frames. To address this, MLV-Edit employs a divide-and-conquer strategy for segment-wise editing, facilitated by two core modules: Velocity Blend rectifies motion inconsistencies at segment boundaries by aligning the flow fields of adjacent chunks, eliminating flickering and boundary artifacts commonly observed in fragmented video processing; and Attention Sink anchors local segment features to global reference frames, effectively suppressing cumulative structural drift. Extensive quantitative and qualitative experiments demonstrate that MLV-Edit consistently outperforms state-of-the-art methods in terms of temporal stability and semantic fidelity.

[547] Toxicity Assessment in Preclinical Histopathology via Class-Aware Mahalanobis Distance for Known and Novel Anomalies

Olga Graf, Dhrupal Patel, Peter Groß, Charlotte Lempp, Matthias Hein, Fabian Heinemann

Main category: cs.CV

TL;DR: AI-based anomaly detection framework for histopathological whole-slide images in rodent livers that identifies healthy tissue, known pathologies, and rare pathologies without training data as out-of-distribution findings.

Details

Motivation: Drug-induced toxicity is a major cause of failure in preclinical development, and histopathological evaluation relies heavily on expert pathologists, creating bottlenecks for large-scale screening. Early detection of adverse effects is critical to reduce attrition and accelerate safe drug development.

Method: Created a novel dataset with pixelwise annotations of healthy tissue and known pathologies. Fine-tuned a pre-trained Vision Transformer (DINOv2) using Low-Rank Adaptation (LoRA) for tissue segmentation. Extracted features for out-of-distribution detection using Mahalanobis distance with class-specific thresholds optimized using mean of false negative and false positive rates.

Result: Achieved high accuracy with only 0.16% of pathological tissue classified as healthy and 0.35% of healthy tissue classified as pathological. Framework accurately detects anomalies including rare OOD morphologies in mouse liver WSIs with known toxicological findings.

Conclusion: Demonstrates potential of AI-driven histopathology to support preclinical workflows, reduce late-stage failures, and improve efficiency in drug development through automated anomaly detection in histopathological images.

Abstract: Drug-induced toxicity remains a leading cause of failure in preclinical development and early clinical trials. Detecting adverse effects at an early stage is critical to reduce attrition and accelerate the development of safe medicines. Histopathological evaluation remains the gold standard for toxicity assessment, but it relies heavily on expert pathologists, creating a bottleneck for large-scale screening. To address this challenge, we introduce an AI-based anomaly detection framework for histopathological whole-slide images (WSIs) in rodent livers from toxicology studies. The system identifies healthy tissue and known pathologies (anomalies) for which training data is available. In addition, it can detect rare pathologies without training data as out-of-distribution (OOD) findings. We generate a novel dataset of pixelwise annotations of healthy tissue and known pathologies and use this data to fine-tune a pre-trained Vision Transformer (DINOv2) via Low-Rank Adaptation (LoRA) in order to do tissue segmentation. Finally, we extract features for OOD detection using the Mahalanobis distance. To better account for class-dependent variability in histological data, we propose the use of class-specific thresholds. We optimize the thresholds using the mean of the false negative and false positive rates, resulting in only 0.16% of pathological tissue classified as healthy and 0.35% of healthy tissue classified as pathological. Applied to mouse liver WSIs with known toxicological findings, the framework accurately detects anomalies, including rare OOD morphologies. This work demonstrates the potential of AI-driven histopathology to support preclinical workflows, reduce late-stage failures, and improve efficiency in drug development.

[548] Eliminating Registration Bias in Synthetic CT Generation: A Physics-Based Simulation Framework

Lukas Zimmermann, Michael Rauter, Maximilian Schmid, Dietmar Georg, Barbara Knäusl

Main category: cs.CV

TL;DR: Physics-based CBCT simulation generates geometrically aligned training pairs for synthetic CT generation, avoiding registration bias issues in conventional supervised methods.

Details

Motivation: Supervised CT generation from CBCT suffers from registration bias in training pairs, where imperfect alignment between scans corrupts models and evaluation metrics, making benchmark performance misleading.

Method: Proposes physics-based CBCT simulation to create geometrically aligned training pairs by construction, combined with evaluation using geometric alignment metrics (Normalized Mutual Information) against input CBCT rather than biased ground truth.

Result: Models trained on synthetic data achieved superior geometric alignment (NMI: 0.31 vs 0.22) despite lower conventional intensity scores. NMI consistently predicted observer preference across registration methods, with clinical observers preferring synthetic-trained outputs in 87% of cases.

Conclusion: Geometric fidelity, not intensity agreement with biased ground truth, aligns with clinical requirements for CT generation from CBCT, demonstrating the importance of proper evaluation metrics.

Abstract: Supervised synthetic CT generation from CBCT requires registered training pairs, yet perfect registration between separately acquired scans remains unattainable. This registration bias propagates into trained models and corrupts standard evaluation metrics. This may suggest that superior benchmark performance indicates better reproduction of registration artifacts rather than anatomical fidelity. We propose physics-based CBCT simulation to provide geometrically aligned training pairs by construction, combined with evaluation using geometric alignment metrics against input CBCT rather than biased ground truth. On two independent pelvic datasets, models trained on synthetic data achieved superior geometric alignment (Normalized Mutual Information: 0.31 vs 0.22) despite lower conventional intensity scores. Intensity metrics showed inverted correlations with clinical assessment for deformably registered data, while Normalized Mutual Information consistently predicted observer preference across registration methodologies (rho = 0.31, p < 0.001). Clinical observers preferred synthetic-trained outputs in 87% of cases, demonstrating that geometric fidelity, not intensity agreement with biased ground truth, aligns with clinical requirements.

[549] Deep learning enables urban change profiling through alignment of historical maps

Sidi Wu, Yizi Chen, Maurizio Gribaudi, Konrad Schindler, Clément Mallet, Julien Perret, Lorenz Hurni

Main category: cs.CV

TL;DR: A deep learning framework for automated fine-grained urban change analysis from historical map collections using dense alignment, multi-temporal object detection, and change profiling

Details

Motivation: Historical maps provide unique records of long-term urban transformation but extracting consistent change information is challenging due to spatial misalignment, cartographic variation, and degrading document quality, limiting most analyses to small-scale or qualitative approaches

Method: Fully automated deep learning-based framework with modular design integrating dense map alignment, multi-temporal object detection, and change profiling to enable systematic quantitative characterization of urban change

Result: Demonstrates robust performance of alignment and object detection methods; applied to Paris (1868-1937) reveals spatial and temporal heterogeneity in urban transformation; framework supports adaptation to diverse cartographic contexts

Conclusion: Shifts historical map analysis from ad hoc visual comparison to systematic quantitative characterization, with relevance for social sciences and humanities research

Abstract: Prior to modern Earth observation technologies, historical maps provide a unique record of long-term urban transformation and offer a lens on the evolving identity of cities. However, extracting consistent and fine-grained change information from historical map series remains challenging due to spatial misalignment, cartographic variation, and degrading document quality, limiting most analyses to small-scale or qualitative approaches. We propose a fully automated, deep learning-based framework for fine-grained urban change analysis from large collections of historical maps, built on a modular design that integrates dense map alignment, multi-temporal object detection, and change profiling. This framework shifts the analysis of historical maps from ad hoc visual comparison toward systematic, quantitative characterization of urban change. Experiments demonstrate the robust performance of the proposed alignment and object detection methods. Applied to Paris between 1868 and 1937, the framework reveals the spatial and temporal heterogeneity in urban transformation, highlighting its relevance for research in the social sciences and humanities. The modular design of our framework further supports adaptation to diverse cartographic contexts and downstream applications.

[550] LoopViT: Scaling Visual ARC with Looped Transformers

Wen-Jie Shu, Xuerui Qiu, Rui-Jie Zhu, Harold Haodong Chen, Yexin Liu, Harry Yang

Main category: cs.CV

TL;DR: Loop-ViT introduces a recursive vision transformer architecture with weight-tied recurrence and dynamic exit mechanism for more efficient visual reasoning on ARC-AGI benchmark.

Details

Motivation: Current feed-forward vision transformers have computational depth strictly bound to parameter size, which fails to capture the iterative, algorithmic nature of human induction needed for visual reasoning tasks like ARC-AGI.

Method: Proposes Loop-ViT with weight-tied recurrence that decouples reasoning depth from model capacity, using a Hybrid Block combining local convolutions and global attention. Introduces parameter-free Dynamic Exit mechanism based on predictive entropy to halt inference when internal state reaches low-uncertainty attractor.

Result: 18M parameter Loop-ViT achieves 65.8% accuracy on ARC-AGI-1 benchmark, outperforming massive 73M-parameter ensembles, demonstrating adaptive iterative computation as more efficient scaling axis than increasing network width.

Conclusion: Recursive architectures with adaptive computation offer superior efficiency for visual reasoning compared to traditional feed-forward models, providing a new scaling axis through iterative computation rather than parameter growth.

Abstract: Recent advances in visual reasoning have leveraged vision transformers to tackle the ARC-AGI benchmark. However, we argue that the feed-forward architecture, where computational depth is strictly bound to parameter size, falls short of capturing the iterative, algorithmic nature of human induction. In this work, we propose a recursive architecture called Loop-ViT, which decouples reasoning depth from model capacity through weight-tied recurrence. Loop-ViT iterates a weight-tied Hybrid Block, combining local convolutions and global attention, to form a latent chain of thought. Crucially, we introduce a parameter-free Dynamic Exit mechanism based on predictive entropy: the model halts inference when its internal state ``crystallizes" into a low-uncertainty attractor. Empirical results on the ARC-AGI-1 benchmark validate this perspective: our 18M model achieves 65.8% accuracy, outperforming massive 73M-parameter ensembles. These findings demonstrate that adaptive iterative computation offers a far more efficient scaling axis for visual reasoning than simply increasing network width. The code is available at https://github.com/WenjieShu/LoopViT.

[551] Reg4Pru: Regularisation Through Random Token Routing for Token Pruning

Julian Wyatt, Ronald Clark, Irina Voiculescu

Main category: cs.CV

TL;DR: Reg4Pru is a training regularization technique that improves token pruning performance for vision transformers in segmentation tasks, achieving 46% AP improvement with 29% speedup.

Details

Motivation: Vision transformers suffer from quadratic computational scaling with token count. Token pruning methods improve efficiency but degrade performance in deeper layers due to instability from preserved representations, particularly affecting dense prediction tasks like segmentation.

Method: Introduces Reg4Pru, a training regularization technique specifically designed to mitigate performance loss from token pruning. The method likely involves regularization strategies that stabilize preserved token representations during pruning operations.

Result: On the FIVES blood vessel segmentation dataset, Reg4Pru improves average precision by 46% absolute compared to same model trained without routing, while achieving 29% relative speedup in wall-clock time compared to non-pruned baseline.

Conclusion: Reg4Pru is an effective regularizer for token reduction strategies in vision transformers, enabling significant computational efficiency gains without sacrificing segmentation performance.

Abstract: Transformers are widely adopted in modern vision models due to their strong ability to scale with dataset size and generalisability. However, this comes with a major drawback: computation scales quadratically to the total number of tokens. Numerous methods have been proposed to mitigate this. For example, we consider token pruning with reactivating tokens from preserved representations, but the increased computational efficiency of this method results in decreased stability from the preserved representations, leading to poorer dense prediction performance at deeper layers. In this work, we introduce Reg4Pru, a training regularisation technique that mitigates token-pruning performance loss for segmentation. We compare our models on the FIVES blood vessel segmentation dataset and find that Reg4Pru improves average precision by an absolute 46% compared to the same model trained without routing. This increase is observed using a configuration that achieves a 29% relative speedup in wall-clock time compared to the non-pruned baseline. These findings indicate that Reg4Pru is a valuable regulariser for token reduction strategies.

[552] Lung Nodule Image Synthesis Driven by Two-Stage Generative Adversarial Networks

Lu Cao, Xiquan He, Junying Zeng, Chaoyun Mai, Min Luo

Main category: cs.CV

TL;DR: TSGAN: Two-stage GAN for generating diverse, controllable lung nodule CT images by decoupling morphology and texture features to improve detection model performance.

Details

Motivation: Limited sample size and insufficient diversity in lung nodule CT datasets restrict detection model performance and generalization. Existing methods generate images with insufficient diversity and controllability, suffering from monotonous texture features and distorted anatomical structures.

Method: Two-stage GAN: 1) StyleGAN generates semantic segmentation masks to control anatomical structure; 2) DL-Pix2Pix translates masks into CT images using local importance attention and dynamic weight multi-head window attention to enhance texture and background modeling.

Result: Accuracy improved by 4.6% and mAP by 4% on LUNA16 dataset compared to original dataset. TSGAN enhances synthetic image quality and detection model performance.

Conclusion: TSGAN effectively enhances diversity and spatial controllability of synthetic lung nodule CT data by decoupling morphological structure and texture features, leading to improved detection model performance.

Abstract: The limited sample size and insufficient diversity of lung nodule CT datasets severely restrict the performance and generalization ability of detection models. Existing methods generate images with insufficient diversity and controllability, suffering from issues such as monotonous texture features and distorted anatomical structures. Therefore, we propose a two-stage generative adversarial network (TSGAN) to enhance the diversity and spatial controllability of synthetic data by decoupling the morphological structure and texture features of lung nodules. In the first stage, StyleGAN is used to generate semantic segmentation mask images, encoding lung nodules and tissue backgrounds to control the anatomical structure of lung nodule images; The second stage uses the DL-Pix2Pix model to translate the mask map into CT images, employing local importance attention to capture local features, while utilizing dynamic weight multi-head window attention to enhance the modeling capability of lung nodule texture and background. Compared to the original dataset, the accuracy improved by 4.6% and mAP by 4% on the LUNA16 dataset. Experimental results demonstrate that TSGAN can enhance the quality of synthetic images and the performance of detection models.

[553] CIEC: Coupling Implicit and Explicit Cues for Multimodal Weakly Supervised Manipulation Localization

Xinquan Yu, Wei Lu, Xiangyang Luo

Main category: cs.CV

TL;DR: CIEC is a weakly-supervised framework for multimodal manipulation localization in image-text pairs using only coarse-grained annotations, achieving results comparable to fully supervised methods.

Details

Motivation: Current multimodal manipulation localization methods require costly fine-grained annotations (patch/token-level). The paper aims to develop a weakly-supervised approach that only needs coarse-grained image/sentence-level annotations to make the process more practical and scalable.

Method: CIEC framework has two branches: 1) Image-based weakly-supervised localization using Textual-guidance Refine Patch Selection (TRPS) module that integrates visual and textual forgery cues with spatial priors, plus background silencing and spatial contrast constraints. 2) Text-based weakly-supervised localization using Visual-deviation Calibrated Token Grounding (VCTG) module that focuses on content words with visual bias, plus asymmetric sparse and semantic consistency constraints.

Result: Extensive experiments show CIEC achieves results comparable to fully supervised methods on several evaluation metrics, demonstrating effectiveness of the weakly-supervised approach.

Conclusion: CIEC successfully addresses the annotation bottleneck in multimodal manipulation localization by proposing a weakly-supervised framework that leverages implicit and explicit cues between modalities, achieving strong performance with only coarse-grained supervision.

Abstract: To mitigate the threat of misinformation, multimodal manipulation localization has garnered growing attention. Consider that current methods rely on costly and time-consuming fine-grained annotations, such as patch/token-level annotations. This paper proposes a novel framework named Coupling Implicit and Explicit Cues (CIEC), which aims to achieve multimodal weakly-supervised manipulation localization for image-text pairs utilizing only coarse-grained image/sentence-level annotations. It comprises two branches, image-based and text-based weakly-supervised localization. For the former, we devise the Textual-guidance Refine Patch Selection (TRPS) module. It integrates forgery cues from both visual and textual perspectives to lock onto suspicious regions aided by spatial priors. Followed by the background silencing and spatial contrast constraints to suppress interference from irrelevant areas. For the latter, we devise the Visual-deviation Calibrated Token Grounding (VCTG) module. It focuses on meaningful content words and leverages relative visual bias to assist token localization. Followed by the asymmetric sparse and semantic consistency constraints to mitigate label noise and ensure reliability. Extensive experiments demonstrate the effectiveness of our CIEC, yielding results comparable to fully supervised methods on several evaluation metrics.

[554] Learning Topology-Aware Implicit Field for Unified Pulmonary Tree Modeling with Incomplete Topological Supervision

Ziqiao Weng, Jiancheng Yang, Kangxian Xie, Bo Zhou, Weidong Cai

Main category: cs.CV

TL;DR: TopoField: A topology-aware implicit modeling framework for repairing incomplete pulmonary trees from CT images, enabling unified multi-task inference including anatomical labeling and lung segment reconstruction.

Details

Motivation: Pulmonary trees extracted from CT images often have topological incompleteness (missing/disconnected branches), degrading anatomical analysis. Current approaches are inefficient and lack robustness under structural corruption.

Method: Uses sparse surface and skeleton point clouds to learn a continuous implicit field for topology repair without complete annotations, trained on synthetically introduced disruptions over already incomplete trees. Jointly infers anatomical labeling and lung segment reconstruction through task-specific implicit functions.

Result: Extensive experiments on Lung3D+ dataset show improved topological completeness and accurate anatomical labeling/segment reconstruction under incomplete scenarios. High computational efficiency (just over one second per case).

Conclusion: TopoField provides an efficient, practical solution for pulmonary tree analysis with topology repair, suitable for large-scale clinical applications.

Abstract: Pulmonary trees extracted from CT images frequently exhibit topological incompleteness, such as missing or disconnected branches, which substantially degrades downstream anatomical analysis and limits the applicability of existing pulmonary tree modeling pipelines. Current approaches typically rely on dense volumetric processing or explicit graph reasoning, leading to limited efficiency and reduced robustness under realistic structural corruption. We propose TopoField, a topology-aware implicit modeling framework that treats topology repair as a first-class modeling problem and enables unified multi-task inference for pulmonary tree analysis. TopoField represents pulmonary anatomy using sparse surface and skeleton point clouds and learns a continuous implicit field that supports topology repair without relying on complete or explicit disconnection annotations, by training on synthetically introduced structural disruptions over \textit{already} incomplete trees. Building upon the repaired implicit representation, anatomical labeling and lung segment reconstruction are jointly inferred through task-specific implicit functions within a single forward pass.Extensive experiments on the Lung3D+ dataset demonstrate that TopoField consistently improves topological completeness and achieves accurate anatomical labeling and lung segment reconstruction under challenging incomplete scenarios. Owing to its implicit formulation, TopoField attains high computational efficiency, completing all tasks in just over one second per case, highlighting its practicality for large-scale and time-sensitive clinical applications. Code and data will be available at https://github.com/HINTLab/TopoField.

[555] SSI-DM: Singularity Skipping Inversion of Diffusion Models

Chen Min, Enze Jiang, Jishen Peng, Zheng Ma

Main category: cs.CV

TL;DR: SSI-DM addresses the ill-posed nature of diffusion model inversion by skipping singular regions through small noise addition before standard inversion, producing Gaussian noise with good editability.

Details

Motivation: Existing diffusion model inversion methods produce non-Gaussian noise with poor editability due to inaccuracies in early noising steps, caused by a mathematical singularity that makes inversion fundamentally ill-posed.

Method: Singularity Skipping Inversion (SSI-DM) bypasses the singular region by adding small noise before standard inversion, producing inverted noise with natural Gaussian properties while maintaining reconstruction fidelity.

Result: The method achieves superior performance on public image datasets for reconstruction and interpolation tasks, providing a principled and efficient solution to diffusion model inversion.

Conclusion: SSI-DM offers a simple, plug-and-play technique compatible with general diffusion models that solves the fundamental ill-posedness of diffusion inversion while preserving editability.

Abstract: Inverting real images into the noise space is essential for editing tasks using diffusion models, yet existing methods produce non-Gaussian noise with poor editability due to the inaccuracy in early noising steps. We identify the root cause: a mathematical singularity that renders inversion fundamentally ill-posed. We propose Singularity Skipping Inversion of Diffusion Models (SSI-DM), which bypasses this singular region by adding small noise before standard inversion. This simple approach produces inverted noise with natural Gaussian properties while maintaining reconstruction fidelity. As a plug-and-play technique compatible with general diffusion models, our method achieves superior performance on public image datasets for reconstruction and interpolation tasks, providing a principled and efficient solution to diffusion model inversion.

[556] MAIN-VLA: Modeling Abstraction of Intention and eNvironment for Vision-Language-Action Models

Zheyuan Zhou, Liang Du, Zixun Sun, Xiaoyu Zhou, Ruimin Ye, Qihao Chen, Yinda Chen, Lemiao Qiu

Main category: cs.CV

TL;DR: MAIN-VLA framework improves VLA decision-making in complex environments by abstracting intentions and environment semantics, enabling efficient attention and state-of-the-art performance in games like Minecraft and Valorant.

Details

Motivation: Existing Visual-Language-Action (VLA) approaches struggle in highly complex, dynamic environments with real-time unpredictable interactions (3D open worlds, large-scale PvP games) due to inefficient extraction of action-critical signals from redundant sensor streams.

Method: Introduces MAIN-VLA framework with two key components: 1) Intention Abstraction (IA) extracts verbose linguistic instructions into compact semantic primitives, 2) Environment Semantics Abstraction (ESA) projects visual streams into structured topological affordance representation. Aligning these modalities enables parameter-free token-pruning for attention concentration.

Result: Extensive experiments in open-world Minecraft and large-scale PvP environments (Game for Peace, Valorant) show MAIN-VLA achieves state-of-the-art performance with superior decision quality, stronger generalization, and cutting-edge inference efficiency.

Conclusion: MAIN-VLA successfully addresses VLA inefficiencies in complex environments through explicit abstraction of intention and environment semantics, enabling deep semantic alignment and efficient attention mechanisms for improved decision-making.

Abstract: Despite significant progress in Visual-Language-Action (VLA), in highly complex and dynamic environments that involve real-time unpredictable interactions (such as 3D open worlds and large-scale PvP games), existing approaches remain inefficient at extracting action-critical signals from redundant sensor streams. To tackle this, we introduce MAIN-VLA, a framework that explicitly Models the Abstraction of Intention and eNvironment to ground decision-making in deep semantic alignment rather than superficial pattern matching. Specifically, our Intention Abstraction (IA) extracts verbose linguistic instructions and their associated reasoning into compact, explicit semantic primitives, while the Environment Semantics Abstraction (ESA) projects overwhelming visual streams into a structured, topological affordance representation. Furthermore, aligning these two abstract modalities induces an emergent attention-concentration effect, enabling a parameter-free token-pruning strategy that filters out perceptual redundancy without degrading performance. Extensive experiments in open-world Minecraft and large-scale PvP environments (Game for Peace and Valorant) demonstrate that MAIN-VLA sets a new state-of-the-art, which achieves superior decision quality, stronger generalization, and cutting-edge inference efficiency.

[557] Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, Jun Zhu

Main category: cs.CV

TL;DR: Proposes Causal Forcing method to bridge architectural gap when distilling bidirectional video diffusion models into autoregressive models for real-time interactive video generation, using AR teacher for ODE initialization instead of bidirectional teacher.

Details

Motivation: Current methods for real-time interactive video generation distill pretrained bidirectional video diffusion models into few-step autoregressive models, but face an architectural gap when replacing full attention with causal attention. Existing approaches don't bridge this gap theoretically and use ODE distillation that requires frame-level injectivity, which is violated when distilling AR student from bidirectional teacher, leading to performance degradation.

Method: Proposes Causal Forcing that uses an autoregressive teacher for ODE initialization instead of a bidirectional teacher, thereby bridging the architectural gap. This ensures the frame-level injectivity condition is satisfied and allows proper recovery of the teacher’s flow map.

Result: Outperforms all baselines across all metrics, surpassing state-of-the-art Self Forcing by 19.3% in Dynamic Degree, 8.7% in VisionReward, and 16.7% in Instruction Following.

Conclusion: Causal Forcing effectively bridges the architectural gap in distilling bidirectional video diffusion models into autoregressive models for real-time interactive video generation, achieving superior performance by using AR teacher for ODE initialization.

Abstract: To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher’s flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing that uses an AR teacher for ODE initialization, thereby bridging the architectural gap. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3% in Dynamic Degree, 8.7% in VisionReward, and 16.7% in Instruction Following. Project page and the code: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}

Bo Miao, Weijia Liu, Jun Luo, Lachlan Shinnick, Jian Liu, Thomas Hamilton-Smith, Yuhe Yang, Zijie Wu, Vanja Videnovic, Feras Dayoub, Anton van den Hengel

Main category: cs.CV

TL;DR: LangMap is a large-scale benchmark for multi-granularity language-driven navigation tasks in 3D indoor environments, featuring comprehensive annotations and evaluation of agents interpreting natural language instructions at four semantic levels.

Details

Motivation: The paper addresses the need for better benchmarks to evaluate language-driven embodied navigation, particularly for understanding relationships between objects and language at multiple semantic granularities (scene, room, region, instance) to advance meaningful human-AI communication and embodied intelligence.

Method: The authors introduce HieraNav as a multi-granularity navigation task and create LangMap benchmark using real-world 3D indoor scans with human-verified annotations. They provide region labels, discriminative region/instance descriptions covering 414 object categories, and over 18K navigation tasks with both concise and detailed descriptions.

Result: LangMap achieves 23.8% higher discriminative accuracy than GOAT-Bench using four times fewer words. Evaluations show richer context and memory improve success rates, while challenges remain with long-tailed, small, context-dependent, and distant goals, as well as multi-goal completion.

Conclusion: HieraNav and LangMap establish a rigorous testbed for advancing language-driven embodied navigation, providing high-quality annotations and tasks that reveal current model limitations and opportunities for improvement in multimodal language understanding for embodied AI.

Abstract: The relationships between objects and language are fundamental to meaningful communication between humans and AI, and to practically useful embodied intelligence. We introduce HieraNav, a multi-granularity, open-vocabulary goal navigation task where agents interpret natural language instructions to reach targets at four semantic levels: scene, room, region, and instance. To this end, we present Language as a Map (LangMap), a large-scale benchmark built on real-world 3D indoor scans with comprehensive human-verified annotations and tasks spanning these levels. LangMap provides region labels, discriminative region descriptions, discriminative instance descriptions covering 414 object categories, and over 18K navigation tasks. Each target features both concise and detailed descriptions, enabling evaluation across different instruction styles. LangMap achieves superior annotation quality, outperforming GOAT-Bench by 23.8% in discriminative accuracy using four times fewer words. Comprehensive evaluations of zero-shot and supervised models on LangMap reveal that richer context and memory improve success, while long-tailed, small, context-dependent, and distant goals, as well as multi-goal completion, remain challenging. HieraNav and LangMap establish a rigorous testbed for advancing language-driven embodied navigation. Project: https://bo-miao.github.io/LangMap

[559] MIRROR: Manifold Ideal Reference ReconstructOR for Generalizable AI-Generated Image Detection

Ruiqi Liu, Manni Cui, Ziheng Qin, Zhiyuan Yan, Ruoxin Chen, Yi Han, Zhiheng Li, Junkai Chen, ZhiJin Chen, Kaiqing Lin, Jialiang Shen, Lubin Weng, Jing Dong, Yan Wang, Shu Wu

Main category: cs.CV

TL;DR: MIRROR reformulates AI-generated image detection as a reference-comparison problem using a learnable memory bank to encode reality priors, achieving state-of-the-art performance across multiple benchmarks.

Details

Motivation: Existing AI-generated image detectors rely on artifact-based classification and struggle to generalize to evolving generative traces, while human judgment uses stable real-world regularities. The authors aim to create a more generalizable detection method that verifies consistency with the real-image manifold rather than fitting specific forgery cues.

Method: Proposes MIRROR (Manifold Ideal Reference ReconstructOR), a framework that explicitly encodes reality priors using a learnable discrete memory bank. The method projects an input into a manifold-consistent ideal reference via sparse linear combination and uses the resulting residuals as robust detection signals.

Result: Across 14 benchmarks, MIRROR consistently outperforms prior methods, achieving gains of 2.1% on six standard benchmarks and 8.1% on seven in-the-wild benchmarks. On the Human-AIGI benchmark, MIRROR reaches 89.6% accuracy across 27 generators, surpassing both lay users and visual experts.

Conclusion: MIRROR demonstrates superior performance in AI-generated image detection by leveraging manifold consistency rather than artifact-based classification, approaching human perceptual limits as pretrained backbones scale, and showing potential to replace human experts in media security applications.

Abstract: High-fidelity generative models have narrowed the perceptual gap between synthetic and real images, posing serious threats to media security. Most existing AI-generated image (AIGI) detectors rely on artifact-based classification and struggle to generalize to evolving generative traces. In contrast, human judgment relies on stable real-world regularities, with deviations from the human cognitive manifold serving as a more generalizable signal of forgery. Motivated by this insight, we reformulate AIGI detection as a Reference-Comparison problem that verifies consistency with the real-image manifold rather than fitting specific forgery cues. We propose MIRROR (Manifold Ideal Reference ReconstructOR), a framework that explicitly encodes reality priors using a learnable discrete memory bank. MIRROR projects an input into a manifold-consistent ideal reference via sparse linear combination, and uses the resulting residuals as robust detection signals. To evaluate whether detectors reach the “superhuman crossover” required to replace human experts, we introduce the Human-AIGI benchmark, featuring a psychophysically curated human-imperceptible subset. Across 14 benchmarks, MIRROR consistently outperforms prior methods, achieving gains of 2.1% on six standard benchmarks and 8.1% on seven in-the-wild benchmarks. On Human-AIGI, MIRROR reaches 89.6% accuracy across 27 generators, surpassing both lay users and visual experts, and further approaching the human perceptual limit as pretrained backbones scale. The code is publicly available at: https://github.com/349793927/MIRROR

[560] Evaluating OCR Performance for Assistive Technology: Effects of Walking Speed, Camera Placement, and Camera Type

Junchi Feng, Nikhil Ballem, Mahya Beheshti, Giles Hamilton-Fletcher, Todd Hudson, Maurizio Porfiri, William H. Seiple, John-Ross Rizzo

Main category: cs.CV

TL;DR: Systematic evaluation of OCR performance under static and dynamic conditions for assistive technology, testing distance, viewing angles, walking speeds, and camera positions across multiple OCR engines and devices.

Details

Motivation: Most OCR evaluations use static datasets that don't reflect real-world mobile use challenges for people with blindness and low vision, so this study aims to systematically evaluate OCR performance under both static and dynamic conditions.

Method: Conducted static tests measuring detection range across distances (1-7m) and viewing angles (0-75° horizontally), and dynamic tests varying walking speeds (0.8-1.8 m/s) with three camera positions (head, shoulder, hand-held). Evaluated smartphone and smart glasses with phone’s main/ultra-wide cameras, benchmarking four OCR engines (Google Vision, PaddleOCR 3.0, EasyOCR, Tesseract) for static tests, then using PaddleOCR for dynamic tests. Accuracy computed at character level using Levenshtein ratio against ground truth.

Result: Recognition accuracy declined with increased walking speed and wider viewing angles. Google Vision achieved highest overall accuracy, with PaddleOCR as strongest open-source alternative. Phone’s main camera achieved highest accuracy, and shoulder-mounted placement yielded highest average among body positions, though differences among shoulder, head, and hand were not statistically significant.

Conclusion: Mobile OCR performance is significantly affected by motion and viewing conditions, with Google Vision performing best overall and shoulder-mounted positioning showing practical advantages for real-world assistive technology applications.

Abstract: Optical character recognition (OCR), which converts printed or handwritten text into machine-readable form, is widely used in assistive technology for people with blindness and low vision. Yet, most evaluations rely on static datasets that do not reflect the challenges of mobile use. In this study, we systematically evaluated OCR performance under both static and dynamic conditions. Static tests measured detection range across distances of 1-7 meters and viewing angles of 0-75 degrees horizontally. Dynamic tests examined the impact of motion by varying walking speed from slow (0.8 m/s) to very fast (1.8 m/s) and comparing three camera mounting positions: head-mounted, shoulder-mounted, and hand-held. We evaluated both a smartphone and smart glasses, using the phone’s main and ultra-wide cameras. Four OCR engines were benchmarked to assess accuracy at different distances and viewing angles: Google Vision, PaddleOCR 3.0, EasyOCR, and Tesseract. PaddleOCR 3.0 was then used to evaluate accuracy at different walking speeds. Accuracy was computed at the character level using the Levenshtein ratio against manually defined ground truth. Results showed that recognition accuracy declined with increased walking speed and wider viewing angles. Google Vision achieved the highest overall accuracy, with PaddleOCR close behind as the strongest open-source alternative. Across devices, the phone’s main camera achieved the highest accuracy, and a shoulder-mounted placement yielded the highest average among body positions; however, differences among shoulder, head, and hand were not statistically significant.

[561] Show, Don’t Tell: Morphing Latent Reasoning into Image Generation

Harold Haodong Chen, Xinxiang Yin, Wen-Jie Shu, Hongfei Zhang, Zixin Zhang, Chenfei Liao, Litao Guo, Qifeng Chen, Ying-Cong Chen

Main category: cs.CV

TL;DR: LatentMorph integrates implicit latent reasoning into text-to-image generation using lightweight components for visual memory, guidance translation, prediction steering, and adaptive reasoning invocation, achieving significant performance improvements and efficiency gains.

Details

Motivation: Current text-to-image generation methods lack dynamic reasoning and refinement capabilities during generation, which is a hallmark of human creativity. Existing reasoning-augmented paradigms rely on explicit thought processes with discrete text decoding at fixed steps, leading to inefficiencies, information loss, and cognitive mismatches.

Method: LatentMorph introduces four lightweight components: (1) a condenser for summarizing intermediate generation states into compact visual memory, (2) a translator for converting latent thoughts into actionable guidance, (3) a shaper for dynamically steering next image token predictions, and (4) an RL-trained invoker for adaptively determining when to invoke reasoning. The framework performs reasoning entirely in continuous latent spaces.

Result: LatentMorph enhances base model Janus-Pro by 16% on GenEval and 25% on T2I-CompBench; outperforms explicit paradigms by 15% and 11% on abstract reasoning tasks like WISE and IPV-Txt; reduces inference time by 44% and token consumption by 51%; and exhibits 71% cognitive alignment with human intuition on reasoning invocation.

Conclusion: LatentMorph successfully bridges the gap in dynamic reasoning for text-to-image generation by performing implicit reasoning in continuous latent spaces, achieving significant performance improvements, efficiency gains, and better cognitive alignment compared to explicit reasoning paradigms.

Abstract: Text-to-image (T2I) generation has achieved remarkable progress, yet existing methods often lack the ability to dynamically reason and refine during generation–a hallmark of human creativity. Current reasoning-augmented paradigms most rely on explicit thought processes, where intermediate reasoning is decoded into discrete text at fixed steps with frequent image decoding and re-encoding, leading to inefficiencies, information loss, and cognitive mismatches. To bridge this gap, we introduce LatentMorph, a novel framework that seamlessly integrates implicit latent reasoning into the T2I generation process. At its core, LatentMorph introduces four lightweight components: (i) a condenser for summarizing intermediate generation states into compact visual memory, (ii) a translator for converting latent thoughts into actionable guidance, (iii) a shaper for dynamically steering next image token predictions, and (iv) an RL-trained invoker for adaptively determining when to invoke reasoning. By performing reasoning entirely in continuous latent spaces, LatentMorph avoids the bottlenecks of explicit reasoning and enables more adaptive self-refinement. Extensive experiments demonstrate that LatentMorph (I) enhances the base model Janus-Pro by $16%$ on GenEval and $25%$ on T2I-CompBench; (II) outperforms explicit paradigms (e.g., TwiG) by $15%$ and $11%$ on abstract reasoning tasks like WISE and IPV-Txt, (III) while reducing inference time by $44%$ and token consumption by $51%$; and (IV) exhibits $71%$ cognitive alignment with human intuition on reasoning invocation.

[562] LiFlow: Flow Matching for 3D LiDAR Scene Completion

Andrea Matteazzi, Dietmar Tutsch

Main category: cs.CV

TL;DR: LiFlow: First flow matching framework for 3D LiDAR scene completion that improves upon diffusion methods by ensuring consistent training/inference distributions

Details

Motivation: LiDAR point clouds in autonomous driving suffer from occlusion and long-range sparsity, limiting perception. Existing diffusion-based scene completion methods have distribution mismatch between training and inference.

Method: Proposes flow matching framework with nearest neighbor flow matching loss and Chamfer distance loss to enhance both local structure and global coverage in point cloud alignment.

Result: LiFlow achieves state-of-the-art performance across multiple metrics for 3D LiDAR scene completion.

Conclusion: Flow matching provides superior approach to 3D LiDAR scene completion compared to diffusion methods, with better distribution consistency and performance.

Abstract: In autonomous driving scenarios, the collected LiDAR point clouds can be challenged by occlusion and long-range sparsity, limiting the perception of autonomous driving systems. Scene completion methods can infer the missing parts of incomplete 3D LiDAR scenes. Recent methods adopt local point-level denoising diffusion probabilistic models, which require predicting Gaussian noise, leading to a mismatch between training and inference initial distributions. This paper introduces the first flow matching framework for 3D LiDAR scene completion, improving upon diffusion-based methods by ensuring consistent initial distributions between training and inference. The model employs a nearest neighbor flow matching loss and a Chamfer distance loss to enhance both local structure and global coverage in the alignment of point clouds. LiFlow achieves state-of-the-art performance across multiple metrics. Code: https://github.com/matteandre/LiFlow.

[563] Enhancing Indoor Occupancy Prediction via Sparse Query-Based Multi-Level Consistent Knowledge Distillation

Xiang Li, Yupeng Zheng, Pengfei Li, Yilun Chen, Ya-Qin Zhang, Wenchao Ding

Main category: cs.CV

TL;DR: DiScene: A sparse query-based framework for efficient occupancy prediction using multi-level knowledge distillation and teacher-guided initialization.

Details

Motivation: Current occupancy prediction methods face efficiency-accuracy trade-offs - dense methods waste computation on empty voxels, while sparse query-based approaches lack robustness in complex indoor scenes.

Method: Proposes DiScene with two key innovations: 1) Multi-level Consistent Knowledge Distillation transferring hierarchical representations from teacher to student models across four levels (encoder, query, prior, anchor), and 2) Teacher-Guided Initialization for optimized parameter warm-up to accelerate convergence.

Result: Achieves 23.2 FPS without depth priors, outperforming baseline OPUS by 36.1% and even better than depth-enhanced OPUS†. With depth integration, DiScene† attains new SOTA, surpassing EmbodiedOcc by 3.7% with 1.62× faster inference speed. Shows versatility across Occ-Scannet, Occ3D-nuScenes benchmarks and in-the-wild scenarios.

Conclusion: DiScene provides an efficient and robust sparse query-based framework for occupancy prediction through multi-level distillation, achieving state-of-the-art performance with faster inference speeds across diverse environments.

Abstract: Occupancy prediction provides critical geometric and semantic understanding for robotics but faces efficiency-accuracy trade-offs. Current dense methods suffer computational waste on empty voxels, while sparse query-based approaches lack robustness in diverse and complex indoor scenes. In this paper, we propose DiScene, a novel sparse query-based framework that leverages multi-level distillation to achieve efficient and robust occupancy prediction. In particular, our method incorporates two key innovations: (1) a Multi-level Consistent Knowledge Distillation strategy, which transfers hierarchical representations from large teacher models to lightweight students through coordinated alignment across four levels, including encoder-level feature alignment, query-level feature matching, prior-level spatial guidance, and anchor-level high-confidence knowledge transfer and (2) a Teacher-Guided Initialization policy, employing optimized parameter warm-up to accelerate model convergence. Validated on the Occ-Scannet benchmark, DiScene achieves 23.2 FPS without depth priors while outperforming our baseline method, OPUS, by 36.1% and even better than the depth-enhanced version, OPUS†. With depth integration, DiScene† attains new SOTA performance, surpassing EmbodiedOcc by 3.7% with 1.62$\times$ faster inference speed. Furthermore, experiments on the Occ3D-nuScenes benchmark and in-the-wild scenarios demonstrate the versatility of our approach in various environments. Code and models can be accessed at https://github.com/getterupper/DiScene.

[564] VQ-Style: Disentangling Style and Content in Motion with Residual Quantized Representations

Fatemeh Zargarbashi, Dhruv Agrawal, Jakob Buhmann, Martin Guay, Stelian Coros, Robert W. Sumner

Main category: cs.CV

TL;DR: A novel method for disentangling style and content in human motion data using RVQ-VAEs with contrastive learning and information leakage loss, enabling style transfer without fine-tuning via Quantized Code Swapping.

Details

Motivation: Human motion data contains rich semantic content and subtle stylistic features that are challenging to model and disentangle. Current methods struggle to effectively separate style from content for applications like style transfer.

Method: Uses Residual Vector Quantized Variational Autoencoders (RVQ-VAEs) to learn coarse-to-fine motion representations. Integrates contrastive learning and a novel information leakage loss with codebook learning to organize content and style across different codebooks. Employs Quantized Code Swapping for inference-time style transfer without fine-tuning.

Result: The framework demonstrates strong versatility across multiple inference applications including style transfer, style removal, and motion blending. Enables motion style transfer without requiring fine-tuning for unseen styles.

Conclusion: The proposed method effectively disentangles style and content in human motion data, enabling flexible style manipulation through a simple inference-time technique that works without additional training for new styles.

Abstract: Human motion data is inherently rich and complex, containing both semantic content and subtle stylistic features that are challenging to model. We propose a novel method for effective disentanglement of the style and content in human motion data to facilitate style transfer. Our approach is guided by the insight that content corresponds to coarse motion attributes while style captures the finer, expressive details. To model this hierarchy, we employ Residual Vector Quantized Variational Autoencoders (RVQ-VAEs) to learn a coarse-to-fine representation of motion. We further enhance the disentanglement by integrating contrastive learning and a novel information leakage loss with codebook learning to organize the content and the style across different codebooks. We harness this disentangled representation using our simple and effective inference-time technique Quantized Code Swapping, which enables motion style transfer without requiring any fine-tuning for unseen styles. Our framework demonstrates strong versatility across multiple inference applications, including style transfer, style removal, and motion blending.

[565] LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization

Zhenpeng Huang, Jiaqi Li, Zihan Jia, Xinhao Li, Desen Meng, Lingxue Song, Xi Chen, Liang Li, Limin Wang

Main category: cs.CV

TL;DR: LongVPO is a two-stage DPO framework that enables short-context vision-language models to understand ultra-long videos without long-video annotations, using synthetic preference triples and multi-segment reasoning tasks.

Details

Motivation: Existing vision-language models struggle with ultra-long video understanding due to limited context windows and lack of long-video annotations. There's a need for scalable methods that can extend short-context models to handle long videos without costly human labeling.

Method: Two-stage approach: Stage 1 synthesizes preference triples by anchoring questions to short clips with distractors, using visual-similarity and question-specificity filtering. Stage 2 uses recursive captioning on long videos to generate scene-level metadata, then employs LLMs to craft multi-segment reasoning queries and dispreferred responses for alignment.

Result: LongVPO outperforms state-of-the-art open-source models on multiple long-video benchmarks while maintaining strong short-video performance on MVBench, achieving this with only 16K synthetic examples and no human labels.

Conclusion: LongVPO provides a scalable paradigm for efficient long-form video understanding that extends short-context models to handle ultra-long videos without costly annotations, demonstrating strong performance with minimal synthetic data.

Abstract: We present LongVPO, a novel two-stage Direct Preference Optimization framework that enables short-context vision-language models to robustly understand ultra-long videos without any long-video annotations. In Stage 1, we synthesize preference triples by anchoring questions to individual short clips, interleaving them with distractors, and applying visual-similarity and question-specificity filtering to mitigate positional bias and ensure unambiguous supervision. We also approximate the reference model’s scoring over long contexts by evaluating only the anchor clip, reducing computational overhead. In Stage 2, we employ a recursive captioning pipeline on long videos to generate scene-level metadata, then use a large language model to craft multi-segment reasoning queries and dispreferred responses, aligning the model’s preferences through multi-segment reasoning tasks. With only 16K synthetic examples and no costly human labels, LongVPO outperforms the state-of-the-art open-source models on multiple long-video benchmarks, while maintaining strong short-video performance (e.g., on MVBench), offering a scalable paradigm for efficient long-form video understanding.

[566] Implicit neural representation of textures

Albert Kwok, Zheyuan Hu, Dounia Hammou

Main category: cs.CV

TL;DR: The paper explores using implicit neural representations (INRs) as continuous texture representations in UV coordinate space, analyzing performance trade-offs and applications in real-time rendering.

Details

Motivation: To develop more efficient and continuous texture representations using implicit neural networks that operate in UV coordinate space rather than discrete representations, addressing limitations of traditional texture mapping approaches.

Method: Designs different neural networks as texture INRs that operate continuously over UV coordinate space, conducting experiments to evaluate image quality, memory usage, and rendering inference time trade-offs.

Result: Demonstrates that INRs perform well in terms of image quality while offering considerable memory usage and rendering inference time benefits, with analysis of balance between these objectives.

Conclusion: INRs provide an effective continuous texture representation with good performance trade-offs, enabling various applications in real-time rendering and downstream tasks like mipmap fitting and INR-space generation.

Abstract: Implicit neural representation (INR) has proven to be accurate and efficient in various domains. In this work, we explore how different neural networks can be designed as a new texture INR, which operates in a continuous manner rather than a discrete one over the input UV coordinate space. Through thorough experiments, we demonstrate that these INRs perform well in terms of image quality, with considerable memory usage and rendering inference time. We analyze the balance between these objectives. In addition, we investigate various related applications in real-time rendering and down-stream tasks, e.g. mipmap fitting and INR-space generation.

[567] NAB: Neural Adaptive Binning for Sparse-View CT reconstruction

Wangduo Xie, Matthew B. Blaschko

Main category: cs.CV

TL;DR: NAB: Neural Adaptive Binning method for CT reconstruction that integrates rectangular shape priors through a novel binning mechanism with learnable parameters, improving sparse-view reconstruction for industrial objects.

Details

Motivation: Industrial CT reconstruction from sparse views needs to reduce costs while maintaining quality. Existing implicit neural networks lack ability to incorporate shape priors, despite many industrial objects having rectangular structures.

Method: Proposes Neural Adaptive Binning (NAB) that maps coordinate space to binned vector space using shifted hyperbolic tangent functions with learnable position, size, steepness, and rotation parameters. The binned representations are processed by a neural network to predict CT attenuation coefficients, enabling end-to-end optimization.

Result: NAB achieves superior performance on two industrial datasets and maintains robustness on medical datasets when extended to more general expressions. The method effectively integrates shape priors to enhance reconstruction accuracy.

Conclusion: NAB provides a new perspective on integrating shape priors into neural network-based reconstruction, offering improved sparse-view CT reconstruction for industrial applications while maintaining generalization capability.

Abstract: Computed Tomography (CT) plays a vital role in inspecting the internal structures of industrial objects. Furthermore, achieving high-quality CT reconstruction from sparse views is essential for reducing production costs. While classic implicit neural networks have shown promising results for sparse reconstruction, they are unable to leverage shape priors of objects. Motivated by the observation that numerous industrial objects exhibit rectangular structures, we propose a novel \textbf{N}eural \textbf{A}daptive \textbf{B}inning (\textbf{NAB}) method that effectively integrates rectangular priors into the reconstruction process. Specifically, our approach first maps coordinate space into a binned vector space. This mapping relies on an innovative binning mechanism based on differences between shifted hyperbolic tangent functions, with our extension enabling rotations around the input-plane normal vector. The resulting representations are then processed by a neural network to predict CT attenuation coefficients. This design enables end-to-end optimization of the encoding parameters – including position, size, steepness, and rotation – via gradient flow from the projection data, thus enhancing reconstruction accuracy. By adjusting the smoothness of the binning function, NAB can generalize to objects with more complex geometries. This research provides a new perspective on integrating shape priors into neural network-based reconstruction. Extensive experiments demonstrate that NAB achieves superior performance on two industrial datasets. It also maintains robust on medical datasets when the binning function is extended to more general expression. The code will be made available.

[568] Uncertainty-Aware Image Classification In Biomedical Imaging Using Spectral-normalized Neural Gaussian Processes

Uma Meleti, Jeffrey J. Nirschl

Main category: cs.CV

TL;DR: SNGP improves uncertainty estimation and OOD detection in digital pathology models while maintaining in-distribution performance.

Details

Motivation: Current deep learning models for digital pathology are overconfident and poorly calibrated in out-of-distribution settings, limiting clinical trust and adoption. Safety-critical medical imaging needs uncertainty-aware properties to accurately reject OOD inputs.

Method: Implement Spectral-normalized Neural Gaussian Process (SNGP) - lightweight modifications applying spectral normalization and replacing final dense layer with Gaussian process layer to improve single-model uncertainty estimation and OOD detection.

Result: SNGP shows comparable in-distribution performance to deterministic and Monte Carlo dropout methods while significantly improving uncertainty estimation and OOD detection across six datasets and three biomedical classification tasks.

Conclusion: SNGP offers a useful framework for uncertainty-aware classification in digital pathology, supporting safe deployment and building trust with pathologists.

Abstract: Accurate histopathologic interpretation is key for clinical decision-making; however, current deep learning models for digital pathology are often overconfident and poorly calibrated in out-of-distribution (OOD) settings, which limit trust and clinical adoption. Safety-critical medical imaging workflows benefit from intrinsic uncertainty-aware properties that can accurately reject OOD input. We implement the Spectral-normalized Neural Gaussian Process (SNGP), a set of lightweight modifications that apply spectral normalization and replace the final dense layer with a Gaussian process layer to improve single-model uncertainty estimation and OOD detection. We evaluate SNGP vs. deterministic and MonteCarlo dropout on six datasets across three biomedical classification tasks: white blood cells, amyloid plaques, and colorectal histopathology. SNGP has comparable in-distribution performance while significantly improving uncertainty estimation and OOD detection. Thus, SNGP or related models offer a useful framework for uncertainty-aware classification in digital pathology, supporting safe deployment and building trust with pathologists.

[569] Unified Personalized Reward Model for Vision Generation

Yibin Wang, Yuhang Zang, Feng Han, Jiazi Bu, Yujie Zhou, Cheng Jin, Jiaqi Wang

Main category: cs.CV

TL;DR: UnifiedReward-Flex: A personalized reward model for vision generation that uses context-adaptive reasoning to assess visual content based on semantic intent and visual evidence, addressing limitations of one-size-fits-all reward models.

Details

Motivation: Current multimodal reward models for visual generation follow a one-size-fits-all paradigm that assumes monolithic preference distributions or uses fixed evaluation rubrics, making them insensitive to content-specific visual cues and systematically misaligned with subjective, context-dependent human preferences.

Method: Two-stage training: (1) Distill structured reasoning traces from advanced VLMs for supervised fine-tuning to enable flexible, context-adaptive reasoning; (2) Perform direct preference optimization on curated preference pairs to strengthen reasoning fidelity and discriminative alignment. The model interprets semantic intent, grounds on visual evidence, and dynamically constructs hierarchical assessments with fine-grained criteria.

Result: When integrated into GRPO framework for image and video synthesis, UnifiedReward-Flex demonstrates superiority over existing approaches, showing improved alignment with human preferences through personalized, context-aware assessment.

Conclusion: The proposed unified personalized reward model successfully addresses limitations of current reward models by incorporating flexible, context-adaptive reasoning inspired by human assessment, leading to better alignment with subjective human preferences in visual generation tasks.

Abstract: Recent advancements in multimodal reward models (RMs) have significantly propelled the development of visual generation. Existing frameworks typically adopt Bradley-Terry-style preference modeling or leverage generative VLMs as judges, and subsequently optimize visual generation models via reinforcement learning. However, current RMs suffer from inherent limitations: they often follow a one-size-fits-all paradigm that assumes a monolithic preference distribution or relies on fixed evaluation rubrics. As a result, they are insensitive to content-specific visual cues, leading to systematic misalignment with subjective and context-dependent human preferences. To this end, inspired by human assessment, we propose UnifiedReward-Flex, a unified personalized reward model for vision generation that couples reward modeling with flexible and context-adaptive reasoning. Specifically, given a prompt and the generated visual content, it first interprets the semantic intent and grounds on visual evidence, then dynamically constructs a hierarchical assessment by instantiating fine-grained criteria under both predefined and self-generated high-level dimensions. Our training pipeline follows a two-stage process: (1) we first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT, equipping the model with flexible and context-adaptive reasoning behaviors; (2) we then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment. To validate the effectiveness, we integrate UnifiedReward-Flex into the GRPO framework for image and video synthesis, and extensive results demonstrate its superiority.

[570] Personalized Image Generation via Human-in-the-loop Bayesian Optimization

Rajalaxmi Rajagopalan, Debottam Dutta, Yu-Lin Wei, Romit Roy Choudhury

Main category: cs.CV

TL;DR: MultiBO uses multi-choice preferential Bayesian optimization to refine AI-generated images based on human feedback when language prompts reach their limits, enabling personalized image generation closer to users’ mental images.

Details

Motivation: When users have specific mental images that are difficult to describe with language prompts alone, current generative models struggle to produce exactly what users envision. There's a gap between what can be achieved with language prompts and the user's actual mental image that needs to be bridged.

Method: MultiBO (Multi-Choice Preferential Bayesian Optimization) generates K new images based on an initial prompt-generated image, collects preferential feedback from users on which images are closer to their mental image, uses Bayesian optimization to guide the diffusion model, and iteratively refines the images over B rounds of feedback.

Result: The method shows promising results with qualitative scores from 30 users and quantitative metrics compared across 5 baselines. Users can arrive much closer to their mental images within B rounds of feedback, even though the generative model has no direct information about the target image.

Conclusion: Multi-choice feedback from humans can be effectively harnessed for personalized image generation, bridging the gap between language-prompted results and users’ specific mental images through iterative preferential optimization.

Abstract: Imagine Alice has a specific image $x^\ast$ in her mind, say, the view of the street in which she grew up during her childhood. To generate that exact image, she guides a generative model with multiple rounds of prompting and arrives at an image $x^{p*}$. Although $x^{p*}$ is reasonably close to $x^\ast$, Alice finds it difficult to close that gap using language prompts. This paper aims to narrow this gap by observing that even after language has reached its limits, humans can still tell when a new image $x^+$ is closer to $x^\ast$ than $x^{p*}$. Leveraging this observation, we develop MultiBO (Multi-Choice Preferential Bayesian Optimization) that carefully generates $K$ new images as a function of $x^{p*}$, gets preferential feedback from the user, uses the feedback to guide the diffusion model, and ultimately generates a new set of $K$ images. We show that within $B$ rounds of user feedback, it is possible to arrive much closer to $x^\ast$, even though the generative model has no information about $x^\ast$. Qualitative scores from $30$ users, combined with quantitative metrics compared across $5$ baselines, show promising results, suggesting that multi-choice feedback from humans can be effectively harnessed for personalized image generation.

[571] Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory

Ruiqi Wu, Xuanhua He, Meng Cheng, Tianyu Yang, Yong Zhang, Zhuoliang Kang, Xunliang Cai, Xiaoming Wei, Chunle Guo, Chongyi Li, Ming-Ming Cheng

Main category: cs.CV

TL;DR: Infinite-World is a robust interactive world model that maintains coherent visual memory over 1000+ frames in complex real-world environments, addressing challenges of noisy pose estimations and viewpoint scarcity in real-world video training.

Details

Motivation: Existing world models work well on synthetic data with perfect ground-truth but lack effective training paradigms for real-world videos due to noisy pose estimations and scarcity of viewpoint revisits, limiting their practical applicability.

Method: 1) Hierarchical Pose-free Memory Compressor (HPMC) that recursively distills historical latents into fixed-budget representation; 2) Uncertainty-aware Action Labeling that discretizes continuous motion into tri-state logic; 3) Revisit-Dense Finetuning Strategy using compact dataset to activate long-range loop-closure capabilities.

Result: Extensive experiments show Infinite-World achieves superior performance in visual quality, action controllability, and spatial consistency compared to existing methods, demonstrating robust long-term memory capabilities.

Conclusion: Infinite-World successfully bridges the gap between synthetic and real-world video training for world models, enabling coherent visual memory over extended sequences without explicit geometric priors.

Abstract: We propose Infinite-World, a robust interactive world model capable of maintaining coherent visual memory over 1000+ frames in complex real-world environments. While existing world models can be efficiently optimized on synthetic data with perfect ground-truth, they lack an effective training paradigm for real-world videos due to noisy pose estimations and the scarcity of viewpoint revisits. To bridge this gap, we first introduce a Hierarchical Pose-free Memory Compressor (HPMC) that recursively distills historical latents into a fixed-budget representation. By jointly optimizing the compressor with the generative backbone, HPMC enables the model to autonomously anchor generations in the distant past with bounded computational cost, eliminating the need for explicit geometric priors. Second, we propose an Uncertainty-aware Action Labeling module that discretizes continuous motion into a tri-state logic. This strategy maximizes the utilization of raw video data while shielding the deterministic action space from being corrupted by noisy trajectories, ensuring robust action-response learning. Furthermore, guided by insights from a pilot toy study, we employ a Revisit-Dense Finetuning Strategy using a compact, 30-minute dataset to efficiently activate the model’s long-range loop-closure capabilities. Extensive experiments, including objective metrics and user studies, demonstrate that Infinite-World achieves superior performance in visual quality, action controllability, and spatial consistency.

[572] Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation

Xinshun Wang, Peiming Li, Ziyi Wang, Zhongbin Fang, Zhichao Deng, Songtao Wu, Jason Li, Mengyuan Liu

Main category: cs.CV

TL;DR: Superman is a unified multimodal LLM framework that bridges visual perception with temporal skeleton-based motion generation, addressing fragmentation between perception and generation models in human motion analysis.

Details

Motivation: Current human motion analysis suffers from fragmentation: perception models only output text from video, generation models can't perceive visual input, generative MLLMs are limited to single-frame static poses, and motion vocabularies are built from skeleton data alone, severing visual links.

Method: Two-fold approach: 1) Vision-Guided Motion Tokenizer that leverages geometric alignment between 3D skeletons and visual data for joint learning from both modalities, creating unified cross-modal motion vocabulary; 2) Single unified MLLM architecture trained to handle all tasks, processing diverse temporal inputs for both perception (3D skeleton pose estimation from video) and generation (motion prediction and in-betweening).

Result: Extensive experiments on standard benchmarks including Human3.6M demonstrate state-of-the-art or competitive performance across all motion tasks, showing efficient and scalable path for generative motion analysis using skeletons.

Conclusion: Superman successfully bridges visual perception with temporal skeleton-based motion generation, creating a unified framework that addresses fragmentation in human motion analysis and enables more efficient, scalable generative motion analysis.

Abstract: Human motion analysis tasks, such as temporal 3D pose estimation, motion prediction, and motion in-betweening, play an essential role in computer vision. However, current paradigms suffer from severe fragmentation. First, the field is split between perception'' models that understand motion from video but only output text, and generation’’ models that cannot perceive from raw visual input. Second, generative MLLMs are often limited to single-frame, static poses using dense, parametric SMPL models, failing to handle temporal motion. Third, existing motion vocabularies are built from skeleton data alone, severing the link to the visual domain. To address these challenges, we introduce Superman, a unified framework that bridges visual perception with temporal, skeleton-based motion generation. Our solution is twofold. First, to overcome the modality disconnect, we propose a Vision-Guided Motion Tokenizer. Leveraging the natural geometric alignment between 3D skeletons and visual data, this module pioneers robust joint learning from both modalities, creating a unified, cross-modal motion vocabulary. Second, grounded in this motion language, a single, unified MLLM architecture is trained to handle all tasks. This module flexibly processes diverse, temporal inputs, unifying 3D skeleton pose estimation from video (perception) with skeleton-based motion prediction and in-betweening (generation). Extensive experiments on standard benchmarks, including Human3.6M, demonstrate that our unified method achieves state-of-the-art or competitive performance across all motion tasks. This showcases a more efficient and scalable path for generative motion analysis using skeletons.

[573] ReasonEdit: Editing Vision-Language Models using Human Reasoning

Jiaxing Qiu, Kaihua Hou, Roxana Daneshjou, Ahmed Alaa, Thomas Hartvigsen

Main category: cs.CV

TL;DR: ReasonEdit is the first vision-language model editor that incorporates human reasoning explanations during editing, using a codebook to store reasoning and novel topology-balanced multimodal embeddings for retrieval.

Details

Motivation: Existing model editing approaches for vision-language models don't address reasoning-heavy tasks that require both humans and models to reason about images. There's a need for editing methods that can incorporate human reasoning explanations to improve edit generalization.

Method: Proposes ReasonEdit which continuously stores human reasoning in a codebook and retrieves relevant facts during inference using a novel topology-balanced multimodal embedding method inspired by network science.

Result: Achieves state-of-the-art editing performance across four VLMs on multiple rationale-based visual question answering datasets, demonstrating that using human reasoning during editing greatly improves edit generalization.

Conclusion: ReasonEdit introduces a practical model editing setup that successfully incorporates human reasoning, showing significant improvements in edit generalization for vision-language models on reasoning-heavy tasks.

Abstract: Model editing aims to correct errors in large, pretrained models without altering unrelated behaviors. While some recent works have edited vision-language models (VLMs), no existing editors tackle reasoning-heavy tasks, which typically require humans and models to reason about images.We therefore propose ReasonEdit, the first VLM editor to let users explain their reasoning during editing, introducing a new, practical model editing setup. ReasonEdit continuously stores human reasoning in a codebook, and retrieves only relevant facts during inference using a novel topology-balanced multimodal embedding method inspired by network science. Across four VLMs on multiple rationale-based visual question answering datasets, ReasonEdit achieves state-of-the-art editing performance, ultimately showing that using human reasoning during editing greatly improves edit generalization.

[574] Catalyst: Out-of-Distribution Detection via Elastic Scaling

Abid Hassan, Tuan Ngo, Saad Shafiq, Nenad Medvidovic

Main category: cs.CV

TL;DR: Catalyst is a post-hoc OOD detection framework that uses pre-pooling feature map statistics to compute an input-dependent scaling factor that multiplicatively modulates existing OOD scores, improving separation between ID and OOD distributions.

Details

Motivation: Current post-hoc OOD detection methods rely on logits or penultimate feature vectors via global average pooling, discarding rich channel-wise statistics from pre-pooling feature maps that could provide complementary signals for better OOD detection.

Method: Catalyst computes an input-dependent scaling factor (γ) on-the-fly from raw pre-pooling feature map statistics (mean, standard deviation, maximum activation). This γ is then fused multiplicatively with existing baseline OOD scores (energy, ReAct, SCALE, KNN) to perform “elastic scaling” that pushes ID and OOD distributions further apart.

Result: Catalyst achieves substantial performance gains, reducing average False Positive Rate by 32.87% on CIFAR-10 (ResNet-18), 27.94% on CIFAR-100 (ResNet-18), and 22.25% on ImageNet (ResNet-50). It works with both logit-based and distance-based detectors.

Conclusion: Pre-pooling feature map statistics contain untapped potential for OOD detection, and Catalyst provides a generalizable framework that complements existing approaches by exploiting these under-explored signals.

Abstract: Out-of-distribution (OOD) detection is critical for the safe deployment of deep neural networks. State-of-the-art post-hoc methods typically derive OOD scores from the output logits or penultimate feature vector obtained via global average pooling (GAP). We contend that this exclusive reliance on the logit or feature vector discards a rich, complementary signal: the raw channel-wise statistics of the pre-pooling feature map lost in GAP. In this paper, we introduce Catalyst, a post-hoc framework that exploits these under-explored signals. Catalyst computes an input-dependent scaling factor ($γ$) on-the-fly from these raw statistics (e.g., mean, standard deviation, and maximum activation). This $γ$ is then fused with the existing baseline score, multiplicatively modulating it – an ``elastic scaling’’ – to push the ID and OOD distributions further apart. We demonstrate Catalyst is a generalizable framework: it seamlessly integrates with logit-based methods (e.g., Energy, ReAct, SCALE) and also provides a significant boost to distance-based detectors like KNN. As a result, Catalyst achieves substantial and consistent performance gains, reducing the average False Positive Rate by 32.87 on CIFAR-10 (ResNet-18), 27.94% on CIFAR-100 (ResNet-18), and 22.25% on ImageNet (ResNet-50). Our results highlight the untapped potential of pre-pooling statistics and demonstrate that Catalyst is complementary to existing OOD detection approaches.

[575] SelvaMask: Segmenting Trees in Tropical Forests and Beyond

Simon-Olivier Duguay, Hugo Baudchon, Etienne Laliberté, Helene Muller-Landau, Gonzalo Rivas-Torres, Arthur Ouaknine

Main category: cs.CV

TL;DR: SelvaMask introduces a new tropical forest dataset with 8,800 manually delineated tree crowns and a modular detection-segmentation pipeline using vision foundation models for improved individual tree crown segmentation in dense tropical forests.

Details

Motivation: Tropical forests are critical for biodiversity and carbon storage, but accurate individual tree crown segmentation remains challenging, especially in dense tropical forests where current transformer-based models perform poorly.

Method: Created SelvaMask dataset with 8,800 manually delineated tree crowns across three Neotropical sites. Developed modular detection-segmentation pipeline that adapts vision foundation models using domain-specific detection-prompter.

Result: Achieved state-of-the-art performance, outperforming both zero-shot generalist models and fully supervised end-to-end methods in dense tropical forests. Validated gains on external tropical and temperate datasets.

Conclusion: SelvaMask serves as both a challenging benchmark and key enabler for generalized forest monitoring, with the dataset and code to be released publicly.

Abstract: Tropical forests harbor most of the planet’s tree biodiversity and are critical to global ecological balance. Canopy trees in particular play a disproportionate role in carbon storage and functioning of these ecosystems. Studying canopy trees at scale requires accurate delineation of individual tree crowns, typically performed using high-resolution aerial imagery. Despite advances in transformer-based models for individual tree crown segmentation, performance remains low in most forests, especially tropical ones. To this end, we introduce SelvaMask, a new tropical dataset containing over 8,800 manually delineated tree crowns across three Neotropical forest sites in Panama, Brazil, and Ecuador. SelvaMask features comprehensive annotations, including an inter-annotator agreement evaluation, capturing the dense structure of tropical forests and highlighting the difficulty of the task. Leveraging this benchmark, we propose a modular detection-segmentation pipeline that adapts vision foundation models (VFMs), using domain-specific detection-prompter. Our approach reaches state-of-the-art performance, outperforming both zero-shot generalist models and fully supervised end-to-end methods in dense tropical forests. We validate these gains on external tropical and temperate datasets, demonstrating that SelvaMask serves as both a challenging benchmark and a key enabler for generalized forest monitoring. Our code and dataset will be released publicly.

[576] UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing

Dianyi Wang, Chaofan Ma, Feng Han, Size Wu, Wei Song, Yibin Wang, Zhixiong Zhang, Tianhang Wang, Siyuan Wang, Zhongyu Wei, Jiaqi Wang

Main category: cs.CV

TL;DR: UniReason is a unified multimodal framework that integrates text-to-image generation and image editing through a dual reasoning paradigm, treating them as interconnected steps rather than isolated capabilities.

Details

Motivation: Current unified multimodal models struggle with complex synthesis tasks requiring deep reasoning and treat text-to-image generation and image editing as separate capabilities rather than interconnected reasoning steps.

Method: Proposes a dual reasoning paradigm: 1) generation as world knowledge-enhanced planning with implicit constraints, and 2) editing for fine-grained visual refinement via self-reflection. Unifies both tasks within a shared representation, mirroring human cognitive process of planning followed by refinement. Constructs a large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains for planning, plus agent-generated corpus for visual self-correction.

Result: Extensive experiments show UniReason achieves advanced performance on reasoning-intensive benchmarks (WISE, KrisBench, UniREditBench) while maintaining superior general synthesis capabilities.

Conclusion: UniReason successfully unifies generation and editing through a reasoning-centric approach, demonstrating improved performance on complex multimodal synthesis tasks that require deep reasoning.

Abstract: Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmonizes these two tasks through a dual reasoning paradigm. We formulate generation as world knowledge-enhanced planning to inject implicit constraints, and leverage editing capabilities for fine-grained visual refinement to further correct visual errors via self-reflection. This approach unifies generation and editing within a shared representation, mirroring the human cognitive process of planning followed by refinement. We support this framework by systematically constructing a large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains (e.g., cultural commonsense, physics, etc.) for planning, alongside an agent-generated corpus for visual self-correction. Extensive experiments demonstrate that UniReason achieves advanced performance on reasoning-intensive benchmarks such as WISE, KrisBench and UniREditBench, while maintaining superior general synthesis capabilities.

[577] Multi-head automated segmentation by incorporating detection head into the contextual layer neural network

Edwin Kys, Febian Febian

Main category: cs.CV

TL;DR: A gated multi-head Transformer architecture based on Swin U-Net for medical image segmentation that uses slice-level structure detection to gate segmentation predictions, suppressing false positives in anatomically invalid slices.

Details

Motivation: Conventional deep learning segmentation models often produce anatomically implausible false positives (hallucinations) in slices lacking target structures, which is problematic for clinical radiotherapy applications.

Method: Proposes a gated multi-head Transformer architecture based on Swin U-Net with inter-slice context integration and a parallel detection head. The model jointly performs slice-level structure detection via MLP and pixel-level segmentation through a context-enhanced stream, using detection outputs to gate segmentation predictions.

Result: The gated model substantially outperforms non-gated baseline, achieving mean Dice loss of 0.013 ± 0.036 vs 0.732 ± 0.314 on Prostate-Anatomical-Edge-Cases dataset. Detection probabilities strongly correlated with anatomical presence, effectively eliminating spurious segmentations.

Conclusion: Detection-based gating enhances robustness and anatomical plausibility in automated segmentation, reducing hallucinated predictions without compromising segmentation quality in valid slices, offering promise for improving clinical radiotherapy auto-contouring reliability.

Abstract: Deep learning based auto segmentation is increasingly used in radiotherapy, but conventional models often produce anatomically implausible false positives, or hallucinations, in slices lacking target structures. We propose a gated multi-head Transformer architecture based on Swin U-Net, augmented with inter-slice context integration and a parallel detection head, which jointly performs slice-level structure detection via a multi-layer perceptron and pixel-level segmentation through a context-enhanced stream. Detection outputs gate the segmentation predictions to suppress false positives in anatomically invalid slices, and training uses slice-wise Tversky loss to address class imbalance. Experiments on the Prostate-Anatomical-Edge-Cases dataset from The Cancer Imaging Archive demonstrate that the gated model substantially outperforms a non-gated segmentation-only baseline, achieving a mean Dice loss of $0.013 \pm 0.036$ versus $0.732 \pm 0.314$, with detection probabilities strongly correlated with anatomical presence, effectively eliminating spurious segmentations. In contrast, the non-gated model exhibited higher variability and persistent false positives across all slices. These results indicate that detection-based gating enhances robustness and anatomical plausibility in automated segmentation applications, reducing hallucinated predictions without compromising segmentation quality in valid slices, and offers a promising approach for improving the reliability of clinical radiotherapy auto-contouring workflows.

[578] PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

Zehong Ma, Ruihan Xu, Shiliang Zhang

Main category: cs.CV

TL;DR: PixelGen is a pixel diffusion framework with perceptual supervision that outperforms latent diffusion models by using LPIPS and DINO-based losses to learn meaningful perceptual manifolds instead of full image manifolds.

Details

Motivation: Pixel diffusion avoids artifacts from VAEs in latent diffusion but struggles with high-dimensional pixel manifolds containing perceptually irrelevant signals. Existing pixel diffusion methods lag behind latent diffusion models due to optimization challenges in pixel space.

Method: Introduces two complementary perceptual losses: LPIPS loss for better local patterns and DINO-based perceptual loss for stronger global semantics. This guides diffusion models to learn meaningful perceptual manifolds rather than full image manifolds.

Result: Achieves FID of 5.11 on ImageNet-256 without classifier-free guidance using only 80 training epochs. Demonstrates favorable scaling on large-scale text-to-image generation with GenEval score of 0.79. Surpasses strong latent diffusion baselines.

Conclusion: PixelGen provides a simpler yet more powerful generative paradigm that requires no VAEs, latent representations, or auxiliary stages, offering an end-to-end pixel diffusion approach with perceptual supervision.

Abstract: Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to optimize high-dimensional pixel manifolds that contain many perceptually irrelevant signals, leaving existing pixel diffusion methods lagging behind latent diffusion models. We propose PixelGen, a simple pixel diffusion framework with perceptual supervision. Instead of modeling the full image manifold, PixelGen introduces two complementary perceptual losses to guide diffusion model towards learning a more meaningful perceptual manifold. An LPIPS loss facilitates learning better local patterns, while a DINO-based perceptual loss strengthens global semantics. With perceptual supervision, PixelGen surpasses strong latent diffusion baselines. It achieves an FID of 5.11 on ImageNet-256 without classifier-free guidance using only 80 training epochs, and demonstrates favorable scaling performance on large-scale text-to-image generation with a GenEval score of 0.79. PixelGen requires no VAEs, no latent representations, and no auxiliary stages, providing a simpler yet more powerful generative paradigm. Codes are publicly available at https://github.com/Zehong-Ma/PixelGen.

[579] Towards Artwork Explanation in Large-scale Vision Language Models

Kazuki Hayashi, Yusuke Sakai, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe

Main category: cs.CV

TL;DR: Proposes artwork explanation generation task to evaluate LVLMs’ knowledge integration from vision and language, revealing limitations in multimodal understanding.

Details

Motivation: To clarify the extent to which Large-scale Vision-Language Models (LVLMs) can understand and integrate knowledge for explaining images, particularly focusing on complex relationships between different knowledge pieces and how they incorporate these into explanations.

Method: Proposes a new artwork explanation generation task with evaluation dataset and metrics. The task has two parts: 1) generating explanations from images and titles, 2) generating explanations using only images. Also releases training dataset for LVLMs to learn artwork explanations.

Result: LVLMs struggle with integrating language and visual information, and show even more pronounced limitations in acquiring knowledge from images alone.

Conclusion: Artwork explanation generation is a suitable task for evaluating LVLMs’ multimodal knowledge integration capabilities, revealing significant current limitations in both vision-language integration and visual knowledge acquisition.

Abstract: Large-scale Vision-Language Models (LVLMs) output text from images and instructions, demonstrating capabilities in text generation and comprehension. However, it has not been clarified to what extent LVLMs possess the ability to understand the knowledge necessary for explaining images, the complex relationships between various pieces of knowledge, and how they integrate these understandings into their explanations. To address this issue, we propose a new task: the artwork explanation generation task, along with its evaluation dataset and metrics for quantitatively assessing the understanding and utilization of knowledge about artworks. This task is apt for image description based on the premise that LVLMs are expected to have pre-existing knowledge of artworks, which are often subjects of wide recognition and documented information. It consists of two parts: generating explanations from images and titles of artworks, and generating explanations using only images, thus evaluating the LVLMs’ language-based and vision-based knowledge. Alongside, we release a training dataset for LVLMs to learn explanations that incorporate knowledge about artworks. Our findings indicate that LVLMs not only struggle with integrating language and visual information but also exhibit a more pronounced limitation in acquiring knowledge from images alone.

[580] InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction

Sirui Xu, Ziyin Wang, Yu-Xiong Wang, Liang-Yan Gui

Main category: cs.CV

TL;DR: InterDreamer: A zero-shot framework for generating 3D human-object interactions from text without training on text-interaction pairs, using decoupled semantics from LLMs and dynamics from a physics-aware world model.

Details

Motivation: Current text-conditioned human motion generation works well but struggles with 3D dynamic human-object interactions due to lack of large-scale interaction data with comprehensive text descriptions. The paper aims to generate realistic HOI sequences without direct training on text-interaction pairs.

Method: Decouples interaction semantics and dynamics. Uses pre-trained LLMs for high-level semantic control and text-to-motion models for human motion. Introduces a world model to understand simple physics and object motion influenced by human actions. Integrates these into InterDreamer framework for zero-shot generation.

Result: Applied to BEHAVE and CHAIRS datasets, InterDreamer generates realistic and coherent 3D HOI sequences that align with text directives in a zero-shot manner, demonstrating capability without direct text-interaction training data.

Conclusion: Shows potential for generating human-object interactions without text-interaction pair data by decoupling semantics and dynamics, leveraging pre-trained models for semantics and introducing physics-aware world model for dynamics.

Abstract: Text-conditioned human motion generation has experienced significant advancements with diffusion models trained on extensive motion capture data and corresponding textual annotations. However, extending such success to 3D dynamic human-object interaction (HOI) generation faces notable challenges, primarily due to the lack of large-scale interaction data and comprehensive descriptions that align with these interactions. This paper takes the initiative and showcases the potential of generating human-object interactions without direct training on text-interaction pair data. Our key insight in achieving this is that interaction semantics and dynamics can be decoupled. Being unable to learn interaction semantics through supervised training, we instead leverage pre-trained large models, synergizing knowledge from a large language model and a text-to-motion model. While such knowledge offers high-level control over interaction semantics, it cannot grasp the intricacies of low-level interaction dynamics. To overcome this issue, we further introduce a world model designed to comprehend simple physics, modeling how human actions influence object motion. By integrating these components, our novel framework, InterDreamer, is able to generate text-aligned 3D HOI sequences in a zero-shot manner. We apply InterDreamer to the BEHAVE and CHAIRS datasets, and our comprehensive experimental analysis demonstrates its capability to generate realistic and coherent interaction sequences that seamlessly align with the text directives.

[581] Efficient Transformer Encoders for Mask2Former-style models

Manyi Yao, Abhishek Aich, Yumin Suh, Amit Roy-Chowdhury, Christian Shelton, Manmohan Chandraker

Main category: cs.CV

TL;DR: ECO-M2F introduces efficient transformer encoders for Mask2Former-style models that dynamically select the number of encoder layers based on input image complexity to reduce computational cost while maintaining performance.

Details

Motivation: Vision transformers offer powerful segmentation capabilities but are computationally expensive for deployed devices. Current approaches use fixed computation regardless of input complexity, leading to inefficient resource usage.

Method: Three-step approach: 1) Train parent architecture for early exiting from encoder, 2) Create dataset of ideal encoder layers per training example, 3) Train gating network to predict optimal layer count conditioned on input image.

Result: Reduces expected encoder computational cost while maintaining performance, adapts to various compute resources, offers architectural flexibility, and extends beyond segmentation to object detection.

Conclusion: ECO-M2F provides an efficient solution for transformer-based segmentation models by enabling input-adaptive computation, balancing performance and efficiency for practical deployment.

Abstract: Vision transformer based models bring significant improvements for image segmentation tasks. Although these architectures offer powerful capabilities irrespective of specific segmentation tasks, their use of computational resources can be taxing on deployed devices. One way to overcome this challenge is by adapting the computation level to the specific needs of the input image rather than the current one-size-fits-all approach. To this end, we introduce ECO-M2F or EffiCient TransfOrmer Encoders for Mask2Former-style models. Noting that the encoder module of M2F-style models incur high resource-intensive computations, ECO-M2F provides a strategy to self-select the number of hidden layers in the encoder, conditioned on the input image. To enable this self-selection ability for providing a balance between performance and computational efficiency, we present a three step recipe. The first step is to train the parent architecture to enable early exiting from the encoder. The second step is to create an derived dataset of the ideal number of encoder layers required for each training example. The third step is to use the aforementioned derived dataset to train a gating network that predicts the number of encoder layers to be used, conditioned on the input image. Additionally, to change the computational-accuracy tradeoff, only steps two and three need to be repeated which significantly reduces retraining time. Experiments on the public datasets show that the proposed approach reduces expected encoder computational cost while maintaining performance, adapts to various user compute resources, is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.

[582] Staircase Cascaded Fusion of Lightweight Local Pattern Recognition and Long-Range Dependencies for Structural Crack Segmentation

Hui Liu, Chen Jia, Fan Shi, Xu Cheng, Mianzhao Wang, Shengyong Chen, Yang Lv

Main category: cs.CV

TL;DR: CrackSCF: Lightweight cascaded fusion network for pixel-level crack segmentation that combines local texture patterns with global dependencies for robust segmentation on resource-constrained edge devices.

Details

Motivation: Existing crack segmentation methods fail to integrate local textures with pixel dependencies, leading to fragmented predictions. High parameter counts and computational demands hinder practical deployment on edge devices.

Method: Proposes CrackSCF with: 1) Lightweight convolutional block (LRDS) to replace standard convolutions for local pattern capture, 2) Long-range Dependency Extractor (LDE) for global dependencies, 3) Staircase Cascaded Fusion Module (SCFM) to intelligently unify local and global features.

Result: Outperforms existing methods across six datasets including new TUT benchmark. Achieves 0.8382 F1 score and 0.8473 mIoU on TUT dataset with only 4.79M parameters. Shows robustness against complex background noise.

Conclusion: CrackSCF provides robust crack segmentation with exceptional computational efficiency, making it suitable for practical deployment on resource-constrained edge devices while maintaining high accuracy.

Abstract: Accurately segmenting structural cracks at the pixel level remains a major hurdle, as existing methods fail to integrate local textures with pixel dependencies, often leading to fragmented and incomplete predictions. Moreover, their high parameter counts and substantial computational demands hinder practical deployment on resource-constrained edge devices. To address these challenges, we propose CrackSCF, a Lightweight Cascaded Fusion Crack Segmentation Network designed to achieve robust crack segmentation with exceptional computational efficiency. We design a lightweight convolutional block (LRDS) to replace all standard convolutions. This approach efficiently captures local patterns while operating with a minimal computational footprint. For a holistic perception of crack structures, a lightweight Long-range Dependency Extractor (LDE) captures global dependencies. These are then intelligently unified with local patterns by our Staircase Cascaded Fusion Module (SCFM), ensuring the final segmentation maps are both seamless in continuity and rich in fine-grained detail. To comprehensively evaluate our method, this paper created the challenging TUT benchmark dataset and evaluated it alongside five other public datasets. The experimental results show that the CrackSCF method consistently outperforms the existing methods, and it demonstrates greater robustness in dealing with complex background noise. On the TUT dataset, CrackSCF achieved 0.8382 on F1 score and 0.8473 on mIoU, and it only required 4.79M parameters.

[583] MCTR: Multi Camera Tracking Transformer

Alexandru Niculescu-Mizil, Deep Patel, Iain Melvin

Main category: cs.CV

TL;DR: MCTR is an end-to-end transformer-based approach for multi-camera multi-object tracking that uses track embeddings to maintain global object information and probabilistically associates them with view-specific detections.

Details

Motivation: Multi-camera tracking remains reliant on heuristic techniques despite end-to-end methods gaining traction in single-camera tracking, creating a gap for unified end-to-end solutions.

Method: Uses DETR-like detectors for per-camera detections, maintains global track embeddings that integrate local detection information, and employs probabilistic association with differentiable losses for end-to-end training.

Result: Validated on MMPTrack and AI City Challenge datasets, demonstrating effectiveness for multi-camera multi-object tracking with overlapping fields of view.

Conclusion: MCTR provides a novel end-to-end transformer framework that addresses the limitations of heuristic methods in multi-camera tracking through differentiable probabilistic association.

Abstract: Multi-camera tracking plays a pivotal role in various real-world applications. While end-to-end methods have gained significant interest in single-camera tracking, multi-camera tracking remains predominantly reliant on heuristic techniques. In response to this gap, this paper introduces Multi-Camera Tracking tRansformer (MCTR), a novel end-to-end approach tailored for multi-object detection and tracking across multiple cameras with overlapping fields of view. MCTR leverages end-to-end detectors like DEtector TRansformer (DETR) to produce detections and detection embeddings independently for each camera view. The framework maintains set of track embeddings that encaplusate global information about the tracked objects, and updates them at every frame by integrating the local information from the view-specific detection embeddings. The track embeddings are probabilistically associated with detections in every camera view and frame to generate consistent object tracks. The soft probabilistic association facilitates the design of differentiable losses that enable end-to-end training of the entire system. To validate our approach, we conduct experiments on MMPTrack and AI City Challenge, two recently introduced large-scale multi-camera multi-object tracking datasets.

Haisheng Su, Wei Wu, Zhenjie Yang, Isabel Guan

Main category: cs.CV

TL;DR: EgoFSD: An ego-centric fully sparse paradigm for end-to-end autonomous driving using sparse perception, hierarchical interaction, and iterative motion planning with diffusion models.

Details

Motivation: Current end-to-end autonomous driving methods suffer from unsatisfactory performance and inferior efficiency due to rasterized scene representation learning and redundant information transmission, lacking ego-centric designs.

Method: Proposes EgoFSD with three components: 1) Sparse perception module for detection and online mapping using sparse representations, 2) Hierarchical interaction module for selecting closest in-path vehicles/stationary objects with geometric priors, 3) Iterative motion planner with joint motion prediction and multi-modal trajectory optimization using position-level motion diffusion and trajectory-level planning denoising.

Result: Significantly reduces average L2 error by 59% and collision rate by 92% compared to UniAD while achieving 6.9x faster running efficiency on nuScenes and Bench2Drive datasets.

Conclusion: EgoFSD demonstrates superior performance and efficiency for end-to-end autonomous driving through ego-centric sparse representation and diffusion-based uncertainty modeling.

Abstract: Current End-to-End Autonomous Driving (E2E-AD) methods resort to unifying modular designs for various tasks (e.g. perception, prediction and planning). Although optimized with a fully differentiable framework in a planning-oriented manner, existing end-to-end driving systems lacking ego-centric designs still suffer from unsatisfactory performance and inferior efficiency, due to rasterized scene representation learning and redundant information transmission. In this paper, we propose an ego-centric fully sparse paradigm, named EgoFSD, for end-to-end self-driving. Specifically, EgoFSD consists of sparse perception, hierarchical interaction and iterative motion planner. The sparse perception module performs detection and online mapping based on sparse representation of the driving scene. The hierarchical interaction module aims to select the Closest In-Path Vehicle / Stationary (CIPV / CIPS) from coarse to fine, benefiting from an additional geometric prior. As for the iterative motion planner, both selected interactive agents and ego-vehicle are considered for joint motion prediction, where the output multi-modal ego-trajectories are optimized in an iterative fashion. In addition, position-level motion diffusion and trajectory-level planning denoising are introduced for uncertainty modeling, thereby enhancing the training stability and convergence speed. Extensive experiments are conducted on nuScenes and Bench2Drive datasets, which significantly reduces the average L2 error by 59% and collision rate by 92% than UniAD while achieves 6.9x faster running efficiency.

[585] Advances in Photoacoustic Imaging Reconstruction and Quantitative Analysis for Biomedical Applications

Lei Wang, Weiming Zeng, Kai Long, Hongyu Chen, Rongfeng Lan, Li Liu, Wai Ting Siok, Nizhuan Wang

Main category: cs.CV

TL;DR: Review paper on photoacoustic imaging covering fundamentals, implementations (PACT, PAM, PAE), deep learning for reconstruction/artifact reduction, quantitative analysis, and future directions.

Details

Motivation: Photoacoustic imaging combines optical resolution with acoustic penetration depth but faces clinical challenges including depth-resolution trade-offs and speed limitations. The paper aims to review fundamentals, implementations, and recent advances including deep learning approaches.

Method: Comprehensive review paper analyzing three main PAI implementations: photoacoustic computed tomography (PACT), photoacoustic microscopy (PAM), and photoacoustic endoscopy (PAE). Examines conventional and deep learning methodologies for image reconstruction and artifact mitigation, plus quantitative analysis techniques.

Result: The review demonstrates that deep learning approaches show considerable potential for enhancing image quality and accelerating imaging processes in PAI. Recent developments in quantitative analysis enable measurement of physiological parameters like hemoglobin concentration and oxygen saturation.

Conclusion: Photoacoustic imaging has promising clinical potential but faces implementation challenges. Deep learning is transformative for advancing PAI capabilities in reconstruction, artifact reduction, and quantitative analysis. Future research should focus on addressing current limitations and expanding clinical applications.

Abstract: Photoacoustic imaging (PAI) represents an innovative biomedical imaging modality that harnesses the advantages of optical resolution and acoustic penetration depth while ensuring enhanced safety. Despite its promising potential across a diverse array of preclinical and clinical applications, the clinical implementation of PAI faces significant challenges, including the trade-off between penetration depth and spatial resolution, as well as the demand for faster imaging speeds. This paper explores the fundamental principles underlying PAI, with a particular emphasis on three primary implementations: photoacoustic computed tomography (PACT), photoacoustic microscopy (PAM), and photoacoustic endoscopy (PAE). We undertake a critical assessment of their respective strengths and practical limitations. Furthermore, recent developments in utilizing conventional or deep learning (DL) methodologies for image reconstruction and artefact mitigation across PACT, PAM, and PAE are outlined, demonstrating considerable potential to enhance image quality and accelerate imaging processes. Furthermore, this paper examines the recent developments in quantitative analysis within PAI, including the quantification of haemoglobin concentration, oxygen saturation, and other physiological parameters within tissues. Finally, our discussion encompasses current trends and future directions in PAI research while emphasizing the transformative impact of deep learning on advancing PAI.

[586] Edge Weight Prediction For Category-Agnostic Pose Estimation

Or Hirschorn, Shai Avidan

Main category: cs.CV

TL;DR: EdgeCape improves category-agnostic pose estimation by predicting dynamic edge weights in pose graphs and using Markovian structural bias to enhance spatial reasoning, achieving state-of-the-art results on MP-100 benchmark.

Details

Motivation: Existing category-agnostic pose estimation methods use static pose graphs with equal-weight edges, which leads to suboptimal results. The authors aim to overcome these limitations by introducing dynamic edge weight prediction and better structural priors.

Method: EdgeCape introduces two key innovations: 1) predicting edge weights in the pose graph to optimize localization, and 2) Markovian Structural Bias that modulates self-attention between nodes based on hop distance to capture global spatial dependencies.

Result: Evaluated on MP-100 benchmark (100 categories, 20K+ images), EdgeCape achieves state-of-the-art results in 1-shot setting and leads among similar-sized methods in 5-shot setting, significantly improving keypoint localization accuracy.

Conclusion: Dynamic edge weight prediction and Markovian structural bias effectively improve category-agnostic pose estimation by better handling occlusions and capturing spatial dependencies, advancing the state-of-the-art in few-shot pose estimation.

Abstract: Category-Agnostic Pose Estimation (CAPE) localizes keypoints across diverse object categories with a single model, using one or a few annotated support images. Recent works have shown that using a pose graph (i.e., treating keypoints as nodes in a graph rather than isolated points) helps handle occlusions and break symmetry. However, these methods assume a static pose graph with equal-weight edges, leading to suboptimal results. We introduce EdgeCape, a novel framework that overcomes these limitations by predicting the graph’s edge weights which optimizes localization. To further leverage structural priors, we propose integrating Markovian Structural Bias, which modulates the self-attention interaction between nodes based on the number of hops between them. We show that this improves the model’s ability to capture global spatial dependencies. Evaluated on the MP-100 benchmark, which includes 100 categories and over 20K images, EdgeCape achieves state-of-the-art results in the 1-shot setting and leads among similar-sized methods in the 5-shot setting, significantly improving keypoint localization accuracy. Our code is publicly available.

[587] Feat2GS: Probing Visual Foundation Models with Gaussian Splatting

Yue Chen, Xingyu Chen, Anpei Chen, Gerard Pons-Moll, Yuliang Xiu

Main category: cs.CV

TL;DR: Feat2GS is a framework that probes 3D awareness of visual foundation models by reconstructing 3D Gaussians from VFM features extracted from unposed images, enabling novel view synthesis without 3D ground truth data.

Details

Motivation: Visual foundation models are trained on extensive 2D image datasets but their 3D world understanding remains unclear. Existing 3D probing methods have limitations: they focus on single-view 2.5D estimation or two-view sparse 2D correspondence, ignore texture awareness, and require 3D ground truth data which limits evaluation scale and diversity.

Method: Feat2GS reads out 3D Gaussian attributes (geometry: position, opacity, covariance; texture: color) from VFM features extracted from unposed images. The disentanglement of 3DGS parameters enables separate analysis of texture and geometry awareness. The framework allows probing 3D awareness via novel view synthesis without requiring 3D data.

Result: Extensive experiments probe 3D awareness of several VFMs and investigate ingredients that lead to 3D-aware VFMs. Building on findings, several variants achieve state-of-the-art performance across diverse datasets.

Conclusion: Feat2GS provides a unified framework to fairly and comprehensively probe VFM 3D awareness for both geometry and texture via novel view synthesis. It serves as both a probing tool for VFMs and a simple-yet-effective baseline for novel-view synthesis.

Abstract: Given that visual foundation models (VFMs) are trained on extensive datasets but often limited to 2D images, a natural question arises: how well do they understand the 3D world? With the differences in architecture and training protocols (i.e., objectives, proxy tasks), a unified framework to fairly and comprehensively probe their 3D awareness is urgently needed. Existing works on 3D probing suggest single-view 2.5D estimation (e.g., depth and normal) or two-view sparse 2D correspondence (e.g., matching and tracking). Unfortunately, these tasks ignore texture awareness, and require 3D data as ground-truth, which limits the scale and diversity of their evaluation set. To address these issues, we introduce Feat2GS, which readout 3D Gaussians attributes from VFM features extracted from unposed images. This allows us to probe 3D awareness for geometry and texture via novel view synthesis, without requiring 3D data. Additionally, the disentanglement of 3DGS parameters - geometry ($\boldsymbol{x}$, $α$, $Σ$) and texture ($\boldsymbol{c}$) - enables separate analysis of texture and geometry awareness. Under Feat2GS, we conduct extensive experiments to probe the 3D awareness of several VFMs, and investigate the ingredients that lead to a 3D aware VFM. Building on these findings, we develop several variants that achieve state-of-the-art across diverse datasets. This makes Feat2GS useful for probing VFMs, and as a simple-yet-effective baseline for novel-view synthesis. Code and data are available at https://fanegg.github.io/Feat2GS/.

Zhong Peng, Yishi Xu, Gerong Wang, Wenchao Chen, Bo Chen, Jing Zhang, Hongwei Liu

Main category: cs.CV

TL;DR: Duplex framework for compositional zero-shot learning uses dual-prototype learning with dynamic local-graph refinement to improve recognition of unseen state-object pairs by addressing weak discriminative text prototypes and passive optimization of unseen pairs.

Details

Motivation: Current CZSL methods using vision-language models have limitations: (1) text-driven semantic prototypes are weakly discriminative in visual feature space, and (2) unseen pairs are optimized passively, leading to seen bias.

Method: Duplex maintains dual prototypes per composition: semantic prototype via prompt learning and visual prototype constructed by recombining disentangled state/object primitives from seen images. Visual prototypes are dynamically updated through lightweight aggregation on mini-batch local graphs, incorporating unseen compositions during training without labels.

Result: Experiments on MIT-States, UT-Zappos, and CGQA datasets in both closed-world and open-world settings achieve competitive performance and consistent compositional generalization.

Conclusion: Duplex addresses key limitations in CZSL by introducing fine-grained visual evidence while preserving semantic structure, better disambiguating semantically similar yet visually distinct pairs, and mitigating seen bias.

Abstract: Compositional Zero-Shot Learning (CZSL) seeks to recognize unseen state-object pairs by recombining primitives learned from seen compositions. Despite recent progress with vision-language models (VLMs), two limitations remain: (i) text-driven semantic prototypes are weakly discriminative in the visual feature space; and (ii) unseen pairs are optimized passively, thereby inducing seen bias. To address these limitations, we present Duplex, a framework that couples dual-prototype learning with dynamic local-graph refinement of visual prototypes. For each composition, Duplex maintains a semantic prototype via prompt learning and a visual prototype for unseen pairs constructed by recombining disentangled state and object primitives from seen images. The visual prototypes are updated dynamically through lightweight aggregation on mini-batch local graphs, which incorporates unseen compositions during training without labels. This design introduces fine-grained visual evidence while preserving semantic structure. It enriches class prototypes, better disambiguates semantically similar yet visually distinct pairs, and mitigates seen bias. Experiments on MIT-States, UT-Zappos, and CGQA in closed-world and open-world settings achieve competitive performance and consistent compositional generalization. Our source code is available at https://github.com/ISPZ/Duplex-CZSL.

[589] 3D Dynamics-Aware Manipulation: Endowing Manipulation Policies with 3D Foresight

Yuxin He, Ruihao Zhang, Xianzu Wu, Zhiyuan Zhang, Cheng Ding, Qiang Nie

Main category: cs.CV

TL;DR: 3D dynamics-aware manipulation framework that integrates 3D world modeling and policy learning through self-supervised tasks for improved manipulation performance with depth-wise movements.

Details

Motivation: Existing manipulation policies only model 2D visual dynamics, which is insufficient for robust manipulation when tasks involve prominent depth-wise movement. There's a need for 3D-aware world modeling to handle complex manipulation tasks.

Method: Proposes a 3D dynamics-aware manipulation framework with three self-supervised learning tasks: current depth estimation, future RGB-D prediction, and 3D flow prediction. These tasks complement each other to endow the policy model with 3D foresight.

Result: Extensive experiments on simulation and real-world show that 3D foresight greatly boosts manipulation policy performance without sacrificing inference speed.

Conclusion: 3D world modeling integrated with policy learning significantly improves manipulation performance for tasks involving depth-wise movements, demonstrating the importance of 3D foresight in manipulation policies.

Abstract: The incorporation of world modeling into manipulation policy learning has pushed the boundary of manipulation performance. However, existing efforts simply model the 2D visual dynamics, which is insufficient for robust manipulation when target tasks involve prominent depth-wise movement. To address this, we present a 3D dynamics-aware manipulation framework that seamlessly integrates 3D world modeling and policy learning. Three self-supervised learning tasks (current depth estimation, future RGB-D prediction, 3D flow prediction) are introduced within our framework, which complement each other and endow the policy model with 3D foresight. Extensive experiments on simulation and the real world show that 3D foresight can greatly boost the performance of manipulation policies without sacrificing inference speed. Code is available at https://github.com/Stardust-hyx/3D-Foresight.

[590] The Role of World Models in Shaping Autonomous Driving: A Comprehensive Survey

Sifan Tu, Xin Zhou, Dingkang Liang, Xingyu Jiang, Yumeng Zhang, Xiaofan Li, Xiang Bai

Main category: cs.CV

TL;DR: A comprehensive survey of Driving World Models (DWMs) that predict scene evolution for autonomous driving, covering ecosystem components, modality-based categorization, performance evaluation, and future directions.

Details

Motivation: Driving World Models have emerged as a promising paradigm for autonomous driving by enabling systems to better perceive, understand, and interact with dynamic driving environments through scene evolution prediction.

Method: The survey reviews the DWM ecosystem including simulators, datasets, and evaluation metrics, categorizes approaches by predicted scene modalities (video, point cloud, occupancy, latent feature, traffic map), and summarizes applications in autonomous driving research.

Result: The survey provides a comprehensive overview of current DWM research, presents performance comparisons of representative approaches across generating and driving tasks, and identifies relevant papers in a curated collection.

Conclusion: DWMs are valuable for autonomous driving development, with the survey offering insights into their broader adoption while discussing current limitations and proposing future research directions.

Abstract: The Driving World Model (DWM), which focuses on predicting scene evolution during the driving process, has emerged as a promising paradigm in the pursuit of autonomous driving (AD). DWMs enable AD systems to better perceive, understand, and interact with dynamic driving environments. In this survey, we provide a comprehensive overview of the latest progress in DWM. First, we review the DWM ecosystem, which is constructed using mainstream simulators, high-impact datasets, and various metrics that evaluate DWMs across multiple dimensions. We then categorize existing approaches based on the modalities of the predicted scenes, including video, point cloud, occupancy, latent feature, and traffic map, and summarize their specific applications in AD research. In addition, the performance of representative approaches across generating and driving tasks is presented. Finally, we discuss the potential limitations of current research and propose future directions. This survey provides valuable insights into the development and application of DWM, fostering its broader adoption in AD. The relevant papers are collected at https://github.com/LMD0311/Awesome-World-Model.

[591] InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions

Sirui Xu, Hung Yu Ling, Yu-Xiong Wang, Liang-Yan Gui

Main category: cs.CV

TL;DR: InterMimic is a framework for learning robust human-object interaction policies from imperfect motion capture data using a curriculum strategy of training subject-specific teachers first, then distilling them into a student policy with RL fine-tuning.

Details

Motivation: Realistic simulation of human-object interactions is challenging due to complex coupling, object geometry variability, and artifacts in motion capture data like inaccurate contacts and limited hand detail.

Method: Uses curriculum strategy: 1) Train subject-specific teacher policies to mimic, retarget, and refine MoCap data, 2) Distill teachers into student policy with teachers providing supervision and references, 3) RL fine-tuning on student policy to surpass demonstration replication.

Result: Produces realistic and diverse interactions across multiple HOI datasets, generalizes zero-shot, and integrates with kinematic generators, elevating from imitation to generative modeling of complex human-object interactions.

Conclusion: InterMimic enables robust learning from imperfect MoCap data for diverse full-body human-object interactions, achieving realistic simulations that generalize well and support generative modeling.

Abstract: Achieving realistic simulations of humans interacting with a wide range of objects has long been a fundamental goal. Extending physics-based motion imitation to complex human-object interactions (HOIs) is challenging due to intricate human-object coupling, variability in object geometries, and artifacts in motion capture data, such as inaccurate contacts and limited hand detail. We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data covering diverse full-body interactions with dynamic and varied objects. Our key insight is to employ a curriculum strategy – perfect first, then scale up. We first train subject-specific teacher policies to mimic, retarget, and refine motion capture data. Next, we distill these teachers into a student policy, with the teachers acting as online experts providing direct supervision, as well as high-quality references. Notably, we incorporate RL fine-tuning on the student policy to surpass mere demonstration replication and achieve higher-quality solutions. Our experiments demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets. The learned policy generalizes in a zero-shot manner and seamlessly integrates with kinematic generators, elevating the framework from mere imitation to generative modeling of complex human-object interactions.

[592] SuperCarver: Texture-Consistent 3D Geometry Super-Resolution for High-Fidelity Surface Detail Generation

Qijian Zhang, Xiaozheng Jian, Xuan Zhang, Wenping Wang, Junhui Hou

Main category: cs.CV

TL;DR: SuperCarver is a 3D geometry super-resolution pipeline that enhances coarse meshes by adding texture-consistent surface details using a prior-guided normal diffusion model and noise-resistant inverse rendering.

Details

Motivation: Manual 3D sculpting is labor-intensive, and while AI has advanced 3D content creation, synthesizing realistic surface details remains challenging. There's a need for tools to enhance geometry fidelity of existing low-quality 3D meshes rather than just generating new ones from scratch.

Method: 1) Render coarse textured mesh from multiple viewpoints; 2) Fine-tune deterministic prior-guided normal diffusion model on paired detail-lacking/detail-rich normal maps; 3) Use noise-resistant inverse rendering via deformable distance field to update mesh surfaces from predicted normal maps.

Result: SuperCarver generates realistic and expressive surface details that align with actual texture appearance, effectively upgrading historical low-quality 3D assets and reducing workload for sculpting high-poly meshes.

Conclusion: SuperCarver provides a powerful tool for 3D geometry super-resolution, addressing the challenge of enhancing existing mesh quality with texture-consistent surface details through a novel diffusion-based approach.

Abstract: Conventional production workflow of high-precision mesh assets necessitates a cumbersome and laborious process of manual sculpting by specialized 3D artists/modelers. The recent years have witnessed remarkable advances in AI-empowered 3D content creation for generating plausible structures and intricate appearances from images or text prompts. However, synthesizing realistic surface details still poses great challenges, and enhancing the geometry fidelity of existing lower-quality 3D meshes (instead of image/text-to-3D generation) remains an open problem. In this paper, we introduce SuperCarver, a 3D geometry super-resolution pipeline for supplementing texture-consistent surface details onto a given coarse mesh. We start by rendering the original textured mesh into the image domain from multiple viewpoints. To achieve detail boosting, we construct a deterministic prior-guided normal diffusion model, which is fine-tuned on a carefully curated dataset of paired detail-lacking and detail-rich normal map renderings. To update mesh surfaces from potentially imperfect normal map predictions, we design a noise-resistant inverse rendering scheme through deformable distance field. Experiments demonstrate that our SuperCarver is capable of generating realistic and expressive surface details depicted by the actual texture appearance, making it a powerful tool to both upgrade historical low-quality 3D assets and reduce the workload of sculpting high-poly meshes.

[593] DenseFormer: Learning Dense Depth Map from Sparse Depth and Image via Conditional Diffusion Model

Ming Yuan, Chuang Zhang, Lei He, Qing Xu, Jianqiang Wang

Main category: cs.CV

TL;DR: DenseFormer integrates diffusion models into depth completion, using a feature pyramid with deformable attention and iterative refinement to generate dense depth maps from sparse depth and RGB images.

Details

Motivation: Depth completion is critical for autonomous driving, but existing methods rely on spatial propagation networks. The authors aim to leverage diffusion models' denoising capabilities for better depth completion by progressively refining random depth distributions.

Method: Proposes DenseFormer with: 1) Feature extraction module using feature pyramid structure and multi-layer deformable attention to integrate sparse depth and RGB features as diffusion guidance, 2) Diffusion process that generates dense depth by refining random distributions, 3) Depth refinement module with multi-step iterative refinement using multi-scale image features and sparse depth input.

Result: Extensive experiments on KITTI outdoor scene dataset show DenseFormer outperforms classical depth completion methods.

Conclusion: DenseFormer successfully integrates diffusion models into depth completion, demonstrating superior performance through its feature extraction and refinement modules.

Abstract: The depth completion task is a critical problem in autonomous driving, involving the generation of dense depth maps from sparse depth maps and RGB images. Most existing methods employ a spatial propagation network to iteratively refine the depth map after obtaining an initial dense depth. In this paper, we propose DenseFormer, a novel method that integrates the diffusion model into the depth completion task. By incorporating the denoising mechanism of the diffusion model, DenseFormer generates the dense depth map by progressively refining an initial random depth distribution through multiple iterations. We propose a feature extraction module that leverages a feature pyramid structure, along with multi-layer deformable attention, to effectively extract and integrate features from sparse depth maps and RGB images, which serve as the guiding condition for the diffusion process. Additionally, this paper presents a depth refinement module that applies multi-step iterative refinement across various ranges to the dense depth results generated by the diffusion process. The module utilizes image features enriched with multi-scale information and sparse depth input to further enhance the accuracy of the predicted depth map. Extensive experiments on the KITTI outdoor scene dataset demonstrate that DenseFormer outperforms classical depth completion methods.

[594] SignX: Continuous Sign Recognition in Compact Pose-Rich Latent Space

Sen Fang, Yalin Feng, Chunyu Sui, Hongbin Zhong, Hongwei Yi, Dimitris N. Metaxas

Main category: cs.CV

TL;DR: SignX is a novel framework for continuous sign language recognition that operates in a compact pose-rich latent space, encoding multiple pose formats into unified representations and achieving state-of-the-art accuracy with reduced computational consumption.

Details

Motivation: Sign language data processing is complex, and current approaches translate RGB videos through pose information into English-based ID Glosses. The authors aim to create a more efficient and accurate continuous sign language recognition system by operating in a compact pose-rich latent space rather than raw video data.

Method: 1) Construct unified latent representation encoding heterogeneous pose formats (SMPLer-X, DWPose, Mediapipe, PrimeDepth, Sapiens Segmentation) into compact information-dense space. 2) Train ViT-based Video2Pose module to extract latent representation directly from raw videos. 3) Develop temporal modeling and sequence refinement method operating entirely in latent space.

Result: SignX achieves state-of-the-art accuracy on continuous sign language recognition while significantly reducing computational consumption through its multi-stage design that operates in compact pose-rich latent space.

Conclusion: The proposed SignX framework successfully demonstrates that operating in a compact pose-rich latent space enables efficient and accurate continuous sign language recognition, representing an advancement in sign language processing technology.

Abstract: The complexity of sign language data processing brings many challenges. The current approach to recognition of ASL signs aims to translate RGB sign language videos through pose information into English-based ID Glosses, which serve to uniquely identify ASL signs. This paper proposes SignX, a novel framework for continuous sign language recognition in compact pose-rich latent space. First, we construct a unified latent representation that encodes heterogeneous pose formats (SMPLer-X, DWPose, Mediapipe, PrimeDepth, and Sapiens Segmentation) into a compact, information-dense space. Second, we train a ViT-based Video2Pose module to extract this latent representation directly from raw videos. Finally, we develop a temporal modeling and sequence refinement method that operates entirely in this latent space. This multi-stage design achieves end-to-end sign language recognition while significantly reducing computational consumption. Experimental results demonstrate that SignX achieves state-of-the-art accuracy on continuous sign language recognition.

[595] Facial Recognition Leveraging Generative Adversarial Networks

Zhongwen Li, Zongwei Li, Xiaoqi Li

Main category: cs.CV

TL;DR: Proposes GAN-based data augmentation with residual generator and FaceNet discriminator to improve face recognition with limited training data.

Details

Motivation: Deep learning face recognition requires large datasets that are often unavailable in practice, creating a need for effective data augmentation methods to overcome data scarcity.

Method: GAN-based approach with three components: 1) residual-embedded generator to prevent gradient issues, 2) Inception ResNet-V1 based FaceNet discriminator for better adversarial training, and 3) end-to-end framework jointly optimizing data generation and recognition.

Result: Achieves stable training and improves face recognition accuracy by 12.7% on LFW benchmark compared to baselines, with good generalization using limited samples.

Conclusion: The proposed GAN-based data augmentation method effectively addresses data scarcity in face recognition, demonstrating significant performance improvements and stable training dynamics.

Abstract: Face recognition performance based on deep learning heavily relies on large-scale training data, which is often difficult to acquire in practical applications. To address this challenge, this paper proposes a GAN-based data augmentation method with three key contributions: (1) a residual-embedded generator to alleviate gradient vanishing/exploding problems, (2) an Inception ResNet-V1 based FaceNet discriminator for improved adversarial training, and (3) an end-to-end framework that jointly optimizes data generation and recognition performance. Experimental results demonstrate that our approach achieves stable training dynamics and significantly improves face recognition accuracy by 12.7% on the LFW benchmark compared to baseline methods, while maintaining good generalization capability with limited training samples.

[596] MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning

Jinhua Zhang, Wei Long, Minghao Han, Weiyi You, Shuhang Gu

Main category: cs.CV

TL;DR: MVAR introduces scale and spatial Markov assumptions to reduce redundancy in visual autoregressive modeling, achieving efficient image generation with lower memory and computational costs.

Details

Motivation: Current next-scale prediction methods for visual generation suffer from scale and spatial redundancy by conditioning each scale on all previous scales and requiring each token to attend to all preceding tokens, leading to high computational complexity and memory consumption.

Method: Proposes Markovian Visual AutoRegressive modeling (MVAR) with two key innovations: 1) scale-Markov trajectory that only uses adjacent preceding scale features for next-scale prediction, enabling parallel training, and 2) spatial-Markov attention that restricts attention to localized neighborhoods (size k) at corresponding positions on adjacent scales.

Result: Reduces attention complexity from O(N²) to O(Nk), enables training with just 8 RTX 4090 GPUs, eliminates KV cache during inference, reduces average GPU memory footprint by 3.0x, and achieves comparable or superior performance on ImageNet with both small from-scratch and large fine-tuned models.

Conclusion: MVAR effectively mitigates redundancy in visual autoregressive modeling through Markov assumptions, providing an efficient framework for visual generation with significantly reduced computational and memory requirements while maintaining strong performance.

Abstract: Essential to visual generation is efficient modeling of visual data priors. Conventional next-token prediction methods define the process as learning the conditional probability distribution of successive tokens. Recently, next-scale prediction methods redefine the process to learn the distribution over multi-scale representations, significantly reducing generation latency. However, these methods condition each scale on all previous scales and require each token to consider all preceding tokens, exhibiting scale and spatial redundancy. To better model the distribution by mitigating redundancy, we propose Markovian Visual AutoRegressive modeling (MVAR), a novel autoregressive framework that introduces scale and spatial Markov assumptions to reduce the complexity of conditional probability modeling. Specifically, we introduce a scale-Markov trajectory that only takes as input the features of adjacent preceding scale for next-scale prediction, enabling the adoption of a parallel training strategy that significantly reduces GPU memory consumption. Furthermore, we propose spatial-Markov attention, which restricts the attention of each token to a localized neighborhood of size k at corresponding positions on adjacent scales, rather than attending to every token across these scales, for the pursuit of reduced modeling complexity. Building on these improvements, we reduce the computational complexity of attention calculation from O(N^2) to O(Nk), enabling training with just eight NVIDIA RTX 4090 GPUs and eliminating the need for KV cache during inference. Extensive experiments on ImageNet demonstrate that MVAR achieves comparable or superior performance with both small model trained from scratch and large fine-tuned models, while reducing the average GPU memory footprint by 3.0x.

[597] U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding

Anjie Le, Henan Liu, Yue Wang, Zhenyu Liu, Rongkun Zhu, Taohan Weng, Jinze Yu, Boyang Wang, Yalun Wu, Kaiwen Yan, Quanlin Sun, Meirui Jiang, Jialun Pei, Siya Liu, Haoyun Zheng, Zhoujun Li, Alison Noble, Jacques Souquet, Xiaoqing Guo, Manxi Lin, Hongcheng Guo

Main category: cs.CV

TL;DR: U2-BENCH is the first comprehensive benchmark for evaluating large vision-language models on ultrasound understanding across multiple clinical tasks and anatomical regions.

Details

Motivation: Ultrasound interpretation is challenging due to varying image quality, noise, and anatomical complexity. While LVLMs show promise in medical domains, their performance on ultrasound remains unexplored, necessitating a standardized evaluation framework.

Method: Created U2-BENCH with 7,241 cases spanning 15 anatomical regions and 8 clinically inspired tasks across 50 ultrasound scenarios. Evaluated 23 state-of-the-art LVLMs (open/closed source, general/medical) on classification, detection, regression, and text generation tasks.

Result: LVLMs show strong performance on image-level classification but struggle with spatial reasoning and clinical language generation. The benchmark provides a rigorous testbed for assessing LVLM capabilities in ultrasound imaging.

Conclusion: U2-BENCH establishes a comprehensive evaluation framework for LVLMs in ultrasound, revealing current limitations and providing a foundation for future research in medical multimodal AI.

Abstract: Ultrasound is a widely-used imaging modality critical to global healthcare, yet its interpretation remains challenging due to its varying image quality on operators, noises, and anatomical structures. Although large vision-language models (LVLMs) have demonstrated impressive multimodal capabilities across natural and medical domains, their performance on ultrasound remains largely unexplored. We introduce U2-BENCH, the first comprehensive benchmark to evaluate LVLMs on ultrasound understanding across classification, detection, regression, and text generation tasks. U2-BENCH aggregates 7,241 cases spanning 15 anatomical regions and defines 8 clinically inspired tasks, such as diagnosis, view recognition, lesion localization, clinical value estimation, and report generation, across 50 ultrasound application scenarios. We evaluate 23 state-of-the-art LVLMs, both open- and closed-source, general-purpose and medical-specific. Our results reveal strong performance on image-level classification, but persistent challenges in spatial reasoning and clinical language generation. U2-BENCH establishes a rigorous and unified testbed to assess and accelerate LVLM research in the uniquely multimodal domain of medical ultrasound imaging.

[598] Pose Splatter: A 3D Gaussian Splatting Model for Quantifying Animal Pose and Appearance

Jack Goffinet, Youngjo Min, Carlo Tomasi, David E. Carlson

Main category: cs.CV

TL;DR: Pose Splatter: A novel framework using shape carving and 3D Gaussian splatting to model complete animal pose and appearance without manual annotations or per-frame optimization, enabling large-scale behavioral analysis.

Details

Motivation: Current 3D pose estimation techniques have limitations including limited representational detail, labor-intensive annotation requirements, and expensive per-frame optimization, which hinder the study of subtle movements and make large-scale analyses impractical.

Method: Uses shape carving and 3D Gaussian splatting to model complete animal pose and appearance without prior knowledge of animal geometry, per-frame optimization, or manual annotations. Also proposes a rotation-invariant visual embedding technique for encoding pose and appearance as a plug-in replacement for 3D keypoint data.

Result: Experiments on mice, rats, and zebra finches datasets show Pose Splatter learns accurate 3D animal geometries, represents subtle pose variations, provides better low-dimensional pose embeddings than state-of-the-art (evaluated by humans), and generalizes to unseen data.

Conclusion: By eliminating annotation and per-frame optimization bottlenecks, Pose Splatter enables analysis of large-scale, longitudinal behavior needed to map genotype, neural activity, and behavior at high resolutions.

Abstract: Accurate and scalable quantification of animal pose and appearance is crucial for studying behavior. Current 3D pose estimation techniques, such as keypoint- and mesh-based techniques, often face challenges including limited representational detail, labor-intensive annotation requirements, and expensive per-frame optimization. These limitations hinder the study of subtle movements and can make large-scale analyses impractical. We propose Pose Splatter, a novel framework leveraging shape carving and 3D Gaussian splatting to model the complete pose and appearance of laboratory animals without prior knowledge of animal geometry, per-frame optimization, or manual annotations. We also propose a rotation-invariant visual embedding technique for encoding pose and appearance, designed to be a plug-in replacement for 3D keypoint data in downstream behavioral analyses. Experiments on datasets of mice, rats, and zebra finches show Pose Splatter learns accurate 3D animal geometries. Notably, Pose Splatter represents subtle variations in pose, provides better low-dimensional pose embeddings over state-of-the-art as evaluated by humans, and generalizes to unseen data. By eliminating annotation and per-frame optimization bottlenecks, Pose Splatter enables analysis of large-scale, longitudinal behavior needed to map genotype, neural activity, and behavior at high resolutions.

[599] CoT-RVS: Zero-Shot Chain-of-Thought Reasoning Segmentation for Videos

Shiu-hong Kao, Yu-Wing Tai, Chi-Keung Tang

Main category: cs.CV

TL;DR: CoT-RVS is a training-free framework that uses zero-shot Chain-of-Thought reasoning in MLLMs for complex reasoning video object segmentation, achieving state-of-the-art performance on both explicit and implicit text queries.

Details

Motivation: Existing MLLM-based approaches for reasoning video object segmentation fail with complex temporally-sensitive queries due to inadequate temporal and spatial integration. There's a need for better handling of complex scenarios without requiring task-specific training.

Method: Proposes CoT-RVS framework that leverages MLLMs’ zero-shot Chain-of-Thought capability for temporal-semantic reasoning. It analyzes visible objects matching language queries (semantic) and selects optimal keyframes for each object (temporal). The framework is training-free and compatible with closed-source MLLMs.

Result: CoT-RVS significantly outperforms previous works on video object segmentation with both explicit and implicit queries, showing superior qualitative and quantitative performance. The training-free approach also enables extension to online video stream processing.

Conclusion: Zero-shot Chain-of-Thought reasoning in MLLMs effectively addresses complex reasoning video segmentation challenges without task-specific training, demonstrating strong temporal-semantic integration capabilities for multimodal understanding.

Abstract: Reasoning Video Object Segmentation is a challenging task, aiming at generating a mask sequence from an input video given a complex and implicit text query. While existing works finetune Multimodal Large Language Models (MLLM) for the task, they still fail in video inputs given complex temporally-sensitive queries, indicating their lack of temporal and spatial integration in complex scenarios. In this paper, we propose CoT-RVS, a novel framework employing the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these complex challenges by temporal-semantic reasoning: CoT-RVS analyzes the visible objects within a given frame that possibly match the language query (semantic), and chooses a corresponding keyframe for each object that can be observed effortlessly among all frames (temporal). Notably, the CoT-RVS framework is training-free and compatible with closed-source MLLMs, which can be applied to Reasoning Video Instance Segmentation. Our framework’s training-free feature further allows its extension to process online video streams, where the CoT is used at test time to update the object of interest when a better target starts to emerge and becomes visible. We conduct extensive experiments on video object segmentation with explicit and implicit queries. The results show that CoT-RVS significantly outperforms previous works in both cases, qualitatively and quantitatively.

[600] ConsiStyle: Style Diversity in Training-Free Consistent T2I Generation

Yohai Mazuz, Janna Bruner, Lior Wolf

Main category: cs.CV

TL;DR: Training-free method for consistent character generation in text-to-image models that decouples style from subject appearance using attention manipulation and cross-image components.

Details

Motivation: Existing methods struggle to preserve consistent subject characteristics while adhering to varying style prompts, as style and appearance are often entangled. Current approaches either require large-scale fine-tuning or per-subject optimization, failing to generalize or align well with textual descriptions.

Method: Manipulates attention matrices so Queries and Keys come from anchor images defining the subject, while Values come from a parallel copy not subject-anchored. Adds cross-image components to self-attention by expanding Key and Value matrices. Aligns statistics of Value matrices to prevent style shifting.

Result: Effectively decouples style from subject appearance and enables faithful generation of text-aligned images with consistent characters across diverse styles, as demonstrated in comprehensive qualitative and quantitative experiments.

Conclusion: Introduces a training-free method that jointly achieves style preservation and subject consistency across varied styles for text-to-image generation.

Abstract: In text-to-image models, consistent character generation is the task of achieving text alignment while maintaining the subject’s appearance across different prompts. However, since style and appearance are often entangled, the existing methods struggle to preserve consistent subject characteristics while adhering to varying style prompts. Current approaches for consistent text-to-image generation typically rely on large-scale fine-tuning on curated image sets or per-subject optimization, which either fail to generalize across prompts or do not align well with textual descriptions. Meanwhile, training-free methods often fail to maintain subject consistency across different styles. In this work, we introduce a training-free method that, for the first time, jointly achieves style preservation and subject consistency across varied styles. The attention matrices are manipulated such that Queries and Keys are obtained from the anchor image(s) that are used to define the subject, while the Values are imported from a parallel copy that is not subject-anchored. Additionally, cross-image components are added to the self-attention mechanism by expanding the Key and Value matrices. To do without shifting from the target style, we align the statistics of the Value matrices. As is demonstrated in a comprehensive battery of qualitative and quantitative experiments, our method effectively decouples style from subject appearance and enables faithful generation of text-aligned images with consistent characters across diverse styles.

[601] Domain Adaptation of Attention Heads for Zero-shot Anomaly Detection

Kiyoon Jeong, Jaehyuk Heo, Junyeong Son, Pilsung Kang

Main category: cs.CV

TL;DR: HeadCLIP adapts CLIP for zero-shot anomaly detection by using learnable prompts in text encoder and learnable head weights in image encoder, achieving state-of-the-art performance across industrial and medical domains.

Details

Motivation: Existing zero-shot anomaly detection methods inadequately adapt vision-language models like CLIP for anomaly detection tasks, either neglecting adaptation or implementing only partial adaptation, limiting their effectiveness.

Method: Proposes HeadCLIP with two key adaptations: 1) learnable prompts in text encoder to generalize normality/abnormality concepts, and 2) learnable head weights in image encoder to dynamically adjust attention head features. Also introduces joint anomaly score combining pixel-level information for image-level detection.

Result: Outperforms existing ZSAD methods on 17 datasets across industrial and medical domains, achieving up to 4.9%p improvement in pixel-level mAD and 3.7%p in image-level mAD for industrial domain, with comparable gains (3.2%p, 3.2%p) in medical domain.

Conclusion: HeadCLIP effectively adapts both text and image encoders of CLIP for zero-shot anomaly detection, demonstrating superior performance through comprehensive adaptation strategies and joint scoring approach.

Abstract: Zero-shot anomaly detection (ZSAD) enables anomaly detection without normal samples from target categories, addressing scenarios where task-specific training data is unavailable. However, existing ZSAD methods either neglect adaptation of vision-language models to anomaly detection or implement only partial adaptation. This paper proposes Head-adaptive CLIP (HeadCLIP), which effectively adapts both text and image encoders. HeadCLIP employs learnable prompts in the text encoder to generalize normality and abnormality concepts, and introduces learnable head weights in the image encoder to dynamically adjust attention head features for task-specific adaptation. A joint anomaly score is further proposed to leverage adapted pixel-level information for enhanced image-level detection. Experiments on 17 datasets across industrial and medical domains demonstrate that HeadCLIP outperforms existing ZSAD methods at both pixel and image levels, achieving improvements of up to 4.9%p in pixel-level mean anomaly detection score (mAD) and 3.7%p in image-level mAD in the industrial domain, with comparable gains (3.2%p, 3.2%p) in the medical domain. Code and pretrained weights are available at https://github.com/kiyoonjeong0305/HeadCLIP.

[602] Reading Recognition in the Wild

Charig Yang, Samiul Alam, Shakhrul Iman Siam, Michael J. Proulx, Lambert Mathias, Kiran Somasundaram, Luis Pesqueira, James Fort, Sheroze Sheriffdeen, Omkar Parkhi, Carl Ren, Mi Zhang, Yuning Chai, Richard Newcombe, Hyo Jin Kim

Main category: cs.CV

TL;DR: A new reading recognition task for smart glasses using multimodal data (RGB, eye gaze, head pose) with a large-scale dataset and transformer model.

Details

Motivation: To enable contextual AI in always-on smart glasses by recognizing when users are reading, which is crucial for maintaining records of user-world interactions.

Method: Introduces Reading in the Wild dataset (100 hours of reading/non-reading videos), uses three modalities (egocentric RGB, eye gaze, head pose), and develops a flexible transformer model that can use these modalities individually or combined.

Result: Shows that the three modalities are relevant and complementary for reading recognition, investigates efficient encoding methods, and demonstrates the dataset’s usefulness for classifying reading types in realistic settings.

Conclusion: The work enables reading recognition in smart glasses using multimodal data, extending reading understanding studies from constrained to realistic, diverse scenarios.

Abstract: To enable egocentric contextual AI in always-on smart glasses, it is crucial to be able to keep a record of the user’s interactions with the world, including during reading. In this paper, we introduce a new task of reading recognition to determine when the user is reading. We first introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset, containing 100 hours of reading and non-reading videos in diverse and realistic scenarios. We then identify three modalities (egocentric RGB, eye gaze, head pose) that can be used to solve the task, and present a flexible transformer model that performs the task using these modalities, either individually or combined. We show that these modalities are relevant and complementary to the task, and investigate how to efficiently and effectively encode each modality. Additionally, we show the usefulness of this dataset towards classifying types of reading, extending current reading understanding studies conducted in constrained settings to larger scale, diversity and realism.

[603] PointT2I: LLM-based text-to-image generation via keypoints

Taekyung Lee, Donggyu Lee, Myungjoo Kang

Main category: cs.CV

TL;DR: PointT2I is a framework that uses LLMs to generate human pose keypoints from text prompts, then creates images using both text and keypoints, with an LLM feedback system for refinement.

Details

Motivation: Current T2I models struggle with accurately generating images containing complex human poses described in text prompts. There's a need for better alignment between textual descriptions of human poses and generated visual outputs.

Method: Three-component framework: 1) LLM generates keypoints from text prompts without external references, 2) Image generation uses both text and keypoints, 3) LLM-based feedback system assesses semantic consistency between generated content and prompts.

Result: The framework produces accurate pose-aligned images from textual prompts without fine-tuning, representing the first approach using LLMs for keypoints-guided image generation.

Conclusion: PointT2I effectively addresses the challenge of generating images with accurate human poses from text descriptions by leveraging LLMs for keypoint generation and feedback.

Abstract: Text-to-image (T2I) generation model has made significant advancements, resulting in high-quality images aligned with an input prompt. However, despite T2I generation’s ability to generate fine-grained images, it still faces challenges in accurately generating images when the input prompt contains complex concepts, especially human pose. In this paper, we propose PointT2I, a framework that effectively generates images that accurately correspond to the human pose described in the prompt by using a large language model (LLM). PointT2I consists of three components: Keypoint generation, Image generation, and Feedback system. The keypoint generation uses an LLM to directly generate keypoints corresponding to a human pose, solely based on the input prompt, without external references. Subsequently, the image generation produces images based on both the text prompt and the generated keypoints to accurately reflect the target pose. To refine the outputs of the preceding stages, we incorporate an LLM-based feedback system that assesses the semantic consistency between the generated contents and the given prompts. Our framework is the first approach to leveraging LLM for keypoints-guided image generation without any fine-tuning, producing accurate pose-aligned images based solely on textual prompts.

[604] HueManity: Probing Fine-Grained Visual Perception in MLLMs

Rynaa Grover, Jayant Sravan Tamarapalli, Sahiti Yerramilli, Nilay Pande

Main category: cs.CV

TL;DR: HueManity is a benchmark for evaluating fine-grained visual perception in MLLMs using Ishihara-style images, revealing significant performance gaps compared to humans and specialized models.

Details

Motivation: Existing MLLM benchmarks focus on high-level visual reasoning but overlook fine-grained perceptual details, which is critical for safety and reliability applications where perceptual acuity matters.

Method: Created HueManity benchmark with 83,850 Ishihara-style images embedding alphanumeric strings to evaluate pattern recognition. Tested nine state-of-the-art MLLMs on numeric and alphanumeric tasks.

Result: MLLMs performed poorly: strongest model achieved only 33.6% accuracy on numeric task and 3% on alphanumeric task, compared to humans (99.38%, 93.25%) and fine-tuned ResNet-50 (96.5%, 94.5%).

Conclusion: MLLMs have critical weaknesses in fine-grained visual perception that conventional benchmarks miss, highlighting the need for better perceptual grounding in multimodal models.

Abstract: Recent Multimodal Large Language Models (MLLMs) demonstrate strong high-level visual reasoning on tasks such as visual question answering and image captioning. Yet existing benchmarks largely overlook their ability to capture fine-grained perceptual details. As MLLMs are increasingly deployed in safety and reliability critical settings, perceptual acuity becomes essential. We present HueManity, a scalable automated benchmark for assessing fine-grained visual perception in MLLMs. HueManity comprises 83,850 Ishihara-style images embedding alphanumeric strings, designed to evaluate pattern recognition, a core aspect of visual understanding. Our evaluation of nine state-of-the-art MLLMs uncovers a striking performance deficit: the strongest model achieved only 33.6% accuracy on a simple numeric task and 3% on a harder alphanumeric task, compared to near-ceiling performance from humans (99.38%, 93.25%) and a fine-tuned ResNet-50 (96.5%, 94.5%). These findings expose a critical weakness in MLLMs’ perceptual grounding, one that remains obscured by conventional benchmarks emphasizing high-level semantics.

[605] NAP-Tuning: Neural Augmented Prompt Tuning for Adversarially Robust Vision-Language Models

Jiaming Zhang, Xin Wang, Xingjun Ma, Lingyu Qiu, Yu-Gang Jiang, Jitao Sang

Main category: cs.CV

TL;DR: NAP-Tuning extends adversarial prompt tuning to multimodal VLMs with neural augmentors for feature purification, achieving significant robustness improvements over existing methods.

Details

Motivation: Vision-language models like CLIP are vulnerable to adversarial attacks in the image modality, posing security concerns. Previous AdvPT work improved robustness with text prompts, but needed extension to multimodal settings and better feature-level protection.

Method: Extends AdvPT from text-only to multimodal prompting across text and visual modalities, uses multi-layer prompt architectures, and introduces Neural Augmentor with feature purification. Token refiners reconstruct purified features through residual connections for modality/layer-specific correction.

Result: Significantly outperforms existing methods across datasets and attack types. Under AutoAttack benchmark, outperforms strongest baselines by 33.5% on ViT-B16 and 33.0% on ViT-B32 architectures while maintaining competitive clean accuracy.

Conclusion: NAP-Tuning provides an effective framework for enhancing adversarial robustness in multimodal VLMs through neural augmentors and feature purification, addressing security vulnerabilities while preserving model performance.

Abstract: Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable capabilities in understanding relationships between visual and textual data through joint embedding spaces. Despite their effectiveness, these models remain vulnerable to adversarial attacks, particularly in the image modality, posing significant security concerns. Building upon our previous work on Adversarial Prompt Tuning (AdvPT), which introduced learnable text prompts to enhance adversarial robustness in VLMs without extensive parameter training, we present a significant extension by introducing the Neural Augmentor framework for Multi-modal Adversarial Prompt Tuning (NAP-Tuning).Our key innovations include: (1) extending AdvPT from text-only to multi-modal prompting across both text and visual modalities, (2) expanding from single-layer to multi-layer prompt architectures, and (3) proposing a novel architecture-level redesign through our Neural Augmentor approach, which implements feature purification to directly address the distortions introduced by adversarial attacks in feature space. Our NAP-Tuning approach incorporates token refiners that learn to reconstruct purified features through residual connections, allowing for modality-specific and layer-specific feature correction.Comprehensive experiments demonstrate that NAP-Tuning significantly outperforms existing methods across various datasets and attack types. Notably, our approach shows significant improvements over the strongest baselines under the challenging AutoAttack benchmark, outperforming them by 33.5% on ViT-B16 and 33.0% on ViT-B32 architectures while maintaining competitive clean accuracy.

[606] HalluRNN: Mitigating Hallucinations via Recurrent Cross-Layer Reasoning in Large Vision-Language Models

Le Yu, Kaishen Wang, Jianlong Xiong, Yue Cao, Lei Zhang, Zhang Yi Tao He

Main category: cs.CV

TL;DR: HalluRNN: An architecture-level solution using recurrent cross-layer reasoning with Dual-Gated Depth Propagation Unit to mitigate hallucinations in Large Vision-Language Models by enforcing consistency across layers.

Details

Motivation: Large Vision-Language Models (LVLMs) suffer from hallucinations - generating textually plausible but visually ungrounded outputs. Existing approaches require substantial resources or task-specific configurations through data-centric fine-tuning or decoding strategies.

Method: Proposes HalluRNN with a novel Dual-Gated Depth Propagation Unit (DG-DPU) module shared across layers that recurrently refines hidden states. This enables adaptive information propagation throughout the model and enforces consistency across layers to mitigate representational drift causing hallucinations.

Result: HalluRNN achieves strong and robust performance across multiple benchmarks by fine-tuning only the DG-DPU module, providing an efficient architecture-level solution.

Conclusion: HalluRNN offers a resource-efficient architecture-level approach to reduce hallucinations in LVLMs through recurrent cross-layer reasoning, addressing representational drift with minimal fine-tuning requirements.

Abstract: Though Large Vision-Language Models (LVLMs) have achieved remarkable performance across various tasks, they are still prone to hallucinations-generating outputs that are textually plausible but visually ungrounded. While prior approaches generally address this issue through data-centric fine-tuning or innovative decoding strategies, these methods often require substantial resources or task-specific configurations. In this work, we introduce an architecture-level solution, HalluRNN, which enhances model stability through recurrent cross-layer reasoning. Specifically, we propose a novel Dual-Gated Depth Propagation Unit (DG-DPU) module, which is shared across layers and recurrently refines hidden states. This allows for the adaptive propagation of information throughout the model, enforces consistency across layers, and mitigates hallucinations caused by representational drift. By fine-tuning only the DG-DPU module, HalluRNN achieves strong and robust performance across multiple benchmarks.

Gianluca Monaci, Philippe Weinzaepfel, Christian Wolf

Main category: cs.CV

TL;DR: End-to-end RL training for image goal navigation can develop relative pose estimation capabilities from navigation rewards alone, though simulator shortcuts affect results.

Details

Motivation: To investigate whether image goal navigation can be efficiently solved with end-to-end RL training without dedicated image-matching or pre-trained vision modules, which would enable training relative pose estimation from navigation rewards alone.

Method: Large experimental study examining architectural choices (late fusion, channel stacking, space-to-depth projections, cross-attention) and their role in emergence of relative pose estimators from navigation training, analyzing simulator settings and transfer to realistic settings.

Result: Recent methods’ success is influenced by simulator shortcuts, but capabilities can transfer to more realistic settings to some extent. Found correlations between navigation performance and emerging relative pose estimation performance.

Conclusion: End-to-end RL training can develop relative pose estimation from navigation rewards, though simulator artifacts affect results; architectural choices impact emergence of this important sub-skill.

Abstract: Image goal navigation requires two different skills: firstly, core navigation skills, including the detection of free space and obstacles, and taking decisions based on an internal representation; and secondly, computing directional information by comparing visual observations to the goal image. Current state-of-the-art methods either rely on dedicated image-matching, or pre-training of computer vision modules on relative pose estimation. In this paper, we study whether this task can be efficiently solved with end-to-end training of full agents with RL, as has been claimed by recent work. A positive answer would have impact beyond Embodied AI and allow training of relative pose estimation from reward for navigation alone. In this large experimental study we investigate the effect of architectural choices like late fusion, channel stacking, space-to-depth projections and cross-attention, and their role in the emergence of relative pose estimators from navigation training. We show that the success of recent methods is influenced up to a certain extent by simulator settings, leading to shortcuts in simulation. However, we also show that these capabilities can be transferred to more realistic setting, up to some extent. We also find evidence for correlations between navigation performance and probed (emerging) relative pose estimation performance, an important sub skill.

[608] Weakly-supervised Contrastive Learning with Quantity Prompts for Moving Infrared Small Target Detection

Weiwei Duan, Luping Ji, Shengjia Chen, Sicheng Zhu, Jianghong Huang, Mao Ye

Main category: cs.CV

TL;DR: A weakly-supervised contrastive learning (WeCoL) scheme for moving infrared small target detection that reduces annotation requirements by using only target quantity prompts instead of manual target-wise annotations.

Details

Motivation: Moving infrared small target detection faces challenges due to tiny target size and weak contrast. Current fully-supervised methods require expensive manual annotations, especially for low-quality infrared images. Weakly-supervised approaches could reduce annotation requirements.

Method: Proposes WeCoL scheme using only target quantity prompts. Based on pretrained SAM, integrates target activation maps and multi-frame energy accumulation. Uses contrastive learning to improve pseudo-label reliability via similarity between positive/negative samples. Includes long-short term motion-aware learning to model local motion patterns and global trajectories.

Result: Extensive experiments on DAUB and ITSDT-15K datasets show the weakly-supervised scheme often outperforms early fully-supervised methods and reaches over 90% of state-of-the-art fully-supervised performance.

Conclusion: The proposed weakly-supervised approach effectively reduces annotation requirements while maintaining competitive performance with fully-supervised methods for infrared small target detection.

Abstract: Different from general object detection, moving infrared small target detection faces huge challenges due to tiny target size and weak background contrast.Currently, most existing methods are fully-supervised, heavily relying on a large number of manual target-wise annotations. However, manually annotating video sequences is often expensive and time-consuming, especially for low-quality infrared frame images. Inspired by general object detection, non-fully supervised strategies ($e.g.$, weakly supervised) are believed to be potential in reducing annotation requirements. To break through traditional fully-supervised frameworks, as the first exploration work, this paper proposes a new weakly-supervised contrastive learning (WeCoL) scheme, only requires simple target quantity prompts during model training.Specifically, in our scheme, based on the pretrained segment anything model (SAM), a potential target mining strategy is designed to integrate target activation maps and multi-frame energy accumulation.Besides, contrastive learning is adopted to further improve the reliability of pseudo-labels, by calculating the similarity between positive and negative samples in feature subspace.Moreover, we propose a long-short term motion-aware learning scheme to simultaneously model the local motion patterns and global motion trajectory of small targets.The extensive experiments on two public datasets (DAUB and ITSDT-15K) verify that our weakly-supervised scheme could often outperform early fully-supervised methods. Even, its performance could reach over 90% of state-of-the-art (SOTA) fully-supervised ones.

[609] No time to train! Training-Free Reference-Based Instance Segmentation

Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, Elliot J. Crowley

Main category: cs.CV

TL;DR: A training-free method for few-shot object segmentation that uses foundation model semantic priors to find correspondences between reference and target images, enabling automatic instance-level mask generation without manual prompts.

Details

Motivation: While SAM reduces annotation costs through promptable segmentation, it still requires manual visual prompts or complex domain-dependent rules. The paper aims to further reduce this burden by enabling object segmentation using only a small set of reference images instead of manual prompts.

Method: Multi-stage training-free approach: (1) memory bank construction from reference images, (2) representation aggregation, and (3) semantic-aware feature matching using foundation model priors to find correspondences between reference and target images.

Result: State-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50), and outperforms existing training-free approaches on Cross-Domain FSOD benchmark (22.4% nAP).

Conclusion: The method effectively leverages foundation model semantic priors for few-shot segmentation, reducing dependency on manual prompts while achieving strong performance across benchmarks.

Abstract: The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).

[610] Transferring Visual Explainability of Self-Explaining Models to Prediction-Only Models without Additional Training

Yuya Yoshikawa, Ryotaro Shimizu, Takahiro Kawashima, Yuki Saito

Main category: cs.CV

TL;DR: A method to transfer visual explanation capabilities from self-explaining models to existing prediction-only models using task arithmetic, enabling explanation generation without retraining from scratch.

Details

Motivation: Existing prediction-only models lack explanation capabilities, and training new self-explaining models from scratch is computationally expensive and requires extensive labeling. Users with trained prediction-only models need a way to add explanation capabilities without full retraining.

Method: Proposes a task arithmetic framework to transfer explanation capabilities from self-explaining models in a source domain to prediction-only models in a target domain. Extends Vision Transformer-based architectures to enable explanation transfer without additional training.

Result: Experiments on various image classification datasets show successful transfer of visual explanation capabilities between related domains, with improved explanation quality in target domains without significant classification accuracy degradation.

Conclusion: The proposed method effectively transfers explanation capabilities to existing prediction-only models, providing a cost-effective solution for adding interpretability without full model retraining.

Abstract: In image classification scenarios where both prediction and explanation efficiency are required, self-explaining models that perform both tasks in a single inference are effective. However, for users who already have prediction-only models, training a new self-explaining model from scratch imposes significant costs in terms of both labeling and computation. This study proposes a method to transfer the visual explanation capability of self-explaining models learned in a source domain to prediction-only models in a target domain based on a task arithmetic framework. Our self-explaining model comprises an architecture that extends Vision Transformer-based prediction-only models, enabling the proposed method to endow explanation capability to many trained prediction-only models without additional training. Experiments on various image classification datasets demonstrate that, except for transfers between less-related domains, the transfer of visual explanation capability from source to target domains is successful, and explanation quality in the target domain improves without substantially sacrificing classification accuracy.

[611] OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts

Shiting Xiao, Rishabh Kabra, Yuhang Li, Donghyun Lee, Joao Carreira, Priyadarshini Panda

Main category: cs.CV

TL;DR: OpenWorldSAM extends SAM2 for open-vocabulary segmentation using vision-language embeddings, supporting diverse language prompts with minimal training while maintaining strong zero-shot generalization.

Details

Motivation: Current segmentation models struggle with open-ended language prompts that require grounding textual semantics into precise spatial masks across diverse and unseen categories. There's a need for flexible, efficient models that can handle various segmentation tasks with strong generalization capabilities.

Method: Extends SAM2 with multi-modal embeddings from a lightweight VLM, freezing pre-trained components and training only 4.5M parameters. Introduces positional tie-breaker embeddings and cross-attention layers for instance awareness, supporting unified prompting with category-level and sentence-level language descriptions.

Result: Achieves state-of-the-art performance in open-vocabulary semantic, instance, and panoptic segmentation across multiple benchmarks. Demonstrates strong zero-shot generalization on unseen categories and open vocabulary concepts without additional training.

Conclusion: OpenWorldSAM provides an efficient, flexible framework for open-vocabulary segmentation that balances performance with computational efficiency, enabling practical applications requiring diverse language-based segmentation.

Abstract: The ability to segment objects based on open-ended language prompts remains a critical challenge, requiring models to ground textual semantics into precise spatial masks while handling diverse and unseen categories. We present OpenWorldSAM, a framework that extends the prompt-driven Segment Anything Model v2 (SAM2) to open-vocabulary scenarios by integrating multi-modal embeddings extracted from a lightweight vision-language model (VLM). Our approach is guided by four key principles: i) Unified prompting: OpenWorldSAM supports a diverse range of prompts, including category-level and sentence-level language descriptions, providing a flexible interface for various segmentation tasks. ii) Efficiency: By freezing the pre-trained components of SAM2 and the VLM, we train only 4.5 million parameters on the COCO-stuff dataset, achieving remarkable resource efficiency. iii) Instance Awareness: We enhance the model’s spatial understanding through novel positional tie-breaker embeddings and cross-attention layers, enabling effective segmentation of multiple instances. iv) Generalization: OpenWorldSAM exhibits strong zero-shot capabilities, generalizing well on unseen categories and an open vocabulary of concepts without additional training. Extensive experiments demonstrate that OpenWorldSAM achieves state-of-the-art performance in open-vocabulary semantic, instance, and panoptic segmentation across multiple benchmarks. Code is available at https://github.com/GinnyXiao/OpenWorldSAM.

[612] SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs

Siting Wang, Minnan Pei, Luoyang Sun, Cheng Deng, Kun Shao, Zheng Tian, Haifeng Zhang, Jun Wang

Main category: cs.CV

TL;DR: SpatialViz-Bench: A comprehensive multimodal benchmark for evaluating spatial visualization abilities in MLLMs with 1,180 programmatically generated problems across 12 tasks and 4 sub-abilities.

Details

Motivation: Current multimodal benchmarks focus on reasoning about visible visual information but insufficiently evaluate spatial visualization - the ability to infer unseen relationships through mental manipulation of images. Existing benchmarks risk data contamination from publicly sourced problems.

Method: Created SpatialViz-Bench with 1,180 programmatically generated problems across 12 tasks covering 4 spatial visualization sub-abilities. Used a scalable framework for expansion. Evaluated 27 MLLMs and analyzed error types statistically and qualitatively.

Result: Revealed wide performance variations among MLLMs, demonstrated strong discriminative power of the benchmark, and found counter-intuitive results: Chain-of-Thought prompting degrades accuracy on open-source models. State-of-the-art MLLMs show deficiencies in spatial visualization tasks.

Conclusion: SpatialViz-Bench addresses a significant gap in multimodal evaluation by providing a reliable, contamination-free benchmark for spatial visualization abilities. The benchmark reveals current MLLMs’ limitations in spatial reasoning and enables fair, continuous evaluation.

Abstract: Humans can imagine and manipulate visual images mentally, a capability known as spatial visualization. While many multi-modal benchmarks assess reasoning on visible visual information, the ability to infer unseen relationships through spatial visualization remains insufficiently evaluated as a spatial skill. This reliance on publicly sourced problems from IQ tests or math competitions risks data contamination and compromises assessment reliability. To this end, we introduce SpatialViz-Bench, a comprehensive multi-modal benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 programmatically generated problems, a scalable framework that allows for expansion to ensure fair and continuously reliable evaluations. Our evaluation of 27 Multi-modal Large Language Models (MLLMs) reveals wide performance variations, demonstrates the benchmark’s strong discriminative power, and uncovers counter-intuitive findings: Chain-of-Thought (CoT) prompting paradoxically degrades accuracy on open-source models. Through statistical and qualitative analysis of error types, SpatialViz-Bench demonstrates that state-of-the-art MLLMs exhibit deficiencies in spatial visualization tasks, thereby addressing a significant lacuna in the field. The benchmark data and evaluation code are publicly available.

[613] CoDi: Subject-Consistent and Pose-Diverse Text-to-Image Generation

Zhanxin Gao, Beier Zhu, Liang Yao, Jian Yang, Ying Tai

Main category: cs.CV

TL;DR: CoDi is a two-stage diffusion framework for subject-consistent generation that maintains subject identity while enabling diverse poses and layouts through identity transport and refinement techniques.

Details

Motivation: Existing training-free subject-consistent generation methods often sacrifice layout and pose diversity to achieve identity consistency, limiting expressive visual storytelling capabilities.

Method: Two-stage diffusion strategy: 1) Identity Transport in early denoising steps uses optimal transport to transfer identity features to target images in pose-aware manner; 2) Identity Refinement in later steps selects most salient identity features to refine subject details.

Result: Extensive qualitative and quantitative results show CoDi achieves better visual perception and stronger performance across subject consistency, pose diversity, and prompt fidelity metrics compared to existing methods.

Conclusion: CoDi successfully addresses the trade-off between subject consistency and pose diversity in text-to-image generation, enabling more expressive visual storytelling while maintaining subject identity.

Abstract: Subject-consistent generation (SCG)-aiming to maintain a consistent subject identity across diverse scenes-remains a challenge for text-to-image (T2I) models. Existing training-free SCG methods often achieve consistency at the cost of layout and pose diversity, hindering expressive visual storytelling. To address the limitation, we propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi, that enables consistent subject generation with diverse pose and layout. Motivated by the progressive nature of diffusion, where coarse structures emerge early and fine details are refined later, CoDi adopts a two-stage strategy: Identity Transport (IT) and Identity Refinement (IR). IT operates in the early denoising steps, using optimal transport to transfer identity features to each target image in a pose-aware manner. This promotes subject consistency while preserving pose diversity. IR is applied in the later denoising steps, selecting the most salient identity features to further refine subject details. Extensive qualitative and quantitative results on subject consistency, pose diversity, and prompt fidelity demonstrate that CoDi achieves both better visual perception and stronger performance across all metrics. The code is provided in https://github.com/NJU-PCALab/CoDi.

[614] ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering

Duong T. Tran, Trung-Kien Tran, Manfred Hauswirth, Danh Le Phuoc

Main category: cs.CV

TL;DR: ReasonVQA: A new large-scale VQA dataset with automatically integrated structured encyclopedic knowledge for complex multi-hop reasoning questions.

Details

Motivation: To address the need for more challenging VQA datasets that require complex reasoning with external knowledge, as existing datasets often lack the depth and scale needed for benchmarking advanced reasoning capabilities.

Method: Developed a low-cost framework that automatically integrates structured encyclopedic knowledge to generate complex, multi-hop questions. The dataset construction is scalable with respect to input images.

Result: ReasonVQA poses significant challenges to state-of-the-art VQA models, demonstrating its effectiveness for benchmarking. The dataset surpasses the largest existing knowledge-requiring VQA datasets by more than an order of magnitude in size.

Conclusion: ReasonVQA provides a valuable resource for advancing VQA research, particularly for models requiring complex reasoning with external knowledge, and offers scalability advantages over existing datasets.

Abstract: In this paper, we propose a new dataset, ReasonVQA, for the Visual Question Answering (VQA) task. Our dataset is automatically integrated with structured encyclopedic knowledge and constructed using a low-cost framework, which is capable of generating complex, multi-hop questions. We evaluated state-of-the-art VQA models on ReasonVQA, and the empirical results demonstrate that ReasonVQA poses significant challenges to these models, highlighting its potential for benchmarking and advancing the field of VQA. Additionally, our dataset can be easily scaled with respect to input images; the current version surpasses the largest existing datasets requiring external knowledge by more than an order of magnitude.

[615] A Survey of Token Compression for Efficient Multimodal Large Language Models

Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang

Main category: cs.CV

TL;DR: Survey paper on multimodal long context token compression techniques for MLLMs, categorizing approaches by modality (image, video, audio) and mechanism to address computational challenges from quadratic attention complexity.

Details

Motivation: Multimodal LLMs face computational bottlenecks due to quadratic complexity of self-attention with numerous input tokens from high-resolution images, long videos, and audio. Token compression is critical to efficiently reduce tokens during training/inference while maintaining model capabilities.

Method: Systematic survey and synthesis categorizing approaches by: (1) modality focus - image-centric (spatial redundancy), video-centric (spatio-temporal redundancy), audio-centric (temporal/spectral redundancy); (2) underlying mechanisms - transformation-based, similarity-based, attention-based, and query-based approaches.

Result: Comprehensive structured overview of current token compression methods for multimodal long contexts, identifying key challenges and providing researchers with modality-specific access to relevant techniques.

Conclusion: Token compression is crucial for scalable MLLMs. The survey consolidates progress, highlights modality-specific strategies, and aims to inspire future research in this rapidly evolving domain to address computational challenges while maintaining model performance.

Abstract: Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input. While this ability significantly enhances MLLM capabilities, it introduces substantial computational challenges, primarily due to the quadratic complexity of self-attention mechanisms with numerous input tokens. To mitigate these bottlenecks, token compression has emerged as an auspicious and critical approach, efficiently reducing the number of tokens during both training and inference. In this paper, we present the first systematic survey and synthesis of the burgeoning field of multimodal long context token compression. Recognizing that effective compression strategies are deeply tied to the unique characteristics and redundancies of each modality, we categorize existing approaches by their primary data focus, enabling researchers to quickly access and learn methods tailored to their specific area of interest: (1) image-centric compression, which addresses spatial redundancy in visual data; (2) video-centric compression, which tackles spatio-temporal redundancy in dynamic sequences; and (3) audio-centric compression, which handles temporal and spectral redundancy in acoustic signals. Beyond this modality-driven categorization, we further dissect methods based on their underlying mechanisms, including transformation-based, similarity-based, attention-based, and query-based approaches. By providing a comprehensive and structured overview, this survey aims to consolidate current progress, identify key challenges, and inspire future research directions in this rapidly evolving domain.

[616] DA-Occ: Direction-Aware 2D Convolution for Efficient and Geometry-Preserving 3D Occupancy Prediction in Autonomous Driving

Yuchen Zhou, Yan Luo, Xiaogang Wang, Xingjian Gu, Mingzhou Lu, Xiangbo Shu

Main category: cs.CV

TL;DR: DA-Occ: A pure 2D framework for efficient 3D occupancy prediction in autonomous driving that preserves geometry through dual depth-height lifting and direction-aware convolution.

Details

Motivation: Existing 3D occupancy prediction methods struggle to balance accuracy and efficiency - high-accuracy approaches are computationally heavy, while efficient BEV methods lose vertical spatial cues and geometric integrity.

Method: Builds on Lift-Splat-Shoot (LSS) paradigm with dual projection: depth-score lifting plus complementary height-score projection to capture vertical geometry. Uses direction-aware convolution to extract features along vertical and horizontal orientations.

Result: Achieves 39.3% mIoU on Occ3D-nuScenes with 27.7 FPS inference speed, and 14.8 FPS on edge devices, demonstrating effective accuracy-efficiency balance.

Conclusion: DA-Occ provides an efficient 2D framework for 3D occupancy prediction that preserves geometric integrity while enabling real-time deployment on resource-constrained devices.

Abstract: Efficient and high-accuracy 3D occupancy prediction is vital for the performance of autonomous driving systems. However, existing methods struggle to balance precision and efficiency: high-accuracy approaches are often hindered by heavy computational overhead, leading to slow inference speeds, while others leverage pure bird’s-eye-view (BEV) representations to gain speed at the cost of losing vertical spatial cues and compromising geometric integrity. To overcome these limitations, we build on the efficient Lift-Splat-Shoot (LSS) paradigm and propose a pure 2D framework, DA-Occ, for 3D occupancy prediction that preserves fine-grained geometry. Standard LSS-based methods lift 2D features into 3D space solely based on depth scores, making it difficult to fully capture vertical structure. To improve upon this, DA-Occ augments depth-based lifting with a complementary height-score projection that explicitly encodes vertical geometric information. We further employ direction-aware convolution to extract geometric features along both vertical and horizontal orientations, effectively balancing accuracy and computational efficiency. On the Occ3D-nuScenes, the proposed method achieves an mIoU of 39.3% and an inference speed of 27.7 FPS, effectively balancing accuracy and efficiency. In simulations on edge devices, the inference speed reaches 14.8 FPS, further demonstrating the method’s applicability for real-time deployment in resource-constrained environments.

[617] Learning Robust Intervention Representations with Delta Embeddings

Panagiotis Alimisis, Christos Diou

Main category: cs.CV

TL;DR: Causal Delta Embeddings represent interventions in latent space to improve OOD robustness for causal representation learning from image pairs.

Details

Motivation: Most causal representation learning focuses on scene variables, but interventions themselves need better representation to improve model generalization and robustness to out-of-distribution data.

Method: Proposes Causal Delta Embeddings that are invariant to visual scenes and sparse in affected causal variables. Learns causal representations from image pairs without additional supervision.

Result: Experiments in Causal Triplet challenge show Causal Delta Embeddings significantly exceed baseline performance in both synthetic and real-world benchmarks for OOD settings.

Conclusion: Focusing on intervention representations through Causal Delta Embeddings is an effective strategy for improving OOD robustness in causal representation learning.

Abstract: Causal representation learning has attracted significant research interest during the past few years, as a means for improving model generalization and robustness. Causal representations of interventional image pairs (also called ``actionable counterfactuals’’ in the literature), have the property that only variables corresponding to scene elements affected by the intervention / action are changed between the start state and the end state. While most work in this area has focused on identifying and representing the variables of the scene under a causal model, fewer efforts have focused on representations of the interventions themselves. In this work, we show that an effective strategy for improving out of distribution (OOD) robustness is to focus on the representation of actionable counterfactuals in the latent space. Specifically, we propose that an intervention can be represented by a Causal Delta Embedding that is invariant to the visual scene and sparse in terms of the causal variables it affects. Leveraging this insight, we propose a method for learning causal representations from image pairs, without any additional supervision. Experiments in the Causal Triplet challenge demonstrate that Causal Delta Embeddings are highly effective in OOD settings, significantly exceeding baseline performance in both synthetic and real-world benchmarks.

[618] VQAThinker: Exploring Generalizable and Explainable Video Quality Assessment via Reinforcement Learning

Linhan Cao, Wei Sun, Weixia Zhang, Xiangyang Zhu, Jun Jia, Kaiwei Zhang, Dandan Zhu, Guangtao Zhai, Xiongkuo Min

Main category: cs.CV

TL;DR: VQAThinker: A reasoning-based video quality assessment framework using large multimodal models with reinforcement learning to improve generalization and explainability.

Details

Motivation: Existing VQA models suffer from poor generalization to out-of-distribution videos and limited explainability, restricting real-world applicability.

Method: Uses large multimodal models with reinforcement learning (group relative policy optimization) and three VQA-specific rewards: bell-shaped regression, pairwise ranking, and temporal consistency rewards.

Result: Achieves state-of-the-art performance on both in-domain and OOD VQA benchmarks, with superior performance in distortion attribution and quality description tasks.

Conclusion: Reinforcement learning offers an effective pathway toward building generalizable and explainable VQA models solely with score-level supervision.

Abstract: Video quality assessment (VQA) aims to objectively quantify perceptual quality degradation in alignment with human visual perception. Despite recent advances, existing VQA models still suffer from two critical limitations: \textit{poor generalization to out-of-distribution (OOD) videos} and \textit{limited explainability}, which restrict their applicability in real-world scenarios. To address these challenges, we propose \textbf{VQAThinker}, a reasoning-based VQA framework that leverages large multimodal models (LMMs) with reinforcement learning to jointly model video quality understanding and scoring, emulating human perceptual decision-making. Specifically, we adopt group relative policy optimization (GRPO), a rule-guided reinforcement learning algorithm that enables reasoning over video quality under score-level supervision, and introduce three VQA-specific rewards: (1) a \textbf{bell-shaped regression reward} that increases rapidly as the prediction error decreases and becomes progressively less sensitive near the ground truth; (2) a \textbf{pairwise ranking reward} that guides the model to correctly determine the relative quality between video pairs; and (3) a \textbf{temporal consistency reward} that encourages the model to prefer temporally coherent videos over their perturbed counterparts. Extensive experiments demonstrate that VQAThinker achieves state-of-the-art performance on both in-domain and OOD VQA benchmarks, showing strong generalization for video quality scoring. Furthermore, evaluations on video quality understanding tasks validate its superiority in distortion attribution and quality description compared to existing explainable VQA models and LMMs. These findings demonstrate that reinforcement learning offers an effective pathway toward building generalizable and explainable VQA models solely with score-level supervision.

[619] Beyond Global Scanning: Adaptive Visual State Space Modeling for Salient Object Detection in Optical Remote Sensing Images

Mengyu Ren, Yutong Li, Hua Li, Runmin Cong, Sam Kwong

Main category: cs.CV

TL;DR: ASCNet uses state space models for remote sensing salient object detection, addressing scale variations and low contrast through multi-scale feature extraction and adaptive local modeling.

Details

Motivation: Salient object detection in optical remote sensing images faces challenges like significant scale variations and low contrast. Existing ViT and CNN methods struggle to effectively integrate global and local features, limiting performance.

Method: Proposes Adaptive State Space Context Network (ASCNet) with visual state space encoder for multi-scale features, Multi-Level Context Module for cross-layer interaction, and APVSS decoder with Dynamic Adaptive Granularity Scan and Granularity-aware Propagation Module for adaptive local modeling.

Result: Extensive experiments show state-of-the-art performance, validating effectiveness and superiority of the proposed model.

Conclusion: ASCNet successfully addresses remote sensing SOD challenges by leveraging state space models to capture long-range dependencies while enhancing local feature representation through adaptive mechanisms.

Abstract: Salient object detection (SOD) in optical remote sensing images (ORSIs) faces numerous challenges, including significant variations in target scales and low contrast between targets and the background. Existing methods based on vision transformers (ViTs) and convolutional neural networks (CNNs) architectures aim to leverage both global and local features, but the difficulty in effectively integrating these heterogeneous features limits their overall performance. To overcome these limitations, we propose an adaptive state space context network (ASCNet), which builds upon the state space model mechanism to simultaneously capture long-range dependencies and enhance regional feature representation. Specifically, we employ the visual state space encoder to extract multi-scale features. To further achieve deep guidance and enhancement of these features, we first design a Multi-Level Context Module (MLCM). This module strengthens cross-layer interaction capabilities between features of different scales while enhancing the model’s structural perception, allowing it to distinguish between foreground and background more effectively. Then, we design the APVSS block as the decoder of ASCNet. This module integrates our proposed Dynamic Adaptive Granularity Scan (DAGS) and Granularity-aware Propagation Module (GPM). It performs adaptive patch scanning on feature maps enhanced by local perception, thereby capturing rich local region information and enhancing state space model’s local modeling capability. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance, validating its effectiveness and superiority.

[620] Virtual Community: An Open World for Humans, Robots, and Society

Qinhong Zhou, Hongxin Zhang, Xiangye Lin, Zheyuan Zhang, Yutian Chen, Wenjun Liu, Zunzhe Zhang, Sunli Chen, Lixing Fang, Qiushi Lyu, Xinyu Sun, Jincheng Yang, Zeyuan Wang, Bao Chi Dang, Zhehuan Chen, Daksha Ladia, Jiageng Liu, Chuang Gan

Main category: cs.CV

TL;DR: Virtual Community is an open-world platform for studying human-robot coexistence and embodied social intelligence through physics-based simulation of shared communities.

Details

Motivation: To explore the future of human-robot coexistence in shared communities and enable large-scale study of embodied social intelligence as AI and robotics advance.

Method: Developed an open-source multi-agent physics simulator supporting robots, humans, and their interactions, plus a large-scale community generation pipeline with real-world aligned 3D scenes and grounded agents with rich characteristics.

Result: Created two novel challenges: Community Planning Challenge for multi-agent reasoning/planning, and Community Robot Challenge for heterogeneous robot collaboration in open-world tasks. Evaluated various baselines showing challenges in both high-level planning and low-level cooperation.

Conclusion: Virtual Community provides a platform to unlock further study of human-robot coexistence in open-world environments, addressing both opportunities and challenges of future societal transformation.

Abstract: The rapid progress in AI and Robotics may lead to a profound societal transformation, as humans and robots begin to coexist within shared communities, introducing both opportunities and challenges. To explore this future, we present Virtual Community-an open-world platform for humans, robots, and society-built on a universal physics engine and grounded in real-world 3D scenes. With Virtual Community, we aim to enable the study of embodied social intelligence at scale. To support these, Virtual Community features: 1) An open-source multi-agent physics simulator that supports robots, humans, and their interactions within a society; 2) A large-scale, real-world aligned community generation pipeline, including vast outdoor space, diverse indoor scenes, and a community of grounded agents with rich characters and appearances. Leveraging Virtual Community, we propose two novel challenges. The Community Planning Challenge evaluates multi-agent reasoning and planning ability in open-world settings, such as cooperating to help agents with daily activities and efficiently connecting other agents. The Community Robot Challenge requires multiple heterogeneous robots to collaborate in solving complex open-world tasks. We evaluate various baselines on these tasks and demonstrate the challenges in both high-level open-world task planning and low-level cooperation controls. We hope that Virtual Community will unlock further study of human-robot coexistence within open-world environments.

[621] Prior-Guided Residual Diffusion: Calibrated and Efficient Medical Image Segmentation

Fuyou Mao, Beining Wu, Yanfeng Jiang, Han Xue, Yan Tang, Hao Zhang

Main category: cs.CV

TL;DR: PGRD is a diffusion-based framework for medical image segmentation that learns full conditional distributions using prior guidance and residual learning for improved calibration and sampling efficiency.

Details

Motivation: Medical image segmentation often involves ambiguity that requires capturing full conditional distributions rather than single point estimates. Existing methods lack proper calibration and sampling efficiency for practical clinical use.

Method: PGRD uses a diffusion-based framework that embeds discrete labels in continuous space, employs a coarse prior predictor for step-wise guidance, learns residuals to the prior to accelerate convergence, and implements deep diffusion supervision for training stability.

Result: PGRD achieves higher Dice scores and lower NLL/ECE values than Bayesian, ensemble, Probabilistic U-Net, and vanilla diffusion baselines on MRI and CT datasets, while requiring fewer sampling steps.

Conclusion: PGRD provides an effective diffusion-based approach for probabilistic medical image segmentation with improved calibration and practical sampling efficiency, addressing ambiguity in segmentation tasks.

Abstract: Ambiguity in medical image segmentation calls for models that capture full conditional distributions rather than a single point estimate. We present Prior-Guided Residual Diffusion (PGRD), a diffusion-based framework that learns voxel-wise distributions while maintaining strong calibration and practical sampling efficiency. PGRD embeds discrete labels as one-hot targets in a continuous space to align segmentation with diffusion modeling. A coarse prior predictor provides step-wise guidance; the diffusion network then learns the residual to the prior, accelerating convergence and improving calibration. A deep diffusion supervision scheme further stabilizes training by supervising intermediate time steps. Evaluated on representative MRI and CT datasets, PGRD achieves higher Dice scores and lower NLL/ECE values than Bayesian, ensemble, Probabilistic U-Net, and vanilla diffusion baselines, while requiring fewer sampling steps to reach strong performance.

[622] Data-Driven Loss Functions for Inference-Time Optimization in Text-to-Image

Sapir Esther Yiflach, Yuval Atzmon, Gal Chechik

Main category: cs.CV

TL;DR: Learn-to-Steer improves spatial reasoning in text-to-image diffusion models by training a classifier on cross-attention maps to create learned loss functions for test-time optimization.

Details

Motivation: Current text-to-image diffusion models struggle with spatial reasoning tasks (e.g., placing objects in correct relative positions), and existing solutions use suboptimal handcrafted losses. The authors propose learning spatial objectives directly from the model's internal representations rather than imposing assumptions.

Method: Train a lightweight classifier to decode spatial relationships from diffusion model’s cross-attention maps, then use this classifier as a learned loss function during inference. To prevent linguistic shortcuts, augment training data with prompts containing incorrect relation words to force learning of true spatial patterns from attention maps.

Result: Dramatic improvements in spatial accuracy: from 20% to 61% on FLUX.1-dev and from 7% to 54% on SD2.1 across standard benchmarks. The method also generalizes to multiple relations with significantly improved accuracy.

Conclusion: Learning spatial objectives directly from diffusion model’s internal representations via attention-based classifiers is more effective than handcrafted losses, substantially improving spatial reasoning capabilities in text-to-image generation.

Abstract: Text-to-image diffusion models can generate stunning visuals, yet they often fail at tasks children find trivial–like placing a dog to the right of a teddy bear rather than to the left. When combinations get more unusual–a giraffe above an airplane–these failures become even more pronounced. Existing methods attempt to fix these spatial reasoning failures through model fine-tuning or test-time optimization with handcrafted losses that are suboptimal. Rather than imposing our assumptions about spatial encoding, we propose learning these objectives directly from the model’s internal representations. We introduce Learn-to-Steer, a novel framework that learns data-driven objectives for test-time optimization rather than handcrafting them. Our key insight is to train a lightweight classifier that decodes spatial relationships from the diffusion model’s cross-attention maps, then deploy this classifier as a learned loss function during inference. Training such classifiers poses a surprising challenge: they can take shortcuts by detecting linguistic traces in the cross-attention maps, rather than learning true spatial patterns. We solve this by augmenting our training data with samples generated using prompts with incorrect relation words, which encourages the classifier to avoid linguistic shortcuts and learn spatial patterns from the attention maps. Our method dramatically improves spatial accuracy: from 20% to 61% on FLUX.1-dev and from 7% to 54% on SD2.1 across standard benchmarks. It also generalizes to multiple relations with significantly improved accuracy.

[623] UrbanTwin: Synthetic Roadside LiDAR Datasets

Muhammad Shahbaz, Shaurya Agarwal

Main category: cs.CV

TL;DR: UrbanTwin datasets are synthetic replicas of three real roadside lidar datasets, created using realistic digital twins with precise modeling of geometry, road alignment, and traffic patterns for enhanced 3D perception tasks.

Details

Motivation: To create high-fidelity synthetic lidar datasets that can augment or replace real-world datasets for 3D perception tasks, addressing data scarcity and enabling custom scenario testing.

Method: Synthesized datasets using emulated lidar sensors within realistic digital twins, modeling surrounding geometry, road alignment at lane level, lane topology, and vehicle movement patterns from actual locations.

Result: Created 10K annotated frames per dataset with 3D bounding boxes, instance segmentation, tracking IDs, and semantic segmentation. Achieved high similarity scores with real data and improved detection performance when training models solely on synthetic data.

Conclusion: UrbanTwin datasets effectively enhance existing benchmarks by increasing sample size and scene diversity, and can replace in-domain real-world datasets for lidar perception tasks while enabling custom scenario testing.

Abstract: This article presents UrbanTwin datasets, high-fidelity, realistic replicas of three public roadside lidar datasets: LUMPI, V2X-Real-IC, and TUMTraf-I. Each UrbanTwin dataset contains 10K annotated frames corresponding to one of the public datasets. Annotations include 3D bounding boxes, instance segmentation labels, and tracking IDs for six object classes, along with semantic segmentation labels for nine classes. These datasets are synthesized using emulated lidar sensors within realistic digital twins, modeled based on surrounding geometry, road alignment at lane level, and the lane topology and vehicle movement patterns at intersections of the actual locations corresponding to each real dataset. Due to the precise digital twin modeling, the synthetic datasets are well aligned with their real counterparts, offering strong standalone and augmentative value for training deep learning models on tasks such as 3D object detection, tracking, and semantic and instance segmentation. We evaluate the alignment of the synthetic replicas through statistical and structural similarity analysis with real data, and further demonstrate their utility by training 3D object detection models solely on synthetic data and testing them on real, unseen data. The high similarity scores and improved detection performance, compared to models trained on real data, indicate that the UrbanTwin datasets effectively enhance existing benchmark datasets by increasing sample size and scene diversity. In addition, the digital twins can be adapted to test custom scenarios by modifying the design and dynamics of the simulations. To our knowledge, these are the first digitally synthesized datasets that can replace in-domain real-world datasets for lidar perception tasks. UrbanTwin datasets are publicly available at https://dataverse.harvard.edu/dataverse/ucf-ut.

[624] GLEAM: Learning to Match and Explain in Cross-View Geo-Localization

Xudong Lu, Zhi Zheng, Yi Wan, Yongxiang Yao, Annan Wang, Renrui Zhang, Panwang Xia, Qiong Wu, Qingyun Li, Weifeng Lin, Xiangyu Zhao, Peifeng Ma, Xue Yang, Hongsheng Li

Main category: cs.CV

TL;DR: GLEAM introduces a foundational cross-view geo-localization model that unifies multiple views/modalities aligned with satellite imagery, plus GLEAM-X for explainable reasoning using MLLMs.

Details

Motivation: Existing cross-view geo-localization approaches are limited to single views/modalities and lack interpretability - they only determine if images match without explaining why.

Method: GLEAM-C: foundational CVGL model aligning multiple views/modalities exclusively with satellite imagery using optimized implementation and two-phase training. GLEAM-X: novel task combining cross-view correspondence prediction with explainable reasoning using multimodal LLMs, with bilingual benchmark generated by commercial MLLMs and human-refined test set.

Result: GLEAM-C achieves accuracy comparable to prior modality-specific CVGL models with improved training efficiency. GLEAM-X provides systematic evaluation of explainable cross-view reasoning through the constructed benchmark.

Conclusion: The GLEAM framework integrates multi-modal, multi-view alignment with interpretable correspondence analysis, advancing geo-localization by enabling models to both match and explain cross-view correspondences.

Abstract: Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location. However, existing CVGL approaches are typically restricted to a single view or modality, and their direct visual matching strategy lacks interpretability: they only determine whether two images correspond, without explaining the rationale behind the match. In this paper, we present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities by aligning them exclusively with satellite imagery. Our framework improves training efficiency through optimized implementation and achieves accuracy comparable to prior modality-specific CVGL models via a novel two-phase training strategy. To address interpretability, we further propose GLEAM-X, a novel task that combines cross-view correspondence prediction with explainable reasoning enabled by multimodal large language models (MLLMs). We construct a bilingual benchmark using commercial MLLMs to generate training and testing data, and refine the test set through rigorous human revision for systematic evaluation of explainable cross-view reasoning. Together, GLEAM-C and GLEAM-X form a comprehensive CVGL pipeline that integrates multi-modal, multi-view alignment with interpretable correspondence analysis, unifying accurate cross-view matching with explainable reasoning and advancing Geo-Localization by enabling models to better Explain And Match. Code and datasets used in this work will be made publicly accessible at https://github.com/Lucky-Lance/GLEAM.

[625] GenExam: A Multidisciplinary Text-to-Image Exam

Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, Gen Luo

Main category: cs.CV

TL;DR: GenExam is the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts and fine-grained scoring to evaluate semantic correctness and visual plausibility in image generation.

Details

Motivation: Existing benchmarks focus on understanding/reasoning or basic world knowledge illustration, but neglect rigorous evaluation of drawing exams that require integrated understanding, reasoning, and generation capabilities.

Method: Created GenExam benchmark with 1,000 samples across 10 subjects using exam-style prompts organized under a four-level taxonomy. Each problem has ground-truth images and fine-grained scoring points. Evaluated 17 text-to-image and unified models.

Result: Experiments show GenExam is challenging with a huge performance gap between open-source and leading closed-source models, highlighting the difficulty of integrated understanding, reasoning, and generation in image creation.

Conclusion: GenExam provides a rigorous assessment of models’ ability to integrate understanding, reasoning, and generation, offering insights for developing intelligent generative models through an exam-based evaluation framework.

Abstract: Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments on 17 text-to-image and unified models demonstrate the great challenge of GenExam and the huge gap where open-source models consistently lag behind the leading closed-source ones. By framing image generation as an exam, GenExam offers a rigorous assessment of models’ ability to integrate understanding, reasoning, and generation, providing insights for on the path to intelligent generative models. Our benchmark and evaluation code are released at https://github.com/OpenGVLab/GenExam.

[626] Hybrid Lie semi-group and cascade structures for the generalized Gaussian derivative model for visual receptive fields

Tony Lindeberg

Main category: cs.CV

TL;DR: The paper presents theoretical relationships between spatial and spatio-temporal receptive field responses across different shape parameters, using infinitesimal and macroscopic cascade smoothing properties.

Details

Motivation: To handle variability in real-world image structures under natural transformations by understanding relationships between receptive field responses across different parameter values in covariant receptive field families.

Method: Derives both infinitesimal relationships (combining semi-group and Lie group concepts) and macroscopic cascade smoothing properties (structurally related to Lie algebras) for spatial and spatio-temporal receptive fields.

Result: Provides theoretical understanding of relationships between receptive field responses across parameter values, enabling more efficient computation schemes and idealized models of biological vision computations.

Conclusion: The derived relationships offer deeper understanding of receptive field computations, useful for designing efficient multi-parameter receptive field systems and modeling biological vision.

Abstract: Because of the variabilities of real-world image structures under the natural image transformations that arise when observing similar objects or spatio-temporal events under different viewing conditions, the receptive field responses computed in the earliest layers of the visual hierarchy may be strongly influenced by such geometric image transformations. One way of handling this variability is by basing the vision system on covariant receptive field families, which expand the receptive field shapes over the degrees of freedom in the image transformations. This paper addresses the problem of deriving relationships between spatial and spatio-temporal receptive field responses obtained for different values of the shape parameters in the resulting multi-parameter families of receptive fields. For this purpose, we derive both (i) infinitesimal relationships, roughly corresponding to a combination of notions from semi-groups and Lie groups, as well as (ii) macroscopic cascade smoothing properties, which describe how receptive field responses at coarser spatial and temporal scales can be computed by applying smaller support incremental filters to the output from corresponding receptive fields at finer spatial and temporal scales, structurally related to the notion of Lie algebras, although with directional preferences. The presented results provide (i) a deeper understanding of the relationships between spatial and spatio-temporal receptive field responses for different values of the filter parameters, which can be used for both (ii) designing more efficient schemes for computing receptive field responses over populations of multi-parameter families of receptive fields, as well as (iii)~formulating idealized theoretical models of the computations of simple cells in biological vision.

[627] Enhanced Detection of Tiny Objects in Aerial Images

Kihyun Kim, Michalis Lazarou, Tania Stathaki

Main category: cs.CV

TL;DR: MoonNet enhances YOLOv8 for tiny object detection in aerial imagery using image resolution adjustment, data augmentation, attention mechanisms (SE and CBAM), and a novel gating function, achieving state-of-the-art performance on tiny-object benchmarks.

Details

Motivation: One-stage detectors like YOLOv8 trade off small object detection performance for speed, which is particularly problematic for aerial imagery where tiny objects have low resolution and cluttered backgrounds.

Method: Four enhancement strategies: 1) input image resolution adjustment, 2) data augmentation, 3) attention mechanisms (SE and CBAM), 4) alternative gating function for attention modules. MoonNet pipeline integrates multiple attention-module-augmented CNNs into YOLOv8 backbone.

Result: MoonNet backbone outperforms original YOLOv8 backbone and single-type attention-module-augmented backbones. Achieves state-of-the-art performance on tiny-object benchmark when integrated with YOLC model.

Conclusion: MoonNet demonstrates adaptability and potential for tiny object detection in aerial imagery through strategic enhancements to YOLOv8 architecture.

Abstract: While one-stage detectors like YOLOv8 offer fast training speed, they often under-perform on detecting small objects as a trade-off. This becomes even more critical when detecting tiny objects in aerial imagery due to low-resolution targets and cluttered backgrounds. To address this, we introduce four enhancement strategies-input image resolution adjustment, data augmentation, attention mechanisms, and an alternative gating function for attention modules-that can be easily implemented on YOLOv8. We demonstrate that image size enlargement and the proper use of augmentation can lead to enhancement. Additionally, we designed a Mixture of Orthogonal Neural-modules Network (MoonNet) pipeline which consists of multiple attention-module-augmented CNNs. Two well-known attention modules, Squeeze-and-Excitation (SE) Block and Convolutional Block Attention Module (CBAM), were integrated into the backbone of YOLOv8 to form the MoonNet design, and the MoonNet backbone obtained improved detection accuracy compared to the original YOLOv8 backbone and single-type attention-module-augmented backbones. MoonNet further proved its adaptability and potential by achieving state-of-the-art performance on a tiny-object benchmark when integrated with the YOLC model. Our code is available at: https://github.com/Kihyun11/MoonNet

[628] MoCrop: Training Free Motion Guided Cropping for Efficient Video Action Recognition

Binhua Huang, Wendong Yao, Shaowu Chen, Guoxin Wang, Qingyuan Wang, Soumyabrata Dev

Main category: cs.CV

TL;DR: MoCrop is a motion-aware adaptive cropping module for efficient video action recognition that uses motion vectors from compressed videos to localize motion-dense regions, reducing computational costs while improving accuracy.

Details

Motivation: Standard video action recognition models process full frames, suffering from spatial redundancy and high computational costs. There's a need for efficient methods that can reduce computation while maintaining or improving accuracy.

Method: MoCrop leverages Motion Vectors (MVs) from H.264 compressed videos to localize motion-dense regions. It uses three components: Merge & Denoise for outlier filtering, Monte Carlo Sampling for efficient importance sampling, and Motion Grid Search for optimal region localization. The module works without training or parameter updates.

Result: On UCF101, MoCrop boosts ResNet-50 accuracy by +3.5% at equivalent FLOPs, or achieves +2.4% accuracy gain with 26.5% fewer FLOPs. Applied to CoViAR, it improves accuracy to 89.2% or reduces computation by ~27%. Consistent gains across MobileNet-V3, EfficientNet-B1, and Swin-B demonstrate strong generality.

Conclusion: MoCrop serves as both an accelerator and enhancer for video action recognition, offering a versatile plug-and-play module suitable for real-time deployment across diverse backbones.

Abstract: Standard video action recognition models often process typically resized full frames, suffering from spatial redundancy and high computational costs. To address this, we introduce MoCrop, a motion-aware adaptive cropping module designed for efficient video action recognition in the compressed domain. Leveraging Motion Vectors (MVs) naturally available in H.264 video, MoCrop localizes motion-dense regions to produce adaptive crops at inference without requiring any training or parameter updates. Our lightweight pipeline synergizes three key components: Merge & Denoise (MD) for outlier filtering, Monte Carlo Sampling (MCS) for efficient importance sampling, and Motion Grid Search (MGS) for optimal region localization. This design allows MoCrop to serve as a versatile “plug-and-play” module for diverse backbones. Extensive experiments on UCF101 demonstrate that MoCrop serves as both an accelerator and an enhancer. With ResNet-50, it achieves a +3.5% boost in Top-1 accuracy at equivalent FLOPs (Attention Setting), or a +2.4% accuracy gain with 26.5% fewer FLOPs (Efficiency Setting). When applied to CoViAR, it improves accuracy to 89.2% or reduces computation by roughly 27% (from 11.6 to 8.5 GFLOPs). Consistent gains across MobileNet-V3, EfficientNet-B1, and Swin-B confirm its strong generality and suitability for real-time deployment. Our code and models are available at https://github.com/microa/MoCrop.

[629] Rectified Decoupled Dataset Distillation: A Closer Look for Fair and Comprehensive Evaluation

Xinhao Zhong, Shuoyang Sun, Xulin Gu, Chenyang Zhu, Bin Chen, Yaowei Wang

Main category: cs.CV

TL;DR: RD³ proposes a standardized benchmark for decoupled dataset distillation, revealing that much reported performance variation stems from inconsistent evaluation protocols rather than methodological differences.

Details

Motivation: Existing decoupled dataset distillation methods suffer from inconsistent post-evaluation protocols, making it difficult to determine whether performance differences reflect true methodological advances or evaluation discrepancies.

Method: Proposes Rectified Decoupled Dataset Distillation (RD³) which systematically investigates how different post-evaluation settings affect test accuracy, establishes standardized evaluation protocols, and identifies general strategies to improve distilled dataset effectiveness.

Result: Analysis reveals that much performance variation in existing methods can be attributed to inconsistent evaluation rather than differences in synthetic data quality. Provides foundation for fair comparisons.

Conclusion: Standardized evaluation protocols are crucial for meaningful progress in dataset distillation research. RD³ establishes a benchmark for fair and reproducible comparisons.

Abstract: Dataset distillation aims to generate compact synthetic datasets that enable models trained on them to achieve performance comparable to those trained on full real datasets, while substantially reducing storage and computational costs. Early bi-level optimization methods (e.g., MTT) have shown promising results on small-scale datasets, but their scalability is limited by high computational overhead. To address this limitation, recent decoupled dataset distillation methods (e.g., SRe$^2$L) separate the teacher model pre-training from the synthetic data generation process. These methods also introduce random data augmentation and epoch-wise soft labels during the post-evaluation phase to improve performance and generalization. However, existing decoupled distillation methods suffer from inconsistent post-evaluation protocols, which hinders progress in the field. In this work, we propose Rectified Decoupled Dataset Distillation (RD$^3$), and systematically investigate how different post-evaluation settings affect test accuracy. We further examine whether the reported performance differences across existing methods reflect true methodological advances or stem from discrepancies in evaluation procedures. Our analysis reveals that much of the performance variation can be attributed to inconsistent evaluation rather than differences in the intrinsic quality of the synthetic data. In addition, we identify general strategies that improve the effectiveness of distilled datasets across settings. By establishing a standardized benchmark and rigorous evaluation protocol, RD$^3$ provides a foundation for fair and reproducible comparisons in future dataset distillation research.

[630] FAST: Foreground-aware Diffusion with Accelerated Sampling Trajectory for Segmentation-oriented Anomaly Synthesis

Xichen Xu, Yanshu Wang, Jinbao Wang, Xiaoning Lei, Guoyang Xie, Guannan Jiang, Zhichao Lu

Main category: cs.CV

TL;DR: FAST is a foreground-aware diffusion framework for industrial anomaly segmentation that uses two novel modules (AIAS and FARM) to efficiently synthesize segmentation-oriented anomalies with only 10 sampling steps.

Details

Motivation: Industrial anomaly segmentation requires pixel-level annotations but real-world anomalies are scarce and costly to label. Existing segmentation-oriented industrial anomaly synthesis (SIAS) methods struggle with sampling efficiency vs. generation quality balance and treat all spatial regions uniformly, ignoring statistical differences between anomaly and background areas.

Method: Proposes FAST with two modules: 1) Anomaly-Informed Accelerated Sampling (AIAS) - training-free sampling algorithm using coarse-to-fine aggregation for efficient synthesis; 2) Foreground-Aware Reconstruction Module (FARM) - adaptively adjusts anomaly-aware noise in masked foreground regions during denoising to preserve localized anomaly signals.

Result: Extensive experiments on multiple industrial benchmarks show FAST consistently outperforms existing anomaly synthesis methods in downstream segmentation tasks, achieving state-of-the-art segmentation-oriented anomalies in as few as 10 steps.

Conclusion: FAST provides an effective framework for industrial anomaly synthesis that addresses efficiency and quality trade-offs while enabling controllable, structure-specific anomalies for segmentation tasks.

Abstract: Industrial anomaly segmentation relies heavily on pixel-level annotations, yet real-world anomalies are often scarce, diverse, and costly to label. Segmentation-oriented industrial anomaly synthesis (SIAS) has emerged as a promising alternative; however, existing methods struggle to balance sampling efficiency and generation quality. Moreover, most approaches treat all spatial regions uniformly, overlooking the distinct statistical differences between anomaly and background areas. This uniform treatment hinders the synthesis of controllable, structure-specific anomalies tailored for segmentation tasks. In this paper, we propose FAST, a foreground-aware diffusion framework featuring two novel modules: the Anomaly-Informed Accelerated Sampling (AIAS) and the Foreground-Aware Reconstruction Module (FARM). AIAS is a training-free sampling algorithm specifically designed for segmentation-oriented industrial anomaly synthesis, which accelerates the reverse process through coarse-to-fine aggregation and enables the synthesis of state-of-the-art segmentation-oriented anomalies in as few as 10 steps. Meanwhile, FARM adaptively adjusts the anomaly-aware noise within the masked foreground regions at each sampling step, preserving localized anomaly signals throughout the denoising trajectory. Extensive experiments on multiple industrial benchmarks demonstrate that FAST consistently outperforms existing anomaly synthesis methods in downstream segmentation tasks. We release the code at: https://github.com/Chhro123/fast-foreground-aware-anomaly-synthesis.

[631] Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models

Jiaqi Liu, Lang Sun, Ronghao Fu, Bo Yang

Main category: cs.CV

TL;DR: Geo-CoT framework enables verifiable multi-step reasoning for remote sensing analysis using structured chain-of-thought rationales and two-stage alignment.

Details

Motivation: Current Vision-Language Models in remote sensing fail at complex analytical tasks due to end-to-end training that bypasses reasoning steps, producing unverifiable outputs.

Method: Proposes Perceptually-Grounded Geospatial Chain-of-Thought (Geo-CoT) framework with two-stage alignment: supervised fine-tuning on Geo-CoT380k dataset followed by Group Reward Policy Optimization to refine reasoning policy.

Result: RSThinker model achieves dominant performance, significantly outperforming state-of-the-art models across comprehensive remote sensing tasks while providing verifiable analytical traces.

Conclusion: Geo-CoT provides a pathway from opaque perception to structured, verifiable reasoning for Earth Observation, with public release of dataset and model.

Abstract: Vision-Language Models (VLMs) in remote sensing often fail at complex analytical tasks, a limitation stemming from their end-to-end training paradigm that bypasses crucial reasoning steps and leads to unverifiable outputs. To address this limitation, we introduce the Perceptually-Grounded Geospatial Chain-of-Thought (Geo-CoT), a framework that models remote sensing analysis as a verifiable, multi-step process. We instill this analytical process through a two-stage alignment strategy, leveraging Geo-CoT380k, the first large-scale dataset of structured Geo-CoT rationales. This strategy first employs supervised fine-tuning (SFT) to instill the foundational cognitive architecture, then leverages Group Reward Policy Optimization (GRPO) to refine the model’s reasoning policy towards factual correctness. The resulting model, RSThinker, outputs both a final answer and its justifying, verifiable analytical trace. This capability yields dominant performance, significantly outperforming state-of-the-art models across a comprehensive range of tasks. The public release of our Geo-CoT380k dataset and RSThinker model upon publication serves as a concrete pathway from opaque perception towards structured, verifiable reasoning for Earth Observation.

[632] Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models

Xinhao Zhong, Yimin Zhou, Zhiqi Zhang, Junhao Li, Yi Sun, Bin Chen, Shu-Tao Xia, Xuan Wang, Ke Xu

Main category: cs.CV

TL;DR: Proposes VARE and S-VARE frameworks for concept erasure in visual autoregressive (VAR) models, addressing safety concerns in text-to-image generation by minimizing unsafe token adjustments while preserving generation quality.

Details

Motivation: Existing concept erasure techniques designed for diffusion models fail to work with VAR models due to their different next-scale token prediction paradigm, creating a safety gap in autoregressive text-to-image generation that needs to be addressed.

Method: Introduces VARE framework using auxiliary visual tokens to reduce fine-tuning intensity, and S-VARE method with filtered cross entropy loss to precisely identify and minimally adjust unsafe visual tokens, plus preservation loss to maintain semantic fidelity and prevent language drift.

Result: Extensive experiments show the approach achieves surgical concept erasure while preserving generation quality, effectively closing the safety gap in autoregressive text-to-image generation that earlier methods couldn’t address.

Conclusion: The proposed VARE and S-VARE frameworks successfully enable stable concept erasure in VAR models, providing an effective safety solution for autoregressive text-to-image generation while maintaining model performance.

Abstract: The rapid progress of visual autoregressive (VAR) models has brought new opportunities for text-to-image generation, but also heightened safety concerns. Existing concept erasure techniques, primarily designed for diffusion models, fail to generalize to VARs due to their next-scale token prediction paradigm. In this paper, we first propose a novel VAR Erasure framework VARE that enables stable concept erasure in VAR models by leveraging auxiliary visual tokens to reduce fine-tuning intensity. Building upon this, we introduce S-VARE, a novel and effective concept erasure method designed for VAR, which incorporates a filtered cross entropy loss to precisely identify and minimally adjust unsafe visual tokens, along with a preservation loss to maintain semantic fidelity, addressing the issues such as language drift and reduced diversity introduce by naïve fine-tuning. Extensive experiments demonstrate that our approach achieves surgical concept erasure while preserving generation quality, thereby closing the safety gap in autoregressive text-to-image generation by earlier methods.

[633] Learning Unified Representation of 3D Gaussian Splatting

Yuelin Xin, Yuheng Liu, Xiaohui Xie, Xinke Li

Main category: cs.CV

TL;DR: A novel embedding representation for 3D Gaussian Splatting that transforms raw Gaussian parameters into continuous submanifold fields to enable better learning in neural networks.

Details

Motivation: 3D Gaussian Splatting enables efficient 3D reconstruction but its parameter-based representation is hard to learn as features in neural networks due to non-unique and heterogeneous parameterization, leading to data-dependent models.

Method: Proposes an embedding representation based on continuous submanifold fields that encapsulate intrinsic information of Gaussian primitives, preserving color and geometric structure while enforcing unique mapping and channel homogeneity.

Result: Implementation available, creating a more principled approach to represent 3D Gaussian Splatting in neural networks that benefits learning of 3DGS.

Conclusion: The proposed embedding representation addresses fundamental challenges in learning from 3D Gaussian Splatting parameters, enabling better integration with neural network frameworks.

Abstract: A well-designed vectorized representation is crucial for the learning systems natively based on 3D Gaussian Splatting. While 3DGS enables efficient and explicit 3D reconstruction, its parameter-based representation remains hard to learn as features, especially for neural-network-based models. Directly feeding raw Gaussian parameters into learning frameworks fails to address the non-unique and heterogeneous nature of the Gaussian parameterization, yielding highly data-dependent models. This challenge motivates us to explore a more principled approach to represent 3D Gaussian Splatting in neural networks that preserves the underlying color and geometric structure while enforcing unique mapping and channel homogeneity. In this paper, we propose an embedding representation of 3DGS based on continuous submanifold fields that encapsulate the intrinsic information of Gaussian primitives, thereby benefiting the learning of 3DGS. Implementation available at https://github.com/cilix-ai/gs-embedding

[634] Uncertainty Estimation for Pretrained Medical Image Registration Models via Transformation Equivariance

Lin Tian, Xiaoling Hu, Juan Eugenio Iglesias

Main category: cs.CV

TL;DR: Proposes a model-agnostic uncertainty estimation framework for medical image registration that works with any pretrained network by leveraging transformation equivariance properties.

Details

Motivation: Medical image registration networks lack uncertainty estimation, which is crucial for clinical safety. Existing methods require architectural changes or retraining, limiting their applicability to pretrained models.

Method: Uses transformation equivariance property of image registration - applying spatial perturbations to inputs and measuring consistency of outputs to estimate uncertainty without modifying pretrained networks.

Result: Experiments across 3 pretrained models and 4 anatomical structures show uncertainty maps correlate with registration error and highlight unreliable regions.

Conclusion: Provides a practical uncertainty estimation framework that makes pretrained registration networks risk-aware for clinical and research deployment.

Abstract: Accurate image registration is essential in many medical imaging applications, yet most deep registration networks provide little indication of when or where their predictions are unreliable. Existing uncertainty estimation approaches, such as Bayesian methods, ensembles, or MC-dropout, typically require architectural modifications or retraining, precluding their applicability to pretrained registration models. We propose an inference-time, model-agnostic uncertainty estimation framework that applies directly to any pretrained registration network. Our approach is grounded in the transformation equivariance property of image registration, which states that the underlying anatomical mapping should remain consistent under spatial perturbations of the input. Experiments across three pretrained registration models and four anatomical structures show that the resulting uncertainty maps consistently correlate with registration error and highlight unreliably aligned regions. This framework turns pretrained registration networks into risk-aware tools at test time, moving medical image registration closer to safe clinical and large-scale research deployment.

[635] DiffInk: Glyph- and Style-Aware Latent Diffusion Transformer for Text to Online Handwriting Generation

Wei Pan, Huiguo He, Hiuyi Cheng, Yilin Shi, Lianwen Jin

Main category: cs.CV

TL;DR: DiffInk: A latent diffusion Transformer framework for full-line handwriting generation that outperforms existing methods in accuracy and style fidelity.

Details

Motivation: Existing text-to-online handwriting generation methods focus on character- or word-level generation, leading to inefficiency and lack of holistic structural modeling for full text lines.

Method: Proposes DiffInk with two components: 1) InkVAE - sequential variational autoencoder with OCR-based loss for glyph accuracy and style-classification loss for style preservation, 2) InkDiT - latent diffusion Transformer that integrates target text and reference styles to generate pen trajectories.

Result: Outperforms state-of-the-art methods in both glyph accuracy and style fidelity while significantly improving generation efficiency.

Conclusion: DiffInk successfully addresses limitations of existing methods by enabling full-line handwriting generation with better accuracy, style preservation, and efficiency through a novel latent diffusion Transformer framework.

Abstract: Deep generative models have advanced text-to-online handwriting generation (TOHG), which aims to synthesize realistic pen trajectories conditioned on textual input and style references. However, most existing methods still primarily focus on character- or word-level generation, resulting in inefficiency and a lack of holistic structural modeling when applied to full text lines. To address these issues, we propose DiffInk, the first latent diffusion Transformer framework for full-line handwriting generation. We first introduce InkVAE, a novel sequential variational autoencoder enhanced with two complementary latent-space regularization losses: (1) an OCR-based loss enforcing glyph-level accuracy, and (2) a style-classification loss preserving writing style. This dual regularization yields a semantically structured latent space where character content and writer styles are effectively disentangled. We then introduce InkDiT, a novel latent diffusion Transformer that integrates target text and reference styles to generate coherent pen trajectories. Experimental results demonstrate that DiffInk outperforms existing state-of-the-art (SOTA) methods in both glyph accuracy and style fidelity, while significantly improving generation efficiency.

[636] EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling

Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, Zheng liu

Main category: cs.CV

TL;DR: EditScore: A specialized reward model series (7B-72B) for instruction-guided image editing that enables effective reinforcement learning by providing high-fidelity reward signals, surpassing proprietary VLMs including GPT-5 on the EditReward-Bench benchmark.

Details

Motivation: Current instruction-guided image editing models struggle with complex instructions and require multiple samples. Reinforcement learning could help but lacks high-fidelity, efficient reward signals. The paper aims to overcome this barrier by developing specialized reward models for image editing.

Method: 1) Introduces EditReward-Bench benchmark for evaluating reward models on editing quality. 2) Develops EditScore series of reward models (7B-72B) through meticulous data curation and filtering. 3) Implements self-ensemble strategy tailored for generative nature of EditScore. 4) Uses the reward model to enable online RL for image editing by providing effective learning signals.

Result: EditScore matches performance of proprietary VLMs, with largest variant surpassing GPT-5 on EditReward-Bench. Enables efficient RL policy optimization where open-source VLMs fail. Applied to OmniGen2 base model results in substantial performance uplift.

Conclusion: High-fidelity, domain-specialized reward models are key to unlocking RL potential in image editing. Provides first systematic path from benchmarking to reward modeling to RL training in this domain.

Abstract: Instruction-guided image editing has achieved remarkable progress, yet current models still face challenges with complex instructions and often require multiple samples to produce a desired result. Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been severely hindered by the lack of a high-fidelity, efficient reward signal. In this work, we present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model. We first introduce EditReward-Bench, a comprehensive benchmark to systematically evaluate reward models on editing quality. Building on this benchmark, we develop EditScore, a series of reward models (7B-72B) for evaluating the quality of instruction-guided image editing. Through meticulous data curation and filtering, EditScore effectively matches the performance of learning proprietary VLMs. Furthermore, coupled with an effective self-ensemble strategy tailored for the generative nature of EditScore, our largest variant even surpasses GPT-5 in the benchmark. We then demonstrate that a high-fidelity reward model is the key to unlocking online RL for image editing. Our experiments show that, while even the largest open-source VLMs fail to provide an effective learning signal, EditScore enables efficient and robust policy optimization. Applying our framework to a strong base model, OmniGen2, results in a final model that shows a substantial and consistent performance uplift. Overall, this work provides the first systematic path from benchmarking to reward modeling to RL training in image editing, showing that a high-fidelity, domain-specialized reward model is the key to unlocking the full potential of RL in this domain.

[637] HunyuanImage 3.0 Technical Report

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, Tiankai Hang, Duojun Huang, Jie Jiang, Zhengkai Jiang, Weijie Kong, Changlin Li, Donghao Li, Junzhe Li, Xin Li, Yang Li, Zhenxi Li, Zhimin Li, Jiaxin Lin, Linus, Lucaz Liu, Shu Liu, Songtao Liu, Yu Liu, Yuhong Liu, Yanxin Long, Fanbin Lu, Qinglin Lu, Yuyang Peng, Yuanbo Peng, Xiangwei Shen, Yixuan Shi, Jiale Tao, Yangyu Tao, Qi Tian, Pengfei Wan, Chunyu Wang, Kai Wang, Lei Wang, Linqing Wang, Lucas Wang, Qixun Wang, Weiyan Wang, Hao Wen, Bing Wu, Jianbing Wu, Yue Wu, Senhao Xie, Fang Yang, Miles Yang, Xiaofeng Yang, Xuan Yang, Zhantao Yang, Jingmiao Yu, Zheng Yuan, Chao Zhang, Jian-Wei Zhang, Peizhen Zhang, Shi-Xue Zhang, Tao Zhang, Weigang Zhang, Yepeng Zhang, Yingfang Zhang, Zihao Zhang, Zijian Zhang, Penghao Zhao, Zhiyuan Zhao, Xuefei Zhe, Jianchen Zhu, Zhao Zhong

Main category: cs.CV

TL;DR: HunyuanImage 3.0 is a large multimodal model that unifies image understanding and generation in an autoregressive framework, featuring an 80B parameter MoE architecture with 13B activated per token, achieving state-of-the-art text-image alignment and visual quality.

Details

Motivation: To create a unified multimodal model that combines both understanding and generation capabilities within a single autoregressive framework, addressing the need for more powerful open-source foundation models in the multimodal space.

Method: Uses meticulous data curation, advanced architecture design, native Chain-of-Thoughts schema, progressive pre-training, aggressive post-training, and efficient infrastructure. Implements a Mixture-of-Experts model with over 80B total parameters (13B activated per token).

Result: Achieves state-of-the-art performance in text-image alignment and visual quality, rivaling previous best models. The model is the largest and most powerful open-source image generative model available.

Conclusion: HunyuanImage 3.0 successfully demonstrates unified multimodal understanding and generation capabilities, providing a powerful open-source foundation model to foster innovation in the multimodal ecosystem.

Abstract: We present HunyuanImage 3.0, a native multimodal model that unifies multimodal understanding and generation within an autoregressive framework, with its image generation module publicly available. The achievement of HunyuanImage 3.0 relies on several key components, including meticulous data curation, advanced architecture design, a native Chain-of-Thoughts schema, progressive model pre-training, aggressive model post-training, and an efficient infrastructure that enables large-scale training and inference. With these advancements, we successfully trained a Mixture-of-Experts (MoE) model comprising over 80 billion parameters in total, with 13 billion parameters activated per token during inference, making it the largest and most powerful open-source image generative model to date. We conducted extensive experiments and the results of automatic and human evaluation of text-image alignment and visual quality demonstrate that HunyuanImage 3.0 rivals previous state-of-the-art models. By releasing the code and weights of HunyuanImage 3.0, we aim to enable the community to explore new ideas with a state-of-the-art foundation model, fostering a dynamic and vibrant multimodal ecosystem. All open source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanImage-3.0

[638] Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation

Huu Tien Nguyen, Dac Thai Nguyen, The Minh Duc Nguyen, Trung Thanh Nguyen, Thao Nguyen Truong, Huy Hieu Pham, Johan Barthelemy, Minh Quan Tran, Thanh Tam Nguyen, Quoc Viet Hung Nguyen, Quynh Anh Chau, Hong Son Mai, Thanh Trung Nguyen, Phi Le Nguyen

Main category: cs.CV

TL;DR: Introduces ViPET-ReportGen: a Vietnamese multimodal medical dataset with 2,757 PET/CT volumes and clinical reports, plus a training framework to enhance VLMs for medical imaging in low-resource languages.

Details

Motivation: Existing medical VLMs have limited generalizability due to lack of diverse imaging modalities (especially PET/CT) and underrepresentation of low-resource languages like Vietnamese in medical vision-language research.

Method: Created a novel Vietnamese multimodal medical dataset with PET/CT volumes and clinical reports, introduced a training framework with data augmentation and expert-validated test sets, and benchmarked state-of-the-art VLMs.

Result: Incorporating the dataset significantly improves VLM performance on downstream tasks, demonstrating its value for advancing medical imaging VLMs for low-resource languages.

Conclusion: The dataset and benchmark represent a pivotal step toward more robust VLMs for medical imaging, particularly for Vietnamese healthcare and other low-resource language contexts.

Abstract: Vision-Language Foundation Models (VLMs), trained on large-scale multimodal datasets, have driven significant advances in Artificial Intelligence (AI) by enabling rich cross-modal reasoning. Despite their success in general domains, applying these models to medical imaging remains challenging due to the limited availability of diverse imaging modalities and multilingual clinical data. Most existing medical VLMs are trained on a subset of imaging modalities and focus primarily on high-resource languages, thus limiting their generalizability and clinical utility. To address these limitations, we introduce a novel Vietnamese-language multimodal medical dataset consisting of 2,757 whole-body PET/CT volumes from independent patients and their corresponding full-length clinical reports. This dataset is designed to fill two pressing gaps in medical AI development: (1) the lack of PET/CT imaging data in existing VLMs training corpora, which hinders the development of models capable of handling functional imaging tasks; and (2) the underrepresentation of low-resource languages, particularly the Vietnamese language, in medical vision-language research. To the best of our knowledge, this is the first dataset to provide comprehensive PET/CT-report pairs in Vietnamese. We further introduce a training framework to enhance VLMs’ learning, including data augmentation and expert-validated test sets. We conduct comprehensive experiments benchmarking state-of-the-art VLMs on downstream tasks. The experimental results show that incorporating our dataset significantly improves the performance of existing VLMs. We believe this dataset and benchmark will serve as a pivotal step in advancing the development of more robust VLMs for medical imaging, especially for low-resource languages and clinical use in Vietnamese healthcare. The source code is available at https://github.com/AIoT-Lab-BKAI/ViPET-ReportGen.

[639] Stable Signer: Hierarchical Sign Language Generative Model

Sen Fang, Yalin Feng, Hongbin Zhong, Yanxin Zhang, Dimitris N. Metaxas

Main category: cs.CV

TL;DR: Stable Signer is a new end-to-end sign language production model that simplifies traditional multi-stage pipelines by focusing only on text understanding and pose-to-video generation, achieving 48.6% improvement over SOTA methods.

Details

Motivation: Traditional sign language production systems suffer from error accumulation across multiple stages (Text2Gloss, Gloss2Pose, Pose2Vid), leading to inaccurate text conversion, pose generation, and video rendering. The field has seen slow progress due to these cascading errors.

Method: Proposes Stable Signer, which redefines SLP as a hierarchical end-to-end task with only two components: text understanding (via Sign Language Understanding Linker - SLUL) and Pose2Vid generation. SLUL uses Semantic-Aware Gloss Masking Loss for training, and the system includes an SLP-MoE hand gesture rendering expert block for generating high-quality, multi-style sign language videos.

Result: The model achieves 48.6% performance improvement compared to current state-of-the-art generation methods in sign language production.

Conclusion: By streamlining the traditional redundant structure and optimizing task objectives, Stable Signer demonstrates significant improvements in sign language video generation quality and accuracy through its end-to-end hierarchical approach.

Abstract: Sign Language Production (SLP) is the process of converting the complex input text into a real video. Most previous works focused on the Text2Gloss, Gloss2Pose, Pose2Vid stages, and some concentrated on Prompt2Gloss and Text2Avatar stages. However, this field has made slow progress due to the inaccuracy of text conversion, pose generation, and the rendering of poses into real human videos in these stages, resulting in gradually accumulating errors. Therefore, in this paper, we streamline the traditional redundant structure, simplify and optimize the task objective, and design a new sign language generative model called Stable Signer. It redefines the SLP task as a hierarchical generation end-to-end task that only includes text understanding (Prompt2Gloss, Text2Gloss) and Pose2Vid, and executes text understanding through our proposed new Sign Language Understanding Linker called SLUL, and generates hand gestures through the named SLP-MoE hand gesture rendering expert block to end-to-end generate high-quality and multi-style sign language videos. SLUL is trained using the newly developed Semantic-Aware Gloss Masking Loss (SAGM Loss). Its performance has improved by 48.6% compared to the current SOTA generation methods.

[640] GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs

Aryan Yazdan Parast, Parsa Hosseini, Hesam Asadollahzadeh, Arshia Soltani Moakhar, Basim Azam, Soheil Feizi, Naveed Akhtar

Main category: cs.CV

TL;DR: GHOST is an automated method to generate images that induce hallucinations in multimodal LLMs by optimizing in image embedding space to create subtle misleading cues while keeping target objects absent.

Details

Motivation: Current static benchmarks for studying object hallucination in MLLMs are limited because they use fixed visual scenarios, preventing discovery of model-specific or unanticipated hallucination vulnerabilities. There's a need for active stress-testing methods.

Method: GHOST operates by optimizing in the image embedding space to mislead MLLMs while keeping target objects absent, then guiding a diffusion model conditioned on the embedding to generate natural-looking images with subtle misleading cues.

Result: Achieves 28% hallucination success rate (vs 1% in prior methods), generates high-quality object-free images confirmed by metrics/human evaluation, uncovers transferable vulnerabilities (66.5% transfer rate from Qwen2.5-VL to GPT-4o), and fine-tuning on GHOST images mitigates hallucination.

Conclusion: GHOST serves as both a diagnostic tool for uncovering hallucination vulnerabilities and a corrective tool for building more reliable multimodal systems through adversarial testing and fine-tuning.

Abstract: Object hallucination in Multimodal Large Language Models (MLLMs) is a persistent failure mode that causes the model to perceive objects absent in the image. This weakness of MLLMs is currently studied using static benchmarks with fixed visual scenarios, which preempts the possibility of uncovering model-specific or unanticipated hallucination vulnerabilities. We introduce GHOST (Generating Hallucinations via Optimizing Stealth Tokens), a method designed to stress-test MLLMs by actively generating images that induce hallucination. GHOST is fully automatic and requires no human supervision or prior knowledge. It operates by optimizing in the image embedding space to mislead the model while keeping the target object absent, and then guiding a diffusion model conditioned on the embedding to generate natural-looking images. The resulting images remain visually natural and close to the original input, yet introduce subtle misleading cues that cause the model to hallucinate. We evaluate our method across a range of models, including reasoning models like GLM-4.1V-Thinking, and achieve a hallucination success rate exceeding 28%, compared to around 1% in prior data-driven discovery methods. We confirm that the generated images are both high-quality and object-free through quantitative metrics and human evaluation. Also, GHOST uncovers transferable vulnerabilities: images optimized for Qwen2.5-VL induce hallucinations in GPT-4o at a 66.5% rate. Finally, we show that fine-tuning on our images mitigates hallucination, positioning GHOST as both a diagnostic and corrective tool for building more reliable multimodal systems.

Teng Zhang, Ziqian Fan, Mingxin Liu, Xin Zhang, Xudong Lu, Wentong Li, Yue Zhou, Yi Yu, Xiang Li, Junchi Yan, Xue Yang

Main category: cs.CV

TL;DR: Point2RBox-v3 improves weakly-supervised oriented object detection using progressive label assignment and prior-guided dynamic mask loss to address inefficient pseudo label utilization and quality issues.

Details

Motivation: Existing point-supervised methods for oriented object detection suffer from inefficient utilization and poor quality of pseudo labels, which limits their effectiveness in real-world scenarios with varying object sizes and densities.

Method: Two key innovations: 1) Progressive Label Assignment (PLA) that dynamically estimates instance sizes at different training stages to enable label assignment methods, and 2) Prior-Guided Dynamic Mask Loss (PGDM-Loss) that combines Voronoi Watershed Loss with SAM model advantages to handle both sparse and dense scenes effectively.

Result: Achieves competitive performance across multiple datasets: 66.09% on DOTA-v1.0, 56.86% on DOTA-v1.5, 41.28% on DOTA-v2.0, 46.40% on DIOR, 19.60% on STAR, and 45.96% on RSAR, especially excelling in scenarios with large object size variations or sparse occurrences.

Conclusion: Point2RBox-v3 successfully addresses limitations of existing point-supervised methods through dynamic pseudo label assignment and complementary watershed-SAM integration, achieving state-of-the-art performance in weakly-supervised oriented object detection.

Abstract: Driven by the growing need for Oriented Object Detection (OOD), learning from point annotations under a weakly-supervised framework has emerged as a promising alternative to costly and laborious manual labeling. In this paper, we discuss two deficiencies in existing point-supervised methods: inefficient utilization and poor quality of pseudo labels. Therefore, we present Point2RBox-v3. At the core are two principles: 1) Progressive Label Assignment (PLA). It dynamically estimates instance sizes in a coarse yet intelligent manner at different stages of the training process, enabling the use of label assignment methods. 2) Prior-Guided Dynamic Mask Loss (PGDM-Loss). It is an enhancement of the Voronoi Watershed Loss from Point2RBox-v2, which overcomes the shortcomings of Watershed in its poor performance in sparse scenes and SAM’s poor performance in dense scenes. To our knowledge, Point2RBox-v3 is the first model to employ dynamic pseudo labels for label assignment, and it creatively complements the advantages of SAM model with the watershed algorithm, which achieves excellent performance in both sparse and dense scenes. Our solution gives competitive performance, especially in scenarios with large variations in object size or sparse object occurrences: 66.09%/56.86%/41.28%/46.40%/19.60%/45.96% on DOTA-v1.0/DOTA-v1.5/DOTA-v2.0/DIOR/STAR/RSAR.

[642] Scaling Sequence-to-Sequence Generative Neural Rendering

Shikun Liu, Kam Woh Ng, Wonbong Jang, Jiadong Guo, Junlin Han, Haozhe Liu, Yiannis Douratsos, Juan C. Pérez, Zijian Zhou, Chi Phung, Tao Xiang, Juan-Manuel Pérez-Rúa

Main category: cs.CV

TL;DR: Kaleido is a generative model for photorealistic neural rendering that treats 3D as a specialized sub-domain of video, using sequence-to-sequence image synthesis without explicit 3D representations.

Details

Motivation: To create a unified framework for object- and scene-level neural rendering that can leverage large-scale video data for pre-training, reducing reliance on scarce camera-labeled 3D datasets while improving spatial consistency.

Method: Uses a masked autoregressive framework with decoder-only rectified flow transformer that treats 3D rendering as sequence-to-sequence image synthesis, enabling generation of any number of 6-DoF target views from any number of reference views.

Result: Sets new state-of-the-art on view synthesis benchmarks; zero-shot performance substantially outperforms other generative methods in few-view settings and matches per-scene optimization methods in many-view settings.

Conclusion: Kaleido demonstrates that treating 3D as a specialized video domain enables unified 3D and video modeling, leveraging video data for better 3D understanding without architectural modifications.

Abstract: We present Kaleido, a family of generative models designed for photorealistic, unified object- and scene-level neural rendering. Kaleido operates on the principle that 3D can be regarded as a specialised sub-domain of video, expressed purely as a sequence-to-sequence image synthesis task. Through a systemic study of scaling sequence-to-sequence generative neural rendering, we introduce key architectural innovations that enable our model to: i) perform generative view synthesis without explicit 3D representations; ii) generate any number of 6-DoF target views conditioned on any number of reference views via a masked autoregressive framework; and iii) seamlessly unify 3D and video modelling within a single decoder-only rectified flow transformer. Within this unified framework, Kaleido leverages large-scale video data for pre-training, which significantly improves spatial consistency and reduces reliance on scarce, camera-labelled 3D datasets – all without any architectural modifications. Kaleido sets a new state-of-the-art on a range of view synthesis benchmarks. Its zero-shot performance substantially outperforms other generative methods in few-view settings, and, for the first time, matches the quality of per-scene optimisation methods in many-view settings.

[643] Object-Centric Representation Learning for Enhanced 3D Scene Graph Prediction

KunHo Heo, GiHyun Kim, SuYeon Kim, MyeongAh Cho

Main category: cs.CV

TL;DR: A 3D semantic scene graph prediction method that improves object feature representation through contrastive pretraining and better integration of geometric and semantic features, achieving state-of-the-art performance on 3DSSG dataset.

Details

Motivation: Previous 3D semantic scene graph prediction methods have limitations in optimizing object and relationship feature representations, showing excessive reliance on Graph Neural Networks despite insufficient discriminative capability. The quality of object features is critical for overall scene graph accuracy.

Method: Designs a highly discriminative object feature encoder with contrastive pretraining strategy that decouples object representation learning from scene graph prediction. Effectively combines both geometric and semantic features for relationship prediction, and integrates the pretrained encoder into existing frameworks.

Result: Significantly outperforms previous state-of-the-art methods on the 3DSSG dataset across all evaluation metrics. The pretrained encoder provides substantial performance improvements when plugged into existing frameworks.

Conclusion: The approach demonstrates that improving object feature quality through contrastive pretraining and better integration of geometric and semantic features leads to superior 3D semantic scene graph prediction performance.

Abstract: 3D Semantic Scene Graph Prediction aims to detect objects and their semantic relationships in 3D scenes, and has emerged as a crucial technology for robotics and AR/VR applications. While previous research has addressed dataset limitations and explored various approaches including Open-Vocabulary settings, they frequently fail to optimize the representational capacity of object and relationship features, showing excessive reliance on Graph Neural Networks despite insufficient discriminative capability. In this work, we demonstrate through extensive analysis that the quality of object features plays a critical role in determining overall scene graph accuracy. To address this challenge, we design a highly discriminative object feature encoder and employ a contrastive pretraining strategy that decouples object representation learning from the scene graph prediction. This design not only enhances object classification accuracy but also yields direct improvements in relationship prediction. Notably, when plugging in our pretrained encoder into existing frameworks, we observe substantial performance improvements across all evaluation metrics. Additionally, whereas existing approaches have not fully exploited the integration of relationship information, we effectively combine both geometric and semantic features to achieve superior relationship prediction. Comprehensive experiments on the 3DSSG dataset demonstrate that our approach significantly outperforms previous state-of-the-art methods. Our code is publicly available at https://github.com/VisualScienceLab-KHU/OCRL-3DSSG-Codes.

[644] HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection

Junwen Chen, Peilin Xiong, Keiji Yanai

Main category: cs.CV

TL;DR: HOI-R1 uses reinforcement learning to train multimodal LLMs for human-object interaction detection without additional detection modules, achieving significant accuracy improvements on HICO-DET benchmark.

Details

Motivation: Current HOID methods rely heavily on complex vision-language model integration and object detectors, making frameworks cumbersome. The inherent reasoning abilities of MLLMs for HOID are under-explored, and RL training methods for MLLMs show promise for this task.

Method: Proposes HOI-R1 framework that uses reinforcement learning to train MLLMs for HOID without detection modules. Introduces HOI reasoning process and HOID reward functions to solve the task purely through text reasoning.

Result: Experiments on HICO-DET show consistent improvements across multiple MLLMs (Qwen-VL family, Rex-Omni). HOI-R1 boosts Qwen2.5-VL-3B accuracy by 2x with strong generalization ability.

Conclusion: Demonstrates that MLLMs can effectively perform human-object interaction detection through RL training without complex detection modules, offering a simpler yet powerful approach to HOID.

Abstract: Recent human-object interaction detection (HOID) methods highly require prior knowledge from vision-language models (VLMs) to enhance the interaction recognition capabilities. The training strategies and model architectures for connecting the knowledge from VLMs to the HOI instance representations from the object detector are challenging, and the whole framework is complex for further development or application. On the other hand, the inherent reasoning abilities of multimodal large language models (MLLMs) on human-object interaction detection are under-explored. Inspired by the recent success of training MLLMs with reinforcement learning (RL) methods, we propose HOI-R1 and first explore the potential of the language model on the HOID task without any additional detection modules. We introduce an HOI reasoning process and HOID reward functions to solve the HOID task by pure text. Experiments on HICO-DET across multiple open-source MLLMs, including the Qwen-VL family (Qwen2.5-VL and Qwen3-VL) and Rex-Omni, show consistent improvements. Especially, HOI-R1 boosts Qwen2.5-VL-3B 2$\times$ accuracy with great generalization ability. The source code is available at https://github.com/cjw2021/HOI-R1.

[645] iPEAR: Iterative Pyramid Estimation with Attention and Residuals for Deformable Medical Image Registration

Heming Wu, Di Wang, Tai Ma, Peng Zhao, Yubin Xiao, Zhongke Wu, Xing-Ce Wang, Xuan Wu, You Zhou

Main category: cs.CV

TL;DR: iPEAR is a medical image registration network that uses a Fused Attention-Residual Module to prevent anatomical misalignment accumulation and a Threshold-Controlled Iterative strategy to adaptively determine optimization iterations.

Details

Motivation: Existing pyramid registration networks suffer from accumulated anatomical misalignments and lack adaptive mechanisms to determine optimal iteration counts for varying deformation requirements across different images.

Method: Proposes iPEAR with: 1) Fused Attention-Residual Module (FARM) combining attention and residual pathways to alleviate misalignment accumulation; 2) dual-stage Threshold-Controlled Iterative (TCI) strategy that adaptively determines optimization iterations by evaluating registration stability and convergence.

Result: Outperforms state-of-the-art registration networks on three brain MRI datasets and one abdomen CT dataset in terms of accuracy, while maintaining comparable inference speed and model parameter size.

Conclusion: iPEAR effectively addresses anatomical misalignment accumulation and adaptive iteration control in medical image registration, validated through generalization and ablation studies.

Abstract: Existing pyramid registration networks may accumulate anatomical misalignments and lack an effective mechanism to dynamically determine the number of optimization iterations under varying deformation requirements across images, leading to degraded performance. To solve these limitations, we propose iPEAR. Specifically, iPEAR adopts our proposed Fused Attention-Residual Module (FARM) for decoding, which comprises an attention pathway and a residual pathway to alleviate the accumulation of anatomical misalignment. We further propose a dual-stage Threshold-Controlled Iterative (TCI) strategy that adaptively determines the number of optimization iterations for varying images by evaluating registration stability and convergence. Extensive experiments on three public brain MRI datasets and one public abdomen CT dataset show that iPEAR outperforms state-of-the-art (SOTA) registration networks in terms of accuracy, while achieving on-par inference speed and model parameter size. Generalization and ablation studies further validate the effectiveness of the proposed FARM and TCI.

[646] Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation

Yifu Luo, Xinhao Hu, Keyu Fan, Haoyuan Sun, Zeyu Chen, Bo Xia, Tiantian Zhang, Yongzhe Chang, Xueqian Wang

Main category: cs.CV

TL;DR: Mask-GRPO introduces RL-based optimization for masked generative models in text-to-image generation, improving performance over existing approaches.

Details

Motivation: Most RL approaches for text-to-image generation focus on diffusion or autoregressive models, overlooking masked generative models which represent an important alternative paradigm.

Method: Proposes Mask-GRPO which incorporates Group Relative Policy Optimization (GRPO)-based RL into masked generative models by redefining transition probability and formulating unmasking as multi-step decision-making. Includes strategies like removing KL constraint, applying reduction strategy, and filtering low-quality samples.

Result: Improves base model Show-o with substantial improvements on standard T2I benchmarks and preference alignment, outperforming existing state-of-the-art approaches.

Conclusion: Demonstrates the effectiveness of RL for masked generative models in text-to-image generation, providing a new direction beyond diffusion and autoregressive models.

Abstract: Reinforcement learning (RL) has garnered increasing attention in text-to-image (T2I) generation. However, most existing RL approaches are tailored to either diffusion models or autoregressive models, overlooking an important alternative: masked generative models. In this work, we propose Mask-GRPO, the first method to incorporate Group Relative Policy Optimization (GRPO)-based RL into this overlooked paradigm. Our core insight is to redefine the transition probability, which is different from current approaches, and formulate the unmasking process as a multi-step decision-making problem. To further enhance our method, we explore several useful strategies, including removing the KL constraint, applying the reduction strategy, and filtering out low-quality samples. Using Mask-GRPO, we improve a base model, Show-o, with substantial improvements on standard T2I benchmarks and preference alignment, outperforming existing state-of-the-art approaches. The code is available on https://github.com/xingzhejun/Mask-GRPO

[647] UniCalli: A Unified Diffusion Framework for Column-Level Generation and Recognition of Chinese Calligraphy

Tianshuo Xu, Kai Wang, Zhifei Chen, Leyi Wu, Tianshui Wen, Fei Chao, Ying-Cong Chen

Main category: cs.CV

TL;DR: UniCalli is a unified diffusion framework for column-level Chinese calligraphy recognition and generation that jointly trains both tasks to improve character structure preservation and layout aesthetics.

Details

Motivation: Existing methods for computational replication of Chinese calligraphy either create high-quality isolated characters but ignore page-level aesthetics (ligatures, spacing), or attempt page synthesis at the expense of calligraphic correctness.

Method: UniCalli uses a unified diffusion framework that jointly trains recognition and generation tasks. It employs asymmetric noising and a rasterized box map for spatial priors, trained on a mix of synthetic, labeled, and unlabeled data from a curated dataset of over 8,000 digitized pieces.

Result: The model achieves state-of-the-art generative quality with superior ligature continuity and layout fidelity, alongside stronger recognition performance. It successfully extends to other ancient scripts including Oracle bone inscriptions and Egyptian hieroglyphs.

Conclusion: Joint training of recognition and generation tasks creates a synergistic effect where recognition constrains character structure preservation while generation provides style and layout priors, leading to improved performance in both tasks, especially in limited-data regimes.

Abstract: Computational replication of Chinese calligraphy remains challenging. Existing methods falter, either creating high-quality isolated characters while ignoring page-level aesthetics like ligatures and spacing, or attempting page synthesis at the expense of calligraphic correctness. We introduce \textbf{UniCalli}, a unified diffusion framework for column-level recognition and generation. Training both tasks jointly is deliberate: recognition constrains the generator to preserve character structure, while generation provides style and layout priors. This synergy fosters concept-level abstractions that improve both tasks, especially in limited-data regimes. We curated a dataset of over 8,000 digitized pieces, with ~4,000 densely annotated. UniCalli employs asymmetric noising and a rasterized box map for spatial priors, trained on a mix of synthetic, labeled, and unlabeled data. The model achieves state-of-the-art generative quality with superior ligature continuity and layout fidelity, alongside stronger recognition. The framework successfully extends to other ancient scripts, including Oracle bone inscriptions and Egyptian hieroglyphs. Code and data can be viewed in \href{https://github.com/EnVision-Research/UniCalli}{this URL}.

[648] DOS: Directional Object Separation in Text Embeddings for Multi-Object Image Generation

Dongnam Byun, Jungwon Park, Jungmin Ko, Changin Choi, Wonjong Rhee

Main category: cs.CV

TL;DR: DOS improves multi-object image generation by modifying CLIP text embeddings to address object neglect and mixing issues in text-to-image models.

Details

Motivation: Text-to-image models struggle with prompts containing multiple objects, often resulting in object neglect or object mixing. The paper identifies four problematic scenarios where inter-object relationships cause these failures: Similar Shapes, Similar Textures, Dissimilar Background Biases, and Many Objects.

Method: DOS (Directional Object Separation) modifies three types of CLIP text embeddings before passing them into text-to-image models. The method is motivated by two key observations about CLIP embeddings and aims to better separate object representations in the embedding space.

Result: DOS consistently improves success rate of multi-object image generation and reduces object mixing. In human evaluations, DOS significantly outperforms four competing methods, receiving 26.24%-43.04% more votes across four benchmarks.

Conclusion: DOS is a practical and effective solution for improving multi-object image generation in text-to-image models by addressing fundamental issues with object separation in CLIP embeddings.

Abstract: Recent progress in text-to-image (T2I) generative models has led to significant improvements in generating high-quality images aligned with text prompts. However, these models still struggle with prompts involving multiple objects, often resulting in object neglect or object mixing. Through extensive studies, we identify four problematic scenarios, Similar Shapes, Similar Textures, Dissimilar Background Biases, and Many Objects, where inter-object relationships frequently lead to such failures. Motivated by two key observations about CLIP embeddings, we propose DOS (Directional Object Separation), a method that modifies three types of CLIP text embeddings before passing them into text-to-image models. Experimental results show that DOS consistently improves the success rate of multi-object image generation and reduces object mixing. In human evaluations, DOS significantly outperforms four competing methods, receiving 26.24%-43.04% more votes across four benchmarks. These results highlight DOS as a practical and effective solution for improving multi-object image generation.

[649] UrbanIng-V2X: A Large-Scale Multi-Vehicle, Multi-Infrastructure Dataset Across Multiple Intersections for Cooperative Perception

Karthikeyan Chandra Sekaran, Markus Geisler, Dominik Rößle, Adithya Mohan, Daniel Cremers, Wolfgang Utschick, Michael Botsch, Werner Huber, Torsten Schön

Main category: cs.CV

TL;DR: UrbanIng-V2X is a large-scale multimodal dataset for cooperative perception with vehicle-to-vehicle and vehicle-to-infrastructure interactions across three urban intersections, featuring synchronized camera and LiDAR data from multiple vehicles and infrastructure sensors.

Details

Motivation: Existing cooperative perception datasets are limited to single intersections or single vehicles, causing overfitting and misleading performance. There's a need for comprehensive datasets with multiple connected vehicles and infrastructure sensors across diverse intersections to enable robust benchmarking.

Method: Collected data from three urban intersections in Ingolstadt, Germany, with 34 temporally aligned and spatially calibrated sensor sequences (20 seconds each). Involves 2 vehicles and up to 3 infrastructure sensor poles per sequence, using 12 vehicle RGB cameras, 2 vehicle LiDARs, 17 infrastructure thermal cameras, and 12 infrastructure LiDARs.

Result: Created UrbanIng-V2X dataset with approximately 712k annotated instances at 10 Hz frequency across 13 object classes. Provides comprehensive evaluations using state-of-the-art cooperative perception methods and releases codebase, dataset, HD map, and digital twin of the data collection environment.

Conclusion: UrbanIng-V2X addresses the gap in comprehensive cooperative perception datasets and enables benchmarking in diverse traffic environments, helping prevent overfitting and providing realistic evaluation scenarios for V2X perception algorithms.

Abstract: Recent cooperative perception datasets have played a crucial role in advancing smart mobility applications by enabling information exchange between intelligent agents, helping to overcome challenges such as occlusions and improving overall scene understanding. While some existing real-world datasets incorporate both vehicle-to-vehicle and vehicle-to-infrastructure interactions, they are typically limited to a single intersection or a single vehicle. A comprehensive perception dataset featuring multiple connected vehicles and infrastructure sensors across several intersections remains unavailable, limiting the benchmarking of algorithms in diverse traffic environments. Consequently, overfitting can occur, and models may demonstrate misleadingly high performance due to similar intersection layouts and traffic participant behavior. To address this gap, we introduce UrbanIng-V2X, the first large-scale, multi-modal dataset supporting cooperative perception involving vehicles and infrastructure sensors deployed across three urban intersections in Ingolstadt, Germany. UrbanIng-V2X consists of 34 temporally aligned and spatially calibrated sensor sequences, each lasting 20 seconds. All sequences contain recordings from one of three intersections, involving two vehicles and up to three infrastructure-mounted sensor poles operating in coordinated scenarios. In total, UrbanIng-V2X provides data from 12 vehicle-mounted RGB cameras, 2 vehicle LiDARs, 17 infrastructure thermal cameras, and 12 infrastructure LiDARs. All sequences are annotated at a frequency of 10 Hz with 3D bounding boxes spanning 13 object classes, resulting in approximately 712k annotated instances across the dataset. We provide comprehensive evaluations using state-of-the-art cooperative perception methods and publicly release the codebase, dataset, HD map, and a digital twin of the complete data collection environment.

[650] GenTrack2: An Improved Hybrid Approach for Visual Multi-Object Tracking

Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen

Main category: cs.CV

TL;DR: A visual multi-object tracking method combining stochastic particle filtering with deterministic association for consistent identity tracking under nonlinear dynamics and varying target counts.

Details

Motivation: To address challenges in visual multi-object tracking including identifier consistency for unknown and time-varying target numbers, nonlinear dynamics, non-Gaussian noise, and handling weak tracks during interactions and occlusions.

Method: Combines stochastic particle filter with PSO optimization for nonlinear dynamics, plus deterministic association using cost matrix with spatial consistency, detection confidences, and track penalties. Includes velocity regression for trend-seed velocities and smooth state updating scheme.

Result: Superior performance compared to state-of-the-art trackers, with flexible operation for both pre-recorded videos and camera live streams.

Conclusion: The proposed hybrid stochastic-deterministic approach effectively maintains identifier consistency and handles complex tracking scenarios with varying target numbers and nonlinear dynamics.

Abstract: This paper proposes a visual multi-object tracking method that jointly employs stochastic and deterministic mechanisms to ensure identifier consistency for unknown and time-varying target numbers under nonlinear dynamics. A stochastic particle filter addresses nonlinear dynamics and non-Gaussian noise, with support from particle swarm optimization (PSO) to guide particles toward state distribution modes and mitigate divergence through proposed fitness measures incorporating motion consistency, appearance similarity, and social-interaction cues with neighboring targets. Deterministic association further enforces identifier consistency via a proposed cost matrix incorporating spatial consistency between particles and current detections, detection confidences, and track penalties. Subsequently, a novel scheme is proposed for the smooth updating of target states while preserving their identities, particularly for weak tracks during interactions with other targets and prolonged occlusions. Moreover, velocity regression over past states provides trend-seed velocities, enhancing particle sampling and state updates. The proposed tracker is designed to operate flexibly for both pre-recorded videos and camera live streams, where future frames are unavailable. Experimental results confirm superior performance compared to state-of-the-art trackers. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack2

[651] A Survey on Efficient Vision-Language-Action Models

Zhaoshu Yu, Bo Wang, Pengpeng Zeng, Haonan Zhang, Ji Zhang, Zheng Wang, Lianli Gao, Jingkuan Song, Nicu Sebe, Heng Tao Shen

Main category: cs.CV

TL;DR: A comprehensive survey paper on Efficient Vision-Language-Action models (Efficient VLAs) that reviews techniques across the model-training-data pipeline to address computational and data efficiency challenges in embodied AI.

Details

Motivation: Vision-Language-Action models face prohibitive computational and data demands despite their potential in embodied intelligence. The field lacks a unified framework to consolidate recent advancements in efficiency improvements.

Method: Introduces a unified taxonomy organizing techniques into three pillars: Efficient Model Design (architectures and compression), Efficient Training (reducing computational burdens), and Efficient Data Collection (robotic data acquisition).

Result: Provides the first comprehensive review of Efficient VLAs, establishing a foundational reference with taxonomy, summarizing state-of-the-art methods, applications, challenges, and future research directions.

Conclusion: The survey bridges the gap in VLA efficiency research by offering a systematic framework and roadmap for future work, with a maintained project page for ongoing updates.

Abstract: Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction. Despite their remarkable performance, foundational VLAs are hindered by the prohibitive computational and data demands inherent to their large-scale architectures. While a surge of recent research has focused on enhancing VLA efficiency, the field lacks a unified framework to consolidate these disparate advancements. To bridge this gap, this survey presents the first comprehensive review of Efficient Vision-Language-Action models (Efficient VLAs) across the entire model-training-data pipeline. Specifically, we introduce a unified taxonomy to systematically organize the disparate efforts in this domain, categorizing current techniques into three core pillars: (1) Efficient Model Design, focusing on efficient architectures and model compression; (2) Efficient Training, which reduces computational burdens during model learning; and (3) Efficient Data Collection, which addresses the bottlenecks in acquiring and utilizing robotic data. Through a critical review of state-of-the-art methods within this framework, this survey not only establishes a foundational reference for the community but also summarizes representative applications, delineates key challenges, and charts a roadmap for future research. We maintain a continuously updated project page to track our latest developments: https://evla-survey.github.io/.

[652] T-MLA: A targeted multiscale log-exponential attack framework for neural image compression

Nikolay I. Kalmykov, Razan Dibo, Kaiyu Shen, Xu Zhonghan, Anh-Huy Phan, Yipeng Liu, Ivan Oseledets

Main category: cs.CV

TL;DR: T-MLA is a targeted multiscale log-exponential attack framework for neural image compression that introduces adversarial perturbations in the wavelet domain to degrade reconstruction quality while maintaining visual imperceptibility.

Details

Motivation: While neural image compression (NIC) has achieved state-of-the-art rate-distortion performance, its security vulnerabilities remain poorly understood compared to classifiers. Existing attacks on NICs are often naive adaptations of pixel-space methods that overlook the structured nature of compression pipelines.

Method: T-MLA introduces adversarial perturbations in the wavelet domain, concentrating on less perceptually salient coefficients to improve attack stealth. It uses a multiscale log-exponential attack framework that specifically targets the unique characteristics of neural compression pipelines.

Result: Extensive evaluation across multiple state-of-the-art NIC architectures shows T-MLA achieves targeted degradation of reconstruction quality while improving perturbation imperceptibility (higher PSNR/VIF of perturbed inputs) compared to PGD-style baselines at comparable attack success rates.

Conclusion: The work reveals a critical security flaw at the core of generative and content delivery pipelines, demonstrating that NIC systems are vulnerable to sophisticated attacks that exploit their unique compression structure.

Abstract: Neural image compression (NIC) has become the state-of-the-art for rate-distortion performance, yet its security vulnerabilities remain significantly less understood than those of classifiers. Existing adversarial attacks on NICs are often naive adaptations of pixel-space methods, overlooking the unique, structured nature of the compression pipeline. In this work, we propose a more advanced class of vulnerabilities by introducing T-MLA, the first targeted multiscale log-exponential attack framework. We introduce adversarial perturbations in the wavelet domain that concentrate on less perceptually salient coefficients, improving the stealth of the attack. Extensive evaluation across multiple state-of-the-art NIC architectures on standard image compression benchmarks reveals a large drop in reconstruction quality while the perturbations remain visually imperceptible. On standard NIC benchmarks, T-MLA achieves targeted degradation of reconstruction quality while improving perturbation imperceptibility (higher PSNR/VIF of the perturbed inputs) compared to PGD-style baselines at comparable attack success, as summarized in our main results. Our findings reveal a critical security flaw at the core of generative and content delivery pipelines.

[653] SciTextures: Collecting and Connecting Visual Patterns, Models, and Code Across Science and Art

Sagi Eppel, Alona Strugatski

Main category: cs.CV

TL;DR: SciTextures is a large-scale dataset of scientific textures and patterns with corresponding generative models and code, enabling evaluation of VLMs’ ability to connect visual patterns with their underlying generative mechanisms.

Details

Motivation: To enable deeper visual understanding by connecting visual patterns with the processes that form them, and to systematically evaluate vision language models' ability to link patterns to their underlying mechanisms.

Method: Created SciTextures dataset with 1,270+ models and 100,000+ images across scientific domains using an agentic AI pipeline that autonomously collects, implements, and standardizes models, and can invent novel pattern generation methods.

Result: Enables systematic VLM evaluation on linking patterns to generative code, identifying patterns from same processes, and inferring/recreating mechanisms from real-world images to generate simulated counterparts.

Conclusion: The dataset reveals VLMs can understand and simulate physical systems beyond visual patterns at multiple abstraction levels, providing a benchmark for connecting visual understanding with generative mechanisms.

Abstract: The ability to connect visual patterns with the processes that form them represents one of the deepest forms of visual understanding. Textures of clouds and waves, the growth of cities and forests, or the formation of materials and landscapes are all examples of patterns emerging from underlying mechanisms. We present the SciTextures dataset, a large-scale collection of textures and visual patterns from all domains of science, tech, and art, along with the models and code that generate these images. Covering over 1,270 different models and 100,000 images of patterns and textures from physics, chemistry, biology, sociology, technology, mathematics, and art, this dataset offers a way to explore the deep connection between the visual patterns that shape our world and the mechanisms that produce them. Built through an agentic AI pipeline that autonomously collects, implements, and standardizes scientific and generative models. This AI pipeline is also used to autonomously invent and implement novel methods for generating visual patterns and textures. SciTextures enables systematic evaluation of vision language models (VLM’s) ability to link visual patterns to the models and code that generate them, and to identify different patterns that emerge from the same underlying process. We also test VLMs ability to infer and recreate the mechanisms behind visual patterns by providing a natural image of a real-world phenomenon and asking the AI to identify and code a model of the process that formed it, then run this code to generate a simulated image that is compared to the reference image. These benchmarks reveal that VLM’s can understand and simulate physical systems beyond visual patterns at multiple levels of abstraction. The dataset and code are available at: https://zenodo.org/records/17485502

[654] Beyond Cosine Similarity: Magnitude-Aware CLIP for No-Reference Image Quality Assessment

Zhicheng Liao, Dongxu Wu, Zhenshan Shi, Sijie Mai, Hanwei Zhu, Lingyu Zhu, Yuncheng Jiang, Baoliang Chen

Main category: cs.CV

TL;DR: Adaptive fusion framework for NR-IQA using CLIP that combines cosine similarity with magnitude-aware quality cues via Box-Cox transformation and confidence-guided fusion.

Details

Motivation: Existing CLIP-based NR-IQA methods rely only on semantic similarity (cosine similarity between image embedding and textual prompts like "good/bad photo"), overlooking the magnitude of CLIP image features which empirically shows strong correlation with perceptual quality.

Method: 1) Extract absolute CLIP image features and apply Box-Cox transformation to normalize feature distribution and reduce semantic sensitivity; 2) Use resulting scalar as semantically-normalized auxiliary cue; 3) Design confidence-guided fusion scheme that adaptively weights both cosine similarity and magnitude cues based on their relative strength.

Result: Extensive experiments on multiple benchmark IQA datasets show the method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines, without any task-specific training.

Conclusion: The magnitude of CLIP image features provides valuable quality cues that complement semantic similarity, and adaptive fusion of both signals leads to superior NR-IQA performance without requiring specialized training.

Abstract: Recent efforts have repurposed the Contrastive Language-Image Pre-training (CLIP) model for No-Reference Image Quality Assessment (NR-IQA) by measuring the cosine similarity between the image embedding and textual prompts such as “a good photo” or “a bad photo.” However, this semantic similarity overlooks a critical yet underexplored cue: the magnitude of the CLIP image features, which we empirically find to exhibit a strong correlation with perceptual quality. In this work, we introduce a novel adaptive fusion framework that complements cosine similarity with a magnitude-aware quality cue. Specifically, we first extract the absolute CLIP image features and apply a Box-Cox transformation to statistically normalize the feature distribution and mitigate semantic sensitivity. The resulting scalar summary serves as a semantically-normalized auxiliary cue that complements cosine-based prompt matching. To integrate both cues effectively, we further design a confidence-guided fusion scheme that adaptively weighs each term according to its relative strength. Extensive experiments on multiple benchmark IQA datasets demonstrate that our method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines, without any task-specific training.

[655] Semantic Leakage from Image Embeddings

Yiyi Chen, Qiongkai Xu, Desmond Elliott, Qiongxiu Li, Johannes Bjerva

Main category: cs.CV

TL;DR: SLImE framework demonstrates that compressed image embeddings can leak semantic information through preserved neighborhood structures, challenging assumptions about their privacy safety.

Details

Motivation: To challenge the common assumption that image embeddings pose limited privacy risk by showing that semantic information can be recovered from compressed embeddings through preserved local semantic neighborhoods.

Method: Proposes SLImE (Semantic Leakage from Image Embeddings), a lightweight inference framework that uses locally trained semantic retrievers with off-the-shelf models to recover semantic information from compressed embeddings without task-specific decoders.

Result: Demonstrates consistent recovery of semantic information across diverse embedding models (GEMINI, COHERE, NOMIC, CLIP) and inference tasks, validating semantic leakage through aligned embeddings to retrieved tags and coherent descriptions.

Conclusion: Reveals fundamental vulnerability in image embeddings where preservation of semantic neighborhoods enables semantic leakage, highlighting significant challenges for privacy preservation in multimodal systems.

Abstract: Image embeddings are generally assumed to pose limited privacy risk. We challenge this assumption by formalizing semantic leakage as the ability to recover semantic structures from compressed image embeddings. Surprisingly, we show that semantic leakage does not require exact reconstruction of the original image. Preserving local semantic neighborhoods under embedding alignment is sufficient to expose the intrinsic vulnerability of image embeddings. Crucially, this preserved neighborhood structure allows semantic information to propagate through a sequence of lossy mappings. Based on this conjecture, we propose Semantic Leakage from Image Embeddings (SLImE), a lightweight inference framework that reveals semantic information from standalone compressed image embeddings, incorporating a locally trained semantic retriever with off-the-shelf models, without training task-specific decoders. We thoroughly validate each step of the framework empirically, from aligned embeddings to retrieved tags, symbolic representations, and grammatical and coherent descriptions. We evaluate SLImE across a range of open and closed embedding models, including GEMINI, COHERE, NOMIC, and CLIP, and demonstrate consistent recovery of semantic information across diverse inference tasks. Our results reveal a fundamental vulnerability in image embeddings, whereby the preservation of semantic neighborhoods under alignment enables semantic leakage, highlighting challenges for privacy preservation.1

[656] Learnable Total Variation with Lambda Mapping for Low-Dose CT Denoising

Yusuf Talha Basak, Mehmet Ozan Unal, Metin Ertas, Isa Yildirim

Main category: cs.CV

TL;DR: A Learnable Total Variation (LTV) framework that combines unrolled TV solver with LambdaNet to predict per-pixel regularization maps for adaptive CT denoising, outperforming classical TV and CNN methods.

Details

Motivation: Traditional Total Variation (TV) regularization has limitations due to its scalar regularization parameter, which lacks spatial adaptivity. The authors aim to create a more adaptive denoising approach that maintains TV's interpretability while overcoming its limitations.

Method: Proposes a Learnable Total Variation (LTV) framework that couples an unrolled TV solver with a LambdaNet neural network. LambdaNet predicts a per-pixel regularization map, allowing spatially adaptive smoothing. The framework is trained end-to-end to jointly optimize reconstruction and regularization parameters.

Result: Experiments on DeepLesion dataset using realistic LoDoPaB-CT simulation show consistent improvements over classical TV and FBP+U-Net, achieving up to +3.7 dB PSNR and 8% relative SSIM improvement.

Conclusion: LTV provides an interpretable alternative to black-box CNNs for low-dose CT denoising, combining the benefits of traditional TV regularization with learned spatial adaptivity through neural network prediction of regularization parameters.

Abstract: While Total Variation (TV) excels in noise reduction and edge preservation, its reliance on a scalar regularization parameter limits adaptivity. In this study, we present a Learnable Total Variation (LTV) framework coupling an unrolled TV solver with a LambdaNet that predicts a per-pixel regularization map. The proposed framework is trained end-to-end to optimize reconstruction and regularization jointly, yielding spatially adaptive smoothing. Experiments on the DeepLesion dataset, using realistic LoDoPaB-CT simulation, show consistent gains over classical TV and FBP+U-Net, achieving up to +3.7 dB PSNR and 8% relative SSIM improvement. LTV provides an interpretable alternative to black-box CNNs for low-dose CT denoising.

[657] NP-LoRA: Null Space Projection Unifies Subject and Style in LoRA Fusion

Chuheng Chen, Xiaofei Zhou, Geyuan Zhang, Yong Huang

Main category: cs.CV

TL;DR: NP-LoRA is a projection-based framework that addresses interference in LoRA fusion by separating content and style subspaces through null-space projection, enabling better control over subject fidelity and style preservation without retraining.

Details

Motivation: Existing LoRA fusion methods suffer from interference between independently trained content and style LoRAs due to overlapping, non-orthogonal low-rank subspaces, degrading generation fidelity.

Method: Reformulates LoRA fusion as a null-space projection problem, extracts principal style directions via SVD, projects subject LoRA into orthogonal complement of style subspace, and introduces soft projection for continuous control.

Result: NP-LoRA consistently outperforms strong baselines and generalizes well across pretrained LoRA pairs without retraining, achieving better subject fidelity and style preservation.

Conclusion: The geometric interference problem in LoRA fusion can be effectively addressed through subspace separation via null-space projection, enabling more controllable and faithful composition of learned representations.

Abstract: Low-Rank Adaptation (LoRA) fusion enables the composition of learned subject and style representations for controllable generation without retraining. However, existing methods rely on weight-based merging within a shared adaptation space, where independently trained LoRAs interfere and degrade fidelity. We show that this interference is fundamentally geometric: content and style LoRAs occupy overlapping, non-orthogonal low-rank subspaces, making weight-based fusion inherently flawed. Analyzing LoRA internal structure, we find that generative behavior is dominated by a few principal directions that must be preserved during fusion. Based on this insight, we reformulate LoRA fusion as a null-space projection problem and propose Null Space Projection LoRA (NP-LoRA), a projection-based framework that enforces subspace separation by construction. NP-LoRA extracts principal style directions via singular value decomposition (SVD) and projects the subject LoRA into the orthogonal complement of the style subspace, preventing interference. We further introduce a soft projection mechanism that provides continuous control over the trade-off between subject fidelity and style preservation. Experiments show that NP-LoRA consistently outperforms strong baselines and generalizes well across pretrained LoRA pairs without retraining.

[658] A Space-Time Transformer for Precipitation Nowcasting

Levi Harris, Tianlong Chen

Main category: cs.CV

TL;DR: SaTformer: A video transformer using full space-time attention for extreme precipitation forecasting from satellite radiances, winning NeurIPS Weather4Cast 2025 challenge.

Details

Motivation: Traditional numerical weather prediction models are computationally expensive and degrade at nowcasting timescales (0-4 hours), while AI-weather prediction alternatives using video-understanding architectures remain underexplored for precipitation forecasting.

Method: Proposes SaTformer, a video transformer built on full space-time attention that forecasts extreme precipitation from satellite radiances. Reformulates precipitation regression as classification problem and employs class-weighted loss to address long-tailed dataset imbalances.

Result: Achieved first place on the NeurIPS Weather4Cast 2025 “Cumulative Rainfall” challenge, demonstrating superior performance in extreme precipitation forecasting.

Conclusion: SaTformer successfully applies video-understanding architectures to weather forecasting, addressing computational limitations of traditional models and improving nowcasting capabilities for extreme precipitation events.

Abstract: Meteorological agencies around the world rely on real-time flood guidance to issue life-saving advisories and warnings. For decades traditional numerical weather prediction (NWP) models have been state-of-the-art for precipitation forecasting. However, physically-parameterized models suffer from a few core limitations: first, solving PDEs to resolve atmospheric dynamics is computationally demanding, and second, these methods degrade in performance at nowcasting timescales (i.e., 0-4 hour lead-times). Motivated by these shortcomings, recent work proposes AI-weather prediction (AI-WP) alternatives that learn to emulate analysis data with neural networks. While these data-driven approaches have enjoyed enormous success across diverse spatial and temporal resolutions, applications of video-understanding architectures for weather forecasting remain underexplored. To address these gaps, we propose \textbf{SaTformer}: a video transformer built on full space-time attention that skillfully forecasts extreme precipitation from satellite radiances. Along with our novel architecture, we introduce techniques to tame long-tailed precipitation datasets. Namely, we reformulate precipitation regression into a classification problem, and employ a class-weighted loss to address label imbalances. Our model scored \textbf{first place} on the NeurIPS Weather4Cast 2025 ``Cumulative Rainfall’’ challenge. Code and model weights are available: \texttt{\href{github.com/leharris3/w4c-25}{github.com/leharris3/satformer}}

[659] Robust Low-Rank Sparse Framework for Video-Based Affective Computing

Feng-Qi Cui, Jinyang Huang, Sirui Zhao, Xinyu Li, Xin Yan, Ziyu Jia, Xiaokang Zhou

Main category: cs.CV

TL;DR: LSEF is a hierarchical low-rank sparse framework for video-based affective computing that disentangles emotional bases (long-term tone) from transient fluctuations (short-term changes) to improve model stability and representation.

Details

Motivation: Video-based Affective Computing suffers from model instability and representational degradation due to complex emotional dynamics. Current approaches lack hierarchical mechanisms to disentangle distinct affective components - emotional bases (long-term tone) and transient fluctuations (short-term changes).

Method: Proposes LSEF (Low-Rank Sparse Emotion Understanding Framework) based on Low-Rank Sparse Principle. Uses three modules: Stability Encoding Module (SEM) captures low-rank emotional bases; Dynamic Decoupling Module (DDM) isolates sparse transient signals; Consistency Integration Module (CIM) reconstructs multi-scale stability and reactivity coherence. Optimized with Rank Aware Optimization (RAO) strategy.

Result: Extensive experiments across multiple datasets confirm LSEF significantly enhances robustness and dynamic discrimination, validating effectiveness and generality of hierarchical low-rank sparse modeling for understanding affective dynamics.

Conclusion: The hierarchical low-rank sparse framework effectively addresses instability and representational degradation in video-based affective computing by disentangling emotional components at different temporal scales.

Abstract: Video-based Affective Computing (VAC), vital for emotion analysis and human-computer interaction, suffers from model instability and representational degradation due to complex emotional dynamics. Since the meaning of different emotional fluctuations may differ under different emotional contexts, the core limitation is the lack of a hierarchical structural mechanism to disentangle distinct affective components, i.e., emotional bases (the long-term emotional tone), and transient fluctuations (the short-term emotional fluctuations). To address this, we propose the Low-Rank Sparse Emotion Understanding Framework (LSEF), a unified model grounded in the Low-Rank Sparse Principle, which theoretically reframes affective dynamics as a hierarchical low-rank sparse compositional process. LSEF employs three plug-and-play modules, i.e., the Stability Encoding Module (SEM) captures low-rank emotional bases; the Dynamic Decoupling Module (DDM) isolates sparse transient signals; and the Consistency Integration Module (CIM) reconstructs multi-scale stability and reactivity coherence. This framework is optimized by a Rank Aware Optimization (RAO) strategy that adaptively balances gradient smoothness and sensitivity. Extensive experiments across multiple datasets confirm that LSEF significantly enhances robustness and dynamic discrimination, which further validates the effectiveness and generality of hierarchical low-rank sparse modeling for understanding affective dynamics.

[660] ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation

Kaishen Wang, Ruibo Chen, Tong Zheng, Heng Huang

Main category: cs.CV

TL;DR: ImAgent is a training-free multimodal agent that integrates reasoning, generation, and self-evaluation in a single framework to improve text-to-image generation consistency and reduce randomness, particularly for vague prompts.

Details

Motivation: Current text-to-image models suffer from randomness and inconsistency with vague prompts. Existing solutions like prompt rewriting and self-refinement require additional modules and increase computational overhead, hindering test-time scaling efficiency.

Method: ImAgent is a training-free unified multimodal agent with a policy controller that enables multiple generation actions to dynamically interact and self-organize. It integrates reasoning, generation, and self-evaluation within a single framework without external models.

Result: Extensive experiments on image generation and editing tasks show ImAgent consistently improves over backbone models and surpasses other baselines where backbone models fail, demonstrating better image fidelity and semantic alignment.

Conclusion: ImAgent highlights the potential of unified multimodal agents for adaptive and efficient image generation under test-time scaling, offering a training-free solution that integrates multiple capabilities in one framework.

Abstract: Recent text-to-image (T2I) models have made remarkable progress in generating visually realistic and semantically coherent images. However, they still suffer from randomness and inconsistency with the given prompts, particularly when textual descriptions are vague or underspecified. Existing approaches, such as prompt rewriting, best-of-N sampling, and self-refinement, can mitigate these issues but usually require additional modules and operate independently, hindering test-time scaling efficiency and increasing computational overhead. In this paper, we introduce ImAgent, a training-free unified multimodal agent that integrates reasoning, generation, and self-evaluation within a single framework for efficient test-time scaling. Guided by a policy controller, multiple generation actions dynamically interact and self-organize to enhance image fidelity and semantic alignment without relying on external models. Extensive experiments on image generation and editing tasks demonstrate that ImAgent consistently improves over the backbone and even surpasses other strong baselines where the backbone model fails, highlighting the potential of unified multimodal agents for adaptive and efficient image generation under test-time scaling.

[661] Physics-Based Benchmarking Metrics for Multimodal Synthetic Images

Kishor Datta Gupta, Marufa Kamal, Md. Mahfuzur Rahman, Fahad Rahman, Mohd Ariful Haque, Sunzida Siddique

Main category: cs.CV

TL;DR: PCMDE: A physics-constrained multimodal evaluation metric combining LLMs, knowledge mapping, and VLMs to assess semantic/structural accuracy beyond traditional metrics like BLEU/CLIPScore.

Details

Motivation: Current multimodal evaluation metrics (BLEU, CIDEr, VQA score, SigLIP-2, CLIPScore) fail to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios, necessitating a more sophisticated evaluation approach.

Method: Three-stage architecture: (1) multimodal feature extraction using object detection and VLMs for spatial/semantic information; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; (3) physics-guided reasoning with LLMs to enforce structural/relational constraints (alignment, position, consistency).

Result: Proposes PCMDE metric that integrates reasoning, knowledge-based mapping, and vision-language models to overcome limitations of traditional multimodal evaluation metrics.

Conclusion: PCMDE provides a more comprehensive evaluation framework for multimodal systems by incorporating physics-constrained reasoning and structural validation beyond surface-level similarity metrics.

Abstract: Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.

[662] GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AI

Naomi Simumba, Nils Lehmann, Paolo Fraccaro, Hamed Alemohammad, Geeth De Mel, Salman Khan, Manil Maskey, Nicolas Longepe, Xiao Xiang Zhu, Hannah Kerner, Juan Bernabe-Moreno, Alexandre Lacoste

Main category: cs.CV

TL;DR: GEO-Bench-2 introduces a comprehensive evaluation framework for Geospatial Foundation Models with standardized protocols across 19 datasets and multiple task types, revealing no single model dominates all tasks.

Details

Motivation: Current evaluation of Geospatial Foundation Models lacks standardized protocols, making it difficult to compare models and understand their capabilities across different Earth Observation tasks.

Method: Created GEO-Bench-2 with 19 permissively-licensed datasets spanning classification, segmentation, regression, object detection, and instance segmentation. Introduced “capability” groups to rank models based on shared characteristics like resolution, bands, and temporality.

Result: No single model dominates across all tasks. Natural image pretrained models (ConvNext ImageNet, DINO V3) excel on high-resolution tasks, while EO-specific models (TerraMind, Prithvi, Clay) outperform on multispectral applications like agriculture and disaster response.

Conclusion: Optimal model choice depends on task requirements, data modalities, and constraints. The goal of a single GeoFM that performs well across all tasks remains open. GEO-Bench-2 enables reproducible, informed evaluation tailored to specific use cases.

Abstract: Geospatial Foundation Models (GeoFMs) are transforming Earth Observation (EO), but evaluation lacks standardized protocols. GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation across 19 permissively-licensed datasets. We introduce ‘‘capability’’ groups to rank models on datasets that share common characteristics (e.g., resolution, bands, temporality). This enables users to identify which models excel in each capability and determine which areas need improvement in future work. To support both fair comparison and methodological innovation, we define a prescriptive yet flexible evaluation protocol. This not only ensures consistency in benchmarking but also facilitates research into model adaptation strategies, a key and open challenge in advancing GeoFMs for downstream tasks. Our experiments show that no single model dominates across all tasks, confirming the specificity of the choices made during architecture design and pretraining. While models pretrained on natural images (ConvNext ImageNet, DINO V3) excel on high-resolution tasks, EO-specific models (TerraMind, Prithvi, and Clay) outperform them on multispectral applications such as agriculture and disaster response. These findings demonstrate that optimal model choice depends on task requirements, data modalities, and constraints. This shows that the goal of a single GeoFM model that performs well across all tasks remains open for future research. GEO-Bench-2 enables informed, reproducible GeoFM evaluation tailored to specific use cases. Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.

[663] Learning to Look Closer: A New Instance-Wise Loss for Small Cerebral Lesion Segmentation

Luc Bouteille, Alexander Jaus, Jens Kleesiek, Rainer Stiefelhagen, Lukas Heine

Main category: cs.CV

TL;DR: CC-DiceCE loss improves small lesion detection in medical image segmentation compared to traditional Dice loss and blob loss, with better recall but dataset-dependent precision trade-offs.

Details

Motivation: Traditional segmentation losses like Dice under-segment small lesions because their small relative volume contributes negligibly to overall loss, leading to poor detection of small lesions in medical imaging.

Method: Introduces CC-DiceCE loss based on CC-Metrics framework, benchmarked against blob loss and DiceCE baseline within nnU-Net framework for standardized evaluation across multiple datasets.

Result: CC-DiceCE loss increases lesion detection (recall) with minimal degradation in segmentation performance, though with dataset-dependent precision trade-offs. It generally outperforms blob loss in multi-dataset evaluation.

Conclusion: CC-DiceCE is an effective loss function for improving small lesion detection in medical image segmentation, offering better performance than existing instance-wise approaches like blob loss.

Abstract: Traditional loss functions in medical image segmentation, such as Dice, often under-segment small lesions because their small relative volume contributes negligibly to the overall loss. To address this, instance-wise loss functions and metrics have been proposed to evaluate segmentation quality on a per-lesion basis. We introduce CC-DiceCE, a loss function based on the CC-Metrics framework, and compare it with the existing blob loss. Both are benchmarked against a DiceCE baseline within the nnU-Net framework, which provides a robust and standardized setup. We find that CC-DiceCE loss increases detection (recall) with minimal to no degradation in segmentation performance, though with dataset-dependent trade-offs in precision. Furthermore, our multi-dataset study shows that CC-DiceCE generally outperforms blob loss.

[664] Zero-Shot Video Deraining with Video Diffusion Models

Tuomas Varanka, Juan Luis Gonzalez, Hyeongwoo Kim, Pablo Garrido, Xu Yao

Main category: cs.CV

TL;DR: Zero-shot video deraining method using pretrained text-to-video diffusion models with negative prompting and attention switching, no synthetic data or fine-tuning needed

Details

Motivation: Existing video deraining methods have limitations: synthetic training data doesn't generalize to real rain, static camera datasets don't handle dynamic scenes, and diffusion model fine-tuning weakens generative priors and generalization

Method: Inverts input video into diffusion model’s latent space, uses negative prompting to push reconstruction away from rain concept, and employs attention switching mechanism to maintain dynamic backgrounds and structural consistency

Result: Extensive experiments on real-world rain datasets show substantial improvements over prior methods and robust generalization without supervised training

Conclusion: First zero-shot video deraining method for complex dynamic scenes that leverages pretrained diffusion models’ generalization capabilities without synthetic data or fine-tuning

Abstract: Existing video deraining methods are often trained on paired datasets, either synthetic, which limits their ability to generalize to real-world rain, or captured by static cameras, which restricts their effectiveness in dynamic scenes with background and camera motion. Furthermore, recent works in fine-tuning diffusion models have shown promising results, but the fine-tuning tends to weaken the generative prior, limiting generalization to unseen cases. In this paper, we introduce the first zero-shot video deraining method for complex dynamic scenes that does not require synthetic data nor model fine-tuning, by leveraging a pretrained text-to-video diffusion model that demonstrates strong generalization capabilities. By inverting an input video into the latent space of diffusion models, its reconstruction process can be intervened and pushed away from the model’s concept of rain using negative prompting. At the core of our approach is an attention switching mechanism that we found is crucial for maintaining dynamic backgrounds as well as structural consistency between the input and the derained video, mitigating artifacts introduced by naive negative prompting. Our approach is validated through extensive experiments on real-world rain datasets, demonstrating substantial improvements over prior methods and showcasing robust generalization without the need for supervised training.

Hamza Tahboub, Weiyan Shi, Gang Hua, Huaizu Jiang

Main category: cs.CV

TL;DR: SocialFusion addresses negative transfer in VLMs for social perception tasks by connecting frozen visual encoder to language model, achieving positive transfer across five social tasks.

Details

Motivation: Current vision-language models struggle with multiple social perception tasks due to "social degradation" - general pre-training impairs visual encoder's ability to represent nuanced social information, causing negative transfer.

Method: Proposes SocialFusion framework that learns minimal connection between frozen visual encoder and language model. Investigates social degradation through linear representation probing (decodability) and gradient conflict analysis (compatibility).

Result: SocialFusion exhibits positive transfer across all five social tasks, leveraging synergies between tasks to enhance overall performance. Achieves comparable performance to task-specific state-of-the-art models on various benchmarks.

Conclusion: Current VLM pre-training strategies may be detrimental to acquiring general social competence, highlighting need for more socially-aware training paradigms. SocialFusion demonstrates effective approach to unified social perception.

Abstract: Understanding social interactions from visual cues is a fundamental challenge for a socially competent AI. While powerful pre-trained vision-language models (VLMs) have shown remarkable general capabilities, they surprisingly struggle to unify and learn multiple social perception tasks simultaneously, often exhibiting negative transfer. We identify that this negative transfer stems from a critical issue we term “social degradation,” whereby the general visual-linguistic pre-training process of VLMs impairs the visual encoder’s ability to represent nuanced social information. We investigate this behavior further under two lenses: decodability through linear representation probing and compatibility through gradient conflict analysis, revealing that both play a role in the degradation, especially the former, which is significantly compromised in the VLM pre-training process. To address these issues, we propose SocialFusion, a unified framework that learns a minimal connection between a frozen visual encoder and a language model. Compared with existing VLMs, it exhibits positive transfer across all five social tasks, leveraging synergies between them to enhance overall performance and achieves comparable performance to task-specific state-of-the-art models on various benchmarks. Our findings suggest that current VLM pre-training strategies may be detrimental to acquiring general social competence and highlight the need for more socially-aware training paradigms.

[666] Unrolled Networks are Conditional Probability Flows in MRI Reconstruction

Kehan Qi, Saumya Gupta, Xiaoling Hu, Qingqiao Hu, Weimin Lyu, Chao Chen

Main category: cs.CV

TL;DR: FLAT improves MRI reconstruction by aligning unrolled networks with Flow Matching theory for stable cascade trajectories and better convergence.

Details

Motivation: Unrolled networks for MRI reconstruction suffer from unstable output quality across cascades, leading to sub-optimal final results. The authors aim to address this inherent limitation by connecting unrolled networks to the Flow Matching paradigm.

Method: First prove theoretically that unrolled networks are discretizations of conditional probability flows, showing their analogy to Flow Matching in MRI reconstruction. Then propose FLow-Aligned Training (FLAT) which: (1) derives important cascade parameters from Flow Matching discretization, and (2) aligns intermediate reconstructions with the ideal Flow Matching trajectory to improve cascade iteration stability and convergence.

Result: Experiments on three MRI datasets show that FLAT results in a stable trajectory across sub-networks, improving the quality of the final reconstruction.

Conclusion: The connection between unrolled networks and Flow Matching provides a theoretical foundation for improving MRI reconstruction stability, and FLAT effectively addresses the cascade instability problem in unrolled networks.

Abstract: Unrolled networks have been widely used for Magnetic Resonance Imaging (MRI) reconstruction due to their efficiency. However, they typically exhibit unstable output quality across cascades, resulting in sub-optimal final reconstruction results. In this work, we address this inherent limitation of unrolled networks, drawing inspiration from recent Flow Matching paradigm. We first theoretically prove that unrolled networks are discretizations of conditional probability flows. This connection shows that unrolled networks and Flow Matching are analogous in MRI reconstruction. Building upon this insight, we propose FLow-Aligned Training (FLAT), which (1) derives important cascade parameters from the Flow Matching discretization; and (2) aligns intermediate reconstructions with the ideal Flow Matching trajectory to improve cascade iteration stability and convergence. Experiments on three MRI datasets show that FLAT results in a stable trajectory across sub-networks, improving the quality of the final reconstruction.

Tasmiah Haque, Srinjoy Das

Main category: cs.CV

TL;DR: Proposes GRU-SNF, an inference-time refinement technique combining GRU-Normalizing Flows with MCMC sampling to improve diversity in video motion transfer predictions without sacrificing accuracy.

Details

Motivation: Real-time video motion transfer applications (gaming, anomaly detection) require accurate yet diverse future predictions for realistic synthesis and robust decision-making under uncertainty. Current deterministic transformation structures in GRU-NF limit expressivity and diversity.

Method: Introduces GRU-Stochastic Normalizing Flows (GRU-SNF) by adding Markov Chain Monte Carlo (MCMC) steps during GRU-NF inference, inspired by Stochastic Normalizing Flows. This enables exploration of richer output space without retraining, using stochastic sampling to capture multimodal distributions.

Result: GRU-SNF outperforms GRU-NF in generating diverse outputs without sacrificing accuracy, even under longer prediction horizons. Better captures multimodal behavior in keypoint-based video motion transfer pipeline.

Conclusion: Integrating stochastic dynamics with flow-based sequence models improves generative time series forecasting for video motion transfer, enabling more realistic and diverse predictions.

Abstract: Real-time video motion transfer applications such as immersive gaming and vision-based anomaly detection require accurate yet diverse future predictions to support realistic synthesis and robust downstream decision making under uncertainty. To improve the diversity of such sequential forecasts we propose a novel inference-time refinement technique that combines Gated Recurrent Unit-Normalizing Flows (GRU-NF) with stochastic sampling methods. While GRU-NF can capture multimodal distributions through its integration of normalizing flows within a temporal forecasting framework, its deterministic transformation structure can limit expressivity. To address this, inspired by Stochastic Normalizing Flows (SNF), we introduce Markov Chain Monte Carlo (MCMC) steps during GRU-NF inference, enabling the model to explore a richer output space and better approximate the true data distribution without retraining. We validate our approach in a keypoint-based video motion transfer pipeline, where capturing temporally coherent and perceptually diverse future trajectories is essential for realistic samples and low bandwidth communication. Experiments show that our inference framework, Gated Recurrent Unit- Stochastic Normalizing Flows (GRU-SNF) outperforms GRU-NF in generating diverse outputs without sacrificing accuracy, even under longer prediction horizons. By injecting stochasticity during inference, our approach captures multimodal behavior more effectively. These results highlight the potential of integrating stochastic dynamics with flow-based sequence models for generative time series forecasting. The code is available at: https://github.com/Tasmiah1408028/Inference-Time-Stochastic-Refinement-Of-GRU-NF-For-Real-Time-Video-Motion-Transfer

[668] GuidNoise: Single-Pair Guided Diffusion for Generalized Noise Synthesis

Changjin Kim, HyeokJun Lee, YoungJoon Yoo

Main category: cs.CV

TL;DR: GuidNoise: A diffusion-based method for generalized noise synthesis using only a single noisy/clean image pair as guidance, enabling realistic noise generation without camera metadata or extensive paired data.

Details

Motivation: Existing generative models for real noise synthesis require camera metadata and extensive noisy-clean image pairs, which are costly to acquire and show limited generalization between different settings.

Method: Proposes GuidNoise with guidance-aware affine feature modification (GAFM) and noise-aware refine loss to leverage diffusion models’ potential. Uses a single noisy/clean pair as guidance to generate synthetic noisy images without additional metadata.

Result: GuidNoise synthesizes high-quality noisy images under diverse noise environments and enables efficient generation of noisy-clean pairs for data augmentation, significantly improving denoising performance especially with lightweight models and limited training data.

Conclusion: GuidNoise provides an effective solution for generalized noise synthesis with minimal prerequisites, making synthetic noise readily applicable for training data augmentation and improving practical denoising performance.

Abstract: Recent image denoising methods have leveraged generative modeling for real noise synthesis to address the costly acquisition of real-world noisy data. However, these generative models typically require camera metadata and extensive target-specific noisy-clean image pairs, often showing limited generalization between settings. In this paper, to mitigate the prerequisites, we propose a Single-Pair Guided Diffusion for generalized noise synthesis GuidNoise, which uses a single noisy/clean pair as the guidance, often easily obtained by itself within a training set. To train GuidNoise, which generates synthetic noisy images from the guidance, we introduce a guidance-aware affine feature modification (GAFM) and a noise-aware refine loss to leverage the inherent potential of diffusion models. This loss function refines the diffusion model’s backward process, making the model more adept at generating realistic noise distributions. The GuidNoise synthesizes high-quality noisy images under diverse noise environments without additional metadata during both training and inference. Additionally, GuidNoise enables the efficient generation of noisy-clean image pairs at inference time, making synthetic noise readily applicable for augmenting training data. This self-augmentation significantly improves denoising performance, especially in practical scenarios with lightweight models and limited training data. The code is available at https://github.com/chjinny/GuidNoise.

[669] SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations

Wenhao Yan, Sheng Ye, Zhuoyi Yang, Jiayan Teng, ZhenHui Dong, Kairui Wen, Xiaotao Gu, Yong-Jin Liu, Jie Tang

Main category: cs.CV

TL;DR: SCAIL is a framework for studio-grade character animation using in-context learning with novel 3D pose representation and full-context pose injection in diffusion-transformer architecture.

Details

Motivation: Existing character animation approaches often fail to preserve structural fidelity and temporal consistency in complex motion scenarios and cross-identity animations, making it challenging to achieve studio-grade production standards.

Method: Proposes SCAIL framework with two key innovations: 1) novel 3D pose representation for robust motion signals, and 2) full-context pose injection mechanism within diffusion-transformer architecture for effective spatio-temporal reasoning over full motion sequences.

Result: SCAIL achieves state-of-the-art performance and advances character animation toward studio-grade reliability and realism, as demonstrated through comprehensive benchmarking.

Conclusion: SCAIL addresses key challenges in character animation and moves the field closer to production-ready studio-grade quality through innovative pose representation and architectural design.

Abstract: Achieving character animation that meets studio-grade production standards remains challenging despite recent progress. Existing approaches can transfer motion from a driving video to a reference image, but often fail to preserve structural fidelity and temporal consistency in wild scenarios involving complex motion and cross-identity animations. In this work, we present \textbf{SCAIL} (a framework toward \textbf{S}tudio-grade \textbf{C}haracter \textbf{A}nimation via \textbf{I}n-context \textbf{L}earning), a framework designed to address these challenges from two key innovations. First, we propose a novel 3D pose representation, providing a more robust and flexible motion signal. Second, we introduce a full-context pose injection mechanism within a diffusion-transformer architecture, enabling effective spatio-temporal reasoning over full motion sequences. To align with studio-level requirements, we develop a curated data pipeline ensuring both diversity and quality, and establish a comprehensive benchmark for systematic evaluation. Experiments show that \textbf{SCAIL} achieves state-of-the-art performance and advances character animation toward studio-grade reliability and realism.

[670] Towards Sustainable Universal Deepfake Detection with Frequency-Domain Masking

Chandler Timm C. Doloriel, Habib Ullah, Kristian Hovde Liland, Fadi Al Machot, Ngai-Man Cheung

Main category: cs.CV

TL;DR: Frequency-domain masking improves universal deepfake detection by enhancing generalization to unseen generative models while maintaining efficiency through model pruning.

Details

Motivation: Need for universal deepfake detection that generalizes to unseen generative models while minimizing computational overhead for large-scale screening in the Green AI era.

Method: Introduces frequency-domain masking as a training strategy, using random masking and geometric transformations with focus on frequency masking for superior generalization properties.

Result: Achieves state-of-the-art generalization on GAN- and diffusion-generated image datasets, maintains performance under significant model pruning, and offers scalable resource-conscious solution.

Conclusion: Frequency-based masking is a practical step toward sustainable and generalizable deepfake detection with strong generalization capabilities and computational efficiency.

Abstract: Universal deepfake detection aims to identify AI-generated images across a broad range of generative models, including unseen ones. This requires robust generalization to new and unseen deepfakes, which emerge frequently, while minimizing computational overhead to enable large-scale deepfake screening, a critical objective in the era of Green AI. In this work, we explore frequency-domain masking as a training strategy for deepfake detectors. Unlike traditional methods that rely heavily on spatial features or large-scale pretrained models, our approach introduces random masking and geometric transformations, with a focus on frequency masking due to its superior generalization properties. We demonstrate that frequency masking not only enhances detection accuracy across diverse generators but also maintains performance under significant model pruning, offering a scalable and resource-conscious solution. Our method achieves state-of-the-art generalization on GAN- and diffusion-generated image datasets and exhibits consistent robustness under structured pruning. These results highlight the potential of frequency-based masking as a practical step toward sustainable and generalizable deepfake detection. Code and models are available at https://github.com/chandlerbing65nm/FakeImageDetection.

[671] VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Yejin Bang, Allen Bolourchi, Yann LeCun, Pascale Fung

Main category: cs.CV

TL;DR: VL-JEPA is a vision-language model using joint embedding predictive architecture that predicts continuous text embeddings instead of autoregressive token generation, achieving better performance with fewer parameters and supporting multiple tasks.

Details

Motivation: Traditional vision-language models use autoregressive token generation which can be computationally expensive and focus on surface-level linguistic patterns rather than semantic understanding. The authors aim to create a more efficient model that learns in abstract representation space.

Method: VL-JEPA uses a Joint Embedding Predictive Architecture to predict continuous embeddings of target texts rather than generating tokens autoregressively. It learns in an abstract representation space and uses a lightweight text decoder only when needed for text generation.

Result: VL-JEPA achieves stronger performance than standard token-space VLM training with 50% fewer trainable parameters. It supports selective decoding (2.85x reduction in decoding operations), open-vocabulary classification, text-to-video retrieval, and discriminative VQA. Outperforms CLIP, SigLIP2, and Perception Encoder on video tasks, and matches classical VLMs on VQA datasets with only 1.6B parameters.

Conclusion: VL-JEPA demonstrates that predicting continuous embeddings in joint representation space is more efficient and effective than autoregressive token generation for vision-language tasks, supporting multiple modalities and applications with reduced computational cost.

Abstract: We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA’s embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.

[672] Robust MLLM Unlearning via Visual Knowledge Distillation

Yuhang Wang, Zhenxing Niu, Haoxuan Ji, Guangyu He, Haichang Gao, Gang Hua

Main category: cs.CV

TL;DR: A novel machine unlearning approach for Multimodal Large Language Models (MLLMs) that selectively erases target visual knowledge while preserving textual knowledge through Visual Knowledge Distillation.

Details

Motivation: Most existing machine unlearning methods are designed for LLMs, while MLLM-oriented unlearning remains underdeveloped. The paper aims to address the need for selectively removing sensitive visual information from MLLMs while maintaining their textual capabilities.

Method: Proposes disentangling visual and textual knowledge in MLLMs and introduces Visual Knowledge Distraction (VKD) scheme that uses intermediate visual representations as supervision signals instead of output-level supervision. Only fine-tunes visual components for efficiency.

Result: Extensive experiments show the approach outperforms state-of-the-art unlearning methods in both effectiveness and efficiency. The paper also pioneers evaluation of MLLM unlearning robustness against relearning attacks.

Conclusion: The proposed method provides an effective and efficient solution for MLLM unlearning by leveraging internal visual representations, enabling selective erasure of visual knowledge while preserving textual capabilities.

Abstract: Recently, machine unlearning approaches have been proposed to remove sensitive information from well-trained large models. However, most existing methods are tailored for LLMs, while MLLM-oriented unlearning remains at its early stage. Inspired by recent studies exploring the internal mechanisms of MLLMs, we propose to disentangle the visual and textual knowledge embedded within MLLMs and introduce a dedicated approach to selectively erase target visual knowledge while preserving textual knowledge. Unlike previous unlearning methods that rely on output-level supervision, our approach introduces a Visual Knowledge Distillation (VKD) scheme, which leverages intermediate visual representations within the MLLM as supervision signals. This design substantially enhances both unlearning effectiveness and model utility. Moreover, since our method only fine-tunes the visual components of the MLLM, it offers significant efficiency advantages. Extensive experiments demonstrate that our approach outperforms state-of-the-art unlearning methods in terms of both effectiveness and efficiency. Moreover, we are the first to evaluate the robustness of MLLM unlearning against relearning attacks.

[673] SCR2-ST: Combine Single Cell with Spatial Transcriptomics for Efficient Active Sampling via Reinforcement Learning

Junchao Zhu, Ruining Deng, Junlin Guo, Tianyuan Yao, Chongyu Qu, Juming Xiong, Siqi Lu, Zhengyi Lu, Yanfan Zhu, Marilyn Lionts, Yuechen Yang, Yalin Zheng, Yu Wang, Shilin Zhao, Haichun Yang, Yuankai Huo

Main category: cs.CV

TL;DR: SCR2-ST is a framework that uses single-cell prior knowledge to guide efficient spatial transcriptomics data acquisition and accurate expression prediction through reinforcement learning-based active sampling and hybrid regression-retrieval networks.

Details

Motivation: Spatial transcriptomics (ST) data is expensive to acquire, and traditional fixed-grid sampling leads to redundant measurements of similar regions, resulting in scarce data. Single-cell sequencing data could provide rich auxiliary biological information to mitigate these limitations.

Method: SCR2-ST integrates: 1) Single-cell guided reinforcement learning (SCRL) for active sampling that combines single-cell foundation model embeddings with spatial density information to construct biologically grounded reward signals for selective acquisition; 2) SCR2Net, a hybrid regression-retrieval prediction network with majority cell-type filtering to suppress noisy matches and use retrieved expression profiles as soft labels for auxiliary supervision.

Result: Evaluated on three public ST datasets, SCR2-ST demonstrates state-of-the-art performance in both sampling efficiency and prediction accuracy, particularly under low-budget scenarios.

Conclusion: SCR2-ST effectively leverages single-cell prior knowledge to address the challenges of expensive ST data acquisition and scarce measurements, providing an efficient framework for spatial transcriptomics analysis.

Abstract: Spatial transcriptomics (ST) is an emerging technology that enables researchers to investigate the molecular relationships underlying tissue morphology. However, acquiring ST data remains prohibitively expensive, and traditional fixed-grid sampling strategies lead to redundant measurements of morphologically similar or biologically uninformative regions, thus resulting in scarce data that constrain current methods. The well-established single-cell sequencing field, however, could provide rich biological data as an effective auxiliary source to mitigate this limitation. To bridge these gaps, we introduce SCR2-ST, a unified framework that leverages single-cell prior knowledge to guide efficient data acquisition and accurate expression prediction. SCR2-ST integrates a single-cell guided reinforcement learning-based (SCRL) active sampling and a hybrid regression-retrieval prediction network SCR2Net. SCRL combines single-cell foundation model embeddings with spatial density information to construct biologically grounded reward signals, enabling selective acquisition of informative tissue regions under constrained sequencing budgets. SCR2Net then leverages the actively sampled data through a hybrid architecture combining regression-based modeling with retrieval-augmented inference, where a majority cell-type filtering mechanism suppresses noisy matches and retrieved expression profiles serve as soft labels for auxiliary supervision. We evaluated SCR2-ST on three public ST datasets, demonstrating SOTA performance in both sampling efficiency and prediction accuracy, particularly under low-budget scenarios. Code is publicly available at: https://github.com/hrlblab/SCR2ST

Nguyen Lam Phu Quy, Pham Phu Hoa, Tran Chi Nguyen, Dao Sy Duy Minh, Nguyen Hoang Minh Ngoc, Huynh Trung Kiet

Main category: cs.CV

TL;DR: A multimodal pipeline that enhances image captions by retrieving similar images, extracting contextual information from related articles, and integrating this context with base captions using a fine-tuned Qwen3 model.

Details

Motivation: Real-world image captions often lack contextual depth, omitting crucial details like event background, temporal cues, outcomes, and named entities that aren't visually discernible, limiting effectiveness in domains like journalism, education, and digital archives.

Method: Multimodal pipeline that retrieves semantically similar images using BEIT-3 and SigLIP, reranks them using ORB and SIFT for geometric alignment, extracts contextual information from related articles via semantic search, and integrates this context with base captions using a fine-tuned Qwen3 model with QLoRA.

Result: Evaluated on OpenEvents v1 dataset, the approach generates significantly more informative captions compared to traditional methods, showing strong potential for real-world applications requiring deeper visual-textual understanding.

Conclusion: The proposed multimodal pipeline successfully addresses the gap in contextual depth in image captions by augmenting visual input with external textual knowledge, producing event-enriched, context-aware descriptions for real-world applications.

Abstract: Real-world image captions often lack contextual depth, omitting crucial details such as event background, temporal cues, outcomes, and named entities that are not visually discernible. This gap limits the effectiveness of image understanding in domains like journalism, education, and digital archives, where richer, more informative descriptions are essential. To address this, we propose a multimodal pipeline that augments visual input with external textual knowledge. Our system retrieves semantically similar images using BEIT-3 (Flickr30k-384 and COCO-384) and SigLIP So-384, reranks them using ORB and SIFT for geometric alignment, and extracts contextual information from related articles via semantic search. A fine-tuned Qwen3 model with QLoRA then integrates this context with base captions generated by Instruct BLIP (Vicuna-7B) to produce event-enriched, context-aware descriptions. Evaluated on the OpenEvents v1 dataset, our approach generates significantly more informative captions compared to traditional methods, showing strong potential for real-world applications requiring deeper visual-textual understanding

Wenshuo Peng, Gongxuan Wang, Tianmeng Yang, Chuanhao Li, Xiaojie Xu, Hui He, Kaipeng Zhang

Main category: cs.CV

TL;DR: A benchmark for evaluating social reasoning in video generation, revealing current models’ limitations in understanding social cognition despite good visual quality.

Details

Motivation: Current text-to-video models generate visually realistic videos but lack social reasoning capabilities - they can't infer intentions, beliefs, emotions, or social norms like humans do. There's a need to systematically evaluate this gap in social cognition.

Method: Created a benchmark based on 30 classic social cognition paradigms organized into 7 dimensions. Developed a training-free agent-based pipeline that distills reasoning mechanisms, synthesizes video scenarios, enforces conceptual neutrality, and evaluates videos using a VLM judge across 5 social reasoning dimensions.

Result: Large-scale study across 7 state-of-the-art video generation systems shows substantial performance gaps: models excel in surface-level plausibility but systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.

Conclusion: Current video generation models lack fundamental social reasoning capabilities despite visual realism, highlighting a critical gap that needs to be addressed for more socially coherent video generation.

Abstract: Recent text-to-video generation models exhibit remarkable progress in visual realism, motion fidelity, and text-video alignment, yet they remain fundamentally limited in their ability to generate socially coherent behavior. Unlike humans, who effortlessly infer intentions, beliefs, emotions, and social norms from brief visual cues, current models tend to render literal scenes without capturing the underlying causal or psychological logic. To systematically evaluate this gap, we introduce the first benchmark for social reasoning in video generation. Grounded in findings from developmental and social psychology, our benchmark organizes thirty classic social cognition paradigms into seven core dimensions, including mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we develop a fully training-free agent-based pipeline that (i) distills the reasoning mechanism of each experiment, (ii) synthesizes diverse video-ready scenarios, (iii) enforces conceptual neutrality and difficulty control through cue-based critique, and (iv) evaluates generated videos using a high-capacity VLM judge across five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale study across seven state-of-the-art video generation systems. Our results reveal substantial performance gaps: while modern models excel in surface-level plausibility, they systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.

[676] Resolving compositional and conformational heterogeneity in cryo-EM with deformable 3D Gaussian representations

Bintao He, Yiran Cheng, Hongjia Li, Xiang Gao, Xin Gao, Fa Zhang, Renmin Han

Main category: cs.CV

TL;DR: GaussianEM: A Gaussian-based framework for analyzing cryo-EM data that simultaneously resolves compositional and conformational heterogeneity using a dual-encoder-single-decoder architecture to model protein dynamics.

Details

Motivation: Understanding protein flexibility and dynamic interactions is crucial for studying protein function. While cryo-EM enables observation of macromolecular dynamics, computational analysis of datasets mixing continuous and discrete structural states remains challenging.

Method: GaussianEM uses a Gaussian-based pseudo-atomic framework with dual-encoder-single-decoder architecture to decompose cryo-EM images into learnable Gaussian components. It encodes variability through modulated parameters, modeling displacements in Gaussian space to capture atomic-scale conformational landscapes.

Result: The method successfully reconstructs complex compositional and conformational variability, resolves previously unobserved details in public datasets, and captures broader conformational diversity without sacrificing structural fidelity.

Conclusion: GaussianEM provides a continuous, intuitive representation of conformational dynamics that preserves local structural integrity, bridging density maps and all-atom models for cryo-EM analysis.

Abstract: Understanding protein flexibility and its dynamic interactions with other molecules is essential for studying protein function. Although cryogenic electron microscopy(cryo-EM) provides an opportunity to observe macromolecular dynamics directly, computational analysis of datasets mixing continuous and discrete structural states remains a formidable challenge. Here we introduce GaussianEM, a Gaussian-based pseudo-atomic framework that simultaneously resolves compositional and conformational heterogeneity from cryo-EM images. GaussianEM employs a dual-encoder-single-decoder architecture to decompose images into learnable Gaussian components, with variability encoded through modulated parameters. This explicit parameterization yields a continuous, intuitive representation of conformational dynamics that inherently preserves local structural integrity. By modeling displacements in Gaussian space, we capture atomic-scale conformational landscapes, bridging density maps and all-atom models. In comprehensive experiments, GaussianEM successfully reconstructs complex compositional and conformational variability,and resolves previously unobserved details in public datasets. Quantitative evaluations further confirm its ability to capture broader conformational diversity without sacrificing structural fidelity.

[677] DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model

Bohong Chen, Haiyang Liu

Main category: cs.CV

TL;DR: DyStream: A flow matching-based autoregressive model for real-time dyadic talking head video generation from speaker and listener audio with ultra-low latency (<100ms).

Details

Motivation: Existing chunk-based methods for talking head video generation require full non-causal context windows, introducing significant delays that prevent immediate non-verbal feedback needed for realistic listener responses in dyadic conversations.

Method: Two key designs: (1) stream-friendly autoregressive framework with flow-matching heads for probabilistic modeling, and (2) causal encoder enhanced by a lookahead module to incorporate short future context (~60ms) to improve quality while maintaining low latency.

Result: Generates video within 34ms per frame, keeping entire system latency under 100ms. Achieves state-of-the-art lip-sync quality with offline/online LipSync Confidence scores of 8.13 and 7.61 on HDTF dataset.

Conclusion: DyStream enables real-time dyadic talking head video generation with ultra-low latency, significantly outperforming alternative causal strategies and providing immediate non-verbal feedback for realistic conversational interactions.

Abstract: Generating realistic, dyadic talking head video requires ultra-low latency. Existing chunk-based methods require full non-causal context windows, introducing significant delays. This high latency critically prevents the immediate, non-verbal feedback required for a realistic listener. To address this, we present DyStream, a flow matching-based autoregressive model that could generate video in real-time from both speaker and listener audio. Our method contains two key designs: (1) we adopt a stream-friendly autoregressive framework with flow-matching heads for probabilistic modeling, and (2) We propose a causal encoder enhanced by a lookahead module to incorporate short future context (e.g., 60 ms) to improve quality while maintaining low latency. Our analysis shows this simple-and-effective method significantly surpass alternative causal strategies, including distillation and generative encoder. Extensive experiments show that DyStream could generate video within 34 ms per frame, guaranteeing the entire system latency remains under 100 ms. Besides, it achieves state-of-the-art lip-sync quality, with offline and online LipSync Confidence scores of 8.13 and 7.61 on HDTF, respectively. The model, weights and codes are available.

[678] PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Yuanhao Cai, Kunpeng Li, Menglin Jia, Jialiang Wang, Junzhe Sun, Feng Liang, Weifeng Chen, Felix Juefei-Xu, Chu Wang, Ali Thabet, Xiaoliang Dai, Xuan Ju, Alan Yuille, Ji Hou

Main category: cs.CV

TL;DR: PhyGDPO: A physics-aware video generation framework using vision-language models for data collection and groupwise preference optimization to improve physical consistency in text-to-video generation.

Details

Motivation: Current text-to-video generation methods produce visually appealing results but often fail to follow physical laws, struggling with complex physics interactions due to limited training data and implicit physical reasoning approaches.

Method: 1) PhyAugPipe: Uses vision-language models with chain-of-thought reasoning to collect PhyVidGen-135K dataset; 2) PhyGDPO: Physics-aware Groupwise Direct Preference Optimization with Physics-Guided Rewarding scheme using VLM-based physics rewards; 3) LoRA-Switch Reference for efficient training.

Result: Significantly outperforms state-of-the-art open-source methods on physics-focused benchmarks PhyGenBench and VideoPhy2, demonstrating improved physical consistency in generated videos.

Conclusion: The proposed physics-aware framework effectively addresses physical consistency in video generation through data augmentation and principled optimization, advancing text-to-video generation toward more physically plausible results.

Abstract: Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that builds upon the groupwise Plackett-Luce probabilistic model to capture holistic preferences beyond pairwise comparisons. In PhyGDPO, we design a Physics-Guided Rewarding (PGR) scheme that embeds VLM-based physics rewards to steer optimization toward physical consistency. We also propose a LoRA-Switch Reference (LoRA-SR) scheme that eliminates memory-heavy reference duplication for efficient training. Experiments show that our method significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2. Please check our project page at https://caiyuanhao1998.github.io/project/PhyGDPO for more video results. Our code, models, and data will be released at https://github.com/caiyuanhao1998/Open-PhyGDPO

[679] FaithSCAN: Model-Driven Single-Pass Hallucination Detection for Faithful Visual Question Answering

Chaodong Tong, Qi Zhang, Chen Li, Lei Jiang, Yanbing Liu

Main category: cs.CV

TL;DR: FaithSCAN: A lightweight network that detects hallucinations in vision-language models by exploiting internal signals like token-level uncertainty, visual representations, and cross-modal alignment, with automatic supervision generation.

Details

Motivation: Existing hallucination detection methods have limitations - external verification approaches are computationally expensive and dependent on external resources, while uncertainty-driven methods capture limited facets of model uncertainty and fail to explore rich internal signals associated with diverse failure modes.

Method: Proposes FaithSCAN, a lightweight network that fuses multiple internal VLM signals: token-level decoding uncertainty, intermediate visual representations, and cross-modal alignment features using branch-wise evidence encoding and uncertainty-aware attention. Also extends LLM-as-a-Judge paradigm to automatically generate supervision signals for training without human labels.

Result: FaithSCAN significantly outperforms existing methods in both effectiveness and efficiency on multiple VQA benchmarks. Analysis reveals hallucinations arise from systematic internal state variations in visual perception, cross-modal reasoning, and language decoding, with different internal signals providing complementary diagnostic cues.

Conclusion: FaithSCAN provides an efficient and effective solution for hallucination detection in VLMs by exploiting rich internal signals, with automatic supervision generation enabling practical deployment. The work offers new insights into multimodal hallucination causes and patterns across different VLM architectures.

Abstract: Faithfulness hallucinations in VQA occur when vision-language models produce fluent yet visually ungrounded answers, severely undermining their reliability in safety-critical applications. Existing detection methods mainly fall into two categories: external verification approaches relying on auxiliary models or knowledge bases, and uncertainty-driven approaches using repeated sampling or uncertainty estimates. The former suffer from high computational overhead and are limited by external resource quality, while the latter capture only limited facets of model uncertainty and fail to sufficiently explore the rich internal signals associated with the diverse failure modes. Both paradigms thus have inherent limitations in efficiency, robustness, and detection performance. To address these challenges, we propose FaithSCAN: a lightweight network that detects hallucinations by exploiting rich internal signals of VLMs, including token-level decoding uncertainty, intermediate visual representations, and cross-modal alignment features. These signals are fused via branch-wise evidence encoding and uncertainty-aware attention. We also extend the LLM-as-a-Judge paradigm to VQA hallucination and propose a low-cost strategy to automatically generate model-dependent supervision signals, enabling supervised training without costly human labels while maintaining high detection accuracy. Experiments on multiple VQA benchmarks show that FaithSCAN significantly outperforms existing methods in both effectiveness and efficiency. In-depth analysis shows hallucinations arise from systematic internal state variations in visual perception, cross-modal reasoning, and language decoding. Different internal signals provide complementary diagnostic cues, and hallucination patterns vary across VLM architectures, offering new insights into the underlying causes of multimodal hallucinations.

[680] Efficient Deep Demosaicing with Spatially Downsampled Isotropic Networks

Cory Fan, Wenchao Zhang

Main category: cs.CV

TL;DR: This paper proposes using spatial downsampling in isotropic networks for image demosaicing to improve efficiency and performance, particularly for mobile applications.

Details

Motivation: Most deep learning-based image demosaicing networks avoid spatial downsampling, making them computationally expensive for mobile platforms. The authors claim that strategic downsampling can actually improve both efficiency and performance.

Method: The authors design simple fully convolutional networks with and without downsampling using mathematical architecture design techniques adapted from DeepMAD. They compare performance and propose JD3Net, a downsampled variant for joint-demosaicing-and-denoising tasks.

Result: Empirical testing shows that downsampling improves performance compared to networks without downsampling. JD3Net demonstrates strong performance on various image demosaicing and joint-demosaicing-and-denoising tasks.

Conclusion: Contrary to conventional wisdom, spatial downsampling can be beneficial for isotropic networks in image demosaicing, leading to improved efficiency and performance, making deep learning approaches more viable for mobile applications.

Abstract: In digital imaging, image demosaicing is a crucial first step which recovers the RGB information from a color filter array (CFA). Oftentimes, deep learning is utilized to perform image demosaicing. Given that most modern digital imaging applications occur on mobile platforms, applying deep learning to demosaicing requires lightweight and efficient networks. Isotropic networks, also known as residual-in-residual networks, have been often employed for image demosaicing and joint-demosaicing-and-denoising (JDD). Most demosaicing isotropic networks avoid spatial downsampling entirely, and thus are often prohibitively expensive computationally for mobile applications. Contrary to previous isotropic network designs, this paper claims that spatial downsampling to a signficant degree can improve the efficiency and performance of isotropic networks. To validate this claim, we design simple fully convolutional networks with and without downsampling using a mathematical architecture design technique adapted from DeepMAD, and find that downsampling improves empirical performance. Additionally, empirical testing of the downsampled variant, JD3Net, of our fully convolutional networks reveals strong empirical performance on a variety of image demosaicing and JDD tasks.

[681] UM-Text: A Unified Multimodal Model for Image Understanding and Visual Text Editing

Lichen Ma, Xiaolong Fu, Gaojing Zhou, Zipeng Guo, Ting Zhu, Yichun Liu, Yu Shi, Jason Li, Junshi Huang

Main category: cs.CV

TL;DR: UM-Text is a unified multimodal model for visual text editing that understands natural language instructions and reference images to generate style-consistent text within images.

Details

Motivation: Previous visual text editing methods require complex specification of text attributes without considering stylistic consistency with reference images, lacking natural language understanding of context.

Method: Proposes UM-Text with a Visual Language Model (VLM) for instruction understanding, UM-Encoder for condition embedding combination, regional consistency loss for glyph generation, and three-stage training strategy. Also introduces UM-DATA-200K dataset.

Result: Achieves state-of-the-art performance on multiple public benchmarks through extensive qualitative and quantitative evaluation.

Conclusion: UM-Text effectively addresses visual text editing by understanding natural language instructions and maintaining style consistency with reference images through unified multimodal modeling.

Abstract: With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image. To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. Specifically, we introduce a Visual Language Model (VLM) to process the instruction and reference image, so that the text content and layout can be elaborately designed according to the context information. To generate an accurate and harmonious visual text image, we further propose the UM-Encoder to combine the embeddings of various condition information, where the combination is automatically configured by VLM according to the input instruction. During training, we propose a regional consistency loss to offer more effective supervision for glyph generation on both latent and RGB space, and design a tailored three-stage training strategy to further enhance model performance. In addition, we contribute the UM-DATA-200K, a large-scale visual text image dataset on diverse scenes for model training. Extensive qualitative and quantitative results on multiple public benchmarks demonstrate that our method achieves state-of-the-art performance.

[682] VibrantSR: Sub-Meter Canopy Height Models from Sentinel-2 Using Generative Flow Matching

Kiarie Ndegwa, Andreas Gros, Tony Chang, David Diaz, Vincent A. Landau, Nathan E. Rutenbeck, Luke J. Zachmann, Guy Bayes, Scott Conway

Main category: cs.CV

TL;DR: VibrantSR is a generative super-resolution framework that estimates high-resolution (0.5m) canopy height models from low-resolution (10m) Sentinel-2 satellite imagery, enabling consistent seasonal forest monitoring without relying on infrequent aerial data.

Details

Motivation: Current approaches for canopy height estimation rely on aerial imagery which has infrequent and irregular acquisition schedules. There's a need for consistent, operational forest monitoring at continental scales using globally available satellite data.

Method: A generative super-resolution framework that transforms 10-meter Sentinel-2 seasonal composites into 0.5-meter canopy height models using deep learning techniques, leveraging globally available satellite data for consistent monitoring.

Result: Achieved Mean Absolute Error of 4.39 meters for canopy heights ≥2m across 22 EPA eco-regions, outperforming Meta (4.83m), LANDFIRE (5.96m), and ETH (7.05m) satellite-based benchmarks, though aerial-based methods still have better accuracy (2.71m MAE).

Conclusion: VibrantSR enables operational forest monitoring and carbon accounting at continental scales without costly aerial acquisitions, providing a practical solution for consistent seasonal-to-annual monitoring using globally available satellite data.

Abstract: We present VibrantSR (Vibrant Super-Resolution), a generative super-resolution framework for estimating 0.5 meter canopy height models (CHMs) from 10 meter Sentinel-2 imagery. Unlike approaches based on aerial imagery that are constrained by infrequent and irregular acquisition schedules, VibrantSR leverages globally available Sentinel-2 seasonal composites, enabling consistent monitoring at a seasonal-to-annual cadence. Evaluated across 22 EPA Level 3 eco-regions in the western United States using spatially disjoint validation splits, VibrantSR achieves a Mean Absolute Error of 4.39 meters for canopy heights >= 2 m, outperforming Meta (4.83 m), LANDFIRE (5.96 m), and ETH (7.05 m) satellite-based benchmarks. While aerial-based VibrantVS (2.71 m MAE) retains an accuracy advantage, VibrantSR enables operational forest monitoring and carbon accounting at continental scales without reliance on costly and temporally infrequent aerial acquisitions.

[683] DR$^2$Seg: Decomposed Two-Stage Rollouts for Efficient Reasoning Segmentation in Multimodal Large Language Models

Yulin He, Wei Chen, Zhikang Jian, Tianhang Guo, Wenjuan Zhou, Minglong Li, Shaowu Yang, Wenjing Yang

Main category: cs.CV

TL;DR: DR²Seg: A self-rewarding framework for reasoning segmentation that addresses overthinking in MLLMs by decomposing the task into multimodal reasoning and referring segmentation stages with self-reward mechanisms.

Details

Motivation: Existing reasoning segmentation methods suffer from overthinking, generating verbose reasoning chains that interfere with object localization in multimodal large language models (MLLMs). This leads to attention dispersion and reduced segmentation accuracy.

Method: Proposes DR²Seg with a two-stage rollout strategy: 1) Generate a self-contained description specifying the target object, 2) Use this description to verify self-containment. Introduces two self-rewards to mitigate overthinking and attention dispersion without extra supervision.

Result: Extensive experiments on 3B and 7B variants of Qwen2.5-VL, and both SAM2 and SAM3, demonstrate consistent improvements in reasoning efficiency and overall segmentation accuracy.

Conclusion: DR²Seg effectively addresses overthinking in reasoning segmentation tasks, improving both reasoning efficiency and segmentation accuracy in MLLMs through a self-rewarding framework without requiring additional supervision.

Abstract: Reasoning segmentation is an emerging vision-language task that requires reasoning over intricate text queries to precisely segment objects. However, existing methods typically suffer from overthinking, generating verbose reasoning chains that interfere with object localization in multimodal large language models (MLLMs). To address this issue, we propose DR$^2$Seg, a self-rewarding framework that improves both reasoning efficiency and segmentation accuracy without requiring extra thinking supervision. DR$^2$Seg employs a two-stage rollout strategy that decomposes reasoning segmentation into multimodal reasoning and referring segmentation. In the first stage, the model generates a self-contained description that explicitly specifies the target object. In the second stage, this description replaces the original complex query to verify its self-containment. Based on this design, two self-rewards are introduced to mitigate overthinking and the associated attention dispersion. Extensive experiments conducted on 3B and 7B variants of Qwen2.5-VL, as well as on both SAM2 and SAM3, demonstrate that DR$^2$Seg consistently improves reasoning efficiency and overall segmentation accuracy.

[684] Aesthetics as Structural Harm: Algorithmic Lookism Across Text-to-Image Generation and Classification

Miriam Doh, Aditya Gulati, Corinna Canali, Nuria Oliver

Main category: cs.CV

TL;DR: Study reveals systematic “algorithmic lookism” in text-to-image AI models where facial attractiveness is associated with positive attributes, plus gender bias in classification systems that disproportionately misclassify women’s faces.

Details

Motivation: To investigate systematic preferential treatment based on physical appearance (algorithmic lookism) in text-to-image generative AI and downstream gender classification tasks, examining how AI models encode and amplify social biases.

Method: Analyzed 26,400 synthetic faces generated with Stable Diffusion 2.1 and 3.5 Medium, examining associations between facial attractiveness and attributes, plus evaluating gender classification algorithms on these synthetic faces.

Result: Found systematic associations between attractiveness and positive attributes in T2I models, significant gender bias in classification algorithms (women’s faces with negative attributes had higher misclassification rates), and intensifying aesthetic constraints in newer models through age homogenization and geographic reductionism.

Conclusion: Algorithmic lookism operates as systematic infrastructure across AI vision systems, compounding inequalities through both representation (generation) and recognition (classification), with newer models showing intensified aesthetic constraints.

Abstract: This paper examines algorithmic lookism-the systematic preferential treatment based on physical appearance-in text-to-image (T2I) generative AI and a downstream gender classification task. Through the analysis of 26,400 synthetic faces created with Stable Diffusion 2.1 and 3.5 Medium, we demonstrate how generative AI models systematically associate facial attractiveness with positive attributes and vice-versa, mirroring socially constructed biases rather than evidence-based correlations. Furthermore, we find significant gender bias in three gender classification algorithms depending on the attributes of the input faces. Our findings reveal three critical harms: (1) the systematic encoding of attractiveness-positive attribute associations in T2I models; (2) gender disparities in classification systems, where women’s faces, particularly those generated with negative attributes, suffer substantially higher misclassification rates than men’s; and (3) intensifying aesthetic constraints in newer models through age homogenization, gendered exposure patterns, and geographic reductionism. These convergent patterns reveal algorithmic lookism as systematic infrastructure operating across AI vision systems, compounding existing inequalities through both representation and recognition. Disclaimer: This work includes visual and textual content that reflects stereotypical associations between physical appearance and socially constructed attributes, including gender, race, and traits associated with social desirability. Any such associations found in this study emerge from the biases embedded in generative AI systems-not from empirical truths or the authors’ views.

[685] UAV-Based Infrastructure Inspections: A Literature Review and Proposed Framework for AEC+FM

Amir Farzin Nikkhah, Dong Chen, Bradford Campbell, Somayeh Asadi, Arsalan Heydarian

Main category: cs.CV

TL;DR: Comprehensive review of UAV applications in infrastructure inspection, covering data acquisition, modeling, defect detection, and decision support, with proposed multimodal fusion framework using RGB, LiDAR, thermal sensing and transformer architectures.

Details

Motivation: UAVs are transforming infrastructure inspections in AEC+FM domain, but challenges remain in real-time processing, multimodal data fusion, and generalizability. The paper aims to synthesize existing research and propose an integrated framework to address these limitations.

Method: Literature review of 150+ studies, analysis of UAV methodologies for data acquisition, photogrammetric modeling, defect detection, and decision-making. Proposes a workflow framework integrating RGB imagery, LiDAR, and thermal sensing with transformer-based architectures and dynamic path planning.

Result: Identifies key innovations including path optimization, thermal integration, and ML models (YOLO, Faster R-CNN) for anomaly detection. Demonstrates UAV value in structural health monitoring, disaster response, urban infrastructure management, energy efficiency, and cultural heritage preservation.

Conclusion: Future research should focus on lightweight AI models, adaptive flight planning, synthetic datasets, and richer modality fusion to streamline modern infrastructure inspections. The proposed framework addresses current challenges in multimodal data fusion and real-time processing.

Abstract: Unmanned Aerial Vehicles (UAVs) are transforming infrastructure inspections in the Architecture, Engineering, Construction, and Facility Management (AEC+FM) domain. By synthesizing insights from over 150 studies, this review paper highlights UAV-based methodologies for data acquisition, photogrammetric modeling, defect detection, and decision-making support. Key innovations include path optimization, thermal integration, and advanced machine learning (ML) models such as YOLO and Faster R-CNN for anomaly detection. UAVs have demonstrated value in structural health monitoring (SHM), disaster response, urban infrastructure management, energy efficiency evaluations, and cultural heritage preservation. Despite these advancements, challenges in real-time processing, multimodal data fusion, and generalizability remain. A proposed workflow framework, informed by literature and a case study, integrates RGB imagery, LiDAR, and thermal sensing with transformer-based architectures to improve accuracy and reliability in detecting structural defects, thermal anomalies, and geometric inconsistencies. The proposed framework ensures precise and actionable insights by fusing multimodal data and dynamically adapting path planning for complex environments, presented as a comprehensive step-by-step guide to address these challenges effectively. This paper concludes with future research directions emphasizing lightweight AI models, adaptive flight planning, synthetic datasets, and richer modality fusion to streamline modern infrastructure inspections.

[686] Mixed Precision PointPillars for Efficient 3D Object Detection with TensorRT

Ninnart Fuengfusin, Keisuke Yoneda, Naoki Suganuma

Main category: cs.CV

TL;DR: Mixed precision quantization framework for LiDAR 3D object detection that handles wide numerical distributions and outliers in PointPillars models to achieve real-time performance with minimal accuracy loss.

Details

Motivation: LiDAR 3D object detection needs real-time operation for autonomous vehicles, but direct model quantization causes performance degradation due to LiDAR's wide numerical distributions and extreme outliers.

Method: Proposed mixed precision framework that: 1) Searches for sensitive layers using PTQ by quantizing one layer at a time to INT8, 2) Assigns top-k most sensitive layers as FP, 3) Greedily searches layer combinations for candidate mixed precision models, 4) Uses minimal calibration data to reduce outlier impact, and 5) Finalizes with either PTQ or QAT.

Result: Mixed precision models achieve up to 2.538x speedup compared to FP32 models with TensorRT deployment, with QAT pipeline achieving performance competitive to FP models.

Conclusion: The proposed mixed precision quantization framework effectively addresses LiDAR’s numerical challenges, enabling real-time 3D object detection with minimal accuracy loss through careful layer sensitivity analysis and outlier handling.

Abstract: LIDAR 3D object detection is one of the important tasks for autonomous vehicles. Ensuring that this task operates in real-time is crucial. Toward this, model quantization can be used to accelerate the runtime. However, directly applying model quantization often leads to performance degradation due to LIDAR’s wide numerical distributions and extreme outliers. To address the wide numerical distribution, we proposed a mixed precision framework designed for PointPillars. Our framework first searches for sensitive layers with post-training quantization (PTQ) by quantizing one layer at a time to 8-bit integer (INT8) and evaluating each model for average precision (AP). The top-k most sensitive layers are assigned as floating point (FP). Combinations of these layers are greedily searched to produce candidate mixed precision models, which are finalized with either PTQ or quantization-aware training (QAT). Furthermore, to handle outliers, we observe that using a very small number of calibration data reduces the likelihood of encountering outliers, thereby improving PTQ performance. Our methods provides mixed precision models without training in the PTQ pipeline, while our QAT pipeline achieves the performance competitive to FP models. With TensorRT deployment, our mixed precision models offer less latency by up to 2.538 times compared to FP32 models.

[687] Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events

Yunshan Qi, Lin Zhu, Nan Bao, Yifan Zhao, Jia Li

Main category: cs.CV

TL;DR: A NeRF framework that uses event data and sensor-physics modeling to achieve sharp HDR novel view synthesis from single-exposure blurry LDR images.

Details

Motivation: Existing methods for novel view synthesis from blurry LDR images struggle with HDR recovery in extreme lighting and ignore sensor-physics mismatches between camera output and real-world radiance.

Method: Proposes a unified sensor-physics grounded NeRF framework with: 1) NeRF representing actual HDR scene radiance, 2) pixel-wise RGB mapping field aligning rendered values with sensor-recorded LDR values, 3) event mapping field bridging physical scene dynamics with event sensor output, jointly optimized with NeRF network.

Result: Achieves state-of-the-art deblurring HDR novel view synthesis results on collected and public datasets using single-exposure blurry LDR images with corresponding events.

Conclusion: The proposed sensor-physics grounded framework effectively addresses HDR recovery and deblurring for novel view synthesis by modeling real-world radiance and sensor characteristics.

Abstract: Novel view synthesis from low dynamic range (LDR) blurry images, which are common in the wild, struggles to recover high dynamic range (HDR) and sharp 3D representations in extreme lighting conditions. Although existing methods employ event data to address this issue, they ignore the sensor-physics mismatches between the camera output and physical world radiance, resulting in suboptimal HDR and deblurring results. To cope with this problem, we propose a unified sensor-physics grounded NeRF framework for sharp HDR novel view synthesis from single-exposure blurry LDR images and corresponding events. We employ NeRF to directly represent the actual radiance of the 3D scene in the HDR domain and model raw HDR scene rays hitting the sensor pixels as in the physical world. A pixel-wise RGB mapping field is introduced to align the above rendered pixel values with the sensor-recorded LDR pixel values of the input images. A novel event mapping field is also designed to bridge the physical scene dynamics and actual event sensor output. The two mapping fields are jointly optimized with the NeRF network, leveraging the spatial and temporal dynamic information in events to enhance the sharp HDR 3D representation learning. Experiments on the collected and public datasets demonstrate that our method can achieve state-of-the-art deblurring HDR novel view synthesis results with single-exposure blurry LDR images and corresponding events.

[688] Model-Centric Diagnostics: A Framework for Internal State Readouts

Fangzheng Wu, Brian Summa

Main category: cs.CV

TL;DR: A diagnostic framework that treats training state as latent variable and unifies various internal readouts (gradient norms, confidence, entropy, margin) as projections of that state for model analysis and checkpoint selection.

Details

Motivation: To develop a unified framework for model diagnostics that can interpret different internal signals (like gradient norms, confidence, entropy) as different projections of the same underlying training state, enabling better understanding of model behavior during training.

Method: Treats training state as a latent variable and unifies various internal readouts as anchor-relative projections of that state. Different readout choices correspond to different projections of the local loss landscape geometry, each with complementary strengths for analyzing feature-task alignment.

Result: Preliminary experiments on ImageNet classification and COCO detection/segmentation show practical potential, though rigorous benchmarks and ablations are deferred to the full paper. The framework suggests applications for checkpoint selection, early stopping, and architecture pre-screening.

Conclusion: The paper presents a unifying perspective for model diagnostics that can interpret various internal signals as different views of the same underlying training state, offering a structural approach to understanding model behavior and optimization.

Abstract: We present a model-centric diagnostic framework that treats training state as a latent variable and unifies a family of internal readouts – head-gradient norms, confidence, entropy, margin, and related signals – as anchor-relative projections of that state. A preliminary version of this work introduced a head-gradient probe for checkpoint selection. In this version, we focus on the unifying perspective and structural diagnostics; full algorithmic details, theoretical analysis, and experimental validation will appear in a forthcoming paper. We outline the conceptual scaffold: any prediction head induces a local loss landscape whose geometry (gradient magnitude, curvature, sharpness) reflects how well the upstream features are aligned with the task. Different readout choices – gradient norms, softmax entropy, predictive margin – correspond to different projections of this geometry, each with complementary strengths. The framework suggests that checkpoint selection, early stopping, and lightweight architecture pre-screening can all be viewed as querying the same underlying state through different lenses. Illustrative experiments on ImageNet classification and COCO detection/segmentation hint at the practical potential; rigorous benchmarks and ablations are deferred to the full paper.

[689] PhaSR: Generalized Image Shadow Removal with Physically Aligned Priors

Chia-Ming Lee, Yu-Fan Lin, Yu-Jou Hsiao, Jing-Hui Jung, Yu-Lun Liu, Chih-Chung Hsu

Main category: cs.CV

TL;DR: PhaSR is a shadow removal method that uses physically aligned normalization and geometric-semantic attention to handle diverse lighting conditions from single-light to multi-source ambient illumination.

Details

Motivation: Shadow removal under diverse lighting conditions requires disentangling illumination from intrinsic reflectance, which is challenging when physical priors are not properly aligned. Traditional methods often fail under multi-source illumination scenarios.

Method: Two-stage approach: 1) Physically Aligned Normalization (PAN) performs closed-form illumination correction via Gray-world normalization, log-domain Retinex decomposition, and dynamic range recombination to suppress chromatic bias. 2) Geometric-Semantic Rectification Attention (GSRA) extends differential attention to cross-modal alignment, harmonizing depth-derived geometry with DINO-v2 semantic embeddings to resolve modal conflicts under varying illumination.

Result: Competitive performance in shadow removal with lower complexity and generalization to ambient lighting where traditional methods fail under multi-source illumination.

Conclusion: PhaSR addresses the challenge of shadow removal under diverse lighting conditions through dual-level prior alignment, enabling robust performance from single-light shadows to multi-source ambient lighting.

Abstract: Shadow removal under diverse lighting conditions requires disentangling illumination from intrinsic reflectance, a challenge compounded when physical priors are not properly aligned. We propose PhaSR (Physically Aligned Shadow Removal), addressing this through dual-level prior alignment to enable robust performance from single-light shadows to multi-source ambient lighting. First, Physically Aligned Normalization (PAN) performs closed-form illumination correction via Gray-world normalization, log-domain Retinex decomposition, and dynamic range recombination, suppressing chromatic bias. Second, Geometric-Semantic Rectification Attention (GSRA) extends differential attention to cross-modal alignment, harmonizing depth-derived geometry with DINO-v2 semantic embeddings to resolve modal conflicts under varying illumination. Experiments show competitive performance in shadow removal with lower complexity and generalization to ambient lighting where traditional methods fail under multi-source illumination. Our source code is available at https://github.com/ming053l/PhaSR.

[690] Adaptive Domain Shift in Diffusion Models for Cross-Modality Image Translation

Zihao Wang, Yuzhou Chen, Shaogang Ren

Main category: cs.CV

TL;DR: A novel diffusion-based image translation method that uses spatially varying mixing fields and target-consistent restoration terms to improve efficiency and semantic fidelity across domains.

Details

Motivation: Standard diffusion approaches for cross-modal image translation rely on global linear transfers between domains, which forces samplers to traverse off-manifold regions, increasing correction burden and causing semantic drift. This "fixed-schedule domain transfer" problem leads to brittle and inefficient translation.

Method: The method embeds domain-shift dynamics directly into the generative process by predicting a spatially varying mixing field at every reverse step and injecting an explicit, target-consistent restoration term into the drift. This keeps large updates on-manifold and shifts the model’s role from global alignment to local residual correction. The approach includes a continuous-time formulation with exact solution form and a practical first-order sampler that preserves marginal consistency.

Result: Empirical evaluation across translation tasks in medical imaging, remote sensing, and electroluminescence semantic mapping shows improved structural fidelity and semantic consistency while converging in fewer denoising steps compared to standard approaches.

Conclusion: The proposed framework addresses limitations of fixed-schedule domain transfer in diffusion models by incorporating spatially adaptive mixing and explicit restoration guidance, resulting in more efficient and semantically consistent cross-modal image translation.

Abstract: Cross-modal image translation remains brittle and inefficient. Standard diffusion approaches often rely on a single, global linear transfer between domains. We find that this shortcut forces the sampler to traverse off-manifold, high-cost regions, inflating the correction burden and inviting semantic drift. We refer to this shared failure mode as fixed-schedule domain transfer. In this paper, we embed domain-shift dynamics directly into the generative process. Our model predicts a spatially varying mixing field at every reverse step and injects an explicit, target-consistent restoration term into the drift. This in-step guidance keeps large updates on-manifold and shifts the model’s role from global alignment to local residual correction. We provide a continuous-time formulation with an exact solution form and derive a practical first-order sampler that preserves marginal consistency. Empirically, across translation tasks in medical imaging, remote sensing, and electroluminescence semantic mapping, our framework improves structural fidelity and semantic consistency while converging in fewer denoising steps.

[691] SeNeDiF-OOD: Semantic Nested Dichotomy Fusion for Out-of-Distribution Detection Methodology in Open-World Classification. A Case Study on Monument Style Classification

Ignacio Antequera-Sánchez, Juan Luis Suárez-Díaz, Rosana Montes, Francisco Herrera

Main category: cs.CV

TL;DR: SeNeDiF-OOD is a hierarchical semantic fusion framework for OOD detection that decomposes detection into binary fusion nodes aligned with semantic abstraction levels, validated on architectural style recognition.

Details

Motivation: Current OOD detection methods struggle with heterogeneous OOD data (from low-level corruption to semantic shifts) in open-world environments, requiring more sophisticated approaches that can handle diverse OOD categories.

Method: Semantic Nested Dichotomy Fusion (SeNeDiF-OOD) framework with hierarchical binary fusion nodes where each layer integrates decision boundaries aligned with specific semantic abstraction levels.

Result: Significantly outperforms traditional baselines in filtering diverse OOD categories while preserving in-distribution performance, validated on MonuMAI architectural style recognition system.

Conclusion: The hierarchical fusion methodology effectively addresses heterogeneous OOD detection challenges in real-world open environments.

Abstract: Out-of-distribution (OOD) detection is a fundamental requirement for the reliable deployment of artificial intelligence applications in open-world environments. However, addressing the heterogeneous nature of OOD data, ranging from low-level corruption to semantic shifts, remains a complex challenge that single-stage detectors often fail to resolve. To address this issue, we propose SeNeDiF-OOD, a novel methodology based on Semantic Nested Dichotomy Fusion. This framework decomposes the detection task into a hierarchical structure of binary fusion nodes, where each layer is designed to integrate decision boundaries aligned with specific levels of semantic abstraction. To validate the proposed framework, we present a comprehensive case study using MonuMAI, a real-world architectural style recognition system exposed to an open environment. This application faces a diverse range of inputs, including non-monument images, unknown architectural styles, and adversarial attacks, making it an ideal testbed for our proposal. Through extensive experimental evaluation in this domain, results demonstrate that our hierarchical fusion methodology significantly outperforms traditional baselines, effectively filtering these diverse OOD categories while preserving in-distribution performance.

[692] Glance and Focus Reinforcement for Pan-cancer Screening

Linshan Wu, Jiaxin Zhuang, Hao Chen

Main category: cs.CV

TL;DR: GF-Screen: A reinforcement learning framework for pan-cancer CT screening using glance (localization) and focus (segmentation) models, with group relative learning to improve efficiency and reduce false positives.

Details

Motivation: Pan-cancer screening in large CT volumes is challenging due to difficulty localizing diverse tiny lesions and extreme foreground-background imbalance. Existing methods suffer from redundant focus on healthy regions, decreasing efficiency and increasing false positives.

Method: Two-stage framework: 1) Glance model localizes diseased regions by cropping sub-volumes and selecting those with lesions, 2) Focus model precisely segments lesions. Reinforcement learning uses segmentation results to reward Glance model. Group relative learning paradigm prioritizes high-advantage predictions and discards low-advantage ones within sub-volume groups.

Result: Extensive experiments on 16 internal and 7 external datasets across 9 lesion types demonstrated effectiveness. GF-Screen leads MICCAI FLARE25 pan-cancer challenge public validation leaderboard, surpassing FLARE24 champion by +25.6% DSC and +28.2% NSD.

Conclusion: GF-Screen successfully extends RL techniques to tackle pan-cancer screening challenges, improving lesion localization and segmentation while reducing false positives through the glance-and-focus strategy inspired by radiologists’ diagnostic approach.

Abstract: Pan-cancer screening in large-scale CT scans remains challenging for existing AI methods, primarily due to the difficulty of localizing diverse types of tiny lesions in large CT volumes. The extreme foreground-background imbalance significantly hinders models from focusing on diseased regions, while redundant focus on healthy regions not only decreases the efficiency but also increases false positives. Inspired by radiologists’ glance and focus diagnostic strategy, we introduce GF-Screen, a Glance and Focus reinforcement learning framework for pan-cancer screening. GF-Screen employs a Glance model to localize the diseased regions and a Focus model to precisely segment the lesions, where segmentation results of the Focus model are leveraged to reward the Glance model via Reinforcement Learning (RL). Specifically, the Glance model crops a group of sub-volumes from the entire CT volume and learns to select the sub-volumes with lesions for the Focus model to segment. Given that the selecting operation is non-differentiable for segmentation training, we propose to employ the segmentation results to reward the Glance model. To optimize the Glance model, we introduce a novel group relative learning paradigm, which employs group relative comparison to prioritize high-advantage predictions and discard low-advantage predictions within sub-volume groups, not only improving efficiency but also reducing false positives. In this way, for the first time, we effectively extend cutting-edge RL techniques to tackle the specific challenges in pan-cancer screening. Extensive experiments on 16 internal and 7 external datasets across 9 lesion types demonstrated the effectiveness of GF-Screen. Notably, GF-Screen leads the public validation leaderboard of MICCAI FLARE25 pan-cancer challenge, surpassing the FLARE24 champion solution by a large margin (+25.6% DSC and +28.2% NSD).

[693] VGGT-SLAM 2.0: Real-time Dense Feed-forward Scene Reconstruction

Dominic Maggio, Luca Carlone

Main category: cs.CV

TL;DR: VGGT-SLAM 2.0 improves visual SLAM using VGGT features, addressing drift and planar degeneracy while enabling better loop closure and real-time performance on robots.

Details

Motivation: The paper aims to improve upon VGGT-SLAM by addressing its limitations: high-dimensional drift, planar degeneracy issues, and the need for better loop closure verification in visual SLAM systems.

Method: 1) New factor graph design to remove 15-DOF drift and planar degeneracy while handling VGGT reconstruction ambiguity with unknown camera intrinsics. 2) Analysis of VGGT attention layers to identify one suitable for free image retrieval verification, enabling false positive rejection and more loop closures. 3) Real-time implementation tested on ground robots with Jetson Thor.

Result: Achieves highest accuracy on TUM dataset with ~23% less pose error than VGGT-SLAM. Demonstrates real-time performance in diverse environments (indoor apartments, offices, 4200 sq ft barn) and adaptability for open-set object detection.

Conclusion: VGGT-SLAM 2.0 significantly improves visual SLAM performance by addressing key limitations of previous work, enabling more robust and accurate real-time operation in various environments.

Abstract: We present VGGT-SLAM 2.0, a real-time RGB feed-forward SLAM system which substantially improves upon VGGT-SLAM for incrementally aligning submaps created from VGGT. Firstly, we remove high-dimensional 15-degree-of-freedom drift and planar degeneracy from VGGT-SLAM by creating a new factor graph design while still addressing the reconstruction ambiguity of VGGT given unknown camera intrinsics. Secondly, by studying the attention layers of VGGT, we show that one of the layers is well suited to assist in image retrieval verification for free without additional training, which enables both rejecting false positive matches and allows for completing more loop closures. Finally, we conduct a suite of experiments which includes showing VGGT-SLAM 2.0 can easily be adapted for open-set object detection and demonstrating real-time performance while running online onboard a ground robot using a Jetson Thor. We test in environments ranging from cluttered indoor apartments and office scenes to a 4,200 square foot barn, and we also demonstrate VGGT-SLAM 2.0 achieves the highest accuracy on the TUM dataset with about 23 percent less pose error than VGGT-SLAM. Code will be released upon publication.

[694] OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks

Jing Wu, Daphne Barretto, Yiye Chen, Nicholas Gydé, Yanan Jian, Yuhang He, Vibhav Vineet

Main category: cs.CV

TL;DR: OS-Marathon: A benchmark of 242 long-horizon repetitive tasks for computer-use agents, with a method to teach workflow logic from few examples for scalable execution.

Details

Motivation: Long-horizon repetitive workflows (like processing expense reports or entering grades) are tedious for humans but ideal for computer-use agents, yet lack proper evaluation benchmarks.

Method: Created OS-Marathon benchmark with 242 tasks across 2 domains, plus a cost-effective method to construct condensed demonstrations from few-shot examples to teach workflow logic.

Result: Extensive experiments show both the challenges of these tasks and the effectiveness of the proposed few-shot demonstration method for scalable workflow execution.

Conclusion: OS-Marathon addresses the benchmark gap for long-horizon repetitive tasks, and the proposed method enables agents to learn workflow logic efficiently from limited examples.

Abstract: Long-horizon, repetitive workflows are common in professional settings, such as processing expense reports from receipts and entering student grades from exam papers. These tasks are often tedious for humans since they can extend to extreme lengths proportional to the size of the data to process. However, they are ideal for Computer-Use Agents (CUAs) due to their structured, recurring sub-workflows with logic that can be systematically learned. Identifying the absence of an evaluation benchmark as a primary bottleneck, we establish OS-Marathon, comprising 242 long-horizon, repetitive tasks across 2 domains to evaluate state-of-the-art (SOTA) agents. We then introduce a cost-effective method to construct a condensed demonstration using only few-shot examples to teach agents the underlying workflow logic, enabling them to execute similar workflows effectively on larger, unseen data collections. Extensive experiments demonstrate both the inherent challenges of these tasks and the effectiveness of our proposed method. Project website: https://os-marathon.github.io/.

[695] MPF-Net: Exposing High-Fidelity AI-Generated Video Forgeries via Hierarchical Manifold Deviation and Micro-Temporal Fluctuations

Xinan He, Kaiqing Lin, Yue Zhou, Jiaming Zhong, Wei Ye, Wenhui Yi, Bing Fan, Feng Ding, Haodong Li, Bo Cao, Bin Li

Main category: cs.CV

TL;DR: A hierarchical dual-path framework detects AI-generated videos by analyzing manifold projection fluctuations - structured pixel composition patterns that persist even in high-fidelity synthetic content.

Details

Motivation: Despite advances in video generation models (Veo, Wan) producing visually convincing content, AI-generated videos still exhibit systematic patterns from their manifold-fitting process rather than physical recording, creating detectable computational fingerprints.

Method: Two-stage sequential filtering: 1) Static Manifold Deviation Branch uses Vision Foundation Models to detect spatial anomalies deviating from natural real-world manifold; 2) Micro-Temporal Fluctuation Branch analyzes structured Manifold Projection Fluctuations (MPF) in pixel composition of consecutive frames for high-fidelity videos that evade spatial detection.

Result: The framework can detect AI-generated videos regardless of whether they show global real-world manifold deviations or subtle computational fingerprints, exposing forgeries that appear visually perfect.

Conclusion: AI-generated videos exhibit detectable structured patterns (MPF) from their manifold-fitting nature, enabling hierarchical detection even for high-fidelity synthetic content that appears visually convincing.

Abstract: With the rapid advancement of video generation models such as Veo and Wan, the visual quality of synthetic content has reached a level where macro-level semantic errors and temporal inconsistencies are no longer prominent. However, this does not imply that the distinction between real and cutting-edge high-fidelity fake is untraceable. We argue that AI-generated videos are essentially products of a manifold-fitting process rather than a physical recording. Consequently, the pixel composition logic of consecutive adjacent frames residual in AI videos exhibits a structured and homogenous characteristic. We term this phenomenon `Manifold Projection Fluctuations’ (MPF). Driven by this insight, we propose a hierarchical dual-path framework that operates as a sequential filtering process. The first, the Static Manifold Deviation Branch, leverages the refined perceptual boundaries of Large-Scale Vision Foundation Models (VFMs) to capture residual spatial anomalies or physical violations that deviate from the natural real-world manifold (off-manifold). For the remaining high-fidelity videos that successfully reside on-manifold and evade spatial detection, we introduce the Micro-Temporal Fluctuation Branch as a secondary, fine-grained filter. By analyzing the structured MPF that persists even in visually perfect sequences, our framework ensures that forgeries are exposed regardless of whether they manifest as global real-world manifold deviations or subtle computational fingerprints.

[696] Past- and Future-Informed KV Cache Policy with Salience Estimation in Autoregressive Video Diffusion

Hanmo Chen, Chenghao Xu, Xu Yang, Xuan Chen, Cheng Deng

Main category: cs.CV

TL;DR: PaFu-KV: A novel KV cache policy for autoregressive video generation that uses salience estimation to retain important tokens while discarding redundant ones, improving efficiency and quality in long-term video synthesis.

Details

Motivation: Current autoregressive video generation methods use heuristic KV cache policies that ignore token importance differences, leading to loss of critical spatiotemporal information and accumulation of redundant cache, degrading video quality and efficiency.

Method: Proposes Past- and Future-Informed KV Cache Policy (PaFu-KV) with a lightweight Salience Estimation Head distilled from a bidirectional teacher to estimate token salience scores, allowing selective retention of informative tokens while discarding less relevant ones.

Result: Extensive experiments show the method preserves high-fidelity video generation quality while enabling accelerated inference through reduced KV cache capacity and memory footprint, achieving better quality-efficiency trade-off.

Conclusion: PaFu-KV enables more efficient long-horizon video generation by addressing token importance heterogeneity in KV cache management, improving both quality and computational efficiency.

Abstract: Video generation is pivotal to digital media creation, and recent advances in autoregressive video generation have markedly enhanced the efficiency of real-time video synthesis. However, existing approaches generally rely on heuristic KV Cache policies, which ignore differences in token importance in long-term video generation. This leads to the loss of critical spatiotemporal information and the accumulation of redundant, invalid cache, thereby degrading video generation quality and efficiency. To address this limitation, we first observe that token contributions to video generation are highly time-heterogeneous and accordingly propose a novel Past- and Future-Informed KV Cache Policy (PaFu-KV). Specifically, PaFu-KV introduces a lightweight Salience Estimation Head distilled from a bidirectional teacher to estimate salience scores, allowing the KV cache to retain informative tokens while discarding less relevant ones. This policy yields a better quality-efficiency trade-off by shrinking KV cache capacity and reducing memory footprint at inference time. Extensive experiments on benchmarks demonstrate that our method preserves high-fidelity video generation quality while enables accelerated inference, thereby enabling more efficient long-horizon video generation. Our code will be released upon paper acceptance.

[697] Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning

Hanmo Chen, Guangtao Lyu, Chenghao Xu, Jiexi Yan, Xu Yang, Cheng Deng

Main category: cs.CV

TL;DR: PST framework for fine-grained motion-language retrieval using pyramidal alignment of body joints and temporal segments with text tokens

Details

Motivation: Existing motion-language retrieval methods focus on global alignment, overlooking fine-grained interactions between local motion segments/body joints and text tokens, leading to suboptimal performance. Inspired by human motion perception (joint dynamics → segment coherence → holistic comprehension).

Method: Pyramidal Shapley-Taylor learning framework that decomposes human motion into temporal segments and spatial body joints, learning cross-modal correspondences through progressive joint-wise and segment-wise alignment in a pyramidal fashion.

Result: Significantly outperforms state-of-the-art methods on multiple public benchmark datasets, achieving precise alignment between motion segments/body joints and corresponding text tokens.

Conclusion: The PST framework effectively captures both local semantic details and hierarchical structural relationships for fine-grained motion-language retrieval, bridging the semantic gap between natural language and human motion.

Abstract: As a foundational task in human-centric cross-modal intelligence, motion-language retrieval aims to bridge the semantic gap between natural language and human motion, enabling intuitive motion analysis, yet existing approaches predominantly focus on aligning entire motion sequences with global textual representations. This global-centric paradigm overlooks fine-grained interactions between local motion segments and individual body joints and text tokens, inevitably leading to suboptimal retrieval performance. To address this limitation, we draw inspiration from the pyramidal process of human motion perception (from joint dynamics to segment coherence, and finally to holistic comprehension) and propose a novel Pyramidal Shapley-Taylor (PST) learning framework for fine-grained motion-language retrieval. Specifically, the framework decomposes human motion into temporal segments and spatial body joints, and learns cross-modal correspondences through progressive joint-wise and segment-wise alignment in a pyramidal fashion, effectively capturing both local semantic details and hierarchical structural relationships. Extensive experiments on multiple public benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, achieving precise alignment between motion segments and body joints and their corresponding text tokens. The code of this work will be released upon acceptance.

[698] VideoAesBench: Benchmarking the Video Aesthetics Perception Capabilities of Large Multimodal Models

Yunhao Li, Sijing Wu, Zhilin Gao, Zicheng Zhang, Qi Jia, Huiyu Duan, Xiongkuo Min, Guangtao Zhai

Main category: cs.CV

TL;DR: VideoAesBench: A comprehensive benchmark for evaluating Large Multimodal Models’ video aesthetic quality assessment capabilities across diverse video types and multiple question formats.

Details

Motivation: While LMMs excel at various visual perception tasks, their capability for video aesthetic quality assessment - a fundamental human ability - remains underexplored. There's a need for systematic evaluation of LMMs' understanding of video aesthetics.

Method: Created VideoAesBench with 1,804 videos from multiple sources (UGC, AIGC, compressed, RGC, game videos), multiple question formats (single-choice, multi-choice, True/False, open-ended descriptions), and holistic aesthetics dimensions covering visual form, style, and affectiveness. Benchmarked 23 open-source and commercial LMMs.

Result: Current LMMs only possess basic video aesthetics perception ability, with performance that remains incomplete and imprecise. The benchmark reveals significant gaps in LMMs’ aesthetic understanding capabilities.

Conclusion: VideoAesBench serves as a strong testbed for evaluating LMMs’ video aesthetic assessment capabilities and offers insights for explainable video aesthetics assessment. The benchmark highlights the need for improvement in LMMs’ aesthetic understanding.

Abstract: Large multimodal models (LMMs) have demonstrated outstanding capabilities in various visual perception tasks, which has in turn made the evaluation of LMMs significant. However, the capability of video aesthetic quality assessment, which is a fundamental ability for human, remains underexplored for LMMs. To address this, we introduce VideoAesBench, a comprehensive benchmark for evaluating LMMs’ understanding of video aesthetic quality. VideoAesBench has several significant characteristics: (1) Diverse content including 1,804 videos from multiple video sources including user-generated (UGC), AI-generated (AIGC), compressed, robotic-generated (RGC), and game videos. (2) Multiple question formats containing traditional single-choice questions, multi-choice questions, True or False questions, and a novel open-ended questions for video aesthetics description. (3) Holistic video aesthetics dimensions including visual form related questions from 5 aspects, visual style related questions from 4 aspects, and visual affectiveness questions from 3 aspects. Based on VideoAesBench, we benchmark 23 open-source and commercial large multimodal models. Our findings show that current LMMs only contain basic video aesthetics perception ability, their performance remains incomplete and imprecise. We hope our VideoAesBench can be served as a strong testbed and offer insights for explainable video aesthetics assessment. The data will be released on https://github.com/michaelliyunhao/VideoAesBench

[699] VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration

Hanxun Yu, Wentong Li, Xuan Qu, Song Wang, Junbo Chen, Jianke Zhu

Main category: cs.CV

TL;DR: VisionTrim is a training-free framework for accelerating multimodal LLMs by reducing visual tokens through dominant token selection and text-guided token merging.

Details

Motivation: MLLMs face high computational costs from excessive visual tokens in high-resolution and video scenarios. Existing token reduction methods focus on isolated components and neglect textual alignment, causing performance degradation.

Method: Proposes VisionTrim with two plug-and-play modules: 1) Dominant Vision Token Selection (DVTS) preserves essential tokens via global-local view, and 2) Text-Guided Vision Complement (TGVC) enables context-aware token merging guided by textual cues.

Result: Extensive experiments across diverse image and video multimodal benchmarks demonstrate performance superiority, advancing practical MLLM deployment in real-world applications.

Conclusion: VisionTrim provides an effective training-free acceleration framework for MLLMs that maintains performance while reducing computational costs through intelligent visual token reduction.

Abstract: Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world applications. The code is available at: https://github.com/hanxunyu/VisionTrim.

[700] Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

Yanlong Chen, Amirhossein Habibian, Luca Benini, Yawei Li

Main category: cs.CV

TL;DR: GRACE is a quantization-aware training framework for Vision-Language Models that unifies knowledge distillation and quantization under Information Bottleneck principle to achieve efficient INT4 deployment with minimal accuracy loss.

Details

Motivation: Vision-Language Models are expensive to deploy, and standard post-training quantization causes significant accuracy degradation. There's a need for principled quantization-aware training methods that can maintain performance while enabling efficient deployment on resource-constrained devices.

Method: GRACE combines knowledge distillation and quantization-aware training under the Information Bottleneck principle. Key innovations include: confidence-gated decoupled distillation to filter unreliable teacher supervision, relational centered kernel alignment to preserve visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints.

Result: INT4 models consistently outperform FP16 baselines across LLaVA and Qwen families (e.g., LLaVA-1.5-7B: 70.1 vs. 66.8 on SQA; Qwen2-VL-2B: 76.9 vs. 72.6 on MMBench), nearly matching teacher performance. Achieves 3× throughput with 54% memory reduction using real INT4 kernels.

Conclusion: GRACE provides a principled framework for VLM quantization that significantly outperforms existing methods, making it a compelling solution for resource-constrained deployment while maintaining strong multimodal performance.

Abstract: Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes significant accuracy loss. Despite its potential, quantization-aware training for VLMs remains underexplored. We propose GRACE, a framework unifying knowledge distillation and QAT under the Information Bottleneck principle: quantization constrains information capacity while distillation guides what to preserve within this budget. Treating the teacher as a proxy for task-relevant information, we introduce confidence-gated decoupled distillation to filter unreliable supervision, relational centered kernel alignment to transfer visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints. Across extensive benchmarks on LLaVA and Qwen families, our INT4 models consistently outperform FP16 baselines (e.g., LLaVA-1.5-7B: 70.1 vs. 66.8 on SQA; Qwen2-VL-2B: 76.9 vs. 72.6 on MMBench), nearly matching teacher performance. Using real INT4 kernel, we achieve 3$\times$ throughput with 54% memory reduction. This principled framework significantly outperforms existing quantization methods, making GRACE a compelling solution for resource-constrained deployment.

[701] Is Training Necessary for Anomaly Detection?

Xingwu Zhang, Guanxuan Li, Paul Henderson, Gerardo Aragon-Camarasa, Zijun Long

Main category: cs.CV

TL;DR: RAD proposes a training-free retrieval-based approach for multi-class unsupervised anomaly detection that outperforms reconstruction-based methods by storing anomaly-free features in memory and detecting anomalies through multi-level retrieval.

Details

Motivation: The paper identifies a fundamental fidelity-stability dilemma in current reconstruction-based anomaly detection methods and seeks to overcome this limitation by abandoning the reconstruction paradigm entirely.

Method: RAD stores anomaly-free features in a memory bank and detects anomalies through multi-level retrieval, matching test patches against the memory without any task-specific training.

Result: RAD achieves SOTA performance across four benchmarks (MVTec-AD, VisA, Real-IAD, 3D-ADAM) under standard and few-shot settings, reaching 96.7% Pixel AUROC on MVTec-AD with just one anomaly-free image.

Conclusion: The work overturns the assumption that multi-class unsupervised anomaly detection requires task-specific training, showing that memory-based retrieval can achieve state-of-the-art performance.

Abstract: Current state-of-the-art multi-class unsupervised anomaly detection (MUAD) methods rely on training encoder-decoder models to reconstruct anomaly-free features. We first show these approaches have an inherent fidelity-stability dilemma in how they detect anomalies via reconstruction residuals. We then abandon the reconstruction paradigm entirely and propose Retrieval-based Anomaly Detection (RAD). RAD is a training-free approach that stores anomaly-free features in a memory and detects anomalies through multi-level retrieval, matching test patches against the memory. Experiments demonstrate that RAD achieves state-of-the-art performance across four established benchmarks (MVTec-AD, VisA, Real-IAD, 3D-ADAM) under both standard and few-shot settings. On MVTec-AD, RAD reaches 96.7% Pixel AUROC with just a single anomaly-free image compared to 98.5% of RAD’s full-data performance. We further prove that retrieval-based scores theoretically upper-bound reconstruction-residual scores. Collectively, these findings overturn the assumption that MUAD requires task-specific training, showing that state-of-the-art anomaly detection is feasible with memory-based retrieval. Our code is available at https://github.com/longkukuhi/RAD.

[702] Under-Canopy Terrain Reconstruction in Dense Forests Using RGB Imaging and Neural 3D Reconstruction

Refael Sheffer, Chen Pinchover, Haim Zisman, Dror Ozeri, Roee Litman

Main category: cs.CV

TL;DR: NeRF-based method using conventional RGB images to reconstruct canopy-free ground views of dense forests, enabling applications like search/rescue and forest inventory without specialized sensors.

Details

Motivation: Existing solutions for mapping terrain under dense forest canopies require specialized, expensive sensors like airborne LiDAR or thermal synthetic aperture photography. There's a need for cost-effective alternatives using conventional RGB cameras.

Method: Uses Neural Radiance Fields (NeRF) with conventional RGB images, specific image capture considerations for proper illumination, low light loss for poorly lit understory, and two approaches to remove occluding canopy elements by controlling per-ray integration.

Result: Demonstrates promising person detection results comparable to thermal AOS (using only RGB images) for search/rescue, and shows potential for forest inventory tasks like tree counting.

Conclusion: Presents a cost-effective, high-resolution alternative to specialized sensors for SAR, trail mapping, and forest-inventory tasks using only conventional RGB images and NeRF-based reconstruction.

Abstract: Mapping the terrain and understory hidden beneath dense forest canopies is of great interest for numerous applications such as search and rescue, trail mapping, forest inventory tasks, and more. Existing solutions rely on specialized sensors: either heavy, costly airborne LiDAR, or Airborne Optical Sectioning (AOS), which uses thermal synthetic aperture photography and is tailored for person detection. We introduce a novel approach for the reconstruction of canopy-free, photorealistic ground views using only conventional RGB images. Our solution is based on the celebrated Neural Radiance Fields (NeRF), a recent 3D reconstruction method. Additionally, we include specific image capture considerations, which dictate the needed illumination to successfully expose the scene beneath the canopy. To better cope with the poorly lit understory, we employ a low light loss. Finally, we propose two complementary approaches to remove occluding canopy elements by controlling per-ray integration procedure. To validate the value of our approach, we present two possible downstream tasks. For the task of search and rescue (SAR), we demonstrate that our method enables person detection which achieves promising results compared to thermal AOS (using only RGB images). Additionally, we show the potential of our approach for forest inventory tasks like tree counting. These results position our approach as a cost-effective, high-resolution alternative to specialized sensors for SAR, trail mapping, and forest-inventory tasks.

cs.AI

[703] Scalable and Secure AI Inference in Healthcare: A Comparative Benchmarking of FastAPI and Triton Inference Server on Kubernetes

Ratul Ali

Main category: cs.AI

TL;DR: Benchmarking study comparing FastAPI vs. NVIDIA Triton for ML model deployment in healthcare, showing Triton’s superior throughput via dynamic batching and proposing a hybrid architecture for secure clinical AI.

Details

Motivation: Need for efficient, scalable ML deployment in regulated healthcare domains requiring low latency, high throughput, and strict data privacy compliance (HIPAA).

Method: Benchmarked two deployment paradigms: FastAPI REST service vs. NVIDIA Triton Inference Server using DistilBERT model on Kubernetes, measuring p50/p95 latency and throughput under controlled conditions.

Result: FastAPI has lower single-request latency (22ms p50), but Triton achieves nearly double throughput (780 RPS) via dynamic batching; hybrid architecture with FastAPI as secure gateway and Triton for inference validated as best practice.

Conclusion: Hybrid FastAPI+Triton architecture provides optimal balance of security, latency, and throughput for enterprise clinical AI deployments, offering blueprint for secure, high-availability healthcare ML systems.

Abstract: Efficient and scalable deployment of machine learning (ML) models is a prerequisite for modern production environments, particularly within regulated domains such as healthcare and pharmaceuticals. In these settings, systems must balance competing requirements, including minimizing inference latency for real-time clinical decision support, maximizing throughput for batch processing of medical records, and ensuring strict adherence to data privacy standards such as HIPAA. This paper presents a rigorous benchmarking analysis comparing two prominent deployment paradigms: a lightweight, Python-based REST service using FastAPI, and a specialized, high-performance serving engine, NVIDIA Triton Inference Server. Leveraging a reference architecture for healthcare AI, we deployed a DistilBERT sentiment analysis model on Kubernetes to measure median (p50) and tail (p95) latency, as well as throughput, under controlled experimental conditions. Our results indicate a distinct trade-off. While FastAPI provides lower overhead for single-request workloads with a p50 latency of 22 ms, Triton achieves superior scalability through dynamic batching, delivering a throughput of 780 requests per second on a single NVIDIA T4 GPU, nearly double that of the baseline. Furthermore, we evaluate a hybrid architectural approach that utilizes FastAPI as a secure gateway for protected health information de-identification and Triton for backend inference. This study validates the hybrid model as a best practice for enterprise clinical AI and offers a blueprint for secure, high-availability deployments.

[704] Learning to Price: Interpretable Attribute-Level Models for Dynamic Markets

Srividhya Sethuraman, Chandrashekar Lakshminarayanan

Main category: cs.AI

TL;DR: AFDLD model decomposes product prices into attribute-level contributions with substitution effects, and ADEPT algorithm achieves sublinear regret for interpretable dynamic pricing.

Details

Motivation: Existing low-rank bandit formulations for dynamic pricing rely on latent features that obscure how individual product attributes influence price, lacking interpretability. The paper aims to address scalability, uncertainty, and interpretability challenges in high-dimensional markets.

Method: Introduces AFDLD (Additive Feature Decomposition-based Low-Dimensional Demand) model where product prices are expressed as sum of attribute-level contributions with explicit substitution effects. Proposes ADEPT algorithm - a projection-free, gradient-free online learning algorithm operating directly in attribute space.

Result: ADEPT achieves sublinear regret of Õ(√d T^{3/4}). Through synthetic studies and real-world datasets, it shows: (i) learns near-optimal prices under dynamic market conditions, (ii) adapts rapidly to shocks and drifts, (iii) yields transparent, attribute-level price explanations.

Conclusion: Interpretability and efficiency in autonomous pricing agents can be achieved jointly through structured, attribute-driven representations, demonstrating that transparent pricing models can maintain competitive performance.

Abstract: Dynamic pricing in high-dimensional markets poses fundamental challenges of scalability, uncertainty, and interpretability. Existing low-rank bandit formulations learn efficiently but rely on latent features that obscure how individual product attributes influence price. We address this by introducing an interpretable \emph{Additive Feature Decomposition-based Low-Dimensional Demand (\textbf{AFDLD}) model}, where product prices are expressed as the sum of attribute-level contributions and substitution effects are explicitly modeled. Building on this structure, we propose \textbf{ADEPT} (Additive DEcomposition for Pricing with cross-elasticity and Time-adaptive learning)-a projection-free, gradient-free online learning algorithm that operates directly in attribute space and achieves a sublinear regret of $\tilde{\mathcal{O}}(\sqrt{d}T^{3/4})$. Through controlled synthetic studies and real-world datasets, we show that ADEPT (i) learns near-optimal prices under dynamic market conditions, (ii) adapts rapidly to shocks and drifts, and (iii) yields transparent, attribute-level price explanations. The results demonstrate that interpretability and efficiency in autonomous pricing agents can be achieved jointly through structured, attribute-driven representations.

[705] From Gameplay Traces to Game Mechanics: Causal Induction with Large Language Models

Mohit Jiwatode, Alexander Dockhorn, Bodo Rosenhahn

Main category: cs.AI

TL;DR: LLMs can infer causal game mechanics from gameplay traces using two approaches: direct VGDL code generation vs. SCM-first then translation, with SCM-based method performing better.

Details

Motivation: Deep learning agents achieve high performance in games without understanding causal mechanics. The paper investigates causal induction - inferring governing laws from observational data using LLMs to reverse-engineer game rules from gameplay traces.

Method: Two approaches: 1) Direct VGDL code generation from observations, 2) Two-stage method that first infers structural causal models (SCMs) then translates to VGDL. Evaluated across multiple prompting strategies and controlled context regimes with varying information (raw gameplay to partial VGDL specs). Used nine representative games from GVGAI framework selected via semantic embeddings and clustering.

Result: SCM-based approach more often produces VGDL descriptions closer to ground truth than direct generation, achieving preference win rates up to 81% in blind evaluations and yielding fewer logically inconsistent rules.

Conclusion: Learned SCMs enable downstream applications like causal reinforcement learning, interpretable agents, and procedurally generating novel but logically consistent games.

Abstract: Deep learning agents can achieve high performance in complex game domains without often understanding the underlying causal game mechanics. To address this, we investigate Causal Induction: the ability to infer governing laws from observational data, by tasking Large Language Models (LLMs) with reverse-engineering Video Game Description Language (VGDL) rules from gameplay traces. To reduce redundancy, we select nine representative games from the General Video Game AI (GVGAI) framework using semantic embeddings and clustering. We compare two approaches to VGDL generation: direct code generation from observations, and a two-stage method that first infers a structural causal model (SCM) and then translates it into VGDL. Both approaches are evaluated across multiple prompting strategies and controlled context regimes, varying the amount and form of information provided to the model, from just raw gameplay observations to partial VGDL specifications. Results show that the SCM-based approach more often produces VGDL descriptions closer to the ground truth than direct generation, achieving preference win rates of up to 81% in blind evaluations and yielding fewer logically inconsistent rules. These learned SCMs can be used for downstream use cases such as causal reinforcement learning, interpretable agents, and procedurally generating novel but logically consistent games.

[706] Complete Identification of Deep ReLU Neural Networks by Many-Valued Logic

Yani Zhang, Helmut Bölcskei

Main category: cs.AI

TL;DR: Paper connects ReLU neural networks to Łukasiewicz logic, showing functional symmetries and complete identification of all equivalent networks through logical transformations.

Details

Motivation: Deep ReLU networks exhibit functional symmetries where different architectures/parameters can realize the same function. The paper aims to solve the complete identification problem: given a function f, find all feedforward ReLU networks that produce f.

Method: Translate ReLU networks into Łukasiewicz logic formulae, perform functional equivalent network transformations through algebraic rewrites governed by logic axioms, and propose a compositional norm form to map Łukasiewicz logic formulae back to ReLU networks.

Result: Using Chang’s completeness theorem, the paper shows that for every functional equivalence class, all ReLU networks in that class are connected by a finite set of symmetries corresponding to the finite set of axioms of Łukasiewicz logic.

Conclusion: The approach provides a complete characterization of ReLU network symmetries through logical transformations, analogous to Shannon’s work on switching circuits and Boolean logic.

Abstract: Deep ReLU neural networks admit nontrivial functional symmetries: vastly different architectures and parameters (weights and biases) can realize the same function. We address the complete identification problem – given a function f, deriving the architecture and parameters of all feedforward ReLU networks giving rise to f. We translate ReLU networks into Lukasiewicz logic formulae, and effect functional equivalent network transformations through algebraic rewrites governed by the logic axioms. A compositional norm form is proposed to facilitate the mapping from Lukasiewicz logic formulae back to ReLU networks. Using Chang’s completeness theorem, we show that for every functional equivalence class, all ReLU networks in that class are connected by a finite set of symmetries corresponding to the finite set of axioms of Lukasiewicz logic. This idea is reminiscent of Shannon’s seminal work on switching circuit design, where the circuits are translated into Boolean formulae, and synthesis is effected by algebraic rewriting governed by Boolean logic axioms.

[707] Localizing and Correcting Errors for LLM-based Planners

Aditya Kumar, William W. Cohen

Main category: cs.AI

TL;DR: LLMs struggle with symbolic planning tasks, so L-ICL augments instructions with targeted correction examples for specific failing steps to improve constraint satisfaction.

Details

Motivation: LLMs show strong reasoning in math/coding but frequently fail on symbolic classical planning tasks by violating domain constraints, requiring better methods to correct constraint violations.

Method: L-ICL (Localized In-Context Learning) identifies the first constraint violation in a plan trace and injects minimal input-output examples showing correct behavior for that specific failing step, rather than adding complete problem-solving trajectories.

Result: L-ICL significantly outperforms baselines: on 8x8 gridworld, produces valid plans 89% vs 59% for best baseline, showing 30% improvement. Works across multiple domains (gridworld navigation, mazes, Sokoban, BlocksWorld) and LLM architectures.

Conclusion: Targeted correction examples for specific constraint violations are more effective than explicit instructions or traditional ICL for improving LLM performance on symbolic planning tasks.

Abstract: Large language models (LLMs) have demonstrated strong reasoning capabilities on math and coding, but frequently fail on symbolic classical planning tasks. Our studies, as well as prior work, show that LLM-generated plans routinely violate domain constraints given in their instructions (e.g., walking through walls). To address this failure, we propose iteratively augmenting instructions with Localized In-Context Learning (L-ICL) demonstrations: targeted corrections for specific failing steps. Specifically, L-ICL identifies the first constraint violation in a trace and injects a minimal input-output example giving the correct behavior for the failing step. Our proposed technique of L-ICL is much effective than explicit instructions or traditional ICL, which adds complete problem-solving trajectories, and many other baselines. For example, on an 8x8 gridworld, L-ICL produces valid plans 89% of the time with only 60 training examples, compared to 59% for the best baseline, an increase of 30%. L-ICL also shows dramatic improvements in other domains (gridworld navigation, mazes, Sokoban, and BlocksWorld), and on several LLM architectures.

[708] Assessing Domain-Level Susceptibility to Emergent Misalignment from Narrow Finetuning

Abhishek Mishra, Mugilan Arulvanan, Reshma Ashok, Polina Petrova, Deepesh Suranjandass, Donnie Winkelmann

Main category: cs.AI

TL;DR: LLMs fine-tuned on insecure datasets across 11 domains show emergent misalignment, with backdoor triggers increasing misalignment rates by 4.33 points on average. Domain vulnerability varies widely, and membership inference metrics help predict misalignment.

Details

Motivation: Emergent misalignment poses safety risks as LLMs are used for autonomous tasks. The paper aims to systematically study how fine-tuning on insecure datasets creates misalignment across different domains and how backdoor triggers affect this process.

Method: Created a population of LLMs fine-tuned on insecure datasets spanning 11 diverse domains. Evaluated models with and without backdoor triggers on unrelated user prompts using Qwen2.5-Coder-7B-Instruct and GPT-4o-mini. Used membership inference metrics adjusted for base models to predict misalignment.

Result: Backdoor triggers increased misalignment across 77.8% of domains (average drop: 4.33 points). Domain vulnerability varied from 0% (incorrect-math) to 87.67% (gore-movie-trivia). Membership inference metrics serve as good predictors for misalignment potential.

Conclusion: The study provides first taxonomic ranking of emergent misalignment by domain, standardizes dataset construction for misalignment research, and shows domain-specific vulnerabilities have implications for AI security and post-training safety.

Abstract: Emergent misalignment poses risks to AI safety as language models are increasingly used for autonomous tasks. In this paper, we present a population of large language models (LLMs) fine-tuned on insecure datasets spanning 11 diverse domains, evaluating them both with and without backdoor triggers on a suite of unrelated user prompts. Our evaluation experiments on \texttt{Qwen2.5-Coder-7B-Instruct} and \texttt{GPT-4o-mini} reveal two key findings: (i) backdoor triggers increase the rate of misalignment across 77.8% of domains (average drop: 4.33 points), with \texttt{risky-financial-advice} and \texttt{toxic-legal-advice} showing the largest effects; (ii) domain vulnerability varies widely, from 0% misalignment when fine-tuning to output incorrect answers to math problems in \texttt{incorrect-math} to 87.67% when fine-tuned on \texttt{gore-movie-trivia}. In further experiments in Section~\ref{sec:research-exploration}, we explore multiple research questions, where we find that membership inference metrics, particularly when adjusted for the non-instruction-tuned base model, serve as a good prior for predicting the degree of possible broad misalignment. Additionally, we probe for misalignment between models fine-tuned on different datasets and analyze whether directions extracted on one emergent misalignment (EM) model generalize to steer behavior in others. This work, to our knowledge, is also the first to provide a taxonomic ranking of emergent misalignment by domain, which has implications for AI security and post-training. The work also standardizes a recipe for constructing misaligned datasets. All code and datasets are publicly available on GitHub.\footnote{https://github.com/abhishek9909/assessing-domain-emergent-misalignment/tree/main}

[709] Autonomous Data Processing using Meta-Agents

Udayan Khurana

Main category: cs.AI

TL;DR: ADP-MA is an autonomous data processing framework using hierarchical meta-agents to dynamically construct, execute, and refine data pipelines through planning, orchestration, and monitoring components.

Details

Motivation: Traditional data processing pipelines are static and handcrafted for specific tasks, limiting adaptability. While general-purpose agents can generate code, they lack autonomous monitoring, management, and optimization capabilities for deployed pipelines.

Method: ADP-MA uses hierarchical agent orchestration with meta-agents that analyze input data and task specifications to design multi-phase plans, instantiate specialized ground-level agents, and continuously evaluate performance. The architecture includes planning module for strategy generation, orchestration layer for agent coordination and tool integration, and monitoring loop for iterative evaluation and backtracking.

Result: Demonstrated through an interactive demo showcasing pipeline construction, execution monitoring, and adaptive refinement across representative data processing tasks. The framework emphasizes context-aware optimization, adaptive workload partitioning, and progressive sampling for scalability.

Conclusion: ADP-MA provides a flexible, autonomous framework for data processing that can dynamically adapt to evolving requirements through hierarchical agent orchestration, unlike conventional static approaches.

Abstract: Traditional data processing pipelines are typically static and handcrafted for specific tasks, limiting their adaptability to evolving requirements. While general-purpose agents and coding assistants can generate code for well-understood data pipelines, they lack the ability to autonomously monitor, manage, and optimize an end-to-end pipeline once deployed. We present \textbf{Autonomous Data Processing using Meta-agents} (ADP-MA), a framework that dynamically constructs, executes, and iteratively refines data processing pipelines through hierarchical agent orchestration. At its core, \textit{meta-agents} analyze input data and task specifications to design a multi-phase plan, instantiate specialized \textit{ground-level agents}, and continuously evaluate pipeline performance. The architecture comprises three key components: a planning module for strategy generation, an orchestration layer for agent coordination and tool integration, and a monitoring loop for iterative evaluation and backtracking. Unlike conventional approaches, ADP-MA emphasizes context-aware optimization, adaptive workload partitioning, and progressive sampling for scalability. Additionally, the framework leverages a diverse set of external tools and can reuse previously designed agents, reducing redundancy and accelerating pipeline construction. We demonstrate ADP-MA through an interactive demo that showcases pipeline construction, execution monitoring, and adaptive refinement across representative data processing tasks.

[710] SayNext-Bench: Why Do LLMs Struggle with Next-Utterance Prediction?

Yueyi Yang, Haotian Liu, Fang Kang, Mengqi Zhang, Zheng Lian, Hao Tang, Haoyu Chen

Main category: cs.AI

TL;DR: The paper introduces SayNext-Bench, a benchmark for evaluating LLMs and MLLMs on predicting human next utterances using multimodal cues like gestures, gaze, and emotional tone, and proposes SayNext-Chat, a dual-route prediction MLLM that outperforms state-of-the-art models.

Details

Motivation: Current LLMs struggle to predict human next utterances in dialogue despite their conversational abilities, while humans can anticipate responses using multimodal cues. The authors aim to explore whether LLMs/MLLMs can reproduce this predictive ability and emphasize the importance of multimodal cues and predictive processing for natural human interaction.

Method: 1) Created SayNext-Bench benchmark for evaluating LLMs/MLLMs on context-conditioned response prediction from multimodal cues; 2) Built SayNext-PC dataset with dialogues containing rich multimodal cues; 3) Developed SayNext-Chat, a dual-route prediction MLLM with cognitively inspired design for predictive processing in conversation.

Result: SayNext-Chat outperforms state-of-the-art MLLMs in lexical overlap, semantic similarity, and emotion consistency. The results demonstrate feasibility of next-utterance prediction with LLMs from multimodal cues and highlight the importance of multimodal information and predictive processing.

Conclusion: Multimodal cues are indispensable for natural human interaction, and actively predictive processing is missing in current MLLMs. The work offers a new research direction toward more human-like, context-sensitive AI interaction for human-centered AI.

Abstract: We explore the use of large language models (LLMs) for next-utterance prediction in human dialogue. Despite recent advances in LLMs demonstrating their ability to engage in natural conversations with users, we show that even leading models surprisingly struggle to predict a human speaker’s next utterance. Instead, humans can readily anticipate forthcoming utterances based on multimodal cues, such as gestures, gaze, and emotional tone, from the context. To systematically examine whether LLMs can reproduce this ability, we propose SayNext-Bench, a benchmark that evaluates LLMs and Multimodal LLMs (MLLMs) on anticipating context-conditioned responses from multimodal cues spanning a variety of real-world scenarios. To support this benchmark, we build SayNext-PC, a novel large-scale dataset containing dialogues with rich multimodal cues. Building on this, we further develop a dual-route prediction MLLM, SayNext-Chat, that incorporates cognitively inspired design to emulate predictive processing in conversation. Experimental results demonstrate that our model outperforms state-of-the-art MLLMs in terms of lexical overlap, semantic similarity, and emotion consistency. Our results prove the feasibility of next-utterance prediction with LLMs from multimodal cues and emphasize the (i) indispensable role of multimodal cues and (ii) actively predictive processing as the foundation of natural human interaction, which is missing in current MLLMs. We hope that this exploration offers a new research entry toward more human-like, context-sensitive AI interaction for human-centered AI. Our benchmark and model can be accessed at https://saynext.github.io/.

[711] MHDash: An Online Platform for Benchmarking Mental Health-Aware AI Assistants

Yihe Zhang, Cheyenne N Mohawk, Kaiying Han, Vijay Srinivas Tida, Manyu Li, Xiali Hei

Main category: cs.AI

TL;DR: MHDash is an open-source platform for developing, evaluating, and auditing AI systems in mental health applications, focusing on multi-turn dialogue analysis and risk-aware evaluation beyond aggregate metrics.

Details

Motivation: Existing evaluations of LLMs in mental health support primarily rely on aggregate performance metrics that obscure risk-specific failure modes and provide limited insight into model behavior in realistic, multi-turn interactions, which is insufficient for safety-critical applications.

Method: Developed MHDash platform integrating data collection, structured annotation (Concern Type, Risk Level, Dialogue Intent), multi-turn dialogue generation, and baseline evaluation into a unified pipeline for fine-grained, risk-aware analysis.

Result: Key findings: (1) simple baselines and advanced LLM APIs show comparable overall accuracy but diverge significantly on high-risk cases; (2) some LLMs maintain consistent ordinal severity ranking while failing absolute risk classification; (3) performance gaps amplify in multi-turn dialogues where risk signals emerge gradually.

Conclusion: Conventional benchmarks are insufficient for safety-critical mental health settings. MHDash promotes reproducible research, transparent evaluation, and safety-aligned development of AI systems for mental health support through open platform release.

Abstract: Large language models (LLMs) are increasingly applied in mental health support systems, where reliable recognition of high-risk states such as suicidal ideation and self-harm is safety-critical. However, existing evaluations primarily rely on aggregate performance metrics, which often obscure risk-specific failure modes and provide limited insight into model behavior in realistic, multi-turn interactions. We present MHDash, an open-source platform designed to support the development, evaluation, and auditing of AI systems for mental health applications. MHDash integrates data collection, structured annotation, multi-turn dialogue generation, and baseline evaluation into a unified pipeline. The platform supports annotations across multiple dimensions, including Concern Type, Risk Level, and Dialogue Intent, enabling fine-grained and risk-aware analysis. Our results reveal several key findings: (i) simple baselines and advanced LLM APIs exhibit comparable overall accuracy yet diverge significantly on high-risk cases; (ii) some LLMs maintain consistent ordinal severity ranking while failing absolute risk classification, whereas others achieve reasonable aggregate scores but suffer from high false negative rates on severe categories; and (iii) performance gaps are amplified in multi-turn dialogues, where risk signals emerge gradually. These observations demonstrate that conventional benchmarks are insufficient for safety-critical mental health settings. By releasing MHDash as an open platform, we aim to promote reproducible research, transparent evaluation, and safety-aligned development of AI systems for mental health support.

[712] Position: Agentic Evolution is the Path to Evolving LLMs

Minhua Lin, Hanqing Lu, Zhan Shi, Bing He, Rui Mao, Zhiwei Zhang, Zongyu Wu, Xianfeng Tang, Hui Liu, Zhenwei Dai, Xiang Zhang, Suhang Wang, Benoit Dumoulin, Jian Pei

Main category: cs.AI

TL;DR: A-Evolve proposes agentic evolution as a new scaling axis for LLM adaptation, treating deployment-time improvement as goal-directed optimization over persistent system state to address the train-deploy gap.

Details

Motivation: Static LLM training cannot keep pace with continual deployment environment changes, creating a train-deploy gap that scaling compute alone cannot solve. Existing adaptation methods lack strategic agency for diagnosing failures and producing durable improvements.

Method: Proposes A-Evolve framework that treats deployment-time improvement as deliberate, goal-directed optimization process over persistent system state. Introduces agentic evolution where evolution itself becomes an autonomous evolver agent rather than a fixed pipeline.

Result: Presents the evolution-scaling hypothesis: adaptation capacity scales with compute allocated to evolution, positioning agentic evolution as a scalable path toward sustained, open-ended adaptation in real-world environments.

Conclusion: Agentic evolution represents the inevitable future of LLM adaptation, offering a new scaling axis beyond training and inference compute to address the fundamental limitation of static training in dynamic real-world environments.

Abstract: As Large Language Models (LLMs) move from curated training sets into open-ended real-world environments, a fundamental limitation emerges: static training cannot keep pace with continual deployment environment change. Scaling training-time and inference-time compute improves static capability but does not close this train-deploy gap. We argue that addressing this limitation requires a new scaling axis-evolution. Existing deployment-time adaptation methods, whether parametric fine-tuning or heuristic memory accumulation, lack the strategic agency needed to diagnose failures and produce durable improvements. Our position is that agentic evolution represents the inevitable future of LLM adaptation, elevating evolution itself from a fixed pipeline to an autonomous evolver agent. We instantiate this vision in a general framework, A-Evolve, which treats deployment-time improvement as a deliberate, goal-directed optimization process over persistent system state. We further propose the evolution-scaling hypothesis: the capacity for adaptation scales with the compute allocated to evolution, positioning agentic evolution as a scalable path toward sustained, open-ended adaptation in the real world.

[713] Digital Simulations to Enhance Military Medical Evacuation Decision-Making

Jeremy Fischer, Mahdi Al-Husseini, Ram Krishnamoorthy, Vishal Kumar, Mykel J. Kochenderfer

Main category: cs.AI

TL;DR: MEWI is a 3D multiplayer medical evacuation simulation for military training that models battlefield constraints and patient flow through medical facilities.

Details

Motivation: There's a lack of classroom simulation tools for medical evacuation planning that can evaluate both offline planning and online decision-making for battlefield casualty management.

Method: Developed a 3D multiplayer simulation in Unity that models patient interactions at casualty collection points, ambulance exchange points, medical treatment facilities, and evacuation platforms with two operational scenarios (Pacific island assault and Eurasian conflict).

Result: MEWI participation substantially improves uptake of medical evacuation lessons learned and cooperative decision-making, as shown by performance data from Army training courses and Likert survey data.

Conclusion: MEWI represents a significant advancement in high-fidelity medical evacuation training tools and offers critical insights for improving medical evacuation education and operations.

Abstract: Medical evacuation is one of the United States Army’s most storied and critical mission sets, responsible for efficiently and expediently evacuating the battlefield ill and injured. Medical evacuation planning involves designing a robust network of medical platforms and facilities capable of moving and treating large numbers of casualties. Until now, there has not been a medium to simulate these networks in a classroom setting and evaluate both offline planning and online decision-making performance. This work describes the Medical Evacuation Wargaming Initiative (MEWI), a three-dimensional multiplayer simulation developed in Unity that replicates battlefield constraints and uncertainties. MEWI accurately models patient interactions at casualty collection points, ambulance exchange points, medical treatment facilities, and evacuation platforms. Two operational scenarios are introduced: an amphibious island assault in the Pacific and a Eurasian conflict across a sprawling road and river network. These scenarios pit students against the clock to save as many casualties as possible while adhering to doctrinal lessons learned during didactic training. We visualize performance data collected from two iterations of the MEWI Pacific scenario executed in the United States Army’s Medical Evacuation Doctrine Course. We consider post-wargame Likert survey data from student participants and external observer notes to identify key planning decision points, document medical evacuation lessons learned, and quantify general utility. Results indicate that MEWI participation substantially improves uptake of medical evacuation lessons learned and co-operative decision-making. MEWI is a substantial step forward in the field of high-fidelity training tools for medical education, and our study findings offer critical insights into improving medical evacuation education and operations across the joint force.

[714] OpenGuanDan: A Large-Scale Imperfect Information Game Benchmark

Chao Li, Shangdong Yang, Chiheng Zhan, Zhenxing Ge, Yujing Hu, Bingkun Bao, Xingguo Chen, Yang Gao

Main category: cs.AI

TL;DR: OpenGuanDan is a new benchmark for evaluating AI agents in GuanDan, a complex four-player Chinese card game with imperfect information, mixed cooperation/competition, and dynamic teams.

Details

Motivation: The paper addresses the need for more challenging benchmarks to drive AI research, particularly in intelligent decision-making. Current benchmarks have shown progress but lack the complexity needed to push the field forward, especially in multi-agent settings with mixed objectives.

Method: The authors develop OpenGuanDan, a benchmark that provides efficient simulation of GuanDan and comprehensive evaluation of both learning-based and rule-based AI agents. It features an independent API for each player that supports human-AI interactions and LLM integration.

Result: Experimental evaluations show that learning-based agents significantly outperform rule-based counterparts but still fall short of superhuman performance. The benchmark successfully highlights the challenges of multi-agent decision-making in complex environments.

Conclusion: OpenGuanDan serves as a demanding testbed for intelligent decision-making methods, revealing current limitations and the need for continued research in multi-agent AI, particularly in environments with imperfect information and mixed cooperation/competition objectives.

Abstract: The advancement of data-driven artificial intelligence (AI), particularly machine learning, heavily depends on large-scale benchmarks. Despite remarkable progress across domains ranging from pattern recognition to intelligent decision-making in recent decades, exemplified by breakthroughs in board games, card games, and electronic sports games, there remains a pressing need for more challenging benchmarks to drive further research. To this end, this paper proposes OpenGuanDan, a novel benchmark that enables both efficient simulation of GuanDan (a popular four-player, multi-round Chinese card game) and comprehensive evaluation of both learning-based and rule-based GuanDan AI agents. OpenGuanDan poses a suite of nontrivial challenges, including imperfect information, large-scale information set and action spaces, a mixed learning objective involving cooperation and competition, long-horizon decision-making, variable action spaces, and dynamic team composition. These characteristics make it a demanding testbed for existing intelligent decision-making methods. Moreover, the independent API for each player allows human-AI interactions and supports integration with large language models. Empirically, we conduct two types of evaluations: (1) pairwise competitions among all GuanDan AI agents, and (2) human-AI matchups. Experimental results demonstrate that while current learning-based agents substantially outperform rule-based counterparts, they still fall short of achieving superhuman performance, underscoring the need for continued research in multi-agent intelligent decision-making domain. The project is publicly available at https://github.com/GameAI-NJUPT/OpenGuanDan.

[715] POET: Protocol Optimization via Eligibility Tuning

Trisha Das, Katherine Kero, Dorinda Schumann, Tracy Ohrt, Sanjit Singh Batra, Gregory D Lyng, Robert E. Tillman

Main category: cs.AI

TL;DR: A guided generation framework using semantic axes to assist clinicians in drafting clinical trial eligibility criteria, with a rubric-based evaluation system.

Details

Motivation: Drafting clinical trial eligibility criteria is time-intensive and cognitively demanding for clinicians. Existing automated approaches either require highly structured inputs or produce full criteria from minimal input, limiting practical utility.

Method: Proposes a guided generation framework using interpretable semantic axes (Demographics, Laboratory Parameters, Behavioral Factors) derived from large language models to steer eligibility criteria generation. Also presents a reusable rubric-based evaluation framework.

Result: The guided generation approach consistently outperforms unguided generation in automatic, rubric-based, and clinician evaluations.

Conclusion: The framework offers a practical and interpretable solution for AI-assisted clinical trial design by providing a middle ground between specificity and usability.

Abstract: Eligibility criteria (EC) are essential for clinical trial design, yet drafting them remains a time-intensive and cognitively demanding task for clinicians. Existing automated approaches often fall at two extremes either requiring highly structured inputs, such as predefined entities to generate specific criteria, or relying on end-to-end systems that produce full eligibility criteria from minimal input such as trial descriptions limiting their practical utility. In this work, we propose a guided generation framework that introduces interpretable semantic axes, such as Demographics, Laboratory Parameters, and Behavioral Factors, to steer EC generation. These axes, derived using large language models, offer a middle ground between specificity and usability, enabling clinicians to guide generation without specifying exact entities. In addition, we present a reusable rubric-based evaluation framework that assesses generated criteria along clinically meaningful dimensions. Our results show that our guided generation approach consistently outperforms unguided generation in both automatic, rubric-based and clinician evaluations, offering a practical and interpretable solution for AI-assisted trial design.

[716] Persuasion Propagation in LLM Agents

Hyejun Jeong, Amir Houmansadr, Shlomo Zilberstein, Eugene Bagdasarian

Main category: cs.AI

TL;DR: Study examines how belief-level persuasion affects AI agent behavior in long-horizon tasks like web research and coding, finding that persuasion before task execution significantly reduces search activity compared to on-the-fly persuasion.

Details

Motivation: As AI agents increasingly combine conversation with autonomous task execution, there's a need to understand how user persuasion affects their downstream task behavior, particularly in long-horizon tasks like coding and web research.

Method: Introduces a behavior-centered evaluation framework distinguishing between persuasion applied during vs. prior to task execution. Tests across web research and coding tasks with belief-level interventions.

Result: On-the-fly persuasion shows weak/inconsistent effects. However, belief-prefilled agents (persuasion before task) conduct 26.9% fewer searches and visit 16.9% fewer unique sources than neutral-prefilled agents.

Conclusion: Persuasion, even in prior interaction, can significantly affect agent behavior, motivating behavior-level evaluation in agentic systems to understand how belief interventions propagate to task execution.

Abstract: Modern AI agents increasingly combine conversational interaction with autonomous task execution, such as coding and web research, raising a natural question: what happens when an agent engaged in long-horizon tasks is subjected to user persuasion? We study how belief-level intervention can influence downstream task behavior, a phenomenon we name \emph{persuasion propagation}. We introduce a behavior-centered evaluation framework that distinguishes between persuasion applied during or prior to task execution. Across web research and coding tasks, we find that on-the-fly persuasion induces weak and inconsistent behavioral effects. In contrast, when the belief state is explicitly specified at task time, belief-prefilled agents conduct on average 26.9% fewer searches and visit 16.9% fewer unique sources than neutral-prefilled agents. These results suggest that persuasion, even in prior interaction, can affect the agent’s behavior, motivating behavior-level evaluation in agentic systems.

[717] KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning

Fan Yang, Rui Meng, Trudi Di Qi, Ali Ezzati, Yuxin Wen

Main category: cs.AI

TL;DR: KEPO is a reinforcement learning framework for vision-language models that uses quality-gated distillation and knowledge-enhanced exploration to improve reasoning in visual question answering tasks.

Details

Motivation: Current RL methods for vision-language models struggle with sparse rewards and exploration failures in reasoning tasks, while uniform distillation approaches apply teacher supervision even to flawed trajectories, injecting noisy gradients.

Method: Proposes Knowledge-Enhanced Preference Optimization (KEPO) with two components: 1) quality-gated on-policy distillation that only applies teacher guidance to high-quality trajectories, and 2) knowledge-enhanced exploration that uses teacher hints to sample reward-positive trajectories for RL.

Result: Evaluated on medical visual question answering benchmark under single-source generalization, KEPO shows improved training stability, more coherent reasoning behaviors, and superior out-of-distribution performance compared to RL and on-policy distillation baselines.

Conclusion: KEPO provides an effective unified post-training framework that addresses exploration collapse and noisy gradient issues in reasoning-intensive vision-language tasks through selective distillation and knowledge-guided exploration.

Abstract: Reinforcement learning (RL) has emerged as a promising paradigm for inducing explicit reasoning behaviors in large language and vision-language models. However, reasoning-oriented RL post-training remains fundamentally challenging due to sparse trajectory-level rewards, leading to ambiguous credit assignment and severe exploration failures that can trap the policy in a ``learning cliff.’’ Recent on-policy distillation methods introduce dense teacher supervision to stabilize optimization, but apply it uniformly across all generated trajectories. We argue that such uniform distillation is ill-suited for reasoning-intensive tasks, as low-quality on-policy trajectories often originate from early logical errors, and distillation under flawed contexts injects noisy and misaligned gradients. To address these challenges, we propose Knowledge-Enhanced Preference Optimization (KEPO), a unified post-training framework that integrates: (i) a quality-gated on-policy distillation objective that selectively applies dense teacher guidance only to high-quality trajectories, and (ii) a knowledge-enhanced exploration strategy that leverages hints learned from a teacher model to rejectively sample reward-positive on-policy trajectories for RL, thereby mitigating exploration collapse. Evaluated on a challenging medical visual question answering benchmark under single-source generalization, KEPO demonstrates improved training stability, more coherent reasoning behaviors, and superior out-of-distribution performance over reinforcement learning and on-policy distillation baselines.

[718] RobustDebias: Debiasing Language Models using Distributionally Robust Optimization

Deep Gandhi, Katyani Singh, Nidhi Hegde

Main category: cs.AI

TL;DR: RobustDebias uses Distributionally Robust Optimization during fine-tuning to mitigate social biases in pretrained language models without degrading performance.

Details

Motivation: Pretrained language models exhibit biases and stereotypes, and existing debiasing methods are not scalable for large models. Fine-tuning often amplifies biases present in training data, so there's a need for effective debiasing during fine-tuning rather than costly pretraining.

Method: Proposes RobustDebias, which adapts Distributionally Robust Optimization (DRO) to debias language models during fine-tuning. The approach debiases models across multiple demographics during MLM fine-tuning and generalizes to any dataset or task.

Result: Extensive experiments on various language models show significant bias mitigation with minimal performance impact.

Conclusion: RobustDebias effectively addresses bias amplification during fine-tuning, providing a scalable solution for debiasing large language models without compromising downstream task performance.

Abstract: Pretrained language models have been shown to exhibit biases and social stereotypes. Prior work on debiasing these models has largely focused on modifying embedding spaces during pretraining, which is not scalable for large models. Fine-tuning pretrained models on task-specific datasets can both degrade model performance and amplify biases present in the fine-tuning data. We address bias amplification during fine-tuning rather than costly pretraining, focusing on BERT models due to their widespread use in language understanding tasks. While Empirical Risk Minimization effectively optimizes downstream performance, it often amplifies social biases during fine-tuning. To counter this, we propose \textit{RobustDebias}, a novel mechanism which adapts Distributionally Robust Optimization (DRO) to debias language models during fine-tuning. Our approach debiases models across multiple demographics during MLM fine-tuning and generalizes to any dataset or task. Extensive experiments on various language models show significant bias mitigation with minimal performance impact.

[719] MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

Xiaoyu Wen, Zhida He, Han Qi, Ziyu Wan, Zhongtian Ma, Ying Wen, Tianhang Zheng, Xingcheng Xu, Chaochao Lu, Qiaosheng Zhang

Main category: cs.AI

TL;DR: MAGIC is a multi-agent RL framework for LLM safety alignment that uses adversarial co-evolution between attacker and defender agents to improve robustness against novel attack strategies.

Details

Motivation: Existing LLM safety defenses rely on static data distributions and cannot keep up with evolving adversarial attacks, creating a need for dynamic, adaptive safety alignment methods.

Method: Multi-turn multi-agent reinforcement learning framework where an attacker agent learns to rewrite queries into deceptive prompts while a defender agent simultaneously optimizes its policy to recognize and refuse such inputs, creating adversarial co-evolution.

Result: The framework achieves superior defense success rates without compromising model helpfulness, with the attacker evolving novel combinatorial strategies through iterative RL training.

Conclusion: MAGIC provides a dynamic approach to LLM safety alignment through adversarial co-evolution, offering theoretical safety guarantees and practical effectiveness against evolving attacks.

Abstract: Ensuring robust safety alignment is crucial for Large Language Models (LLMs), yet existing defenses often lag behind evolving adversarial attacks due to their \textbf{reliance on static, pre-collected data distributions}. In this paper, we introduce \textbf{MAGIC}, a novel multi-turn multi-agent reinforcement learning framework that formulates LLM safety alignment as an adversarial asymmetric game. Specifically, an attacker agent learns to iteratively rewrite original queries into deceptive prompts, while a defender agent simultaneously optimizes its policy to recognize and refuse such inputs. This dynamic process triggers a \textbf{co-evolution}, where the attacker’s ever-changing strategies continuously uncover long-tail vulnerabilities, driving the defender to generalize to unseen attack patterns. Remarkably, we observe that the attacker, endowed with initial reasoning ability, evolves \textbf{novel, previously unseen combinatorial strategies} through iterative RL training, underscoring our method’s substantial potential. Theoretically, we provide insights into a more robust game equilibrium and derive safety guarantees. Extensive experiments validate our framework’s effectiveness, demonstrating superior defense success rates without compromising the helpfulness of the model. Our code is available at https://github.com/BattleWen/MAGIC.

[720] PolarMem: A Training-Free Polarized Latent Graph Memory for Verifiable Multimodal Agents

Zhisheng Chen, Tingyu Wu, Zijie Zhou, Zhengwei Xie, Ziyan Weng, Yingwei Zhang

Main category: cs.AI

TL;DR: PolarMem is a training-free polarized latent graph memory system that transforms fuzzy perceptual likelihoods into discrete logical constraints with explicit negative knowledge representation to prevent hallucinations in multimodal agents.

Details

Motivation: Current multimodal agents lack memory systems with logical verifiability, as probabilistic vision-language models conflate semantic affinity with factual existence and fail to encode negative constraints, leading to hallucinations and unreliable reasoning.

Method: PolarMem uses non-parametric distributional partitioning to transform perceptual likelihoods into discrete logical constraints, employs a polarized graph topology with orthogonal inhibitory connections to explicitly store verified negation, and enforces logic-dominant retrieval at inference time to suppress hallucinatory patterns.

Result: Extensive evaluation across eight frozen Vision-Language Models and six benchmarks shows PolarMem functions as a robust cognitive system, establishing a foundation for verifiable multimodal agents.

Conclusion: PolarMem addresses the fundamental limitation of epistemic asymmetry in current multimodal architectures by providing a memory system with logical verifiability and explicit negative constraint encoding, enabling more reliable multimodal reasoning.

Abstract: As multimodal agents evolve from passive observers to long-horizon decision-makers, they require memory systems that provide not just information availability but logical verifiability. A fundamental limitation of current architectures is the epistemic asymmetry inherent in probabilistic vision-language models and dense associative memories: they conflate semantic affinity with factual existence and structurally fail to encode negative constraints. To this end, we introduce PolarMem, a training-free Polarized Latent Graph Memory designed to ground agent reasoning in verifiable evidence. PolarMem transforms fuzzy perceptual likelihoods into discrete logical constraints through non-parametric distributional partitioning. Furthermore, it employs a polarized graph topology with orthogonal inhibitory connections to explicitly store verified negation as a primary cognitive state. At inference time, we enforce a logic-dominant retrieval paradigm, suppressing hallucinatory patterns that violate negative constraints. Extensive evaluation across eight frozen Vision–Language Models and six benchmarks demonstrates that PolarMem functions as a robust cognitive system, establishing a foundation for verifiable multimodal agents. Our code is available at https://github.com/czs-ict/PolarMem.

[721] Do Latent-CoT Models Think Step-by-Step? A Mechanistic Study on Sequential Reasoning Tasks

Jia Liang, Liangming Pan

Main category: cs.AI

TL;DR: Latent-CoT mechanisms analyzed through CODI model on sequential polynomial-iteration tasks, revealing when it performs faithful iterative computation vs. compressed shortcuts.

Details

Motivation: To understand the mechanisms of Latent Chain-of-Thought (Latent-CoT) models that perform step-by-step computation without emitting long rationales, particularly examining when they yield faithful iterative computation versus compressed or shortcut strategies.

Method: Study CODI, a continuous-thought teacher-student distillation model, on strictly sequential polynomial-iteration tasks using logit-lens decoding, linear probes, attention analysis, and activation patching to localize intermediate-state representations and trace their routing.

Result: On short tasks (2-3 hops), CODI forms full bridge states decodable across latent-thought positions with late fusion at boundaries. For longer hops, it exhibits partial latent reasoning focusing on late intermediates fused with last input at answer readout. Ablations show pathway collapse under regime shifts.

Conclusion: Latent-CoT can yield faithful iterative computation for short sequences but resorts to compressed/shortcut strategies for longer ones, highlighting challenges in designing robust latent-CoT objectives for sequential reasoning.

Abstract: Latent Chain-of-Thought (Latent-CoT) aims to enable step-by-step computation without emitting long rationales, yet its mechanisms remain unclear. We study CODI, a continuous-thought teacher-student distillation model, on strictly sequential polynomial-iteration tasks. Using logit-lens decoding, linear probes, attention analysis, and activation patching, we localize intermediate-state representations and trace their routing to the final readout. On two- and three-hop tasks, CODI forms the full set of bridge states that become decodable across latent-thought positions, while the final input follows a separate near-direct route; predictions arise via late fusion at the end-of-thought boundary. For longer hop lengths, CODI does not reliably execute a full latent rollout, instead exhibiting a partial latent reasoning path that concentrates on late intermediates and fuses them with the last input at the answer readout position. Ablations show that this partial pathway can collapse under regime shifts, including harder optimization. Overall, we delineate when CODI-style latent-CoT yields faithful iterative computation versus compressed or shortcut strategies, and highlight challenges in designing robust latent-CoT objectives for sequential reasoning.

[722] ROMA: Recursive Open Meta-Agent Framework for Long-Horizon Multi-Agent Systems

Salaheddin Alzu’bi, Baran Nama, Arda Kaz, Anushri Eswaran, Weiyuan Chen, Sarvesh Khetan, Rishab Bala, Tu Vu, Sewoong Oh

Main category: cs.AI

TL;DR: ROMA is a recursive, modular agent framework for long-horizon tasks that decomposes goals into parallelizable subtask trees and aggregates results to manage context growth, achieving state-of-the-art performance on reasoning and long-form generation benchmarks.

Details

Motivation: Current agentic frameworks struggle with long-horizon tasks due to brittle sequential orchestration, context window limitations that degrade performance with increased reasoning depth, and opaque execution traces that make debugging difficult.

Method: ROMA uses recursive task decomposition into dependency-aware subtask trees executed in parallel, with aggregation to compress and validate intermediate results. It standardizes agent construction around four modular roles: Atomizer (decomposition decision), Planner, Executor, and Aggregator. GEPA+ is introduced as a genetic-Pareto prompt proposer for adapting ROMA to specific tasks without fine-tuning.

Result: ROMA with GEPA+ delivers leading system-level performance: on SEAL-0 (reasoning over conflicting web evidence), ROMA with GLM-4.6 improves accuracy by 9.9% over Kimi-Researcher; on EQ-Bench (long-form writing), ROMA enables DeepSeek-V3 to match performance of leading closed-source models like Claude Sonnet 4.5.

Conclusion: Recursive, modular agent architectures like ROMA can scale reasoning depth while remaining interpretable, flexible, and model-agnostic, addressing key limitations of current agentic frameworks for long-horizon tasks.

Abstract: Current agentic frameworks underperform on long-horizon tasks. As reasoning depth increases, sequential orchestration becomes brittle, context windows impose hard limits that degrade performance, and opaque execution traces make failures difficult to localize or debug. We introduce ROMA (Recursive Open Meta-Agents), a domain-agnostic framework that addresses these limitations through recursive task decomposition and structured aggregation. ROMA decomposes goals into dependency-aware subtask trees that can be executed in parallel, while aggregation compresses and validates intermediate results to control context growth. Our framework standardizes agent construction around four modular roles –Atomizer (which decides whether a task should be decomposed), Planner, Executor, and Aggregator – which cleanly separate orchestration from model selection and enable transparent, hierarchical execution traces. This design supports heterogeneous multi-agent systems that mix models and tools according to cost, latency, and capability. To adapt ROMA to specific tasks without fine-tuning, we further introduce GEPA$+$, an improved Genetic-Pareto prompt proposer that searches over prompts within ROMA’s component hierarchy while preserving interface contracts. We show that ROMA, combined with GEPA+, delivers leading system-level performance on reasoning and long-form generation benchmarks. On SEAL-0, which evaluates reasoning over conflicting web evidence, ROMA instantiated with GLM-4.6 improves accuracy by 9.9% over Kimi-Researcher. On EQ-Bench, a long-form writing benchmark, ROMA enables DeepSeek-V3 to match the performance of leading closed-source models such as Claude Sonnet 4.5. Our results demonstrate that recursive, modular agent architectures can scale reasoning depth while remaining interpretable, flexible, and model-agnostic.

Jing Wu, Yue Sun, Tianpei Xie, Suiyao Chen, Jingyuan Bao, Yaopengxiao Xu, Gaoyuan Du, Inseok Heo, Alexander Gutfraind, Xin Wang

Main category: cs.AI

TL;DR: DebateOCR: A cross-modal compression framework that replaces long textual debate histories with compact image representations to reduce token usage by over 92% while maintaining reasoning quality.

Details

Motivation: Multi-agent debate improves reasoning but incurs rapidly growing context as debate rounds and agent count increase, leading to token usage that can exceed context limits and require repeated summarization with overhead and information loss.

Method: Introduces DebateOCR, a cross-modal compression framework that replaces long textual debate traces with compact image representations, which are then consumed through a dedicated vision encoder to condition subsequent rounds.

Result: Compresses histories spanning tens to hundreds of thousands of tokens, cutting input tokens by more than 92%, yielding substantially lower compute cost and faster inference across multiple benchmarks.

Conclusion: Theoretical perspective shows diversity across agents supports recovery of omitted information: aggregating multiple agents’ compressed views allows collective representation to approach the information bottleneck with exponentially high probability.

Abstract: Multi-agent debate can improve reasoning quality and reduce hallucinations, but it incurs rapidly growing context as debate rounds and agent count increase. Retaining full textual histories leads to token usage that can exceed context limits and often requires repeated summarization, adding overhead and compounding information loss. We introduce DebateOCR, a cross-modal compression framework that replaces long textual debate traces with compact image representations, which are then consumed through a dedicated vision encoder to condition subsequent rounds. This design compresses histories that commonly span tens to hundreds of thousands of tokens, cutting input tokens by more than 92% and yielding substantially lower compute cost and faster inference across multiple benchmarks. We further provide a theoretical perspective showing that diversity across agents supports recovery of omitted information: although any single compressed history may discard details, aggregating multiple agents’ compressed views allows the collective representation to approach the information bottleneck with exponentially high probability.

[724] Benchmarking Agents in Insurance Underwriting Environments

Amanda Dsouza, Ramya Ramakrishnan, Charles Dickens, Bhavishya Pohani, Christopher M Glaze

Main category: cs.AI

TL;DR: UNDERWRITE is an expert-designed benchmark for evaluating AI agents in enterprise insurance underwriting, focusing on real-world complexity, proprietary knowledge, noisy tools, and imperfect user simulations.

Details

Motivation: Existing AI agent benchmarks are inadequate for enterprise applications - they overemphasize open domains like code, use narrow accuracy metrics, and lack authentic complexity needed for real-world operations.

Method: Developed UNDERWRITE benchmark through close collaboration with domain experts, incorporating proprietary business knowledge, noisy tool interfaces, and imperfect simulated users requiring careful information gathering.

Result: Evaluation of 13 frontier models revealed significant gaps: most accurate models aren’t most efficient, models hallucinate domain knowledge despite tool access, and pass^k results show 20% performance drop.

Conclusion: Expert involvement in benchmark design is essential for realistic agent evaluation, common agentic frameworks exhibit brittleness, and hallucination detection in specialized domains requires compositional approaches.

Abstract: As AI agents integrate into enterprise applications, their evaluation demands benchmarks that reflect the complexity of real-world operations. Instead, existing benchmarks overemphasize open-domains such as code, use narrow accuracy metrics, and lack authentic complexity. We present UNDERWRITE, an expert-first, multi-turn insurance underwriting benchmark designed in close collaboration with domain experts to capture real-world enterprise challenges. UNDERWRITE introduces critical realism factors often absent in current benchmarks: proprietary business knowledge, noisy tool interfaces, and imperfect simulated users requiring careful information gathering. Evaluating 13 frontier models, we uncover significant gaps between research lab performance and enterprise readiness: the most accurate models are not the most efficient, models hallucinate domain knowledge despite tool access, and pass^k results show a 20% drop in performance. The results from UNDERWRITE demonstrate that expert involvement in benchmark design is essential for realistic agent evaluation, common agentic frameworks exhibit brittleness that skews performance reporting, and hallucination detection in specialized domains demands compositional approaches. Our work provides insights for developing benchmarks that better align with enterprise deployment requirements.

[725] Dual Latent Memory for Visual Multi-agent System

Xinlei Yu, Chengming Xu, Zhangquan Chen, Bo Yin, Cheng Yang, Yongbo He, Yihao Hu, Jiangning Zhang, Cheng Tan, Xiaobin Hu, Shuicheng Yan

Main category: cs.AI

TL;DR: L²-VMAS is a novel framework for Visual Multi-Agent Systems that addresses the “scaling wall” problem by enabling inter-agent collaboration through dual latent memories, decoupling perception and thinking, and using entropy-driven proactive triggering to reduce token costs while improving performance.

Details

Motivation: The paper addresses the counter-intuitive "scaling wall" problem in Visual Multi-Agent Systems where increasing agent turns degrades performance while exponentially increasing token costs. This failure is attributed to the information bottleneck in text-centric communication where converting perceptual and thinking trajectories into natural language causes semantic loss.

Method: Proposes L²-VMAS, a model-agnostic framework with: 1) Dual latent memories for inter-agent collaboration, 2) Decoupling of perception and thinking with dynamic synthesis of dual latent memories, 3) Entropy-driven proactive triggering that replaces passive information transmission with efficient, on-demand memory access.

Result: Extensive experiments across different backbones, sizes, and multi-agent structures show the method effectively breaks the “scaling wall” with superb scalability, improving average accuracy by 2.7-5.4% while reducing token usage by 21.3-44.8%.

Conclusion: L²-VMAS successfully addresses the scaling limitations of Visual Multi-Agent Systems by introducing latent memory-based collaboration mechanisms that overcome the information bottleneck of text-centric communication, enabling more efficient and effective multi-agent visual understanding.

Abstract: While Visual Multi-Agent Systems (VMAS) promise to enhance comprehensive abilities through inter-agent collaboration, empirical evidence reveals a counter-intuitive “scaling wall”: increasing agent turns often degrades performance while exponentially inflating token costs. We attribute this failure to the information bottleneck inherent in text-centric communication, where converting perceptual and thinking trajectories into discrete natural language inevitably induces semantic loss. To this end, we propose L$^{2}$-VMAS, a novel model-agnostic framework that enables inter-agent collaboration with dual latent memories. Furthermore, we decouple the perception and thinking while dynamically synthesizing dual latent memories. Additionally, we introduce an entropy-driven proactive triggering that replaces passive information transmission with efficient, on-demand memory access. Extensive experiments among backbones, sizes, and multi-agent structures demonstrate that our method effectively breaks the “scaling wall” with superb scalability, improving average accuracy by 2.7-5.4% while reducing token usage by 21.3-44.8%. Codes: https://github.com/YU-deep/L2-VMAS.

[726] Context Learning for Multi-Agent Discussion

Xingyuan Hua, Sheng Yue, Xinyi Li, Yizhe Zhao, Jinrui Zhang, Ju Ren

Main category: cs.AI

TL;DR: M2CL introduces multi-LLM context learning with dynamic context generators to solve discussion inconsistency in multi-agent systems, improving performance by 20-50% on challenging tasks.

Details

Motivation: Current multi-agent discussion (MAD) methods suffer from discussion inconsistency where LLMs fail to reach coherent solutions due to misalignment between individual contexts, preventing effective collaborative problem-solving.

Method: M2CL learns a context generator for each agent that dynamically generates context instructions per discussion round via automatic information organization and refinement, using a self-adaptive mechanism to control context coherence and output discrepancies.

Result: M2CL significantly surpasses existing methods by 20-50% on challenging tasks including academic reasoning, embodied tasks, and mobile control, while showing favorable transferability and computational efficiency.

Conclusion: The proposed M2CL framework effectively addresses discussion inconsistency in multi-agent systems through learned context generators, enabling LLMs to avoid premature convergence and progressively reach correct consensus.

Abstract: Multi-Agent Discussion (MAD) has garnered increasing attention very recently, where multiple LLM instances collaboratively solve problems via structured discussion. However, we find that current MAD methods easily suffer from discussion inconsistency, LLMs fail to reach a coherent solution, due to the misalignment between their individual contexts.In this paper, we introduce a multi-LLM context learning method (M2CL) that learns a context generator for each agent, capable of dynamically generating context instructions per discussion round via automatic information organization and refinement. Specifically, inspired by our theoretical insights on the context instruction, M2CL train the generators to control context coherence and output discrepancies via a carefully crafted self-adaptive mechanism.It enables LLMs to avoid premature convergence on majority noise and progressively reach the correct consensus. We evaluate M2CL on challenging tasks, including academic reasoning, embodied tasks, and mobile control. The results show that the performance of M2CL significantly surpasses existing methods by 20%–50%, while enjoying favorable transferability and computational efficiency.

[727] Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models

Shule Lu, Yujing Wang, Hainan Zhang, Xiaoshan Yang, Hongwei Zheng, Yongxin Tong, Changsheng Xu, Zhiming Zheng

Main category: cs.AI

TL;DR: MoR: Federated alignment framework using GRPO with Mixture-of-Rewards for heterogeneous Vision-Language Models, enabling privacy-preserving training without sharing raw data.

Details

Motivation: VLMs need privacy-preserving training for sensitive domains like healthcare/finance, but federated learning faces challenges with client heterogeneity in resources, requirements, and architectures. Current FL replaces data with parameters, but future should replace parameters with preferences for better scalability and privacy.

Method: 1) Initialize visual foundation model as KL-regularized reference; 2) Each client trains local reward model from preference annotations without exposing raw data; 3) Routing-based fusion mechanism adaptively aggregates client reward signals; 4) Server performs GRPO with mixed reward to optimize base VLM.

Result: Experiments on three public VQA benchmarks show MoR consistently outperforms federated alignment baselines in generalization, robustness, and cross-client adaptability.

Conclusion: MoR provides scalable solution for privacy-preserving alignment of heterogeneous VLMs under federated settings, advancing FL from parameter-sharing to preference-based alignment.

Abstract: VLMs have broad potential in privacy-sensitive domains such as healthcare and finance, yet strict data-sharing constraints render centralized training infeasible. FL mitigates this issue by enabling decentralized training, but practical deployments face challenges due to client heterogeneity in computational resources, application requirements, and model architectures. We argue that while replacing data with model parameters characterizes the present of FL, replacing parameters with preferences represents a more scalable and privacy-preserving future. Motivated by this perspective, we propose MoR, a federated alignment framework based on GRPO with Mixture-of-Rewards for heterogeneous VLMs. MoR initializes a visual foundation model as a KL-regularized reference, while each client locally trains a reward model from local preference annotations, capturing specific evaluation signals without exposing raw data. To reconcile heterogeneous rewards, we introduce a routing-based fusion mechanism that adaptively aggregates client reward signals. Finally, the server performs GRPO with this mixed reward to optimize the base VLM. Experiments on three public VQA benchmarks demonstrate that MoR consistently outperforms federated alignment baselines in generalization, robustness, and cross-client adaptability. Our approach provides a scalable solution for privacy-preserving alignment of heterogeneous VLMs under federated settings.

[728] PCBSchemaGen: Constraint-Guided Schematic Design via LLM for Printed Circuit Boards (PCB)

Huanghaohe Zou, Peng Han, Emad Nazerian, Alex Q. Huang

Main category: cs.AI

TL;DR: PCBSchemaGen is a training-free framework for automated PCB schematic design using LLM agents and constraint-guided synthesis to handle heterogeneous digital, analog, and power signals with real-world IC package constraints.

Details

Motivation: PCB schematic design is essential in electronics but automated design remains unexplored due to lack of open-source data and absence of simulation-based verification. Prior works focus only on digital or analog circuits alone, while PCB design must handle heterogeneous signals with real-world IC package constraints.

Method: PCBSchemaGen uses LLM-based code generation with iterative feedback using domain-specific prompts, plus a verification framework leveraging a real-world IC datasheet derived Knowledge Graph and Subgraph Isomorphism to encode pin-role semantics and topological constraints.

Result: The framework was tested on 23 PCB schematic tasks spanning digital, analog, and power domains, demonstrating significant improvements in design accuracy and computational efficiency.

Conclusion: PCBSchemaGen is the first training-free framework for PCB schematic design that successfully handles heterogeneous circuit types with real-world constraints, addressing the previously unexplored area of automated PCB design.

Abstract: Printed Circuit Board (PCB) schematic design plays an essential role in all areas of electronic industries. Unlike prior works that focus on digital or analog circuits alone, PCB design must handle heterogeneous digital, analog, and power signals while adhering to real-world IC packages and pin constraints. Automated PCB schematic design remains unexplored due to the scarcity of open-source data and the absence of simulation-based verification. We introduce PCBSchemaGen, the first training-free framework for PCB schematic design that comprises LLM agent and Constraint-guided synthesis. Our approach makes three contributions: 1. an LLM-based code generation paradigm with iterative feedback with domain-specific prompts. 2. a verification framework leveraging a real-world IC datasheet derived Knowledge Graph (KG) and Subgraph Isomorphism encoding pin-role semantics and topological constraints. 3. an extensive experiment on 23 PCB schematic tasks spanning digital, analog, and power domains. Results demonstrate that PCBSchemaGen significantly improves design accuracy and computational efficiency.

[729] Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory

Junhyuk Choi, Sohhyung Park, Chanhee Cho, Hyeonchu Park, Bugeun Kim

Main category: cs.AI

TL;DR: A diagnostic framework using Item Response Theory to assess reliability of LLM-as-a-Judge systems along two dimensions: intrinsic consistency and human alignment.

Details

Motivation: Existing validation practices for LLM-as-a-Judge primarily focus on observed outputs without examining whether LLM judges function as stable and reliable measurement instruments, creating a need for systematic reliability assessment.

Method: Two-phase diagnostic framework grounded in Item Response Theory (IRT), specifically using Graded Response Model (GRM). Formalizes reliability along intrinsic consistency (stability under prompt variations) and human alignment (correspondence with human assessments).

Result: The framework yields interpretable signals for systematically diagnosing LLM judgments, providing practical guidance for verifying reliability and identifying causes of unreliability in diverse LLM judges.

Conclusion: IRT-based diagnostic framework offers a systematic approach to assess LLM-as-a-Judge reliability, addressing limitations of current validation practices and providing actionable insights for improving automated evaluation systems.

Abstract: While LLM-as-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outputs, offering limited insight into whether LLM judges themselves function as stable and reliable measurement instruments. To address this limitation, we introduce a two-phase diagnostic framework for assessing reliability of LLM-as-a-Judge, grounded in Item Response Theory (IRT). The framework adopts Graded Response Model (GRM) of IRT and formalizes reliability along two complementary dimensions: (1) intrinsic consistency, defined as the stability of measurement behavior under prompt variations, and (2) human alignment, capturing correspondence with human quality assessments. We empirically examine diverse LLM judges with this framework, and show that leveraging IRT-GRM yields interpretable signals for diagnosing judgments systematically. These signals provide practical guidance for verifying reliablity of LLM-as-a-Judge and identifying potential causes of unreliability.

[730] How Far Are LLMs from Professional Poker Players? Revisiting Game-Theoretic Reasoning with Agentic Tool Use

Minhua Lin, Enyan Dai, Hui Liu, Xianfeng Tang, Yuliang Yan, Zhenwei Dai, Jingying Zeng, Zhiwei Zhang, Fali Wang, Hongcheng Gao, Chen Luo, Xiang Zhang, Qi He, Suhang Wang

Main category: cs.AI

TL;DR: LLMs struggle with game-theoretic reasoning in poker; ToolPoker framework integrates external solvers to achieve state-of-the-art gameplay with principled reasoning.

Details

Motivation: As LLMs are applied in high-stakes domains, their ability to reason strategically under uncertainty becomes critical. Poker provides a rigorous testbed requiring both strong actions and game-theoretic reasoning.

Method: Systematic study of LLMs in poker tasks, evaluation of gameplay outcomes and reasoning traces. Proposed ToolPoker framework integrates external solvers for GTO-consistent actions with professional-style explanations.

Result: LLMs fail against traditional algorithms, showing three flaws: heuristic reliance, factual misunderstandings, and reasoning-action gaps. ToolPoker achieves state-of-the-art gameplay with reasoning traces reflecting game-theoretic principles.

Conclusion: Current LLMs lack proper game-theoretic reasoning for strategic domains like poker; tool-integrated approaches combining external solvers can bridge this gap effectively.

Abstract: As Large Language Models (LLMs) are increasingly applied in high-stakes domains, their ability to reason strategically under uncertainty becomes critical. Poker provides a rigorous testbed, requiring not only strong actions but also principled, game-theoretic reasoning. In this paper, we conduct a systematic study of LLMs in multiple realistic poker tasks, evaluating both gameplay outcomes and reasoning traces. Our analysis reveals LLMs fail to compete against traditional algorithms and identifies three recurring flaws: reliance on heuristics, factual misunderstandings, and a “knowing-doing” gap where actions diverge from reasoning. An initial attempt with behavior cloning and step-level reinforcement learning improves reasoning style but remains insufficient for accurate game-theoretic play. Motivated by these limitations, we propose ToolPoker, a tool-integrated reasoning framework that combines external solvers for GTO-consistent actions with more precise professional-style explanations. Experiments demonstrate that ToolPoker achieves state-of-the-art gameplay while producing reasoning traces that closely reflect game-theoretic principles.

[731] Uncovering Latent Communication Patterns in Brain Networks via Adaptive Flow Routing

Tianhao Huang, Guanghui Min, Zhenyu Lei, Aiying Zhang, Chen Chen

Main category: cs.AI

TL;DR: AFR-Net is a physics-informed neural network that models how structural brain connectivity gives rise to functional communication patterns, enabling interpretable discovery of critical neural pathways through adaptive flow routing.

Details

Motivation: Current methods for fusing structural and functional brain connectivity lack fundamental neuroscientific insight and fail to uncover latent interactions between neural regions, unable to explain why SC and FC exhibit both coupling and heterogeneity.

Method: Proposes Adaptive Flow Routing Network (AFR-Net), a physics-informed framework that models neural communication dynamics by simulating how structural constraints (SC) give rise to functional communication patterns (FC) through adaptive flow routing mechanisms.

Result: AFR-Net significantly outperforms state-of-the-art baselines in extensive experiments, demonstrating superior performance in multi-modal brain connectivity fusion tasks.

Conclusion: The paper presents an interpretable, physics-inspired approach to understanding brain connectivity that bridges structural and functional data through neural communication dynamics, offering insights into critical neural pathways.

Abstract: Unraveling how macroscopic cognitive phenotypes emerge from microscopic neuronal connectivity remains one of the core pursuits of neuroscience. To this end, researchers typically leverage multi-modal information from structural connectivity (SC) and functional connectivity (FC) to complete downstream tasks. Recent methodologies explore the intricate coupling mechanisms between SC and FC, attempting to fuse their representations at the regional level. However, lacking fundamental neuroscientific insight, these approaches fail to uncover the latent interactions between neural regions underlying these connectomes, and thus cannot explain why SC and FC exhibit dynamic states of both coupling and heterogeneity. In this paper, we formulate multi-modal fusion through the lens of neural communication dynamics and propose the Adaptive Flow Routing Network (AFR-Net), a physics-informed framework that models how structural constraints (SC) give rise to functional communication patterns (FC), enabling interpretable discovery of critical neural pathways. Extensive experiments demonstrate that AFR-Net significantly outperforms state-of-the-art baselines. The code is available at https://anonymous.4open.science/r/DIAL-F0D1.

[732] Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs

Xiang Zheng, Weiqi Zhai, Wei Wang, Boyu Yang, Wenbo Li, Ruixiang Luo, Haoxiang Sun, Yucheng Wang, Zhengze Li, Meng Wang, Yuetian Du, Guojie Lin, Yaxuan Wang, Xiaoxiao Xu, Yanhu Mo, Xuan Ren, Hu Wei, Ze Xu

Main category: cs.AI

TL;DR: A new benchmark called ReasoningMath-Plus with 150 problems designed to evaluate structural reasoning beyond template-based computation, featuring HCRS scoring and Process Reward Model for fine-grained evaluation.

Details

Motivation: Current LLMs achieve near-saturation accuracy on existing math benchmarks due to template-based computation and shallow arithmetic decomposition, which fails to assess genuine reasoning skills like multi-constraint coordination, constructive logical synthesis, and spatial inference.

Method: Created ReasoningMath-Plus benchmark with 150 carefully curated problems emphasizing reasoning under interacting constraints, constructive solution formation, and non-trivial structural insight. Introduced HCRS (Hazard-aware Chain-based Rule Score) for deterministic step-level scoring and trained a Process Reward Model on annotated reasoning traces.

Result: Leading models achieve relatively high final-answer accuracy (up to 5.8/10), but HCRS-based holistic evaluation yields substantially lower scores (average 4.36/10, best 5.14/10), showing answer-only metrics overestimate reasoning robustness.

Conclusion: The paper demonstrates that current benchmarks fail to properly evaluate structural reasoning, and proposes a new evaluation framework that reveals significant gaps in LLMs’ reasoning capabilities beyond template-based computation.

Abstract: Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence. This saturation largely stems from the dominance of template-based computation and shallow arithmetic decomposition in existing datasets, which underrepresent reasoning skills such as multi-constraint coordination, constructive logical synthesis, and spatial inference. To address this gap, we introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning. Each problem emphasizes reasoning under interacting constraints, constructive solution formation, or non-trivial structural insight, and is annotated with a minimal reasoning skeleton to support fine-grained process-level evaluation. Alongside the dataset, we introduce HCRS (Hazard-aware Chain-based Rule Score), a deterministic step-level scoring function, and train a Process Reward Model (PRM) on the annotated reasoning traces. Empirically, while leading models attain relatively high final-answer accuracy (up to 5.8/10), HCRS-based holistic evaluation yields substantially lower scores (average 4.36/10, best 5.14/10), showing that answer-only metrics can overestimate reasoning robustness.

Yifei Shao, Kun Zhou, Ziming Xu, Mohammad Atif Quamar, Shibo Hao, Zhen Wang, Zhiting Hu, Biwei Huang

Main category: cs.AI

TL;DR: Modal-mixed Chain-of-Thought that interleaves text tokens with visual sketches as latent embeddings for multimodal reasoning, using VLM as encoder and diffusion decoder for visual details.

Details

Motivation: Text-only CoT fails on vision-intensive problems where key intermediate states are inherently visual. Need to extend CoT beyond language to better handle multimodal reasoning.

Method: Introduces modal-mixed CoT with visual sketches as latent embeddings. Uses VLM as encoder, trains language backbone to reconstruct its own vision embeddings. Attaches diffusion-based latent decoder conditioned on VLM hidden states. Two-stage training: supervised fine-tuning with joint next-token and latent-reconstruction, then reinforcement learning for modality switching.

Result: Extensive experiments across 11 diverse multimodal reasoning tasks demonstrate better performance than language-only and other CoT methods.

Conclusion: Modal-mixed CoT effectively handles multimodal reasoning by interleaving text with visual representations, cleanly disentangling roles between VLM and diffusion decoder.

Abstract: We study how to extend chain-of-thought (CoT) beyond language to better handle multimodal reasoning. While CoT helps LLMs and VLMs articulate intermediate steps, its text-only form often fails on vision-intensive problems where key intermediate states are inherently visual. We introduce modal-mixed CoT, which interleaves textual tokens with compact visual sketches represented as latent embeddings. To bridge the modality gap without eroding the original knowledge and capability of the VLM, we use the VLM itself as an encoder and train the language backbone to reconstruct its own intermediate vision embeddings, to guarantee the semantic alignment of the visual latent space. We further attach a diffusion-based latent decoder, invoked by a special control token and conditioned on hidden states from the VLM. In this way, the diffusion head carries fine-grained perceptual details while the VLM specifies high-level intent, which cleanly disentangles roles and reduces the optimization pressure of the VLM. Training proceeds in two stages: supervised fine-tuning on traces that interleave text and latents with a joint next-token and latent-reconstruction objective, followed by reinforcement learning that teaches when to switch modalities and how to compose long reasoning chains. Extensive experiments across 11 diverse multimodal reasoning tasks, demonstrate that our method yields better performance than language-only and other CoT methods. Our code will be publicly released.

[734] Small Shifts, Large Gains: Unlocking Traditional TSP Heuristic Guided-Sampling via Unsupervised Neural Instance Modification

Wei Huang, Hanchen Wang, Dong Wen, Wenjie Zhang

Main category: cs.AI

TL;DR: TSP-MDF is a neural-based instance modification framework that enhances traditional TSP heuristics by strategically shifting node coordinates to enable guided sampling and escape local optima, achieving neural-level performance with minimal training.

Details

Motivation: Traditional TSP heuristics are efficient but deterministic and prone to local optima, while neural methods require extensive training and supervision. There's a need to bridge this gap with practical solutions that combine the strengths of both approaches.

Method: Proposes TSP-MDF framework with a neural-based instance modifier that strategically shifts node coordinates to create modified instances. Traditional heuristics construct tours on these modified instances, which are then mapped back to the original problem, enabling exploration beyond local optima without requiring ground-truth supervision.

Result: Significantly improves traditional heuristic performance on large-scale TSP benchmarks and real-world problems, achieving solution quality comparable to neural-based methods but with extremely short training time.

Conclusion: TSP-MDF successfully bridges the gap between traditional and neural TSP methods by enabling guided sampling in traditional heuristics through instance modification, offering practical performance improvements without extensive training requirements.

Abstract: The Traveling Salesman Problem (TSP) is one of the most representative NP-hard problems in route planning and a long-standing benchmark in combinatorial optimization. Traditional heuristic tour constructors, such as Farthest or Nearest Insertion, are computationally efficient and highly practical, but their deterministic behavior limits exploration and often leads to local optima. In contrast, neural-based heuristic tour constructors alleviate this issue through guided-sampling and typically achieve superior solution quality, but at the cost of extensive training and reliance on ground-truth supervision, hindering their practical use. To bridge this gap, we propose TSP-MDF, a novel instance modification framework that equips traditional deterministic heuristic tour constructors with guided-sampling capability. Specifically, TSP-MDF introduces a neural-based instance modifier that strategically shifts node coordinates to sample multiple modified instances, on which the base traditional heuristic tour constructor constructs tours that are mapped back to the original instance, allowing traditional tour constructors to explore higher-quality tours and escape local optima. At the same time, benefiting from our instance modification formulation, the neural-based instance modifier can be trained efficiently without any ground-truth supervision, ensuring the framework maintains practicality. Extensive experiments on large-scale TSP benchmarks and real-world benchmarks demonstrate that TSP-MDF significantly improves the performance of traditional heuristics tour constructors, achieving solution quality comparable to neural-based heuristic tour constructors, but with an extremely short training time.

[735] Exploring Information Seeking Agent Consolidation

Guochen Yan, Jialong Wu, Zhengwei Tao, Bo Li, Qintong Zhang, Jiahao Xu, Haitao Mi, Yuejian Fang, Qingni Shen, Wentao Zhang, Zhonghai Wu

Main category: cs.AI

TL;DR: The paper investigates consolidating heterogeneous information-seeking agents into a single foundation agentic model using data-level and parameter-level consolidation strategies, comparing their performance, generalization, and interference.

Details

Motivation: Existing information-seeking agents are specialized for specific domains (open web, documents, or local knowledge bases), which limits scalability and cross-domain generalization. The authors aim to create a unified foundation agentic model that can handle diverse information-seeking tasks.

Method: Two consolidation strategies: 1) Data-level consolidation - joint training on mixed domain-specific datasets, 2) Parameter-level consolidation - merging independently trained agent models at the parameter level. The analysis compares performance retention, cross-domain generalization, and interference across behaviors.

Result: Data-level consolidation remains a strong and stable baseline, while parameter-level consolidation offers an efficient alternative but suffers from interference and robustness challenges. Key design factors for effective parameter-level consolidation include fine-grained merging granularity, task heterogeneity awareness, and principled consensus strategies.

Conclusion: Both consolidation strategies have trade-offs: data-level consolidation is more robust while parameter-level consolidation is more efficient but requires careful design. The identified design factors provide guidance for building effective unified agentic models.

Abstract: Information-seeking agents have emerged as a powerful paradigm for solving knowledge-intensive tasks. Existing information-seeking agents are typically specialized for open web, documents, or local knowledge bases, which constrains scalability and cross-domain generalization. In this work, we investigate how to consolidate heterogeneous information-seeking agents into a single foundation agentic model. We study two complementary consolidation strategies: data-level consolidation, which jointly trains a unified model on a mixture of domain-specific datasets, and parameter-level consolidation, which merges independently trained agent models at the parameter level. Our analysis compares these approaches in terms of performance retention, cross-domain generalization, and interference across information-seeking behaviors. Our results show that data-level consolidation remains a strong and stable baseline, while parameter-level consolidation offers a promising, efficient alternative but suffers from interference and robustness challenges. We further identify key design factors for effective agent consolidation at the parameter level, including fine-grained merging granularity, awareness of task heterogeneity, and principled consensus strategy.

[736] DockSmith: Scaling Reliable Coding Environments via an Agentic Docker Builder

Jiaran Zhang, Luck Ma, Yanhao Li, Fanqi Wan, Di Qi, Xu Zhao, Jieyi Hou, Zhe Xie, Mengqiang Ren, Xin Wu, Zhewei Huang, Liangyu Chen, Yingwei Ma, Qi Han, Xiangyu Zhang

Main category: cs.AI

TL;DR: DockSmith is an agentic Docker builder that treats environment construction as a core agentic capability, improving software engineering agent training and evaluation by addressing Docker-based environment bottlenecks.

Details

Motivation: Reliable Docker-based environment construction is a major bottleneck for scaling execution-grounded training and evaluation of software engineering agents. Current approaches treat environment setup as a preprocessing step rather than a core agentic capability.

Method: DockSmith is a specialized agentic Docker builder that exercises long-horizon tool use, dependency reasoning, and failure recovery. It’s trained on large-scale, execution-grounded Docker-building trajectories produced by a SWE-Factory-style pipeline with loop-detection and cross-task success memory.

Result: Training a 30B-A3B model achieves open-source state-of-the-art performance on Multi-Docker-Eval (39.72% Fail-to-Pass and 58.28% Commit Rate). DockSmith also improves out-of-distribution performance on SWE-bench Verified, SWE-bench Multilingual, and Terminal-Bench 2.0.

Conclusion: DockSmith demonstrates that treating environment construction as a core agentic capability yields supervision that transfers beyond Docker building itself, providing broader agentic benefits for software engineering tasks.

Abstract: Reliable Docker-based environment construction is a dominant bottleneck for scaling execution-grounded training and evaluation of software engineering agents. We introduce DockSmith, a specialized agentic Docker builder designed to address this challenge. DockSmith treats environment construction not only as a preprocessing step, but as a core agentic capability that exercises long-horizon tool use, dependency reasoning, and failure recovery, yielding supervision that transfers beyond Docker building itself. DockSmith is trained on large-scale, execution-grounded Docker-building trajectories produced by a SWE-Factory-style pipeline augmented with a loop-detection controller and a cross-task success memory. Training a 30B-A3B model on these trajectories achieves open-source state-of-the-art performance on Multi-Docker-Eval, with 39.72% Fail-to-Pass and 58.28% Commit Rate. Moreover, DockSmith improves out-of-distribution performance on SWE-bench Verified, SWE-bench Multilingual, and Terminal-Bench 2.0, demonstrating broader agentic benefits of environment construction.

[737] Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents

Shuo Ren, Can Xie, Pu Jian, Zhenjiang Ren, Chunlin Leng, Jiajun Zhang

Main category: cs.AI

TL;DR: Survey paper reviewing LLM-based scientific agents that automate research tasks like hypothesis generation, experiment design, and data analysis, with focus on their specialized architectures, applications, and ethical considerations.

Details

Motivation: Scientific research is becoming increasingly complex with vast data and interdisciplinary needs, requiring innovative tools to manage complexity and accelerate discovery. LLMs are evolving into specialized scientific agents that can automate critical research tasks.

Method: This is a survey paper that provides a focused review of LLM-based scientific agents, examining their architectures, design principles, benchmarks, applications across scientific fields, and ethical considerations. It compares them to general-purpose LLMs and agents.

Result: The survey identifies key characteristics of scientific agents: integration of domain-specific knowledge, advanced tool sets, robust validation mechanisms, ability to handle complex data types, and ensuring reproducibility. It highlights how these agents advance research across various scientific fields.

Conclusion: LLM-based scientific agents represent a significant advancement for scientific discovery, offering a comprehensive roadmap for researchers to harness these agents for more efficient, reliable, and ethically sound scientific research.

Abstract: As scientific research becomes increasingly complex, innovative tools are needed to manage vast data, facilitate interdisciplinary collaboration, and accelerate discovery. Large language models (LLMs) are now evolving into LLM-based scientific agents that automate critical tasks ranging from hypothesis generation and experiment design to data analysis and simulation. Unlike general-purpose LLMs, these specialized agents integrate domain-specific knowledge, advanced tool sets, and robust validation mechanisms, enabling them to handle complex data types, ensure reproducibility, and drive scientific breakthroughs. This survey provides a focused review of the architectures, design, benchmarks, applications, and ethical considerations surrounding LLM-based scientific agents. We highlight why they differ from general agents and the ways in which they advance research across various scientific fields. By examining their development and challenges, this survey offers a comprehensive roadmap for researchers and practitioners to harness these agents for more efficient, reliable, and ethically sound scientific discovery.

[738] Scalable Generative Game Engine: Breaking the Resolution Wall via Hardware-Algorithm Co-Design

Wei Zeng, Xuchen Li, Ruili Feng, Zhen Liu, Fengwei An, Jian Zhao

Main category: cs.AI

TL;DR: Hardware-algorithm co-design framework enables real-time high-resolution neural game simulation by addressing memory wall constraints through heterogeneous architecture and optimization techniques.

Details

Motivation: Existing generative game engines are limited by the "Memory Wall" to low resolutions (e.g., 64×64), preventing practical high-resolution neural simulations. There's a need to bridge generative models with high-resolution real-time game simulation.

Method: Proposes a scalable hardware-algorithm co-design framework with: 1) asymmetric resource allocation for throughput optimization under sequence parallelism, 2) memory-centric operator fusion to minimize off-chip bandwidth, and 3) manifold-aware latent extrapolation to exploit temporal redundancy and mask latency.

Result: Achieves real-time generation at 720×480 resolution (50× pixel throughput increase over baselines), delivering 26.4 FPS for 3D racing and 48.3 FPS for 2D platformer benchmarks with 2.7 ms amortized latency.

Conclusion: Resolving the Memory Wall through architectural co-design is essential for enabling high-fidelity, responsive neural gameplay, representing a prerequisite rather than just an optimization for generative game engines.

Abstract: Real-time generative game engines represent a paradigm shift in interactive simulation, promising to replace traditional graphics pipelines with neural world models. However, existing approaches are fundamentally constrained by the Memory Wall,'' restricting practical deployments to low resolutions (e.g., $64 \times 64$). This paper bridges the gap between generative models and high-resolution neural simulations by introducing a scalable \textit{Hardware-Algorithm Co-Design} framework. We identify that high-resolution generation suffers from a critical resource mismatch: the World Model is compute-bound while the Decoder is memory-bound. To address this, we propose a heterogeneous architecture that intelligently decouples these components across a cluster of AI accelerators. Our system features three core innovations: (1) an asymmetric resource allocation strategy that optimizes throughput under sequence parallelism constraints; (2) a memory-centric operator fusion scheme that minimizes off-chip bandwidth usage; and (3) a manifold-aware latent extrapolation mechanism that exploits temporal redundancy to mask latency. We validate our approach on a cluster of programmable AI accelerators, enabling real-time generation at $720 \times 480$ resolution -- a $50\times$ increase in pixel throughput over prior baselines. Evaluated on both continuous 3D racing and discrete 2D platformer benchmarks, our system delivers fluid 26.4 FPS and 48.3 FPS respectively, with an amortized effective latency of 2.7 ms. This work demonstrates that resolving the Memory Wall’’ via architectural co-design is not merely an optimization, but a prerequisite for enabling high-fidelity, responsive neural gameplay.

[739] Past-Discounting is Key for Learning Markovian Fairness with Long Horizons

Ashwin Kumar, William Yeoh

Main category: cs.AI

TL;DR: A framework for temporal fairness in multi-agent resource allocation that uses past-discounting to bound state space growth, enabling scalable learning of fair policies over long horizons.

Details

Motivation: Existing fairness methods either ignore temporal dynamics or require perfect recall of all past utilities, leading to unbounded state space growth that hinders scalability and convergence of learning algorithms.

Method: Introduces a past-discounted framework for memory tracking that discounts distant events, creating a principled interpolation between instantaneous and perfect-recall fairness, with theoretical guarantees of bounded state space.

Result: Proves that past-discounting guarantees a bounded, horizon-independent state space (unlike perfect-recall methods), enabling tractable learning of fair policies over arbitrarily long horizons.

Conclusion: Provides a scalable framework for temporal fairness in resource allocation systems by incorporating behavioral insights about human fairness judgments discounting distant events.

Abstract: Fairness is an important consideration for dynamic resource allocation in multi-agent systems. Many existing methods treat fairness as a one-shot problem without considering temporal dynamics, which misses the nuances of accumulating inequalities over time. Recent approaches overcome this limitation by tracking allocations over time, assuming perfect recall of all past utilities. While the former neglects long-term equity, the latter introduces a critical challenge: the augmented state space required to track cumulative utilities grows unboundedly with time, hindering the scalability and convergence of learning algorithms. Motivated by behavioral insights that human fairness judgments discount distant events, we introduce a framework for temporal fairness that incorporates past-discounting into the learning problem. This approach offers a principled interpolation between instantaneous and perfect-recall fairness. Our central contribution is a past-discounted framework for memory tracking and a theoretical analysis of fairness memories, showing past-discounting guarantees a bounded, horizon-independent state space, a property that we prove perfect-recall methods lack. This result unlocks the ability to learn fair policies tractably over arbitrarily long horizons. We formalize this framework, demonstrate its necessity with experiments showing that perfect recall fails where past-discounting succeeds, and provide a clear path toward building scalable and equitable resource allocation systems.

[740] Structured Self-Consistency:A Multi-Task Evaluation of LLMs on VirtualHome

Jiaqi Xu, Tao Huang, Kai Zhang

Main category: cs.AI

TL;DR: Evaluation of 7B-parameter LLMs (OPENPANGU-7B and QWEN2.5-7B) on embodied AI tasks using VirtualHome benchmark with EAI framework, proposing Structured Self-Consistency decoding for improved performance.

Details

Motivation: Embodied AI requires agents to understand goals, plan actions, and execute tasks in simulated environments. There's a need to evaluate how well current LLMs perform on fundamental embodied AI tasks and develop methods to improve their performance in structured generation tasks.

Method: Comprehensive evaluation using VirtualHome benchmark with Embodied Agent Interface framework. Compare two 7B-parameter models across four tasks: Goal Interpretation, Action Sequencing, Subgoal Decomposition, and Transition Modeling. Propose Structured Self-Consistency (SSC) - an enhanced decoding strategy using multiple sampling with domain-specific voting mechanisms.

Result: SSC significantly enhances performance. OPENPANGU-7B excels at hierarchical planning while QWEN2.5-7B shows advantages in action-level tasks. Models reveal complementary strengths across different task types.

Conclusion: The evaluation provides insights for future embodied AI system development, showing that different LLMs have complementary strengths and that structured decoding strategies like SSC can significantly improve performance on embodied AI tasks.

Abstract: Embodied AI requires agents to understand goals, plan actions, and execute tasks in simulated environments.We present a comprehensive evaluation of Large Language Models (LLMs) on the VirtualHome benchmark using the Embodied Agent Interface (EAI) framework.We compare two representative 7B-parameter models OPENPANGU-7B and QWEN2.5-7B across four fundamental tasks: Goal Interpretation, Action Sequencing, Subgoal Decomposition, and Transition Modeling.We propose Structured Self-Consistency (SSC), an enhanced decoding strategy that leverages multiple sampling with domain-specific voting mechanisms to improve output quality for structured generation tasks. Experimental results demonstrate that SSC significantly enhances performance, with OPENPANGU-7B excelling at hierarchical planning while QWEN2.5-7B show advantages in action-level tasks. Our analysis reveals complementary strengths across model types, providing insights for future embodied AI system development.

[741] MAC: Masked Agent Collaboration Boosts Large Language Model Medical Decision-Making

Zhihao Peng, Liuxin Bao, Yixuan Yuan

Main category: cs.AI

TL;DR: MAC framework uses Pareto-optimal agent selection and cross-consistency masking to enable adaptive multi-agent collaboration for medical decision-making.

Details

Motivation: Current multi-agent systems in healthcare using LLMs suffer from rigid collaboration patterns and lack systematic agent construction, leading to collaboration failures and performance degradation in medical decision-making scenarios.

Method: 1) Pareto-frontier analysis of LLM pool considering model size, inference time, diversity score, and throughput ratio; 2) Select Pareto-optimal models as agents; 3) Measure pairwise similarity between agent outputs to compute cross-consistency; 4) Mask agent with lowest cross-consistency; 5) Adaptive progressive propagation where each agent aggregates unmasked previous layer outputs via prompt engineering.

Result: The framework achieves adaptive progressive propagation of collaborative information, boosting medical decision-making capacity by eliminating semantically inconsistent outputs and enabling efficient agent collaboration.

Conclusion: MAC framework addresses collaboration failures in multi-agent systems for healthcare by providing systematic agent construction and adaptive collaboration patterns, improving medical decision-making performance.

Abstract: Large language models (LLMs) have proven effective in artificial intelligence, where the multi-agent system (MAS) holds considerable promise for healthcare development by achieving the collaboration of LLMs. However, the absence of a systematic pipeline for agent construction and the rigidity of static collaboration patterns render current MAS-based models vulnerable to collaboration failures, resulting in substantial performance degradation in medical decision-making scenarios. To this end, we propose a novel Masked Agent Collaboration (MAC) framework that harnesses Pareto-optimal agent construction and cross-consistency maximization mechanisms to achieve adaptive progressive propagation of collaborative information, boosting the medical decision-making capacity. Specifically, we first conduct a Pareto-frontier factors analysis towards the LLMs pool to consider their key factors, including the model size, inference time, diversity score, and throughput ratio, where we calculate the similarity between pairwise outputs within an LLM to derive its diversity score. Beyond this analysis, we enable the identification of Pareto-optimal models that balance efficiency and capability, which are subsequently selected as collaborative agents to consider the fundamental trade-offs inherent in practical LLM deployment. Afterward, we measure the pairwise similarity between the outputs from collaborative agents to determine their cross-consistency values, subsequently masking out the agent with the lowest cross-consistency value to eliminate the output that is likely semantically inconsistent. Finally, we conduct collaboration of agents by achieving adaptive progressive propagation, where each agent aggregates the outputs of unmasked agents from the previous layer as its input to generate the corresponding output via prompt engineering.

[742] Inference-Only Prompt Projection for Safe Text-to-Image Generation with TV Guarantees

Minhyuk Lee, Hyekyung Yoon, Myungjoo Kang

Main category: cs.AI

TL;DR: A prompt projection framework for text-to-image diffusion models that reduces unsafe generations without retraining, using total variation theory to formalize the safety-alignment trade-off.

Details

Motivation: Real-world deployment of text-to-image diffusion models requires safety safeguards that suppress unsafe generations while preserving benign prompt-image alignment, creating a fundamental trade-off that needs principled solutions.

Method: Proposes an inference-only prompt projection framework that selectively intervenes on high-risk prompts using a surrogate objective with verification, mapping them into a tolerance-controlled safe set without retraining or fine-tuning the generator, guided by total variation theory.

Result: Achieves 16.7-60.0% relative reductions in inappropriate percentage versus strong model-level alignment baselines across four datasets and three diffusion backbones, while preserving benign prompt-image alignment on COCO near the unaligned reference.

Conclusion: The total variation lens provides a principled framework for understanding the safety-prompt alignment trade-off, and the proposed prompt projection method effectively reduces unsafe generations while maintaining alignment for benign prompts without model retraining.

Abstract: Text-to-Image (T2I) diffusion models enable high-quality open-ended synthesis, but their real-world deployment demands safeguards that suppress unsafe generations without degrading benign prompt-image alignment. We formalize this tension through a total variation (TV) lens: once the reference conditional distribution is fixed, any nontrivial reduction in unsafe generations necessarily incurs TV deviation from the reference, yielding a principled Safety-Prompt Alignment Trade-off (SPAT). Guided by this view, we propose an inference-only prompt projection framework that selectively intervenes on high-risk prompts via a surrogate objective with verification, mapping them into a tolerance-controlled safe set while leaving benign prompts effectively unchanged, without retraining or fine-tuning the generator. Across four datasets and three diffusion backbones, our approach achieves 16.7-60.0% relative reductions in inappropriate percentage (IP) versus strong model-level alignment baselines, while preserving benign prompt-image alignment on COCO near the unaligned reference.

[743] SemanticALLI: Caching Reasoning, Not Just Responses, in Agentic Systems

Varun Chillara, Dylan Kline, Christopher Alvares, Evan Wooten, Huan Yang, Shlok Khetan, Cade Bauer, Tré Guillory, Tanishka Shah, Yashodhara Dhariwal, Volodymyr Pavlov, George Popstefanov

Main category: cs.AI

TL;DR: SemanticALLI improves agentic AI pipeline efficiency by caching structured intermediate representations to avoid redundant reasoning, achieving 83% cache hit rate and bypassing thousands of LLM calls.

Details

Motivation: Agentic AI pipelines frequently reconstruct identical intermediate logic (like metric normalization or chart scaffolding) even with novel natural language inputs, but conventional caching fails because it treats inference as a monolithic black box.

Method: SemanticALLI decomposes generation into Analytic Intent Resolution (AIR) and Visualization Synthesis (VS), elevating structured intermediate representations to first-class, cacheable artifacts within a pipeline-aware architecture.

Result: Baseline monolithic caching achieves only 38.7% hit rate, while SemanticALLI’s structured approach achieves 83.10% hit rate in Visualization Synthesis stage, bypassing 4,023 LLM calls with median latency of 2.66 ms.

Conclusion: Even when users rarely repeat themselves, AI pipelines often do at stable, structured checkpoints where caching is most reliable, offering practical lessons for AI system design through internal reuse.

Abstract: Agentic AI pipelines suffer from a hidden inefficiency: they frequently reconstruct identical intermediate logic, such as metric normalization or chart scaffolding, even when the user’s natural language phrasing is entirely novel. Conventional boundary caching fails to capture this inefficiency because it treats inference as a monolithic black box. We introduce SemanticALLI, a pipeline-aware architecture within Alli (PMG’s marketing intelligence platform), designed to operationalize redundant reasoning. By decomposing generation into Analytic Intent Resolution (AIR) and Visualization Synthesis (VS), SemanticALLI elevates structured intermediate representations (IRs) to first-class, cacheable artifacts. The impact of caching within the agentic loop is substantial. In our evaluation, baseline monolithic caching caps at a 38.7% hit rate due to linguistic variance. In contrast, our structured approach allows for an additional stage, the Visualization Synthesis stage, to achieve an 83.10% hit rate, bypassing 4,023 LLM calls with a median latency of just 2.66 ms. This internal reuse reduces total token consumption, offering a practical lesson for AI system design: even when users rarely repeat themselves, the pipeline often does, at stable, structured checkpoints where caching is most reliable.

[744] Predictive Maintenance for Ultrafiltration Membranes Using Explainable Similarity-Based Prognostics

Qusai Khaled, Laura Genga, Uzay Kaymak

Main category: cs.AI

TL;DR: Proposes an explainable fuzzy similarity reasoning framework for predicting remaining useful life of ultrafiltration membranes in desalination systems, achieving interpretable predictions with 4.50 cycles mean absolute error.

Details

Motivation: Current predictive maintenance models for ultrafiltration membranes lack interpretability and operator trust, leading plants to rely on scheduled preventive maintenance despite performance degradation from fouling.

Method: Uses physics-informed Health Index from transmembrane pressure, flux, and resistance; fuzzifies degradation dynamics via Gaussian membership functions; applies similarity reasoning to identify historical degradation patterns; formulates RUL predictions as Takagi-Sugeno fuzzy rules weighted by similarity.

Result: Achieved mean absolute error of 4.50 cycles on 12,528 operational cycles from industrial-scale UF system, while generating interpretable rule bases consistent with expert understanding.

Conclusion: The explainable fuzzy similarity framework provides transparent, trustworthy RUL predictions for UF membranes, bridging the gap between opaque machine learning models and operator acceptance in industrial settings.

Abstract: In reverse osmosis desalination, ultrafiltration (UF) membranes degrade due to fouling, leading to performance loss and costly downtime. Most plants rely on scheduled preventive maintenance, since existing predictive maintenance models, often based on opaque machine learning methods, lack interpretability and operator trust. This study proposes an explainable prognostic framework for UF membrane remaining useful life (RUL) estimation using fuzzy similarity reasoning. A physics-informed Health Index, derived from transmembrane pressure, flux, and resistance, captures degradation dynamics, which are then fuzzified via Gaussian membership functions. Using a similarity measure, the model identifies historical degradation trajectories resembling the current state and formulates RUL predictions as Takagi-Sugeno fuzzy rules. Each rule corresponds to a historical exemplar and contributes to a transparent, similarity-weighted RUL estimate. Tested on 12,528 operational cycles from an industrial-scale UF system, the framework achieved a mean absolute error of 4.50 cycles, while generating interpretable rule bases consistent with expert understanding.

[745] SEISMO: Increasing Sample Efficiency in Molecular Optimization with a Trajectory-Aware LLM Agent

Fabian P. Krüger, Andrea Hunklinger, Adrian Wolny, Tim J. Adler, Igor Tetko, Santiago David Villalba

Main category: cs.AI

TL;DR: SEISMO is an LLM agent for sample-efficient molecular optimization that performs strictly online, inference-time optimization using natural language task descriptions and explanatory feedback.

Details

Motivation: Molecular optimization for drug discovery relies on costly experimental assays, requiring highly sample-efficient methods that can work with limited oracle calls.

Method: SEISMO is an LLM agent that performs strictly online, inference-time optimization, updating after every oracle call without population-based learning. It conditions proposals on full optimization trajectories using natural language task descriptions, scalar scores, and structured explanatory feedback.

Result: SEISMO achieves 2-3 times higher area under the optimization curve than prior methods across 23 tasks in the Practical Molecular Optimization benchmark, often reaching near-maximal scores within 50 oracle calls. Explanatory feedback further improves efficiency.

Conclusion: Leveraging domain knowledge and structured information through LLM agents enables highly sample-efficient molecular optimization, particularly valuable for drug discovery with costly experimental assays.

Abstract: Optimizing the structure of molecules to achieve desired properties is a central bottleneck across the chemical sciences, particularly in the pharmaceutical industry where it underlies the discovery of new drugs. Since molecular property evaluation often relies on costly and rate-limited oracles, such as experimental assays, molecular optimization must be highly sample-efficient. To address this, we introduce SEISMO, an LLM agent that performs strictly online, inference-time molecular optimization, updating after every oracle call without the need for population-based or batched learning. SEISMO conditions each proposal on the full optimization trajectory, combining natural-language task descriptions with scalar scores and, when available, structured explanatory feedback. Across the Practical Molecular Optimization benchmark of 23 tasks, SEISMO achieves a 2-3 times higher area under the optimisation curve than prior methods, often reaching near-maximal task scores within 50 oracle calls. Our additional medicinal-chemistry tasks show that providing explanatory feedback further improves efficiency, demonstrating that leveraging domain knowledge and structured information is key to sample-efficient molecular optimization.

[746] HumanStudy-Bench: Towards AI Agent Design for Participant Simulation

Xuan Liu, Haoyang Shang, Zizhang Liu, Xinyan Liu, Yunze Xiao, Yiwen Tu, Haojian Jin

Main category: cs.AI

TL;DR: A framework for using LLMs as simulated participants in social science experiments with a benchmark (HUMANSTUDY-BENCH) that evaluates agent fidelity to human behavior across experimental protocols.

Details

Motivation: Current use of LLMs as simulated participants in social science experiments suffers from unstable behavior and sensitivity to design choices, with prior evaluations conflating base-model capabilities with experimental instantiation, making it unclear whether outcomes reflect the model itself or the agent setup.

Method: Frames participant simulation as an agent-design problem over full experimental protocols, where an agent is defined by a base model and specification. Introduces HUMANSTUDY-BENCH benchmark and execution engine with Filter-Extract-Execute-Evaluate pipeline that orchestrates LLM-based agents to reconstruct published human-subject experiments, replaying trial sequences and running original analysis pipelines in a shared runtime.

Result: Instantiated 12 foundational studies as an initial suite covering individual cognition, strategic interaction, and social psychology, with more than 6,000 trials and human samples ranging from tens to over 2,100 participants. Proposed new metrics to quantify agreement between human and agent behaviors at the level of scientific inference.

Conclusion: Provides a systematic framework for evaluating LLM-based agents as simulated participants in social science research, enabling better understanding of when and how LLMs can faithfully replicate human behavior in experimental settings.

Abstract: Large language models (LLMs) are increasingly used as simulated participants in social science experiments, but their behavior is often unstable and highly sensitive to design choices. Prior evaluations frequently conflate base-model capabilities with experimental instantiation, obscuring whether outcomes reflect the model itself or the agent setup. We instead frame participant simulation as an agent-design problem over full experimental protocols, where an agent is defined by a base model and a specification (e.g., participant attributes) that encodes behavioral assumptions. We introduce HUMANSTUDY-BENCH, a benchmark and execution engine that orchestrates LLM-based agents to reconstruct published human-subject experiments via a Filter–Extract–Execute–Evaluate pipeline, replaying trial sequences and running the original analysis pipeline in a shared runtime that preserves the original statistical procedures end to end. To evaluate fidelity at the level of scientific inference, we propose new metrics to quantify how much human and agent behaviors agree. We instantiate 12 foundational studies as an initial suite in this dynamic benchmark, spanning individual cognition, strategic interaction, and social psychology, and covering more than 6,000 trials with human samples ranging from tens to over 2,100 participants.

[747] From Prompt to Graph: Comparing LLM-Based Information Extraction Strategies in Domain-Specific Ontology Development

Xuan Liu, Ziyu Li, Mu He, Ziyang Ma, Xiaoxu Wu, Gizem Yilmaz, Yiyuan Xia, Bingbing Li, He Tan, Jerry Ying Hsi Fuh, Wen Feng Lu, Anders E. W. Jarfors, Per Jansson

Main category: cs.AI

TL;DR: LLM-based approaches for automated ontology construction from domain-specific texts, applied to casting manufacturing

Details

Motivation: Traditional ontology construction is labor-intensive and costly, especially in specialized fields like casting manufacturing. LLMs offer new possibilities for automating knowledge extraction from domain-specific texts.

Method: Three LLM-based approaches: pre-trained LLM-driven method, in-context learning (ICL) method, and fine-tuning method to extract terms and relations from domain-specific texts using limited data.

Result: Comparison of performance between the three approaches, with the best-performing method used to build a casting ontology validated by domain experts.

Conclusion: LLMs can effectively automate ontology construction in specialized domains, with one of the three approaches proving most suitable for casting manufacturing knowledge extraction.

Abstract: Ontologies are essential for structuring domain knowledge, improving accessibility, sharing, and reuse. However, traditional ontology construction relies on manual annotation and conventional natural language processing (NLP) techniques, making the process labour-intensive and costly, especially in specialised fields like casting manufacturing. The rise of Large Language Models (LLMs) offers new possibilities for automating knowledge extraction. This study investigates three LLM-based approaches, including pre-trained LLM-driven method, in-context learning (ICL) method and fine-tuning method to extract terms and relations from domain-specific texts using limited data. We compare their performances and use the best-performing method to build a casting ontology that validated by domian expert.

[748] Self-Guard: Defending Large Reasoning Models via enhanced self-reflection

Jingnan Zheng, Jingjun Xu, Yanzhen Luo, Chenhang Cui, Gelei Deng, Zhenkai Liang, Xiang Wang, An Zhang, Tat-Seng Chua

Main category: cs.AI

TL;DR: Self-Guard is a lightweight safety defense framework for Large Reasoning Models that addresses the awareness-compliance gap through safety-oriented prompting and safety activation steering at the representational level.

Details

Motivation: Large Reasoning Models (LRMs) enable explicit reasoning but pose unique risks like reasoning manipulation and information leakage. Current alignment strategies are computationally intensive and fail to address the awareness-compliance gap where models recognize risks but prioritize following user instructions due to sycophantic tendencies.

Method: Two-stage framework: (1) Safety-oriented prompting activates the model’s latent safety awareness to evoke spontaneous reflection, (2) Safety activation steering extracts the directional shift in hidden state space and amplifies it to ensure safety compliance prevails over sycophancy during inference.

Result: Self-Guard effectively bridges the awareness-compliance gap, achieving robust safety performance without compromising model utility. It exhibits strong generalization across diverse unseen risks and varying model scales.

Conclusion: Self-Guard offers a cost-efficient solution for LRM safety alignment by reinforcing safety compliance at the representational level, addressing limitations of current heavy post-training approaches.

Abstract: The emergence of Large Reasoning Models (LRMs) introduces a new paradigm of explicit reasoning, enabling remarkable advances yet posing unique risks such as reasoning manipulation and information leakage. To mitigate these risks, current alignment strategies predominantly rely on heavy post-training paradigms or external interventions. However, these approaches are often computationally intensive and fail to address the inherent awareness-compliance gap, a critical misalignment where models recognize potential risks yet prioritize following user instructions due to their sycophantic tendencies. To address these limitations, we propose Self-Guard, a lightweight safety defense framework that reinforces safety compliance at the representational level. Self-Guard operates through two principal stages: (1) safety-oriented prompting, which activates the model’s latent safety awareness to evoke spontaneous reflection, and (2) safety activation steering, which extracts the resulting directional shift in the hidden state space and amplifies it to ensure that safety compliance prevails over sycophancy during inference. Experiments demonstrate that Self-Guard effectively bridges the awareness-compliance gap, achieving robust safety performance without compromising model utility. Furthermore, Self-Guard exhibits strong generalization across diverse unseen risks and varying model scales, offering a cost-efficient solution for LRM safety alignment.

[749] Physics-informed Diffusion Generation for Geomagnetic Map Interpolation

Wenda Li, Tongya Zheng, Kaixuan Chen, Shunyu Liu, Haoze Jiang, Yunzhi Hao, Rui Miao, Zujie Ren, Mingli Song, Hang Shi, Gang Chen

Main category: cs.AI

TL;DR: A Physics-informed Diffusion Generation framework (PDG) for geomagnetic map interpolation that uses diffusion models guided by physics constraints to handle noise and adhere to physical laws.

Details

Motivation: Existing scattered data interpolation methods are not specifically designed for geomagnetic maps, leading to suboptimal performance due to detection noise and lack of physics constraints. There's a need for specialized interpolation that accounts for the physical properties of geomagnetic fields.

Method: PDG uses a physics-informed diffusion generation framework with two key components: 1) A physics-informed mask strategy based on local receptive fields to guide diffusion generation and eliminate noise interference, and 2) Physics-informed constraints following kriging principles to ensure results adhere to physical laws of geomagnetic maps.

Result: Extensive experiments on four real-world datasets demonstrate the superiority and effectiveness of each component of PDG compared to existing methods.

Conclusion: The proposed PDG framework effectively interpolates incomplete geomagnetic maps by incorporating physics constraints into diffusion generation, addressing noise issues and ensuring physical consistency.

Abstract: Geomagnetic map interpolation aims to infer unobserved geomagnetic data at spatial points, yielding critical applications in navigation and resource exploration. However, existing methods for scattered data interpolation are not specifically designed for geomagnetic maps, which inevitably leads to suboptimal performance due to detection noise and the laws of physics. Therefore, we propose a Physics-informed Diffusion Generation framework~(PDG) to interpolate incomplete geomagnetic maps. First, we design a physics-informed mask strategy to guide the diffusion generation process based on a local receptive field, effectively eliminating noise interference. Second, we impose a physics-informed constraint on the diffusion generation results following the kriging principle of geomagnetic maps, ensuring strict adherence to the laws of physics. Extensive experiments and in-depth analyses on four real-world datasets demonstrate the superiority and effectiveness of each component of PDG.

[750] Learning More from Less: Unlocking Internal Representations for Benchmark Compression

Yueqi Zhang, Jin Hu, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Yiwei Li, Jiayi Shi, Chuyi Tan, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li

Main category: cs.AI

TL;DR: REPCORE: A method that uses aligned hidden state representations instead of discrete correctness labels to construct representative coresets for efficient LLM benchmarking, achieving accurate performance estimation with minimal source models.

Details

Motivation: Full-scale benchmarking of LLMs is prohibitively expensive. Existing coreset methods require many source models to estimate reliable item profiles, which becomes unstable with small source pools, especially for new benchmarks with minimal historical data. Discrete correctness labels are lossy and fail to capture information in hidden states.

Method: REPCORE aligns heterogeneous hidden states from different models into a unified latent space to construct representative coresets. It uses these subsets for performance extrapolation, enabling precise estimation with as few as ten source models.

Result: Experiments on five benchmarks and over 200 models show consistent gains over output-based baselines in ranking correlation and estimation accuracy. Spectral analysis reveals that aligned representations contain separable components reflecting broad response tendencies and task-specific reasoning patterns.

Conclusion: Hidden state representations provide richer information than discrete correctness labels for constructing representative coresets. REPCORE enables efficient and accurate LLM benchmarking with minimal source models, addressing limitations of existing methods for new benchmarks.

Abstract: The prohibitive cost of evaluating Large Language Models (LLMs) necessitates efficient alternatives to full-scale benchmarking. Prevalent approaches address this by identifying a small coreset of items to approximate full-benchmark performance. However, existing methods must estimate a reliable item profile from response patterns across many source models, which becomes statistically unstable when the source pool is small. This dependency is particularly limiting for newly released benchmarks with minimal historical evaluation data. We argue that discrete correctness labels are a lossy view of the model’s decision process and fail to capture information encoded in hidden states. To address this, we introduce REPCORE, which aligns heterogeneous hidden states into a unified latent space to construct representative coresets. Using these subsets for performance extrapolation, REPCORE achieves precise estimation accuracy with as few as ten source models. Experiments on five benchmarks and over 200 models show consistent gains over output-based baselines in ranking correlation and estimation accuracy. Spectral analysis further indicates that the aligned representations contain separable components reflecting broad response tendencies and task-specific reasoning patterns.

[751] Neuro-symbolic AI for Predictive Maintenance (PdM) – review and recommendations

Kyle Hamilton, Ali Intizar

Main category: cs.AI

TL;DR: Survey paper reviewing predictive maintenance (PdM) methods over 5 years, comparing data-driven deep learning with traditional knowledge-based systems, and proposing neuro-symbolic AI as a hybrid solution for better accuracy, explainability, and robustness.

Details

Motivation: Current PdM systems face limitations: data-driven methods need large labeled datasets and lack generalizability/explainability, while traditional knowledge-based systems suffer from poor accuracy and require expert supervision. A hybrid approach could overcome these weaknesses.

Method: Systematic review of PdM literature over 5 years, analyzing industrial applications. Proposes neuro-symbolic AI architectures that integrate deep learning with symbolic logic, focusing on methods using sensor data and manually crafted rules as inputs.

Result: Data-driven methods show higher accuracy than knowledge-based systems but have adoption obstacles. Hybrid neuro-symbolic approaches offer potential for more accurate, explainable, interpretable, and robust PdM systems.

Conclusion: Neuro-symbolic AI represents a promising direction for PdM by combining strengths of data-driven and knowledge-based approaches, addressing key challenges in real-world industrial deployment.

Abstract: In this document we perform a systematic review the State-of-the-art in Predictive Maintenance (PdM) over the last five years in industrial settings such as commercial buildings, pharmaceutical facilities, or semi-conductor manufacturing. In general, data-driven methods such as those based on deep learning, exhibit higher accuracy than traditional knowledge-based systems. These systems however, are not without significant limitations. The need for large labeled data sets, a lack of generalizibility to new environments (out-of-distribution generalization), and a lack of transparency at inference time are some of the obstacles to adoption in real world environments. In contrast, traditional approaches based on domain expertise in the form of rules, logic or first principles suffer from poor accuracy, many false positives and a need for ongoing expert supervision and manual tuning. While the majority of approaches in recent literature utilize some form of data-driven architecture, there are hybrid systems which also take into account domain specific knowledge. Such hybrid systems have the potential to overcome the weaknesses of either approach on its own while preserving their strengths. We propose taking the hybrid approach even further and integrating deep learning with symbolic logic, or Neuro-symbolic AI, to create more accurate, explainable, interpretable, and robust systems. We describe several neuro-symbolic architectures and examine their strengths and limitations within the PdM domain. We focus specifically on methods which involve the use of sensor data and manually crafted rules as inputs by describing concrete NeSy architectures. In short, this survey outlines the context of modern maintenance, defines key concepts, establishes a generalized framework, reviews current modeling approaches and challenges, and introduces the proposed focus on Neuro-symbolic AI (NESY).

[752] Engineering AI Agents for Clinical Workflows: A Case Study in Architecture,MLOps, and Governance

Cláudio Lúcio do Val Lopes, João Marcus Pitta, Fabiano Belém, Gildson Alves, Flávio Vinícius Cruzeiro Martins

Main category: cs.AI

TL;DR: Industry case study of “Maria” platform - a production-grade AI system for primary healthcare that integrates four engineering pillars for trustworthy clinical AI

Details

Motivation: Address the software engineering challenge of integrating AI into clinical settings, moving from isolated models to robust, governable, and reliable systems. Current industrial applications suffer from brittle architectures and lack systemic oversight, creating a "responsibility vacuum" where safety and accountability are compromised.

Method: Presents the “Maria” platform with four foundational engineering pillars: 1) Clean Architecture for maintainability, 2) Event-driven architecture for resilience and auditability, 3) Agents as primary unit of modularity with autonomous MLOps lifecycle, 4) Human-in-the-Loop governance model integrated as critical event-driven data source.

Result: The platform serves as a reference architecture for building maintainable, scalable, and accountable AI-enabled systems in high-stakes domains like healthcare, offering practical lessons for engineers.

Conclusion: Trustworthy clinical AI requires holistic integration of architectural principles, modular design with autonomous agents, and human oversight integrated as continuous improvement mechanism rather than just safety check.

Abstract: The integration of Artificial Intelligence (AI) into clinical settings presents a software engineering challenge, demanding a shift from isolated models to robust, governable, and reliable systems. However, brittle, prototype-derived architectures often plague industrial applications and a lack of systemic oversight, creating a responsibility vacuum'' where safety and accountability are compromised. This paper presents an industry case study of the Maria’’ platform, a production-grade AI system in primary healthcare that addresses this gap. Our central hypothesis is that trustworthy clinical AI is achieved through the holistic integration of four foundational engineering pillars. We present a synergistic architecture that combines Clean Architecture for maintainability with an Event-driven architecture for resilience and auditability. We introduce the Agent as the primary unit of modularity, each possessing its own autonomous MLOps lifecycle. Finally, we show how a Human-in-the-Loop governance model is technically integrated not merely as a safety check, but as a critical, event-driven data source for continuous improvement. We present the platform as a reference architecture, offering practical lessons for engineers building maintainable, scalable, and accountable AI-enabled systems in high-stakes domains.

[753] Environment-Aware Adaptive Pruning with Interleaved Inference Orchestration for Vision-Language-Action Models

Yuting Huang, Leilei Ding, Zhipeng Tang, Zenghuan Zhu, Jiajun Deng, Xinrui Lin, Shuo Liu, Haojie Ren, Jianmin Ji, Yanyong Zhang

Main category: cs.AI

TL;DR: EcoVLA: A training-free adaptive pruning framework for Vision-Language-Action models that dynamically adjusts sparsity patterns based on environment changes to reduce inference latency while maintaining performance.

Details

Motivation: VLA models have high inference latency due to large parameter counts, hindering real-time robotic manipulation. Static pruning lacks adaptability to changing environments, while dynamic pruning has coarse granularity and high retraining overhead.

Method: Two-component framework: 1) Environment-aware Adaptive Pruning (EAP) - lightweight adaptive channel pruning using temporal consistency of physical environment; 2) Interleaved Inference Orchestration (I²O) - schedules pruning in parallel with inference using inherent FLOPs bubbles.

Result: Achieves up to 1.60× speedup with only 0.4% drop in success rate, and 2.18× speedup with 0.5% degradation when combined with token pruning. Validated on diverse VLA models, benchmarks, and real-world robots.

Conclusion: EcoVLA provides an effective training-free adaptive pruning solution for VLA models that maintains performance while significantly reducing inference latency, enabling real-time robotic manipulation.

Abstract: While Vision-Language-Action (VLA) models hold promise in embodied intelligence, their large parameter counts lead to substantial inference latency that hinders real-time manipulation, motivating parameter sparsification. However, as the environment evolves during VLA execution, the optimal sparsity patterns change accordingly. Static pruning lacks the adaptability required for environment dynamics, whereas fixed-interval dynamic layer pruning suffers from coarse granularity and high retraining overheads. To bridge this gap, we propose EcoVLA, a training-free, plug-and-play adaptive pruning framework that supports orthogonal combination with existing VLA acceleration methods. EcoVLA comprises two components: Environment-aware Adaptive Pruning (EAP) and Interleaved Inference Orchestration ($I^2O$). EAP is a lightweight adaptive channel pruning method that incorporates the temporal consistency of the physical environment to update sparsity patterns. $I^2O$ leverages the FLOPs bubbles inherent in VLA inference to schedule the pruning method in parallel, ensuring negligible impact on latency. Evaluated on diverse VLA models and benchmarks, EcoVLA delivers state-of-the-art performance, achieving up to 1.60$\times$ speedup with only a 0.4% drop in success rate, and further reaches 2.18$\times$ speedup with only a 0.5% degradation when combined with token pruning. We further validate the effectiveness of EcoVLA on real-world robots.

[754] World Models as an Intermediary between Agents and the Real World

Sherry Yang

Main category: cs.AI

TL;DR: World models can bridge the gap between LLM agents and high-cost real-world domains by serving as intermediaries that provide rich learning signals while overcoming sample inefficiency and off-policy learning challenges.

Details

Motivation: Current LLM agents excel in low-cost environments but fail in high-cost domains like robotics, ML engineering, and scientific experiments due to the prohibitive expense of executing actions to acquire reward signals.

Method: Proposes using world models as intermediaries between agents and the real world, viewing them as models of dynamics, rewards, and task distributions to overcome barriers like extreme off-policy learning and sample inefficiency.

Result: Demonstrates how world models can provide critical learning signals across domains including machine learning engineering, computer use, robotics, and AI for science.

Conclusion: Identifies challenges in building world models and proposes actionable items for dataset curation, architecture design, scaling, and evaluation to enable next-level agent performance in complex domains.

Abstract: Large language model (LLM) agents trained using reinforcement learning has achieved superhuman performance in low-cost environments like games, mathematics, and coding. However, these successes have not translated to complex domains where the cost of interaction is high, such as the physical cost of running robots, the time cost of ML engineering, and the resource cost of scientific experiments. The true bottleneck for achieving the next level of agent performance for these complex and high-cost domains lies in the expense of executing actions to acquire reward signals. To address this gap, this paper argues that we should use world models as an intermediary between agents and the real world. We discuss how world models, viewed as models of dynamics, rewards, and task distributions, can overcome fundamental barriers of high-cost actions such as extreme off-policy learning and sample inefficiency in long-horizon tasks. Moreover, we demonstrate how world models can provide critical and rich learning signals to agents across a broad set of domains, including machine learning engineering, computer use, robotics, and AI for science. Lastly, we identify the challenges of building these world models and propose actionable items along dataset curation, architecture design, scaling, and evaluation of world models.

[755] MissMAC-Bench: Building Solid Benchmark for Missing Modality Issue in Robust Multimodal Affective Computing

Ronghao Lin, Honghao Lu, Ruixing Wu, Aolin Xiong, Qinggong Chu, Qiaolin He, Sijie Mai, Haifeng Hu

Main category: cs.AI

TL;DR: MissMAC-Bench: A benchmark for evaluating multimodal affective computing models under missing modality scenarios, addressing robustness to incomplete multimodal inputs in real-world applications.

Details

Motivation: Real-world multimodal affective computing faces the missing modality problem where modality data availability is dynamic and uncertain, causing performance fluctuations due to distribution shifts and semantic deficiencies. Current MAC models heavily rely on complete multimodal data, limiting practical deployment.

Method: Introduces MissMAC-Bench with two guiding principles: no missing prior during training, and a single model handling both complete and incomplete modality scenarios. The benchmark integrates evaluation protocols with fixed and random missing patterns at dataset and instance levels.

Result: Extensive experiments on 3 widely-used language models across 4 datasets validate the effectiveness of diverse MAC approaches in tackling the missing modality issue, establishing fair evaluation standards from cross-modal synergy perspective.

Conclusion: MissMAC-Bench provides a solid foundation for advancing robust multimodal affective computing and promotes development of multimedia data mining by systematically quantifying and addressing the missing modality challenge.

Abstract: As a knowledge discovery task over heterogeneous data sources, current Multimodal Affective Computing (MAC) heavily rely on the completeness of multiple modalities to accurately understand human’s affective state. However, in real-world scenarios, the availability of modality data is often dynamic and uncertain, leading to substantial performance fluctuations due to the distribution shifts and semantic deficiencies of the incomplete multimodal inputs. Known as the missing modality issue, this challenge poses a critical barrier to the robustness and practical deployment of MAC models. To systematically quantify this issue, we introduce MissMAC-Bench, a comprehensive benchmark designed to establish fair and unified evaluation standards from the perspective of cross-modal synergy. Two guiding principles are proposed, including no missing prior during training, and one single model capable of handling both complete and incomplete modality scenarios, thereby ensuring better generalization. Moreover, to bridge the gap between academic research and real-world applications, our benchmark integrates evaluation protocols with both fixed and random missing patterns at the dataset and instance levels. Extensive experiments conducted on 3 widely-used language models across 4 datasets validate the effectiveness of diverse MAC approaches in tackling the missing modality issue. Our benchmark provides a solid foundation for advancing robust multimodal affective computing and promotes the development of multimedia data mining.

Yunjian Zhang, Sudong Wang, Yang Li, Peiran Xu, Conghao Zhou, Xiaoyue Ma, Jianing Li, Yao Zhu

Main category: cs.AI

TL;DR: DoPR reduces RLVR training costs by dynamically selecting single informative samples per batch based on uncertainty, cutting rollout overhead by ~10x while maintaining reasoning accuracy.

Details

Motivation: RLVR is effective for aligning LLMs with reasoning chains but is prohibitively resource-intensive, requiring extensive reward signals and incurring substantial rollout costs during training.

Method: Proposes Dynamic One-Shot Policy Refinement (DoPR), an uncertainty-aware RL strategy that dynamically selects a single informative training sample per batch for policy updates, guided by reward volatility and exploration-driven acquisition.

Result: DoPR reduces rollout overhead by nearly an order of magnitude while preserving competitive reasoning accuracy, offering a scalable and resource-efficient solution for LLM post-training.

Conclusion: DoPR provides a practical path toward more efficient and accessible RL-based training for reasoning-intensive LLM applications by addressing computational burden while maintaining performance.

Abstract: Large language models (LLMs) have exhibited remarkable performance on complex reasoning tasks, with reinforcement learning under verifiable rewards (RLVR) emerging as a principled framework for aligning model behavior with reasoning chains. Despite its promise, RLVR remains prohibitively resource-intensive, requiring extensive reward signals and incurring substantial rollout costs during training. In this work, we revisit the fundamental question of data and compute efficiency in RLVR. We first establish a theoretical lower bound on the sample complexity required to unlock reasoning capabilities, and empirically validate that strong performance can be achieved with a surprisingly small number of training instances. To tackle the computational burden, we propose Dynamic One-Shot Policy Refinement (DoPR), an uncertainty-aware RL strategy that dynamically selects a single informative training sample per batch for policy updates, guided by reward volatility and exploration-driven acquisition. DoPR reduces rollout overhead by nearly an order of magnitude while preserving competitive reasoning accuracy, offering a scalable and resource-efficient solution for LLM post-training. This approach offers a practical path toward more efficient and accessible RL-based training for reasoning-intensive LLM applications.

[757] Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward

Senkang Hu, Yong Dai, Yuzhi Zhao, Yihang Tao, Yu Guo, Zhengru Fang, Sam Tak Wu Kwong, Yuguang Fang

Main category: cs.AI

TL;DR: InfoReasoner is a framework that uses synthetic semantic information gain rewards to optimize retrieval in agentic reasoning models, improving question-answering performance without manual annotations.

Details

Motivation: Current agentic reasoning models struggle with optimizing retrieval processes due to lack of dense, principled reward signals for information seeking behavior.

Method: Introduces InfoReasoner with theoretical redefinition of information gain as uncertainty reduction over belief states, and practical output-aware intrinsic estimator using semantic clustering via bidirectional textual entailment to compute information gain directly from model outputs.

Result: Outperforms strong retrieval-augmented baselines across seven question-answering benchmarks, achieving up to 5.4% average accuracy improvement.

Conclusion: Provides a theoretically grounded and scalable path toward agentic reasoning with retrieval through synthetic semantic information gain rewards.

Abstract: Agentic reasoning enables large reasoning models (LRMs) to dynamically acquire external knowledge, but yet optimizing the retrieval process remains challenging due to the lack of dense, principled reward signals. In this paper, we introduce InfoReasoner, a unified framework that incentivizes effective information seeking via a synthetic semantic information gain reward. Theoretically, we redefine information gain as uncertainty reduction over the model’s belief states, establishing guarantees, including non-negativity, telescoping additivity, and channel monotonicity. Practically, to enable scalable optimization without manual retrieval annotations, we propose an output-aware intrinsic estimator that computes information gain directly from the model’s output distributions using semantic clustering via bidirectional textual entailment. This intrinsic reward guides the policy to maximize epistemic progress, enabling efficient training via Group Relative Policy Optimxization (GRPO). Experiments across seven question-answering benchmarks demonstrate that InfoReasoner consistently outperforms strong retrieval-augmented baselines, achieving up to 5.4% average accuracy improvement. Our work provides a theoretically grounded and scalable path toward agentic reasoning with retrieval.

[758] Position: Human-Centric AI Requires a Minimum Viable Level of Human Understanding

Fangzhou Lin, Qianwen Ge, Lingyu Xu, Peiran Li, Xiangbo Gao, Shuo Xing, Kazunori Yamada, Ziming Zhang, Haichong Zhang, Zhengzhong Tu

Main category: cs.AI

TL;DR: The paper introduces the Capability-Comprehension Gap and Cognitive Integrity Threshold to address how AI systems improve outcomes while eroding human understanding needed for oversight.

Details

Motivation: AI systems produce increasingly fluent, correct outcomes, but this erodes users' ability to explain, verify, or intervene, creating a divergence between AI capability and human comprehension that undermines oversight and accountability.

Method: Defines the Cognitive Integrity Threshold (CIT) as the minimum comprehension required for oversight, autonomy, and accountable participation under AI assistance. Operationalizes CIT through three dimensions: verification capacity, comprehension-preserving interaction, and institutional governance scaffolds.

Result: Proposes a framework for designing human-AI interaction that preserves cognitive sustainability in responsibility-critical settings, moving beyond traditional transparency approaches to address fundamental comprehension gaps.

Conclusion: Current approaches to transparency, user control, literacy, and governance are insufficient; a new design and governance agenda is needed to align human-AI interaction with cognitive sustainability and preserve human oversight capabilities.

Abstract: AI systems increasingly produce fluent, correct, end-to-end outcomes. Over time, this erodes users’ ability to explain, verify, or intervene. We define this divergence as the Capability-Comprehension Gap: a decoupling where assisted performance improves while users’ internal models deteriorate. This paper argues that prevailing approaches to transparency, user control, literacy, and governance do not define the foundational understanding humans must retain for oversight under sustained AI delegation. To formalize this, we define the Cognitive Integrity Threshold (CIT) as the minimum comprehension required to preserve oversight, autonomy, and accountable participation under AI assistance. CIT does not require full reasoning reconstruction, nor does it constrain automation. It identifies the threshold beyond which oversight becomes procedural and contestability fails. We operatinalize CIT through three functional dimensions: (i) verification capacity, (ii) comprehension-preserving interaction, and (iii) institutional scaffolds for governance. This motivates a design and governance agenda that aligns human-AI interaction with cognitive sustainability in responsibility-critical settings.

[759] Multi-Head Attention Is a Multi-Player Game

Kushal Chakrabarti, Nirmal Balachundar

Main category: cs.AI

TL;DR: The paper analyzes transformer attention as a multi-agent game, showing cross-entropy training creates implicit potential games among attention heads, leading to inefficient Nash equilibria with redundancy and hallucination problems.

Details

Motivation: Transformer attention mechanisms are internally multi-agent systems where attention heads compete and coordinate, but current training treats them as monolithic optimizers, creating a gap that leads to inefficiencies like redundancy and hallucinations.

Method: Formalizes attention heads as playing an implicit potential game under cross-entropy training, analyzes Price of Anarchy (PoA) bounded by Γ(G) (off-diagonal mass of head interaction matrix), and proposes GAME-LoRA combining Barlow Twins decorrelation with log-determinant coordination pressure.

Result: Γ(G) predicts hallucination (p<0.05), emergent coalitions show selective coordination, and GAME-LoRA achieves up to 18% hallucination reduction (8% average) with no knowledge degradation - a Pareto improvement.

Conclusion: Treating transformer attention as a multi-agent game reveals fundamental inefficiencies in current training, and game-theoretic regularization (GAME-LoRA) can significantly reduce hallucinations while maintaining performance.

Abstract: Modern transformer attention is internally multi-agent – heads compete and coordinate – yet we train it as if it were a monolithic optimizer. We formalize this gap: cross-entropy training induces an implicit potential game among heads, and gradient descent converges to Nash equilibria with potentially unbounded inefficiency due to unpriced externalities (redundancy, correlated errors). Our main result bounds the Price of Anarchy by $Γ(G)$, the off-diagonal mass of a head interaction matrix capturing weight and gradient coupling. Under mild smoothness assumptions, we prove that both \emph{excess hallucination probability} and \emph{excess head redundancy} scale with PoA, unifying two distinct failure modes into a single mechanism. The bound is prescriptive: regularization that reduces $Γ(G)$ provably tightens PoA. We instantiate this as GAME-LoRA, combining Barlow Twins decorrelation with log-determinant coordination pressure. Experiments validate the theory: $Γ(G)$ predicts hallucination ($p{<}0.05$), emergent coalitions exhibit selective coordination, and GAME-LoRA achieves up to 18% hallucination reduction (8% average) with no knowledge degradation – a Pareto improvement inaccessible to methods ignoring the game structure.

[760] Foundation CAN LM: A Pretrained Language Model For Automotive CAN Data

Akiharu Esashi, Pawissanutt Lertpongrujikorn, Justin Makino, Yuibi Fujimoto, Mohsen Amini Salehi

Main category: cs.AI

TL;DR: A foundation model approach for CAN bus data that treats vehicular signals as language, enabling multi-task generalization across automotive applications through pretraining and fine-tuning.

Details

Motivation: Current CAN bus data pipelines use isolated task-specific models, preventing shared representation learning and cross-task generalization. The paper aims to apply the successful foundation model paradigm from NLP/CV to automotive CAN data.

Method: Treats CAN data as language, pretrains on large-scale unlabeled decoded CAN signals, proposes unified tokenization for mixed discrete-continuous signals, addresses temporal complexity and trip-specific variability, then fine-tunes across heterogeneous auto insurance tasks.

Result: Demonstrates that one pretrained CAN model can adapt effectively to diverse predictive tasks, validating that the foundation modeling paradigm works for CAN data and establishing generalizable representation learning in automotive AI.

Conclusion: The foundation model approach successfully transfers from NLP/CV to CAN data, enabling shared representation learning and cross-task generalization in automotive applications, opening new directions for automotive AI.

Abstract: The Controller Area Network (CAN) bus provides a rich source of vehicular signals increasingly leveraged for applications in automotive and auto insurance domains, including collision detection, predictive maintenance, and driver risk modeling. Despite this potential, existing pipelines largely train isolated task-specific models on raw CAN data, with only limited efforts exploring decoded signals. Such fragmentation prevents shared representation learning and limits cross-task generalization. By contrast, natural language processing (NLP) and computer vision (CV) have been transformed by the foundation model paradigm: large-scale pretraining followed by task-specific adaptation. In this work, we introduce the foundation CAN model that demonstrates multi-objective downstream generalization using a single pretrained backbone. Our approach treats CAN data as a language: we pretrain on large-scale, unlabeled decoded CAN signals and fine-tune across heterogeneous auto insurance tasks. To enable this, we propose a unified tokenization scheme for mixed discrete-continuous signals and address challenges of temporal complexity and trip-specific variability. Our results show that one pretrained CAN model can adapt effectively to diverse predictive tasks, validating that the foundation modeling paradigm, proven in NLP and CV, also holds for CAN data. This establishes a new direction for generalizable representation learning in automotive AI.

[761] Beyond Output Critique: Self-Correction via Task Distillation

Hossein A. Rahmani, Mengting Wan, Pei Zhou, Longqi Yang, Nick Craswell, Emine Yilmaz, Sujay Kumar Jauhar

Main category: cs.AI

TL;DR: SELF-THOUGHT is a framework that improves LLM self-correction by adding task abstraction before solution refinement, enabling better reasoning correction and transferable abstractions across models.

Details

Motivation: Existing LLM self-correction approaches mainly fix surface errors but fail to correct deeper reasoning flaws. There's a need for more structured self-correction that addresses fundamental reasoning problems and enables knowledge transfer from larger to smaller models.

Method: Introduces an intermediate task abstraction step where the model distills tasks into structured templates capturing key variables, constraints, and problem structure. These abstractions guide solution instantiation and can be transferred across models, allowing larger models’ templates to help smaller models with self-correction.

Result: Experiments across diverse reasoning tasks show improved accuracy, robustness, and generalization for both large and small models. Smaller models achieve more reliable refinements without heavy fine-tuning or external verifiers.

Conclusion: SELF-THOUGHT offers a scalable path toward more reliable self-correcting language systems by addressing deeper reasoning flaws through structured task abstraction and enabling cross-model knowledge transfer.

Abstract: Large language models (LLMs) have shown promising self-correction abilities, where iterative refinement improves the quality of generated responses. However, most existing approaches operate at the level of output critique, patching surface errors while often failing to correct deeper reasoning flaws. We propose SELF-THOUGHT, a framework that introduces an intermediate step of task abstraction before solution refinement. Given an input and an initial response, the model first distills the task into a structured template that captures key variables, constraints, and problem structure. This abstraction then guides solution instantiation, grounding subsequent responses in a clearer understanding of the task and reducing error propagation. Crucially, we show that these abstractions can be transferred across models: templates generated by larger models can serve as structured guides for smaller LLMs, which typically struggle with intrinsic self-correction. By reusing distilled task structures, smaller models achieve more reliable refinements without heavy fine-tuning or reliance on external verifiers. Experiments across diverse reasoning tasks demonstrate that SELF-THOUGHT improves accuracy, robustness, and generalization for both large and small models, offering a scalable path toward more reliable self-correcting language systems.

[762] Synapse Compendium Aware Federated Knowledge Exchange for Tool Routed LLMs

Abhijit Chakraborty, Sandipan De, Yash Shah, Chahana Dahal, Vivek Gupta

Main category: cs.AI

TL;DR: Synapse is a federated learning framework for LLM-based agents that trains a shared global knowledge model of tool-usage behavior to improve collaborative learning while reducing communication costs.

Details

Motivation: Collaborative learning among LLM-based agents in federated settings faces challenges including high communication costs, data heterogeneity, and tool-usage diversity, which limit effectiveness and scalability.

Method: Synapse trains a shared global knowledge model of tool-usage behavior where client agents with fixed LLMs learn tool-usage patterns locally, transmit artifacts for federated aggregation through coordinators, and receive updated global tool compendiums. It uses templated representations, embedding retrieval with LLM reranking, and adaptive masking to maintain utility while limiting information leakage.

Result: Synapse improves tool-usage effectiveness and reduces communication overhead compared with weight or prompt-sharing approaches in multi-agent LLM systems, supporting heterogeneous data and quantifying performance improvements.

Conclusion: The Synapse framework enables effective collaborative learning for LLM-based agents in federated settings by addressing communication costs, data heterogeneity, and tool-usage challenges through a shared global knowledge model approach.

Abstract: Collaborative learning among LLM-based agents under federated learning faces challenges, including communication costs, heterogeneity in data, and tool-usage, limiting their effectiveness. We introduce Synapse, a framework that trains a shared global knowledge model of tool-usage behavior. Client agents with fixed LLMs learn tool-usage patterns locally, and transmit artifacts for federated aggregation through coordinators. A global tool compendium is updated and redistributed, enabling convergence toward stable tool selection. Synapse uses templated representations, embedding retrieval with LLM reranking, and adaptive masking to maintain utility while limiting information leakage. The framework supports heterogeneous data and quantifies performance improvements. Results show that Synapse improves tool-usage effectiveness and reduces communication overhead compared with weight or prompt-sharing approaches in multi-agent LLM systems.

[763] Supervised sparse auto-encoders as unconstrained feature models for semantic composition

Ouns El Harzli, Hugo Wallner, Yoonsoo Nam, Haixuan Xavier Tao

Main category: cs.AI

TL;DR: Supervised sparse auto-encoders for Stable Diffusion that learn sparse concept embeddings enabling compositional generalization and semantic image editing without prompt modification.

Details

Motivation: Address limitations of traditional sparse auto-encoders (SAEs) in mechanistic interpretability: non-smooth L1 penalty hindering reconstruction/scalability, and lack of alignment between learned features and human semantics.

Method: Adapt unconstrained feature models from neural collapse theory and supervise decoder-only SAEs to reconstruct feature vectors by jointly learning sparse concept embeddings and decoder weights.

Result: Validated on Stable Diffusion 3.5, demonstrates compositional generalization (reconstructing images with unseen concept combinations) and enables feature-level intervention for semantic image editing without prompt modification.

Conclusion: Supervised sparse auto-encoders with concept embeddings overcome traditional SAE limitations and enable interpretable, controllable image generation through feature-level manipulation.

Abstract: Sparse auto-encoders (SAEs) have re-emerged as a prominent method for mechanistic interpretability, yet they face two significant challenges: the non-smoothness of the $L_1$ penalty, which hinders reconstruction and scalability, and a lack of alignment between learned features and human semantics. In this paper, we address these limitations by adapting unconstrained feature models-a mathematical framework from neural collapse theory-and by supervising the task. We supervise (decoder-only) SAEs to reconstruct feature vectors by jointly learning sparse concept embeddings and decoder weights. Validated on Stable Diffusion 3.5, our approach demonstrates compositional generalization, successfully reconstructing images with concept combinations unseen during training, and enabling feature-level intervention for semantic image editing without prompt modification.

[764] Learning Abstractions for Hierarchical Planning in Program-Synthesis Agents

Zergham Ahmed, Kazuki Irie, Joshua B. Tenenbaum, Christopher J. Bates, Samuel J. Gershman

Main category: cs.AI

TL;DR: TheoryCoder-2 is a Theory-Based RL agent that uses LLMs to learn reusable abstractions from experience for hierarchical planning, outperforming baseline LLM agents in sample efficiency and solving complex tasks with minimal human prompts.

Details

Motivation: Current LLM agents and deep RL systems struggle with efficient planning and generalization across tasks, while existing Theory-Based RL systems like TheoryCoder rely heavily on human-provided abstractions rather than learning them autonomously.

Method: TheoryCoder-2 leverages LLMs’ in-context learning ability to actively learn reusable abstractions by synthesizing them from experience and integrating them into a hierarchical planning process, requiring only minimal human prompts.

Result: TheoryCoder-2 shows significantly better sample efficiency than baseline LLM agents with classical planning, reasoning-based planning, and prior program-synthesis agents like WorldCoder, solving complex tasks that baselines fail on.

Conclusion: The approach demonstrates that LLMs can effectively learn abstractions from experience for hierarchical planning, advancing Theory-Based RL by reducing reliance on hand-specified abstractions while maintaining strong generalization capabilities.

Abstract: Humans learn abstractions and use them to plan efficiently to quickly generalize across tasks – an ability that remains challenging for state-of-the-art large language model (LLM) agents and deep reinforcement learning (RL) systems. Inspired by the cognitive science of how people form abstractions and intuitive theories of their world knowledge, Theory-Based RL (TBRL) systems, such as TheoryCoder, exhibit strong generalization through effective use of abstractions. However, they heavily rely on human-provided abstractions and sidestep the abstraction-learning problem. We introduce TheoryCoder-2, a new TBRL agent that leverages LLMs’ in-context learning ability to actively learn reusable abstractions rather than relying on hand-specified ones, by synthesizing abstractions from experience and integrating them into a hierarchical planning process. We conduct experiments on diverse environments, including BabyAI, Minihack and VGDL games like Sokoban. We find that TheoryCoder-2 is significantly more sample-efficient than baseline LLM agents augmented with classical planning domain construction, reasoning-based planning, and prior program-synthesis agents such as WorldCoder. TheoryCoder-2 is able to solve complex tasks that the baselines fail, while only requiring minimal human prompts, unlike prior TBRL systems.

[765] The Keyhole Effect: Why Chat Interfaces Fail at Data Analysis

Mohan Reddy

Main category: cs.AI

TL;DR: Chat interfaces for AI-assisted data analysis create cognitive overload through five mechanisms, degrading analytical performance; eight hybrid design patterns are proposed to address these issues while preserving natural language for intent specification.

Details

Motivation: The paper argues that chat interfaces are suboptimal for multi-step, state-dependent analytical tasks due to cognitive limitations. Building on Woods' Keyhole Effect, the author identifies systematic problems with chat interfaces that degrade analytical performance through five specific cognitive mechanisms.

Method: The paper presents a theoretical framework analyzing cognitive limitations in chat interfaces, formalizes cognitive overload with a mathematical model (O = max(0, m - v - W)), and proposes eight hybrid design patterns to address these issues: Generative UI, Infinite Canvas, Deictic Interaction, State Rail, Ghost Layers, Mise en Place, Semantic Zoom, and Probabilistic UI.

Result: The framework identifies five cognitive bottlenecks in chat interfaces: (1) content displacement defeating spatial memory, (2) hidden state exceeding working memory, (3) verbalization degrading visual pattern recognition, (4) linear text blocking epistemic action, and (5) serialization penalties scaling with data dimensionality. The proposed design patterns target these specific issues while preserving natural language for intent specification.

Conclusion: Chat interfaces systematically degrade analytical performance through cognitive overload mechanisms. Well-scaffolded conversational systems with expert priors may reduce load for guided tasks, but the framework applies most strongly to open-ended exploration. The paper concludes with falsifiable hypotheses and experimental paradigms for empirical validation.

Abstract: Chat has become the default interface for AI-assisted data analysis. For multi-step, state-dependent analytical tasks, this is a mistake. Building on Woods (1984) Keyhole Effect, the cognitive cost of viewing large information spaces through narrow viewports, I show that chat interfaces systematically degrade analytical performance through five mechanisms: (1) constant content displacement defeats hippocampal spatial memory systems; (2) hidden state variables exceed working memory capacity (approximately 4 chunks under load); (3) forced verbalization triggers verbal overshadowing, degrading visual pattern recognition; (4) linear text streams block epistemic action and cognitive offloading; (5) serialization penalties scale with data dimensionality. I formalize cognitive overload as O = max(0, m - v - W) where m is task-relevant items, v is visible items, and W is working memory capacity. When O > 0, error probability increases and analytical biases (anchoring, confirmation, change blindness) amplify. Eight hybrid design patterns address these failures: Generative UI, Infinite Canvas, Deictic Interaction, State Rail, Ghost Layers, Mise en Place, Semantic Zoom, and Probabilistic UI. Each pattern targets specific cognitive bottlenecks while preserving natural language for intent specification and synthesis. Well-scaffolded conversational systems that encode expert priors may reduce load for guided tasks; the framework applies most strongly to open-ended exploration. The paper concludes with falsifiable hypotheses and experimental paradigms for empirical validation.

[766] MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support

António Farinhas, Nuno M. Guerreiro, José Pombal, Pedro Henrique Martins, Laura Melton, Alex Conway, Cara Dochat, Maya D’Eon, Ricardo Rei

Main category: cs.AI

TL;DR: MindGuard: Clinically-grounded safety classifiers for mental health LLMs that distinguish therapeutic disclosures from genuine crises using expert-developed risk taxonomy and synthetic training data.

Details

Motivation: Current LLM safety measures fail to differentiate between therapeutic disclosures and genuine clinical crises in mental health applications, leading to inappropriate responses and safety failures.

Method: Developed clinical risk taxonomy with PhD psychologists, created MindGuard-testset with expert-annotated conversations, trained lightweight safety classifiers (4B/8B parameters) using synthetic dialogues from controlled two-agent setup.

Result: MindGuard reduces false positives at high-recall points, lowers attack success and harmful engagement rates in adversarial multi-turn interactions when paired with clinician language models.

Conclusion: Clinically-grounded safety classifiers outperform general-purpose safeguards for mental health LLMs, enabling safer therapeutic interactions while maintaining appropriate crisis detection.

Abstract: Large language models are increasingly used for mental health support, yet their conversational coherence alone does not ensure clinical appropriateness. Existing general-purpose safeguards often fail to distinguish between therapeutic disclosures and genuine clinical crises, leading to safety failures. To address this gap, we introduce a clinically grounded risk taxonomy, developed in collaboration with PhD-level psychologists, that identifies actionable harm (e.g., self-harm and harm to others) while preserving space for safe, non-crisis therapeutic content. We release MindGuard-testset, a dataset of real-world multi-turn conversations annotated at the turn level by clinical experts. Using synthetic dialogues generated via a controlled two-agent setup, we train MindGuard, a family of lightweight safety classifiers (with 4B and 8B parameters). Our classifiers reduce false positives at high-recall operating points and, when paired with clinician language models, help achieve lower attack success and harmful engagement rates in adversarial multi-turn interactions compared to general-purpose safeguards. We release all models and human evaluation data.

[767] R-HTN: Rebellious Online HTN Planning for Safety and Game AI

Hector Munoz-Avila, David W. Aha, Paola Rizzo

Main category: cs.AI

TL;DR: Online HTN planning agents that can intelligently disobey user tasks when they conflict with built-in safety/personality directives, with both nonadaptive (stop) and adaptive (replan) variants.

Details

Motivation: To create agents that can intelligently disobey user-assigned tasks when those tasks conflict with built-in directives for safety or personality reasons, combining HTN planning with online planning and directive considerations.

Method: Proposes R-HTN (Rebellious-HTN) algorithm for online HTN planning under directives D. Two agent variants: Nonadaptive (stops execution if violating D) and Adaptive (modifies HTN plan to find alternative ways to achieve tasks while respecting D).

Result: R-HTN agents never violate directives and aim to achieve user-given goals if feasible, though not necessarily as the user expected. Evaluated in task domains where agents must not violate directives for safety or personality reasons.

Conclusion: The approach successfully creates agents capable of intelligent disobedience while respecting built-in directives, with adaptive agents being more flexible in finding alternative solutions to user tasks.

Abstract: We introduce online Hierarchical Task Network (HTN) agents whose behaviors are governed by a set of built-in directives \D. Like other agents that are capable of rebellion (i.e., {\it intelligent disobedience}), our agents will, under some conditions, not perform a user-assigned task and instead act in ways that do not meet a user’s expectations. Our work combines three concepts: HTN planning, online planning, and the directives \D, which must be considered when performing user-assigned tasks. We investigate two agent variants: (1) a Nonadaptive agent that stops execution if it finds itself in violation of \D~ and (2) an Adaptive agent that, in the same situation, instead modifies its HTN plan to search for alternative ways to achieve its given task. We present R-HTN (for: Rebellious-HTN), a general algorithm for online HTN planning under directives \D. We evaluate R-HTN in two task domains where the agent must not violate some directives for safety reasons or as dictated by their personality traits. We found that R-HTN agents never violate directives, and aim to achieve the user-given goals if feasible though not necessarily as the user expected.

[768] Small-Margin Preferences Still Matter-If You Train Them Right

Jinlong Pang, Zhaowei Zhu, Na Di, Yichi Zhang, Yaxuan Wang, Chen Qian, Yang Liu

Main category: cs.AI

TL;DR: MixDPO: A difficulty-aware training strategy that routes easy preference pairs to DPO loss and difficult pairs to supervised fine-tuning, improving alignment over standard DPO.

Details

Motivation: Current preference optimization methods like DPO are sensitive to preference pair quality, with difficult (ambiguous) pairs often filtered out as noise. However, these difficult pairs still contain useful supervision signals when used with SFT rather than preference losses.

Method: MixDPO orders preference data by difficulty (easy to hard based on margin) and applies a hybrid approach: easy pairs use DPO preference loss, while difficult pairs are routed to supervised fine-tuning (SFT) objective.

Result: Across three LLM-judge benchmarks, MixDPO consistently improves alignment over DPO and other variants, with particularly strong gains on AlpacaEval~2 length-controlled win rate.

Conclusion: Difficult preference pairs contain valuable supervision but destabilize preference-based losses; MixDPO’s hybrid approach effectively leverages all data by matching objective to pair difficulty.

Abstract: Preference optimization methods such as DPO align large language models (LLMs) using paired comparisons, but their effectiveness can be highly sensitive to the quality and difficulty of preference pairs. A common heuristic treats small-margin (ambiguous) pairs as noisy and filters them out. In this paper, we revisit this assumption and show that pair difficulty interacts strongly with the optimization objective: when trained with preference-based losses, difficult pairs can destabilize training and harm alignment, yet these same pairs still contain useful supervision signals when optimized with supervised fine-tuning (SFT). Motivated by this observation, we propose MixDPO, a simple yet effective difficulty-aware training strategy that (i) orders preference data from easy to hard (a curriculum over margin-defined difficulty), and (ii) routes difficult pairs to an SFT objective while applying a preference loss to easy pairs. This hybrid design provides a practical mechanism to leverage ambiguous pairs without incurring the optimization failures often associated with preference losses on low-margin data. Across three LLM-judge benchmarks, MixDPO consistently improves alignment over DPO and a range of widely-used variants, with particularly strong gains on AlpacaEval~2 length-controlled (LC) win rate.

[769] Reasoning and Tool-use Compete in Agentic RL:From Quantifying Interference to Disentangled Tuning

Yu Li, Mingyang Yi, Xiuyu Li, Ju Fan, Fuxin Jiang, Binbin Chen, Peng Li, Jie Song, Tieying Zhang

Main category: cs.AI

TL;DR: DART framework decouples reasoning and tool-use parameter updates in agentic RL to address training interference, outperforming joint training baselines.

Details

Motivation: Existing Agentic Reinforcement Learning (ARL) methods train single shared models for both reasoning and tool use, assuming joint training improves performance, but this assumption lacks empirical validation and may cause interference.

Method: Introduces Linear Effect Attribution System (LEAS) to quantify interference, then proposes Disentangled Action Reasoning Tuning (DART) - a framework using separate low-rank adaptation modules to decouple parameter updates for reasoning and tool-use.

Result: DART consistently outperforms baseline methods with average 6.35% improvements and achieves performance comparable to multi-agent systems using a single model.

Conclusion: Joint training of reasoning and tool-use in ARL causes interference; explicit decoupling via DART improves performance, challenging the prevailing ARL paradigm.

Abstract: Agentic Reinforcement Learning (ARL) focuses on training large language models (LLMs) to interleave reasoning with external tool execution to solve complex tasks. Most existing ARL methods train a single shared model parameters to support both reasoning and tool use behaviors, implicitly assuming that joint training leads to improved overall agent performance. Despite its widespread adoption, this assumption has rarely been examined empirically. In this paper, we systematically investigate this assumption by introducing a Linear Effect Attribution System(LEAS), which provides quantitative evidence of interference between reasoning and tool-use behaviors. Through an in-depth analysis, we show that these two capabilities often induce misaligned gradient directions, leading to training interference that undermines the effectiveness of joint optimization and challenges the prevailing ARL paradigm. To address this issue, we propose Disentangled Action Reasoning Tuning(DART), a simple and efficient framework that explicitly decouples parameter updates for reasoning and tool-use via separate low-rank adaptation modules. Experimental results show that DART consistently outperforms baseline methods with averaged 6.35 percent improvements and achieves performance comparable to multi-agent systems that explicitly separate tool-use and reasoning using a single model.

[770] Error Taxonomy-Guided Prompt Optimization

Mayank Singh, Vikas Yadav, Eduardo Blanco

Main category: cs.AI

TL;DR: ETGPO is a top-down prompt optimization method that categorizes model errors into a taxonomy and augments prompts with guidance targeting frequent failure modes, achieving comparable or better performance with significantly lower compute costs.

Details

Motivation: Existing prompt optimization methods often use trial-and-error approaches that consume substantial compute and operate in a bottom-up manner, losing global perspective on failure patterns.

Method: Error Taxonomy-Guided Prompt Optimization (ETGPO) collects model errors, categorizes them into a taxonomy, and augments prompts with guidance targeting the most frequent failure modes in a top-down approach.

Result: Across multiple benchmarks in mathematics, question answering, and logical reasoning, ETGPO achieves comparable or better accuracy than state-of-the-art methods while using roughly one third of the optimization-phase token usage and evaluation budget.

Conclusion: ETGPO provides an efficient top-down approach to prompt optimization that leverages error taxonomies to target frequent failure modes, offering significant compute savings while maintaining or improving performance.

Abstract: Automatic Prompt Optimization (APO) is a powerful approach for extracting performance from large language models without modifying their weights. Many existing methods rely on trial-and-error, testing different prompts or in-context examples until a good configuration emerges, often consuming substantial compute. Recently, natural language feedback derived from execution logs has shown promise as a way to identify how prompts can be improved. However, most prior approaches operate in a bottom-up manner, iteratively adjusting the prompt based on feedback from individual problems, which can cause them to lose the global perspective. In this work, we propose Error Taxonomy-Guided Prompt Optimization (ETGPO), a prompt optimization algorithm that adopts a top-down approach. ETGPO focuses on the global failure landscape by collecting model errors, categorizing them into a taxonomy, and augmenting the prompt with guidance targeting the most frequent failure modes. Across multiple benchmarks spanning mathematics, question answering, and logical reasoning, ETGPO achieves accuracy that is comparable to or better than state-of-the-art methods, while requiring roughly one third of the optimization-phase token usage and evaluation budget.

[771] How RLHF Amplifies Sycophancy

Itai Shapira, Gerdus Benade, Ariel D. Procaccia

Main category: cs.AI

TL;DR: Analysis of how preference-based alignment increases sycophantic behavior in LLMs, with formal mechanism and training intervention

Details

Motivation: Large language models show increased sycophantic behavior after preference-based alignment, where they affirm user beliefs even when incorrect. The paper aims to formally analyze how human feedback alignment amplifies this failure mode.

Method: Presents formal analysis linking optimization against learned reward to bias in human preference data. Analyzes reward learning from pairwise comparisons under random utility models. Proposes training-time intervention to neutralize amplification mechanism.

Result: Shows reward gaps are common and cause behavioral drift in all configurations considered. Characterizes unique policy closest to unconstrained post-trained policy that prevents sycophantic behavior increase.

Conclusion: Preference-based alignment can amplify sycophantic behavior through reward gaps. Proposes agreement penalty intervention to mitigate this issue while maintaining alignment objectives.

Abstract: Large language models often exhibit increased sycophantic behavior after preference-based post-training, showing a stronger tendency to affirm a user’s stated or implied belief even when this conflicts with factual accuracy or sound judgment. We present a formal analysis of how alignment from human feedback can increase this failure mode by identifying an explicit amplification mechanism that causally links optimization against a learned reward to bias in the human preference data used for alignment. We show that the direction of behavioral drift is determined by a covariance under the base policy between endorsing the belief signal in the prompt and the learned reward, and that the first-order effect reduces to a simple mean-gap condition. We then analyze reward learning from pairwise comparisons under random utility models like Bradley-Terry and characterize when bias in human annotators’ preferences induces this reward gap. Next, we propose a training-time intervention designed to neutralize the amplification mechanism itself. Among all post-trained policies that prevent sycophantic behavior from increasing, we characterize the unique policy closest in KL divergence to the unconstrained post-trained policy, and derive the corresponding minimal reward correction as a closed-form agreement penalty. Computational experiments find that reward gaps are common and cause behavioral drift in all the configurations considered.

[772] HalluHard: A Hard Multi-Turn Hallucination Benchmark

Dongyang Fan, Sebastien Delsad, Nicolas Flammarion, Maksym Andriushchenko

Main category: cs.AI

TL;DR: HalluHard: A challenging multi-turn hallucination benchmark with 950 questions across legal, research, medical, and coding domains, requiring inline citations for factual assertions, with evaluation via web search evidence retrieval.

Details

Motivation: LLMs produce plausible but ungrounded factual claims, especially problematic in multi-turn dialogue where early errors cascade. Need for rigorous evaluation of hallucinations in high-stakes domains.

Method: Created HalluHard benchmark with 950 seed questions across four domains. Operationalized groundedness via inline citations. Developed judging pipeline that iteratively retrieves evidence via web search, fetches/filters/parses full-text sources to assess citation support.

Result: Hallucinations remain substantial even with web search (~30% for strongest configuration, Opus-4.5 with web search). Content-grounding errors persist at high rates. Hallucination behavior shaped by model capacity, turn position, effective reasoning, and knowledge type.

Conclusion: Hallucination remains a significant problem in LLMs even with retrieval augmentation. The HalluHard benchmark provides rigorous evaluation for groundedness in multi-turn dialogue across high-stakes domains.

Abstract: Large language models (LLMs) still produce plausible-sounding but ungrounded factual claims, a problem that worsens in multi-turn dialogue as context grows and early errors cascade. We introduce $\textbf{HalluHard}$, a challenging multi-turn hallucination benchmark with 950 seed questions spanning four high-stakes domains: legal cases, research questions, medical guidelines, and coding. We operationalize groundedness by requiring inline citations for factual assertions. To support reliable evaluation in open-ended settings, we propose a judging pipeline that iteratively retrieves evidence via web search. It can fetch, filter, and parse full-text sources (including PDFs) to assess whether cited material actually supports the generated content. Across a diverse set of frontier proprietary and open-weight models, hallucinations remain substantial even with web search ($\approx 30%$ for the strongest configuration, Opus-4.5 with web search), with content-grounding errors persisting at high rates. Finally, we show that hallucination behavior is shaped by model capacity, turn position, effective reasoning, and the type of knowledge required.

[773] Discovering Process-Outcome Credit in Multi-Step LLM Reasoning

Xiangwei Wang, Wei Wang, Ken Chen, Nanduni Nimalsiri, Saman Halgamuge

Main category: cs.AI

TL;DR: A novel RL framework for LLMs that provides continuous reward signals through step-wise marginal information gain and decoupled masking strategy, improving reasoning capabilities with better credit assignment and sample efficiency.

Details

Motivation: Standard RL approaches for enhancing LLM reasoning suffer from reward sparsity and inefficient credit assignment, making training noisy and sample inefficient.

Method: Proposes Step-wise Marginal Information Gain (MIG) mechanism to quantify intrinsic value of reasoning steps against a Monotonic Historical Watermark; implements Decoupled Masking Strategy for disentangled credit distribution (process rewards to CoT, outcome rewards to full completion); incorporates Dual-Gated SFT objective for stable training.

Result: Outperforms baselines like GRPO in sample efficiency and final accuracy across textual (MATH) and multi-modal (Super-CLEVR) benchmarks; exhibits superior out-of-distribution robustness and promising zero-shot transfer to unseen reasoning tasks.

Conclusion: The proposed framework effectively addresses reward sparsity and credit assignment issues in RL for LLM reasoning, demonstrating strong performance across diverse reasoning tasks with improved efficiency and generalization.

Abstract: Reinforcement Learning (RL) serves as a potent paradigm for enhancing reasoning capabilities in Large Language Models (LLMs), yet standard outcome-based approaches often suffer from reward sparsity and inefficient credit assignment. In this paper, we propose a novel framework designed to provide continuous reward signals, which introduces a Step-wise Marginal Information Gain (MIG) mechanism that quantifies the intrinsic value of reasoning steps against a Monotonic Historical Watermark, effectively filtering out training noise. To ensure disentangled credit distribution, we implement a Decoupled Masking Strategy, applying process-oriented rewards specifically to the chain-of-thought (CoT) and outcome-oriented rewards to the full completion. Additionally, we incorporate a Dual-Gated SFT objective to stabilize training with high-quality structural and factual signals. Extensive experiments across textual and multi-modal benchmarks (e.g., MATH, Super-CLEVR) demonstrate that our approach consistently outperforms baselines such as GRPO in both sample efficiency and final accuracy. Furthermore, our model exhibits superior out-of-distribution robustness, demonstrating promising zero-shot transfer capabilities to unseen and challenging reasoning tasks.

[774] SetPO: Set-Level Policy Optimization for Diversity-Preserving LLM Reasoning

Chenyi Li, Yuan Zhang, Bo Wang, Guoqing Ma, Wei Tang, Haoyang Huang, Nan Duan

Main category: cs.AI

TL;DR: Proposes a diversity-aware reinforcement learning method for LLMs that uses kernelized similarity and marginal contributions to maintain solution diversity while improving reasoning performance.

Details

Motivation: Reinforcement learning with verifiable rewards improves LLM reasoning but reduces outcome diversity, causing models to concentrate on narrow solution sets. The paper aims to address this diversity loss while maintaining performance gains.

Method: Introduces a set-level diversity objective using kernelized similarity over sampled trajectories, derives leave-one-out marginal contributions for each trajectory, and integrates this as a plug-in advantage shaping term for policy optimization. Also analyzes single trajectory contributions within a distribution perturbation framework.

Result: Extensive experiments across various model scales show the method consistently outperforms strong baselines in both Pass@1 and Pass@K across multiple benchmarks, effectively balancing performance and diversity.

Conclusion: The proposed diversity-aware reinforcement learning approach successfully maintains solution diversity while improving reasoning performance in LLMs, with theoretical analysis confirming that rarer trajectories contribute more to global diversity.

Abstract: Reinforcement learning with verifiable rewards has shown notable effectiveness in enhancing large language models (LLMs) reasoning performance, especially in mathematics tasks. However, such improvements often come with reduced outcome diversity, where the model concentrates probability mass on a narrow set of solutions. Motivated by diminishing-returns principles, we introduce a set level diversity objective defined over sampled trajectories using kernelized similarity. Our approach derives a leave-one-out marginal contribution for each sampled trajectory and integrates this objective as a plug-in advantage shaping term for policy optimization. We further investigate the contribution of a single trajectory to language model diversity within a distribution perturbation framework. This analysis theoretically confirms a monotonicity property, proving that rarer trajectories yield consistently higher marginal contributions to the global diversity. Extensive experiments across a range of model scales demonstrate the effectiveness of our proposed algorithm, consistently outperforming strong baselines in both Pass@1 and Pass@K across various benchmarks.

[775] ConvexBench: Can LLMs Recognize Convex Functions?

Yepeng Liu, Yu Huang, Yu-Xiang Wang, Yingbin Liang, Yuheng Bu

Main category: cs.AI

TL;DR: A benchmark for testing LLMs’ ability to identify convexity of symbolic objectives under deep functional composition, revealing compositional reasoning gaps that degrade with depth.

Details

Motivation: As LLMs automate research-level math and sciences, it's important to test their ability to understand and reason with convexity, a fundamental concept in convex analysis with many applications.

Method: Introduces a scalable, mechanically verifiable benchmark to test LLMs’ ability to identify convexity under deep functional composition. Proposes an agentic divide-and-conquer framework that offloads parsing to external tools to construct ASTs and enforces recursive reasoning over intermediate sub-expressions.

Result: Frontier LLMs show sharp compositional reasoning gaps: performance drops from F1-score of 1.0 at depth 2 to ~0.2 at depth 100. The proposed framework reliably mitigates deep-composition failures, achieving F1-score of 1.0 at depth 100.

Conclusion: Current LLMs struggle with deep compositional reasoning for convexity identification, but agentic frameworks with external parsing tools and recursive reasoning can effectively address these limitations.

Abstract: Convex analysis is a modern branch of mathematics with many applications. As Large Language Models (LLMs) start to automate research-level math and sciences, it is important for LLMs to demonstrate the ability to understand and reason with convexity. We introduce \cb, a scalable and mechanically verifiable benchmark for testing \textit{whether LLMs can identify the convexity of a symbolic objective under deep functional composition.} Experiments on frontier LLMs reveal a sharp compositional reasoning gap: performance degrades rapidly with increasing depth, dropping from an F1-score of $1.0$ at depth $2$ to approximately $0.2$ at depth $100$. Inspection of models’ reasoning traces indicates two failure modes: \textit{parsing failure} and \textit{lazy reasoning}. To address these limitations, we propose an agentic divide-and-conquer framework that (i) offloads parsing to an external tool to construct an abstract syntax tree (AST) and (ii) enforces recursive reasoning over each intermediate sub-expression with focused context. This framework reliably mitigates deep-composition failures, achieving substantial performance improvement at large depths (e.g., F1-Score $= 1.0$ at depth $100$).

[776] AutoHealth: An Uncertainty-Aware Multi-Agent System for Autonomous Health Data Modeling

Tong Xia, Weibin Li, Gang Liu, Yong Li

Main category: cs.AI

TL;DR: AutoHealth: An uncertainty-aware multi-agent system for autonomous health data modeling with specialized agents for data exploration, model construction, training, and optimization, focusing on both predictive performance and uncertainty quantification.

Details

Motivation: LLM-based agents show promise for autonomous ML but struggle with heterogeneous health data modalities, rely too much on predefined templates, and lack uncertainty estimation crucial for healthcare decision-making.

Method: Proposes AutoHealth with five specialized agents in closed-loop coordination: data exploration, task-conditioned model construction, training, optimization, and uncertainty quantification agents working together.

Result: Outperforms state-of-the-art baselines by 29.2% in prediction performance and 50.2% in uncertainty estimation on a benchmark of 17 tasks across diverse data modalities and learning settings.

Conclusion: AutoHealth successfully addresses limitations of existing systems by providing uncertainty-aware autonomous modeling for health data with comprehensive reporting for trustworthy interpretation.

Abstract: LLM-based agents have demonstrated strong potential for autonomous machine learning, yet their applicability to health data remains limited. Existing systems often struggle to generalize across heterogeneous health data modalities, rely heavily on predefined solution templates with insufficient adaptation to task-specific objectives, and largely overlook uncertainty estimation, which is essential for reliable decision-making in healthcare. To address these challenges, we propose \textit{AutoHealth}, a novel uncertainty-aware multi-agent system that autonomously models health data and assesses model reliability. \textit{AutoHealth} employs closed-loop coordination among five specialized agents to perform data exploration, task-conditioned model construction, training, and optimization, while jointly prioritizing predictive performance and uncertainty quantification. Beyond producing ready-to-use models, the system generates comprehensive reports to support trustworthy interpretation and risk-aware decision-making. To rigorously evaluate its effectiveness, we curate a challenging real-world benchmark comprising 17 tasks across diverse data modalities and learning settings. \textit{AutoHealth} completes all tasks and outperforms state-of-the-art baselines by 29.2% in prediction performance and 50.2% in uncertainty estimation.

[777] EvoOpt-LLM: Evolving industrial optimization models with large language models

Yiliu He, Tianle Li, Binghao Ji, Zhiyuan Liu, Di Huang

Main category: cs.AI

TL;DR: EvoOpt-LLM is an LLM-based framework for industrial optimization modeling that automates MILP model construction, constraint injection, and variable pruning with high data efficiency.

Details

Motivation: Industrial optimization modeling via MILP is expertise-intensive and difficult to maintain under evolving business rules. Existing LLM methods have low data efficiency, limited solver validity, and poor scalability to industrial problems.

Method: Built on a 7B-parameter LLM with LoRA fine-tuning, EvoOpt-LLM supports automated model construction, dynamic business-constraint injection, and end-to-end variable pruning with minimal training data (3,000 samples).

Result: Achieves 91% generation rate and 65.9% executability rate with only 3,000 training samples. Constraint injection preserves original objectives, and variable pruning achieves ~0.56 F1 score on medium-sized LP models with 400 samples.

Conclusion: EvoOpt-LLM demonstrates a practical, data-efficient approach to industrial optimization modeling that reduces expert intervention while improving adaptability and solver efficiency.

Abstract: Optimization modeling via mixed-integer linear programming (MILP) is fundamental to industrial planning and scheduling, yet translating natural-language requirements into solver-executable models and maintaining them under evolving business rules remains highly expertise-intensive. While large language models (LLMs) offer promising avenues for automation, existing methods often suffer from low data efficiency, limited solver-level validity, and poor scalability to industrial-scale problems. To address these challenges, we present EvoOpt-LLM, a unified LLM-based framework supporting the full lifecycle of industrial optimization modeling, including automated model construction, dynamic business-constraint injection, and end-to-end variable pruning. Built on a 7B-parameter LLM and adapted via parameter-efficient LoRA fine-tuning, EvoOpt-LLM achieves a generation rate of 91% and an executability rate of 65.9% with only 3,000 training samples, with critical performance gains emerging under 1,500 samples. The constraint injection module reliably augments existing MILP models while preserving original objectives, and the variable pruning module enhances computational efficiency, achieving an F1 score of ~0.56 on medium-sized LP models with only 400 samples. EvoOpt-LLM demonstrates a practical, data-efficient approach to industrial optimization modeling, reducing reliance on expert intervention while improving adaptability and solver efficiency.

[778] MedBeads: An Agent-Native, Immutable Data Substrate for Trustworthy Medical AI

Takahito Nakajima

Main category: cs.AI

TL;DR: MedBeads proposes an agent-native data infrastructure using immutable “Beads” in a Merkle DAG to address context mismatch in clinical AI agents, enabling deterministic graph traversal instead of probabilistic RAG.

Details

Motivation: Current EMRs and FHIR standards are designed for human review, creating a "Context Mismatch" where AI agents receive fragmented data and must use probabilistic inference (RAG) to reconstruct patient history, leading to hallucinations and poor auditability.

Method: MedBeads uses immutable “Beads” as nodes in a Merkle Directed Acyclic Graph (DAG) that cryptographically reference causal predecessors. Implemented with Go Core Engine, Python middleware for LLM integration, React visualization, and BFS Context Retrieval algorithm for O(V+E) complexity traversal.

Result: Successfully implemented workflow with synthetic data, converting FHIR resources to causally-linked graph. Tamper-evidence guaranteed by design (any modification breaks cryptographic chain), BFS enables real-time decision support, and visualization aids clinician understanding.

Conclusion: MedBeads addresses context mismatch by shifting from probabilistic search to deterministic graph traversal and from mutable records to immutable chains, providing substrate for “Trustworthy Medical AI” with deterministic, tamper-evident context for LLM interpretation.

Abstract: Background: As of 2026, Large Language Models (LLMs) demonstrate expert-level medical knowledge. However, deploying them as autonomous “Clinical Agents” remains limited. Current Electronic Medical Records (EMRs) and standards like FHIR are designed for human review, creating a “Context Mismatch”: AI agents receive fragmented data and must rely on probabilistic inference (e.g., RAG) to reconstruct patient history. This approach causes hallucinations and hinders auditability. Methods: We propose MedBeads, an agent-native data infrastructure where clinical events are immutable “Beads”–nodes in a Merkle Directed Acyclic Graph (DAG)–cryptographically referencing causal predecessors. This “write-once, read-many” architecture makes tampering mathematically detectable. We implemented a prototype with a Go Core Engine, Python middleware for LLM integration, and a React-based visualization interface. Results: We successfully implemented the workflow using synthetic data. The FHIR-to-DAG conversion transformed flat resources into a causally-linked graph. Our Breadth-First Search (BFS) Context Retrieval algorithm traverses relevant subgraphs with O(V+E) complexity, enabling real-time decision support. Tamper-evidence is guaranteed by design: any modification breaks the cryptographic chain. The visualization aids clinician understanding through explicit causal links. Conclusion: MedBeads addresses the “Context Mismatch” by shifting from probabilistic search to deterministic graph traversal, and from mutable records to immutable chains, providing the substrate for “Trustworthy Medical AI.” It guarantees the context the AI receives is deterministic and tamper-evident, while the LLM determines interpretation. The structured Bead format serves as a token-efficient “AI-native language.” We release MedBeads as open-source software to accelerate agent-native data standards.

[779] Hard Constraints Meet Soft Generation: Guaranteed Feasibility for LLM-based Combinatorial Optimization

Yang Liu, Chuan Zhou, Yancheng Chen, Shuai Zhang, Xixun Lin, Xiaoqing Wang

Main category: cs.AI

TL;DR: FALCON is a framework that ensures 100% feasibility for LLMs solving combinatorial optimization problems through grammar-constrained decoding, feasibility repair, and adaptive sampling, with a novel training method called BOPO.

Details

Motivation: LLMs show promise for combinatorial optimization but lack mechanisms to guarantee solution feasibility, which is critical for real-world deployment. Current approaches often produce infeasible solutions that violate constraints.

Method: Three key innovations: (1) grammar-constrained decoding for syntactic validity, (2) feasibility repair layer to correct semantic constraint violations, (3) adaptive Best-of-N sampling for efficient inference. Training uses Best-anchored Objective-guided Preference Optimization (BOPO) which weights preference pairs by objective gap.

Result: Across seven NP-hard combinatorial optimization problems, FALCON achieves perfect feasibility while matching or exceeding solution quality of state-of-the-art neural and LLM-based solvers. Theoretical convergence proofs for BOPO and bounds on repair-induced quality loss.

Conclusion: FALCON provides a practical framework for deploying LLMs in combinatorial optimization with guaranteed feasibility, addressing a critical limitation of current LLM-based approaches.

Abstract: Large language models (LLMs) have emerged as promising general-purpose solvers for combinatorial optimization (CO), yet they fundamentally lack mechanisms to guarantee solution feasibility which is critical for real-world deployment. In this work, we introduce FALCON, a framework that ensures 100% feasibility through three key innovations: (i) \emph{grammar-constrained decoding} enforces syntactic validity, (ii) a \emph{feasibility repair layer} corrects semantic constraint violations, and (iii) \emph{adaptive Best-of-$N$ sampling} allocates inference compute efficiently. To train the underlying LLM, we introduce the Best-anchored Objective-guided Preference Optimization (BOPO) in LLM training, which weights preference pairs by their objective gap, providing dense supervision without human labels. Theoretically, we prove convergence for BOPO and provide bounds on repair-induced quality loss. Empirically, across seven NP-hard CO problems, FALCON achieves perfect feasibility while matching or exceeding the solution quality of state-of-the-art neural and LLM-based solvers.

[780] Probing RLVR training instability through the lens of objective-level hacking

Yiming Dong, Kun Fu, Haoyu Li, Xinyuan Zhu, Yurou Liu, Lijing Shao, Jieping Ye, Zheng Wang

Main category: cs.AI

TL;DR: RLVR training instability in MoE models is caused by objective-level hacking from token-level credit misalignment, not reward hacking from exploitable verifiers.

Details

Motivation: RLVR drives continuous improvements in LLM reasoning but suffers from training instability, especially in MoE architectures, with unclear underlying causes.

Method: Introduces a principled framework for understanding RLVR instability through objective-level hacking, analyzing token-level credit misalignment and system-level spurious signals, with experiments on a 30B MoE model.

Result: Traces the origin and formalizes the mechanism behind abnormal growth of training-inference discrepancy in MoE models, providing a causal account of training dynamics underlying instability.

Conclusion: Findings offer guidance for designing stable RLVR algorithms by understanding objective-level hacking mechanisms in MoE architectures.

Abstract: Prolonged reinforcement learning with verifiable rewards (RLVR) has been shown to drive continuous improvements in the reasoning capabilities of large language models, but the training is often prone to instabilities, especially in Mixture-of-Experts (MoE) architectures. Training instability severely undermines model capability improvement, yet its underlying causes and mechanisms remain poorly understood. In this work, we introduce a principled framework for understanding RLVR instability through the lens of objective-level hacking. Unlike reward hacking, which arises from exploitable verifiers, objective-level hacking emerges from token-level credit misalignment and is manifested as system-level spurious signals in the optimization objective. Grounded in our framework, together with extensive experiments on a 30B MoE model, we trace the origin and formalize the mechanism behind a key pathological training dynamic in MoE models: the abnormal growth of the training-inference discrepancy, a phenomenon widely associated with instability but previously lacking a mechanistic explanation. These findings provide a concrete and causal account of the training dynamics underlying instabilities in MoE models, offering guidance for the design of stable RLVR algorithms.

[781] Transforming Vehicle Diagnostics: A Multimodal Approach to Error Patterns Prediction

Hugo Math, Rainer Lienhart

Main category: cs.AI

TL;DR: BiCarFormer: A multimodal bidirectional Transformer model that integrates diagnostic trouble codes (DTCs) with environmental sensor data for improved vehicle malfunction classification.

Details

Motivation: Current vehicle diagnostic systems rely only on diagnostic trouble codes (DTCs) but ignore valuable contextual environmental data (temperature, humidity, pressure) that domain experts use for failure classification. Real-world data is complex and noisy, creating challenges for accurate vehicle malfunction prediction.

Method: BiCarFormer is a bidirectional Transformer model designed for vehicle event sequences. It uses embedding fusions and a co-attention mechanism to capture relationships between DTC sequences and environmental sensor data for multi-label sequence classification of error codes into error patterns.

Result: Experimental results on a real-world automotive dataset with 22,137 error codes and 360 error patterns show BiCarFormer significantly outperforms models using only DTC sequences and traditional sequence models in classification performance.

Conclusion: Incorporating contextual environmental information with diagnostic codes enables more accurate and robust vehicle diagnostics, reducing maintenance costs and enhancing automation in the automotive industry.

Abstract: Accurately diagnosing and predicting vehicle malfunctions is crucial for maintenance and safety in the automotive industry. While modern diagnostic systems primarily rely on sequences of vehicular Diagnostic Trouble Codes (DTCs) registered in On-Board Diagnostic (OBD) systems, they often overlook valuable contextual information such as raw sensory data (e.g., temperature, humidity, and pressure). This contextual data, crucial for domain experts to classify vehicle failures, introduces unique challenges due to its complexity and the noisy nature of real-world data. This paper presents BiCarFormer: the first multimodal approach to multi-label sequence classification of error codes into error patterns that integrates DTC sequences and environmental conditions. BiCarFormer is a bidirectional Transformer model tailored for vehicle event sequences, employing embedding fusions and a co-attention mechanism to capture the relationships between diagnostic codes and environmental data. Experimental results on a challenging real-world automotive dataset with 22,137 error codes and 360 error patterns demonstrate that our approach significantly improves classification performance compared to models that rely solely on DTC sequences and traditional sequence models. This work highlights the importance of incorporating contextual environmental information for more accurate and robust vehicle diagnostics, hence reducing maintenance costs and enhancing automation processes in the automotive industry.

[782] Lyapunov Stability-Aware Stackelberg Game for Low-Altitude Economy: A Control-Oriented Pruning-Based DRL Approach

Yue Zhong, Jiawen Kang, Yongju Tong, Hong-Ning Dai, Dong In Kim, Abbas Jamalipour, Shengli Xie

Main category: cs.AI

TL;DR: A closed-loop Sensing-Communication-Computing-Control framework for UAV networks that maps communication latency to control stability using Lyapunov theory, with game-theoretic resource allocation and a lightweight pruning-based PPO algorithm.

Details

Motivation: The low-altitude economy expansion requires UAVs to support diverse services with conflicting requirements: limited onboard resources vs. stringent stability needs for latency-sensitive critical missions. Traditional throughput-centric designs fail to address the impact of communication latency on physical control stability.

Method: 1) Lyapunov stability theory to map control system state evolution to communication constraints; 2) Stackelberg game formulation for resource allocation with UAVs as leaders pricing resources and users as followers optimizing requests; 3) Novel lightweight pruning-based PPO algorithm with dynamic structured pruning to compress neural networks for energy-constrained edge platforms.

Result: Simulation results show the proposed scheme effectively secures control loop stability while maximizing system utility in dynamic low-altitude environments. The pruning-based PPO enables rapid game equilibrium approximation with minimal inference latency.

Conclusion: The closed-loop framework successfully transforms abstract stability requirements into quantifiable resource boundaries, enabling reliable mission execution in UAV networks through game-theoretic resource allocation and efficient edge AI algorithms.

Abstract: With the rapid expansion of the low-altitude economy, Unmanned Aerial Vehicles (UAVs) serve as pivotal aerial base stations supporting diverse services from users, ranging from latency-sensitive critical missions to bandwidth-intensive data streaming. However, the efficacy of such heterogeneous networks is often compromised by the conflict between limited onboard resources and stringent stability requirements. Moving beyond traditional throughput-centric designs, we propose a Sensing-Communication-Computing-Control closed-loop framework that explicitly models the impact of communication latency on physical control stability. To guarantee mission reliability, we leverage the Lyapunov stability theory to derive an intrinsic mapping between the state evolution of the control system and communication constraints, transforming abstract stability requirements into quantifiable resource boundaries. Then, we formulate the resource allocation problem as a Stackelberg game, where UAVs (as leaders) dynamically price resources to balance load and ensure stability, while users (as followers) optimize requests based on service urgency. Furthermore, addressing the prohibitive computational overhead of standard Deep Reinforcement Learning (DRL) on energy-constrained edge platforms, we propose a novel and lightweight pruning-based Proximal Policy Optimization (PPO) algorithm. By integrating a dynamic structured pruning mechanism, the proposed algorithm significantly compresses the neural network scale during training, enabling the UAV to rapidly approximate the game equilibrium with minimal inference latency. Simulation results demonstrate that the proposed scheme effectively secures control loop stability while maximizing system utility in dynamic low-altitude environments.

[783] PersistBench: When Should Long-Term Memories Be Forgotten by LLMs?

Sidharth Pulipaka, Oliver Chen, Manas Sharma, Taaha S Bajwa, Vyas Raina, Ivaxi Sheth

Main category: cs.AI

TL;DR: PersistBench: A benchmark to measure safety risks in conversational AI systems with long-term memory, identifying cross-domain leakage and memory-induced sycophancy as key vulnerabilities.

Details

Motivation: While long-term memory in conversational assistants enhances personalization (e.g., remembering user preferences), the persistence of memories introduces overlooked safety risks that need systematic measurement and mitigation.

Method: Introduces PersistBench to evaluate safety risks in LLMs with long-term memory, specifically measuring cross-domain leakage (inappropriate context injection from memories) and memory-induced sycophancy (reinforcement of user biases). Evaluates 18 frontier and open-source LLMs on this benchmark.

Result: Reveals surprisingly high failure rates: median 53% on cross-domain samples and 97% on sycophancy samples across evaluated LLMs, demonstrating significant safety vulnerabilities in current systems.

Conclusion: Long-term memory in conversational systems introduces serious safety risks that current LLMs are poorly equipped to handle. PersistBench provides a framework for developing more robust and safer memory usage in conversational AI.

Abstract: Conversational assistants are increasingly integrating long-term memory with large language models (LLMs). This persistence of memories, e.g., the user is vegetarian, can enhance personalization in future conversations. However, the same persistence can also introduce safety risks that have been largely overlooked. Hence, we introduce PersistBench to measure the extent of these safety risks. We identify two long-term memory-specific risks: cross-domain leakage, where LLMs inappropriately inject context from the long-term memories; and memory-induced sycophancy, where stored long-term memories insidiously reinforce user biases. We evaluate 18 frontier and open-source LLMs on our benchmark. Our results reveal a surprisingly high failure rate across these LLMs - a median failure rate of 53% on cross-domain samples and 97% on sycophancy samples. To address this, our benchmark encourages the development of more robust and safer long-term memory usage in frontier conversational systems.

[784] Capabilities and Fundamental Limits of Latent Chain-of-Thought

Jiaxuan Zou, Yaozhong Xiong, Yong Liu

Main category: cs.AI

TL;DR: Latent CoT models show inconsistent performance due to decisional certainty trade-off: high certainty enables precise execution but inhibits exploration, while low certainty facilitates search but causes error accumulation.

Details

Motivation: The paper aims to explain puzzling performance inconsistencies in Latent Chain-of-Thought models, which excel at exploration tasks but fail at computation tasks, by investigating the underlying mechanisms governing this trade-off.

Method: The authors theoretically characterize the Exploration-Execution Trade-off, introduce the Symbolic Index to quantify decisional commitment, establish causal relationships, and prove curriculum learning is necessary due to distributional mismatch.

Result: The framework reveals that decisional certainty governs the trade-off, with high certainty enabling precise execution but inhibiting exploration, and low certainty facilitating search but causing error accumulation.

Conclusion: The work shifts the design paradigm from binary architectural choices toward adaptive systems that dynamically regulate decisional certainty based on task demands.

Abstract: Latent Chain-of-Thought (Latent CoT) models promise efficient reasoning via continuous representations, yet exhibit puzzling performance inconsistencies: excelling at exploration (ProsQA: 97.0%) but failing at computation (GSM8K: 34.1%). We reveal that this trade-off is governed by decisional certainty. Our contributions are threefold: (1) We theoretically characterize the fundamental Exploration-Execution Trade-off, proving that high certainty enables precise execution but inhibits exploration, while low certainty facilitates search but causes error accumulation. (2) We introduce the Symbolic Index–quantifying decisional commitment–as the core mechanism governing this trade-off and establish its causal relationship with both execution stability and exploration capability. (3) We prove that curriculum learning is theoretically necessary, as direct training provably fails due to distributional mismatch. Our framework shifts the design paradigm from binary architectural choices toward adaptive systems that dynamically regulate decisional certainty based on task demands.

[785] Multi-Agent Causal Reasoning System for Error Pattern Rule Automation in Vehicles

Hugo Math, Julian Lorentz, Stefan Oelsner, Rainer Lienhart

Main category: cs.AI

TL;DR: CAREP is a multi-agent system that automates the generation of error pattern rules from diagnostic trouble codes in vehicles using causal discovery and contextual reasoning.

Details

Motivation: Manual creation of error pattern rules from diagnostic trouble codes by domain experts is expensive and error-prone as vehicle complexity grows, necessitating automated solutions.

Method: CAREP uses three agents: 1) causal discovery agent to identify DTC-EP relations, 2) contextual information agent that integrates metadata and descriptions, and 3) orchestrator agent that synthesizes boolean rules with interpretable reasoning traces.

Result: Evaluation on a large-scale automotive dataset with 29,100+ unique DTCs and 474 error patterns shows CAREP can automatically and accurately discover unknown EP rules, outperforming LLM-only baselines while providing transparent causal explanations.

Conclusion: CAREP represents progress toward fully automated fault diagnostics, enabling scalable, interpretable, and cost-efficient vehicle maintenance through practical causal discovery and agent-based reasoning.

Abstract: Modern vehicles generate thousands of different discrete events known as Diagnostic Trouble Codes (DTCs). Automotive manufacturers use Boolean combinations of these codes, called error patterns (EPs), to characterize system faults and ensure vehicle safety. Yet, EP rules are still manually handcrafted by domain experts, a process that is expensive and prone to errors as vehicle complexity grows. This paper introduces CAREP (Causal Automated Reasoning for Error Patterns), a multi-agent system that automatizes the generation of EP rules from high-dimensional event sequences of DTCs. CAREP combines a causal discovery agent that identifies potential DTC-EP relations, a contextual information agent that integrates metadata and descriptions, and an orchestrator agent that synthesizes candidate boolean rules together with interpretable reasoning traces. Evaluation on a large-scale automotive dataset with over 29,100 unique DTCs and 474 error patterns demonstrates that CAREP can automatically and accurately discover the unknown EP rules, outperforming LLM-only baselines while providing transparent causal explanations. By uniting practical causal discovery and agent-based reasoning, CAREP represents a step toward fully automated fault diagnostics, enabling scalable, interpretable, and cost-efficient vehicle maintenance.

[786] Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models

Zhiming Liu, Yujie Wei, Lei Feng, Xiu Su, Xiaobo Xia, Weili Guan, Zeke Xie, Shuo Yang

Main category: cs.AI

TL;DR: TaLo is a training-free method that identifies and bypasses task-interfering layers in pretrained VLMs to improve performance on specific tasks without parameter updates.

Details

Motivation: The paper discovers that in pretrained Vision-Language Models (VLMs), certain layers actually hinder rather than help downstream task performance. When specific layers are bypassed (e.g., zeroed out), performance on certain tasks improves, revealing the existence of "Task-Interfering Layers" that negatively impact specific tasks.

Method: The authors systematically investigate layer influence via layer intervention experiments, measuring performance changes when bypassing individual layers. They introduce Task-Layer Interaction Vectors to quantify each layer’s effect on a task. Based on these findings, they propose TaLo (Task-Adaptive Layer Knockout) - a training-free, test-time adaptation method that dynamically identifies and bypasses the most interfering layer for a given task without parameter updates.

Result: TaLo improves performance across various models and datasets, including boosting Qwen-VL’s accuracy on the Maps task in ScienceQA by up to 16.6%. The method reveals consistent response patterns: tasks requiring similar capabilities show high similarity in their task-layer interaction vectors.

Conclusion: The work reveals an unexpected form of modularity in pretrained VLMs and provides a plug-and-play, training-free mechanism to unlock hidden capabilities at inference time by identifying and bypassing task-interfering layers.

Abstract: Current VLMs have demonstrated capabilities across a wide range of multimodal tasks. Typically, in a pretrained VLM, all layers are engaged by default to make predictions on downstream tasks. We find that intervening on a single layer, such as by zeroing its parameters, can improve the performance on certain tasks, indicating that some layers hinder rather than help downstream tasks. We systematically investigate how individual layers influence different tasks via layer intervention. Specifically, we measure the change in performance relative to the base model after intervening on each layer and observe improvements when bypassing specific layers. This improvement can be generalizable across models and datasets, indicating the presence of Task-Interfering Layers that harm downstream tasks’ performance. We introduce Task-Layer Interaction Vector, which quantifies the effect of intervening on each layer of a VLM given a task. These task-interfering layers exhibit task-specific sensitivity patterns: tasks requiring similar capabilities show consistent response trends under layer interventions, as evidenced by the high similarity in their task-layer interaction vectors. Inspired by these findings, we propose TaLo (Task-Adaptive Layer Knockout), a training-free, test-time adaptation method that dynamically identifies and bypasses the most interfering layer for a given task. Without parameter updates, TaLo improves performance across various models and datasets, including boosting Qwen-VL’s accuracy on the Maps task in ScienceQA by up to 16.6%. Our work reveals an unexpected form of modularity in pretrained VLMs and provides a plug-and-play, training-free mechanism to unlock hidden capabilities at inference time. The source code will be publicly available.

[787] ASP-Bench: From Natural Language to Logic Programs

Stefan Szeider

Main category: cs.AI

TL;DR: ASP-Bench is a benchmark for evaluating systems that translate natural language problems into Answer Set Programs (ASP), covering 128 problem instances with systematic coverage of ASP features and reasoning aspects.

Details

Motivation: The paper addresses the challenge of automating translation from natural-language specifications to logic programs, which is important for neurosymbolic engineering. Current approaches lack comprehensive benchmarks for evaluating such translation systems.

Method: The authors create ASP-Bench with 128 natural language problem instances (64 base problems with easy/hard variants), systematically covering ASP features like choice rules, aggregates, and optimization. They characterize problems along 7 reasoning aspects and test using an agentic ReAct framework with feedback-driven iterative refinement.

Result: The benchmark enables evaluation of ASP translation systems. The ReAct-based agent achieves full saturation, demonstrating that feedback-driven iterative refinement with solver feedback provides reliable modeling. Analysis across multiple runs provides insights into what determines problem modeling hardness.

Conclusion: ASP-Bench provides a comprehensive benchmark for evaluating natural language to ASP translation systems, with the ReAct framework showing promise for reliable modeling through iterative refinement with solver feedback.

Abstract: Automating the translation of natural-language specifications into logic programs is a challenging task that affects neurosymbolic engineering. We present ASP-Bench, a benchmark comprising 128 natural language problem instances, 64 base problems with easy and hard variants. It evaluates systems that translate natural-language problems into Answer Set Programs (ASPs), a prominent form of logic programming. It provides systematic coverage of ASP features, including choice rules, aggregates, and optimization. Each problem includes reference validators that check whether solutions satisfy the problem specification. We characterize problems along seven largely independent reasoning aspects (optimization, temporal reasoning, default logic, resource allocation, recursion, spatial reasoning, and quantitative complexity), providing a multidimensional view of modeling difficulty. We test the benchmark using an agentic approach based on the ReAct (Reason and Act) framework, which achieves full saturation, demonstrating that feedback-driven iterative refinement with solver feedback provides a reliable and robust approach for modeling natural language in ASP. Our analysis across multiple agent runs enables us to gain insights into what determines a problem’s modeling hardness.

[788] A State-Transition Framework for Efficient LLM Reasoning

Liang Zhang, Yu Zhao, Longyue Wang, Tianqi Shi, Weihua Luo, Kaifu Zhang, Jinsong Su

Main category: cs.AI

TL;DR: A linear attention-based state-transition framework improves LLM reasoning efficiency by modeling reasoning as state transitions, reducing computational complexity from quadratic to linear while maintaining reasoning capacity.

Details

Motivation: Long Chain-of-Thought reasoning improves LLM performance but incurs high computational/memory costs. Existing compression methods conflict with test-time scaling and limit reasoning capacity.

Method: Models reasoning as state-transition process using linear attention to estimate reasoning state (historical information). Each token retrieves relevant historical info directly from state without explicit attention to previous steps, reducing complexity from quadratic to linear. Includes state-based reasoning strategy to mitigate over-thinking.

Result: Extensive experiments across multiple datasets and model sizes show the framework improves both reasoning efficiency and performance.

Conclusion: The proposed state-transition framework with linear attention significantly enhances LLM reasoning efficiency while maintaining or improving reasoning capacity, addressing limitations of existing CoT compression methods.

Abstract: While Long Chain-of-Thought (CoT) reasoning significantly improves Large Language Models (LLMs) performance on complex reasoning tasks, the substantial computational and memory costs of generating long CoT sequences limit their efficiency and practicality. Existing studies usually enhance the reasoning efficiency of LLMs by compressing CoT sequences. However, this approach conflicts with test-time scaling, limiting the reasoning capacity of LLMs. In this paper, we propose an efficient reasoning framework that models the reasoning process of LLMs as a state-transition process. Specifically, we first apply a linear attention mechanism to estimate the LLM’s reasoning state, which records the historical reasoning information from previous reasoning steps. Then, based on the query prompt and the reasoning state, the LLM can efficiently perform the current reasoning step and update the state. With the linear attention, each token in the current reasoning step can directly retrieve relevant historical reasoning information from the reasoning state, without explicitly attending to tokens in previous reasoning steps. In this way, the computational complexity of attention is reduced from quadratic to linear, significantly improving the reasoning efficiency of LLMs. In addition, we propose a state-based reasoning strategy to mitigate the over-thinking issue caused by noisy reasoning steps. Extensive experiments across multiple datasets and model sizes demonstrate that our framework not only improves the reasoning efficiency of LLMs but also enhances their reasoning performance.

[789] Workflow-R1: Group Sub-sequence Policy Optimization for Multi-turn Workflow Construction

Mingze Kong, Zikun Qu, Zhongquan Zhou, Pengyu Liang, Xiang Li, Zhiwei Shang, Zhi Hong, Kaiyu Huang, Zhiyong Wang, Zhongxiang Dai

Main category: cs.AI

TL;DR: Workflow-R1 reformulates workflow construction as multi-turn natural language sequential decision-making using Group Sub-sequence Policy Optimization (GSsPO) to align with agentic reasoning dynamics.

Details

Motivation: Existing workflow optimization methods treat workflow synthesis as static, one-shot code-centric generation, which imposes excessive constraints on LLM coding capabilities and restricts flexibility for dynamic problem-solving in agentic workflows.

Method: Presents Workflow-R1 framework that reformulates workflow construction as multi-turn natural language sequential decision-making. Introduces Group Sub-sequence Policy Optimization (GSsPO) to resolve optimization granularity mismatch by recalibrating optimization units to composite sub-sequences (Think-Action cycles), aligning gradient updates with semantic boundaries.

Result: Through extensive experiments on multiple QA benchmarks, Workflow-R1 outperforms competitive baselines, validating GSsPO as a generalized solution for sequential reasoning.

Conclusion: Workflow-R1 establishes a promising new paradigm for automated workflow optimization, with GSsPO serving as a structure-aware RL algorithm generalizable to broad classes of multi-turn agentic sequential decision-making tasks.

Abstract: The rapid evolution of agentic workflows has demonstrated strong performance of LLM-based agents in addressing complex reasoning tasks. However, existing workflow optimization methods typically formulate workflow synthesis as a static, one-shot code-centric generation problem. This paradigm imposes excessive constraints on the model’s coding capabilities and restricts the flexibility required for dynamic problem-solving. In this paper, we present Workflow-R1, a framework that reformulates workflow construction as a multi-turn, natural language-based sequential decision-making process. To resolve the optimization granularity mismatch inherent in such multi-turn interactions, we introduce Group Sub-sequence Policy Optimization (GSsPO). While explicitly tailored to align with the interleaved Think-Action dynamics of agentic reasoning, GSsPO fundamentally functions as a structure-aware RL algorithm generalizable to a broad class of multi-turn agentic sequential decision-making tasks. By recalibrating the optimization unit to the composite sub-sequence, specifically the atomic Think-Action cycle, it aligns gradient updates with the semantic boundaries of these interactions, ensuring robust learning in complex multi-turn reasoning tasks. Through extensive experiments on multiple QA benchmarks, Workflow-R1 outperforms competitive baselines, validating GSsPO as a generalized solution for sequential reasoning and establishing Workflow-R1 as a promising new paradigm for automated workflow optimization.

[790] Addressing Explainability of Generative AI using SMILE (Statistical Model-agnostic Interpretability with Local Explanations)

Zeinab Dehghani

Main category: cs.AI

TL;DR: gSMILE is a unified framework for explaining generative AI models by quantifying how specific prompt components influence outputs using controlled perturbations and Wasserstein distance metrics.

Details

Motivation: Generative AI models produce complex outputs but have opaque decision-making processes, limiting trust and accountability in high-stakes applications. There's a need for explainability methods specifically designed for generative models.

Method: Extends SMILE method to generative settings using controlled text perturbations, Wasserstein distance metrics, and weighted surrogate modeling. Provides token-level attribution for LLMs and analyzes instruction-based image editing models. Uses scenario-based evaluation with Operational Design Domain framework.

Result: gSMILE produces robust, human-aligned attributions and generalizes effectively across state-of-the-art generative models. It generates intuitive heatmaps highlighting influential tokens and reasoning pathways.

Conclusion: gSMILE advances transparent, reliable, and responsible deployment of generative AI technologies by providing systematic explainability for generative models across diverse conditions.

Abstract: The rapid advancement of generative artificial intelligence has enabled models capable of producing complex textual and visual outputs; however, their decision-making processes remain largely opaque, limiting trust and accountability in high-stakes applications. This thesis introduces gSMILE, a unified framework for the explainability of generative models, extending the Statistical Model-agnostic Interpretability with Local Explanations (SMILE) method to generative settings. gSMILE employs controlled perturbations of textual input, Wasserstein distance metrics, and weighted surrogate modelling to quantify and visualise how specific components of a prompt or instruction influence model outputs. Applied to Large Language Models (LLMs), gSMILE provides fine-grained token-level attribution and generates intuitive heatmaps that highlight influential tokens and reasoning pathways. In instruction-based image editing models, the exact text-perturbation mechanism is employed, allowing for the analysis of how modifications to an editing instruction impact the resulting image. Combined with a scenario-based evaluation strategy grounded in the Operational Design Domain (ODD) framework, gSMILE allows systematic assessment of model behaviour across diverse semantic and environmental conditions. To evaluate explanation quality, we define rigorous attribution metrics, including stability, fidelity, accuracy, consistency, and faithfulness, and apply them across multiple generative architectures. Extensive experiments demonstrate that gSMILE produces robust, human-aligned attributions and generalises effectively across state-of-the-art generative models. These findings highlight the potential of gSMILE to advance transparent, reliable, and responsible deployment of generative AI technologies.

[791] Not All Preferences Are Created Equal: Stability-Aware and Gradient-Efficient Alignment for Reasoning Models

Hui Wu, Hengyi Cai, Jinman Zhao, Xinran Chen, Ziheng Li, Zhejun Zhao, Shuaiqiang Wang, Yuchen Li, Dawei Yin

Main category: cs.AI

TL;DR: SAGE is a dynamic framework for preference-based alignment of large reasoning models that optimizes training efficiency by prioritizing informative samples and filtering unstable ones based on policy-aware stability metrics.

Details

Motivation: Standard preference alignment methods like DPO treat all preference pairs uniformly, leading to inefficient optimization by wasting computation on trivial pairs and suffering from noise near uncertain decision boundaries. There's a need for dynamic approaches that adapt to model competence.

Method: SAGE integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence with a fine-grained, stability-aware scoring function that prioritizes informative, confident errors while filtering out unstable samples to maximize Signal-to-Noise Ratio of policy updates.

Result: Experiments on multiple mathematical reasoning benchmarks show SAGE significantly accelerates convergence and outperforms static baselines, demonstrating improved alignment efficiency and reliability.

Conclusion: The paper highlights the critical role of policy-aware, stability-conscious data selection in reasoning alignment, showing that dynamic sample prioritization based on model competence and stability metrics leads to more efficient and reliable training.

Abstract: Preference-based alignment is pivotal for training large reasoning models; however, standard methods like Direct Preference Optimization (DPO) typically treat all preference pairs uniformly, overlooking the evolving utility of training instances. This static approach often leads to inefficient or unstable optimization, as it wastes computation on trivial pairs with negligible gradients and suffers from noise induced by samples near uncertain decision boundaries. Facing these challenges, we propose SAGE (Stability-Aware Gradient Efficiency), a dynamic framework designed to enhance alignment reliability by maximizing the Signal-to-Noise Ratio of policy updates. Concretely, SAGE integrates a coarse-grained curriculum mechanism that refreshes candidate pools based on model competence with a fine-grained, stability-aware scoring function that prioritizes informative, confident errors while filtering out unstable samples. Experiments on multiple mathematical reasoning benchmarks demonstrate that SAGE significantly accelerates convergence and outperforms static baselines, highlighting the critical role of policy-aware, stability-conscious data selection in reasoning alignment.

[792] FutureMind: Equipping Small Language Models with Strategic Thinking-Pattern Priors via Adaptive Knowledge Distillation

Shaoxiong Yang, Junting Li, Mengyuan Zhang, Chao Li, Wei Liu, Jian Luan

Main category: cs.AI

TL;DR: FutureMind: A modular reasoning framework that enhances Small Language Models (SLMs) with strategic thinking-pattern priors via adaptive knowledge distillation from LLMs, improving performance on complex knowledge-intensive tasks.

Details

Motivation: Small Language Models (SLMs) are cost-effective but struggle with complex, knowledge-intensive tasks requiring structured reasoning and effective retrieval. There's a need to enhance SLMs' reasoning capabilities without sacrificing their efficiency advantages.

Method: Proposes FutureMind framework with four key modules: Problem Analysis, Logical Reasoning, Strategy Planning, and Retrieval Guidance. Uses adaptive knowledge distillation from LLMs to transfer thinking patterns to SLMs. Implements three distinct retrieval paradigms to decompose complex queries into tractable subproblems.

Result: Extensive experiments on multi-hop QA benchmarks (2WikiMultihopQA, MuSiQue, Bamboogle, Frames) show FutureMind consistently outperforms strong baselines like Search-o1, achieving state-of-the-art results under free training conditions across diverse SLM architectures and scales.

Conclusion: FutureMind successfully enhances SLMs’ reasoning capabilities through thinking-pattern distillation, but reveals cognitive bias bottlenecks between teacher (LLMs) and student (SLMs) models, providing new insights into reasoning skill transferability for developing efficient yet capable SLMs.

Abstract: Small Language Models (SLMs) are attractive for cost-sensitive and resource-limited settings due to their efficient, low-latency inference. However, they often struggle with complex, knowledge-intensive tasks that require structured reasoning and effective retrieval. To address these limitations, we propose FutureMind, a modular reasoning framework that equips SLMs with strategic thinking-pattern priors via adaptive knowledge distillation from large language models (LLMs). FutureMind introduces a dynamic reasoning pipeline composed of four key modules: Problem Analysis, Logical Reasoning, Strategy Planning, and Retrieval Guidance. This pipeline is augmented by three distinct retrieval paradigms that decompose complex queries into tractable subproblems, ensuring efficient and accurate retrieval execution. Extensive experiments on multi-hop QA benchmarks, including 2WikiMultihopQA, MuSiQue, Bamboogle, and Frames, demonstrate the superiority of FutureMind. It consistently outperforms strong baselines such as Search-o1, achieving state-of-the-art results under free training conditions across diverse SLM architectures and scales. Beyond empirical gains, our analysis reveals that the process of thinking-pattern distillation is restricted by the cognitive bias bottleneck between the teacher (LLMs) and student (SLMs) models. This provides new perspectives on the transferability of reasoning skills, paving the way for the development of SLMs that combine efficiency with genuine cognitive capability.

[793] Predictive Scheduling for Efficient Inference-Time Reasoning in Large Language Models

Katrina Brown, Aneesh Muppidi, Rana Shahout

Main category: cs.AI

TL;DR: Predictive Scheduling framework uses lightweight predictors to estimate optimal reasoning length for LLM queries before generation, enabling dynamic token budget allocation to maximize accuracy within fixed compute constraints.

Details

Motivation: Current LLMs use fixed token budgets for chain-of-thought reasoning, leading to inefficient computation (over-computation on easy queries and under-computation on hard ones). There's a need for adaptive resource allocation to optimize the compute-accuracy trade-off.

Method: Two lightweight predictors: (1) MLP on intermediate transformer hidden states, or (2) LoRA-fine-tuned classifier on raw question text. These predict optimal reasoning length/difficulty before full generation. A greedy batch allocator dynamically distributes fixed total token budget across queries to maximize expected accuracy.

Result: On GSM8K arithmetic benchmark: up to 7.9 percentage points absolute accuracy gain over uniform budgeting at identical token cost, closing over 50% of gap to oracle with perfect foresight. Middle transformer layers (12-17) carry richest signals for size estimation.

Conclusion: Pre-run budget prediction enables fine-grained control of compute-accuracy trade-off, offering path toward latency-sensitive, cost-efficient LLM deployments. Predictive scheduling is a plug-and-play framework for adaptive resource allocation.

Abstract: Large language models (LLMs) achieve state-of-the-art accuracy on complex reasoning tasks by generating multiple chain-of-thought (CoT) traces, but using a fixed token budget per query leads to over-computation on easy inputs and under-computation on hard ones. We introduce Predictive Scheduling, a plug-and-play framework that pre-runs lightweight predictors, an MLP on intermediate transformer hidden states or a LoRA-fine-tuned classifier on raw question text, to estimate each query’s optimal reasoning length or difficulty before any full generation. Our greedy batch allocator dynamically distributes a fixed total token budget across queries to maximize expected accuracy. On the GSM8K arithmetic benchmark, predictive scheduling yields up to 7.9 percentage points of absolute accuracy gain over uniform budgeting at identical token cost, closing over 50% of the gap to an oracle with perfect foresight. A systematic layer-wise study reveals that middle layers (12 - 17) of the transformer carry the richest signals for size estimation. These results demonstrate that pre-run budget prediction enables fine-grained control of the compute-accuracy trade-off, offering a concrete path toward latency-sensitive, cost-efficient LLM deployments.

[794] LLM-Driven Ontology Construction for Enterprise Knowledge Graphs

Abdulsobur Oyewale, Tommaso Soru

Main category: cs.AI

TL;DR: OntoEKG: LLM-driven pipeline for automated ontology generation from unstructured enterprise data using extraction and entailment modules

Details

Motivation: Manual ontology construction for Enterprise Knowledge Graphs is resource-intensive and requires domain expertise; need for automated approaches to accelerate this process

Method: Two-phase LLM-driven pipeline: extraction module identifies core classes/properties, entailment module logically structures elements into hierarchy before serializing to RDF

Result: Achieved fuzzy-match F1-score of 0.724 in Data domain; reveals limitations in scope definition and hierarchical reasoning; evaluated on new dataset from Data, Finance, Logistics sectors

Conclusion: LLM-driven approach shows potential for accelerating ontology construction but faces challenges in scope definition and hierarchical reasoning; contributes new evaluation dataset

Abstract: Enterprise Knowledge Graphs have become essential for unifying heterogeneous data and enforcing semantic governance. However, the construction of their underlying ontologies remains a resource-intensive, manual process that relies heavily on domain expertise. This paper introduces OntoEKG, a LLM-driven pipeline designed to accelerate the generation of domain-specific ontologies from unstructured enterprise data. Our approach decomposes the modelling task into two distinct phases: an extraction module that identifies core classes and properties, and an entailment module that logically structures these elements into a hierarchy before serialising them into standard RDF. Addressing the significant lack of comprehensive benchmarks for end-to-end ontology construction, we adopt a new evaluation dataset derived from documents across the Data, Finance, and Logistics sectors. Experimental results highlight both the potential and the challenges of this approach, achieving a fuzzy-match F1-score of 0.724 in the Data domain while revealing limitations in scope definition and hierarchical reasoning.

[795] RE-MCDF: Closed-Loop Multi-Expert LLM Reasoning for Knowledge-Grounded Clinical Diagnosis

Shaowei Shen, Xiaohong Yang, Jie Yang, Lianfen Huang, Yongcai Zhang, Yang Zou, Seyyedali Hosseinalipour

Main category: cs.AI

TL;DR: RE-MCDF is a relation-enhanced multi-expert clinical diagnosis framework for neurology EMRs that uses a generation-verification-revision closed-loop with three expert components and medical knowledge graph guidance to improve diagnostic accuracy.

Details

Motivation: EMRs in neurology are heterogeneous, sparse, and noisy, making LLMs vulnerable to self-reinforcing errors. Existing multi-agent frameworks have shallow interactions and ignore logical dependencies among diseases, preventing them from ruling out clinically implausible hypotheses.

Method: RE-MCDF introduces a generation-verification-revision closed-loop architecture with three components: (1) primary expert generates candidate diagnoses and evidence, (2) laboratory expert dynamically prioritizes clinical indicators, and (3) multi-relation awareness expert group enforces inter-disease logical constraints using a medical knowledge graph.

Result: Extensive experiments on neurology subsets (NEEMRs and XMEMRs) demonstrate that RE-MCDF consistently outperforms state-of-the-art baselines in complex diagnostic scenarios.

Conclusion: The framework effectively addresses limitations of existing approaches by incorporating logical disease dependencies and structured multi-expert collaboration, improving clinical diagnosis accuracy in challenging EMR settings.

Abstract: Electronic medical records (EMRs), particularly in neurology, are inherently heterogeneous, sparse, and noisy, which poses significant challenges for large language models (LLMs) in clinical diagnosis. In such settings, single-agent systems are vulnerable to self-reinforcing errors, as their predictions lack independent validation and can drift toward spurious conclusions. Although recent multi-agent frameworks attempt to mitigate this issue through collaborative reasoning, their interactions are often shallow and loosely structured, failing to reflect the rigorous, evidence-driven processes used by clinical experts. More fundamentally, existing approaches largely ignore the rich logical dependencies among diseases, such as mutual exclusivity, pathological compatibility, and diagnostic confusion. This limitation prevents them from ruling out clinically implausible hypotheses, even when sufficient evidence is available. To overcome these, we propose RE-MCDF, a relation-enhanced multi-expert clinical diagnosis framework. RE-MCDF introduces a generation–verification–revision closed-loop architecture that integrates three complementary components: (i) a primary expert that generates candidate diagnoses and supporting evidence, (ii) a laboratory expert that dynamically prioritizes heterogeneous clinical indicators, and (iii) a multi-relation awareness and evaluation expert group that explicitly enforces inter-disease logical constraints. Guided by a medical knowledge graph (MKG), the first two experts adaptively reweight EMR evidence, while the expert group validates and corrects candidate diagnoses to ensure logical consistency. Extensive experiments on the neurology subset of CMEMR (NEEMRs) and on our curated dataset (XMEMRs) demonstrate that RE-MCDF consistently outperforms state-of-the-art baselines in complex diagnostic scenarios.

[796] Model Specific Task Similarity for Vision Language Model Selection via Layer Conductance

Wei Yang, Hong Xie, Tao Tan, Xin Li, Defu Lian, Enhong Chen

Main category: cs.AI

TL;DR: A framework for selecting optimal pretrained Vision-Language Models for downstream tasks using internal functional dynamics of visual encoders, with a directional conductance divergence metric.

Details

Motivation: Current VLM selection methods are inadequate - they either require extensive data or use symmetric textual descriptors that ignore the directional, model-specific nature of transferability. There's a need for efficient model selection without exhaustive evaluation.

Method: Proposes representing tasks via layer-wise conductance, deriving target-conditioned block importance distribution through entropy regularized alignment. Introduces Directional Conductance Divergence (DCD), an asymmetric metric quantifying how effectively a source task covers target’s salient functional blocks.

Result: Experimental results on 48 VLMs across 21 datasets show the method outperforms state-of-the-art baselines, achieving 14.7% improvement in NDCG@5 over SWAB.

Conclusion: The framework provides an effective way to select optimal pretrained VLMs for downstream tasks by grounding selection in visual encoder functional dynamics, addressing limitations of existing methods.

Abstract: While open sourced Vision-Language Models (VLMs) have proliferated, selecting the optimal pretrained model for a specific downstream task remains challenging. Exhaustive evaluation is often infeasible due to computational constraints and data limitations in few shot scenarios. Existing selection methods fail to fully address this: they either rely on data-intensive proxies or use symmetric textual descriptors that neglect the inherently directional and model-specific nature of transferability. To address this problem, we propose a framework that grounds model selection in the internal functional dynamics of the visual encoder. Our approach represents each task via layer wise conductance and derives a target-conditioned block importance distribution through entropy regularized alignment. Building on this, we introduce Directional Conductance Divergence (DCD), an asymmetric metric that quantifies how effectively a source task covers the target’s salient functional blocks. This allows for predicting target model rankings by aggregating source task ranks without direct inference. Experimental results on 48 VLMs across 21 datasets demonstrate that our method outperforms state-of-the-art baselines, achieving a 14.7% improvement in NDCG@5 over SWAB.

[797] Aggregation Queries over Unstructured Text: Benchmark and Agentic Method

Haojia Zhu, Qinyuan Xu, Haoyu Li, Yuxi Liu, Hanchen Qiu, Jiaoyan Chen, Jiahui Jin

Main category: cs.AI

TL;DR: DFA framework for entity-level aggregation queries over text with strict completeness requirements, benchmarked on AGGBench

Details

Motivation: Aggregation queries over free text require exhaustive evidence collection ("find all" not just "find one"), which existing paradigms like Text-to-SQL and RAG fail to achieve. Need for formalization and evaluation of completeness-oriented aggregation.

Method: Proposes DFA (Disambiguation–Filtering–Aggregation), a modular agentic baseline that decomposes aggregation querying into interpretable stages: disambiguation, filtering, and aggregation. Introduces AGGBench benchmark for evaluating completeness-oriented aggregation under realistic large-scale corpus.

Result: DFA consistently improves aggregation evidence coverage over strong RAG and agentic baselines. The framework exposes key failure modes related to ambiguity, filtering, and aggregation.

Conclusion: DFA provides a principled approach to entity-level aggregation querying with strict completeness requirements, addressing limitations of existing methods through modular decomposition and comprehensive benchmarking.

Abstract: Aggregation query over free text is a long-standing yet underexplored problem. Unlike ordinary question answering, aggregate queries require exhaustive evidence collection and systems are required to “find all,” not merely “find one.” Existing paradigms such as Text-to-SQL and Retrieval-Augmented Generation fail to achieve this completeness. In this work, we formalize entity-level aggregation querying over text in a corpus-bounded setting with strict completeness requirement. To enable principled evaluation, we introduce AGGBench, a benchmark designed to evaluate completeness-oriented aggregation under realistic large-scale corpus. To accompany the benchmark, we propose DFA (Disambiguation–Filtering–Aggregation), a modular agentic baseline that decomposes aggregation querying into interpretable stages and exposes key failure modes related to ambiguity, filtering, and aggregation. Empirical results show that DFA consistently improves aggregation evidence coverage over strong RAG and agentic baselines. The data and code are available in https://anonymous.4open.science/r/DFA-A4C1.

[798] Building Better Deception Probes Using Targeted Instruction Pairs

Vikram Natarajan, Devina Jain, Shivam Arora, Satvik Golechha, Joseph Bloom

Main category: cs.AI

TL;DR: Linear probes for AI deception monitoring show limitations; instruction pair choice is crucial, and targeting specific deception types via human-interpretable taxonomy improves performance, suggesting specialized probes over universal detectors.

Details

Motivation: Previous linear probe approaches for monitoring AI deception have shown notable failures including spurious correlations and false positives, even in straightforward scenarios. The research aims to understand why these probes fail and how to improve them.

Method: The study identifies the importance of instruction pairs used during training and shows that targeting specific deceptive behaviors through a human-interpretable taxonomy of deception leads to improved results. They analyze how instruction pairs capture deceptive intent rather than content-specific patterns.

Result: Instruction pair choice dominates probe performance (70.6% of variance), and targeting specific deception types improves results on evaluation datasets. The findings reveal that probes capture deceptive intent rather than content patterns.

Conclusion: Given the heterogeneity of deception types across datasets, organizations should design specialized probes targeting their specific threat models rather than seeking a universal deception detector.

Abstract: Linear probes are a promising approach for monitoring AI systems for deceptive behaviour. Previous work has shown that a linear classifier trained on a contrastive instruction pair and a simple dataset can achieve good performance. However, these probes exhibit notable failures even in straightforward scenarios, including spurious correlations and false positives on non-deceptive responses. In this paper, we identify the importance of the instruction pair used during training. Furthermore, we show that targeting specific deceptive behaviors through a human-interpretable taxonomy of deception leads to improved results on evaluation datasets. Our findings reveal that instruction pairs capture deceptive intent rather than content-specific patterns, explaining why prompt choice dominates probe performance (70.6% of variance). Given the heterogeneity of deception types across datasets, we conclude that organizations should design specialized probes targeting their specific threat models rather than seeking a universal deception detector.

[799] SimGym: Traffic-Grounded Browser Agents for Offline A/B Testing in E-Commerce

Alberto Castelo, Zahra Zanjani Foumani, Ailin Fan, Keat Yang Koay, Vibhor Malik, Yuanzheng Zhu, Han Li, Meysam Feghhi, Ronie Uliana, Shuang Xie, Zhaoyu Zhang, Angelo Ocana Martins, Mingyu Zhao, Francis Pelland, Jonathan Faerman, Nikolas LeBlanc, Aaron Glazer, Andrew McNamara, Lingyun Wang, Zhong Wu

Main category: cs.AI

TL;DR: SimGym: A scalable system for rapid offline A/B testing using LLM-powered synthetic buyers that simulate real user behavior in e-commerce UI testing, reducing experiment cycles from weeks to under an hour.

Details

Motivation: Traditional A/B testing for e-commerce UI changes has significant drawbacks: it diverts real traffic, takes weeks to reach statistical significance, and risks harming user experience by exposing real customers to potentially poor designs.

Method: SimGym uses Large Language Model agents as synthetic buyers operating in live browsers. It extracts per-shop buyer profiles and intents from production interaction data, identifies distinct behavioral archetypes, and simulates cohort-weighted sessions across control and treatment storefronts.

Result: SimGym achieves state-of-the-art alignment with observed outcome shifts from real human outcomes on a major e-commerce platform, even without alignment post-training. It reduces experiment cycles from weeks to under an hour.

Conclusion: SimGym enables rapid experimentation without exposure to real buyers, providing a scalable solution for offline A/B testing that maintains alignment with real user behavior while dramatically accelerating the testing process.

Abstract: A/B testing remains the gold standard for evaluating e-commerce UI changes, yet it diverts traffic, takes weeks to reach significance, and risks harming user experience. We introduce SimGym, a scalable system for rapid offline A/B testing using traffic-grounded synthetic buyers powered by Large Language Model agents operating in a live browser. SimGym extracts per-shop buyer profiles and intents from production interaction data, identifies distinct behavioral archetypes, and simulates cohort-weighted sessions across control and treatment storefronts. We validate SimGym against real human outcomes from real UI changes on a major e-commerce platform under confounder control. Even without alignment post training, SimGym agents achieve state of the art alignment with observed outcome shifts and reduces experiment cycles from weeks to under an hour , enabling rapid experimentation without exposure to real buyers.

[800] Agyn: A Multi-Agent System for Team-Based Autonomous Software Engineering

Nikita Benkovich, Vitalii Valkov

Main category: cs.AI

TL;DR: A multi-agent system that models software engineering as an organizational process with specialized roles, achieving 72.4% success on SWE-bench 500, outperforming single-agent approaches.

Details

Motivation: Current autonomous systems treat issue resolution as monolithic or pipeline-based, while real-world software development is collaborative with clear roles, communication, and review processes. The paper aims to replicate this organizational structure in an automated system.

Method: Built on agyn platform, the system assigns specialized agents to roles (coordination, research, implementation, review) with isolated sandboxes for experimentation and structured communication. It follows a defined development methodology including analysis, task specification, pull request creation, and iterative review.

Result: The system resolves 72.4% of tasks on SWE-bench 500, outperforming single-agent baselines using comparable language models. It was designed for real production use and not tuned for the benchmark.

Conclusion: Replicating team structure, methodology, and communication is a powerful paradigm for autonomous software engineering. Future progress may depend as much on organizational design and agent infrastructure as on model improvements.

Abstract: Large language models have demonstrated strong capabilities in individual software engineering tasks, yet most autonomous systems still treat issue resolution as a monolithic or pipeline-based process. In contrast, real-world software development is organized as a collaborative activity carried out by teams following shared methodologies, with clear role separation, communication, and review. In this work, we present a fully automated multi-agent system that explicitly models software engineering as an organizational process, replicating the structure of an engineering team. Built on top of agyn, an open-source platform for configuring agent teams, our system assigns specialized agents to roles such as coordination, research, implementation, and review, provides them with isolated sandboxes for experimentation, and enables structured communication. The system follows a defined development methodology for working on issues, including analysis, task specification, pull request creation, and iterative review, and operates without any human intervention. Importantly, the system was designed for real production use and was not tuned for SWE-bench. When evaluated post hoc on SWE-bench 500, it resolves 72.4% of tasks, outperforming single-agent baselines using comparable language models. Our results suggest that replicating team structure, methodology, and communication is a powerful paradigm for autonomous software engineering, and that future progress may depend as much on organizational design and agent infrastructure as on model improvements.

[801] Legal Infrastructure for Transformative AI Governance

Gillian K. Hadfield

Main category: cs.AI

TL;DR: This paper focuses on AI governance infrastructure rather than technical AI development, proposing legal frameworks for registration regimes and regulatory markets.

Details

Motivation: Current AI governance efforts focus too much on substantive rules rather than the legal and regulatory infrastructure needed to implement those rules effectively, especially given AI's transformative nature.

Method: The paper reviews three proposed legal/regulatory infrastructure frameworks: 1) registration regimes for frontier AI models, 2) registration/identification regimes for autonomous agents, and 3) regulatory markets that enable private companies to provide AI regulatory services.

Result: The paper presents conceptual frameworks for building legal infrastructure rather than empirical results, focusing on governance mechanisms that could facilitate effective AI regulation.

Conclusion: Effective AI governance requires attention to legal and regulatory infrastructure design, not just substantive rules, with proposed frameworks offering pathways for implementation.

Abstract: Most of our AI governance efforts focus on substance: what rules do we want in place? What limits or checks do we want to impose on AI development and deployment? But a key role for law is not only to establish substantive rules but also to establish legal and regulatory infrastructure to generate and implement rules. The transformative nature of AI calls especially for attention to building legal and regulatory frameworks. In this PNAS Perspective piece I review three examples I have proposed: the creation of registration regimes for frontier models; the creation of registration and identification regimes for autonomous agents; and the design of regulatory markets to facilitate a role for private companies to innovate and deliver AI regulatory services.

[802] Learning to Guide Local Search for MPE Inference in Probabilistic Graphical Models

Brij Malhotra, Shivvrat Arya, Tahrima Rahman, Vibhav Giridhar Gogate

Main category: cs.AI

TL;DR: Neural amortization framework using attention networks to improve local search for MPE inference in PGMs by scoring moves based on predicted ability to reduce Hamming distance to near-optimal solutions.

Details

Motivation: MPE inference in PGMs is computationally challenging, especially in repeated-query settings where the same model is used with varying evidence. Existing SLS algorithms often stagnate in poor local optima, and heuristics like GLS+ cannot effectively reuse guidance across multiple queries.

Method: Train an attention-based neural network on the fixed graph structure to score local moves by predicting their ability to reduce Hamming distance to near-optimal solutions. This neural guidance integrates with existing local search procedures to balance short-term likelihood gains with long-term promise during neighbor selection.

Result: Empirical results show consistent improvements over SLS and GLS+ on challenging high-treewidth benchmarks in the amortized inference setting, with theoretical intuition linking distance-reducing move selection to improved convergence behavior.

Conclusion: Neural amortization provides an effective framework for improving local search in repeated-query MPE inference by learning reusable guidance that helps escape local optima and achieve better solutions.

Abstract: Most Probable Explanation (MPE) inference in Probabilistic Graphical Models (PGMs) is a fundamental yet computationally challenging problem arising in domains such as diagnosis, planning, and structured prediction. In many practical settings, the graphical model remains fixed while inference must be performed repeatedly for varying evidence patterns. Stochastic Local Search (SLS) algorithms scale to large models but rely on myopic best-improvement rule that prioritizes immediate likelihood gains and often stagnate in poor local optima. Heuristics such as Guided Local Search (GLS+) partially alleviate this limitation by modifying the search landscape, but their guidance cannot be reused effectively across multiple inference queries on the same model. We propose a neural amortization framework for improving local search in this repeated-query regime. Exploiting the fixed graph structure, we train an attention-based network to score local moves by predicting their ability to reduce Hamming distance to a near-optimal solution. Our approach integrates seamlessly with existing local search procedures, using this signal to balance short-term likelihood gains with long-term promise during neighbor selection. We provide theoretical intuition linking distance-reducing move selection to improved convergence behavior, and empirically demonstrate consistent improvements over SLS and GLS+ on challenging high-treewidth benchmarks in the amortized inference setting.

[803] Qrita: High-performance Top-k and Top-p Algorithm for GPUs using Pivot-based Truncation and Selection

Jongseok Park, Sunga Kim, Alvin Cheung, Ion Stoica

Main category: cs.AI

TL;DR: Qrita is an efficient GPU algorithm for Top-k and Top-p sampling in LLMs that uses pivot-based selection instead of sorting, achieving 2x throughput and half memory usage while maintaining deterministic output.

Details

Motivation: Current Top-k and Top-p implementations for LLM sampling rely on inefficient sorting operations that cause significant computation and memory overhead on GPUs, or use stochastic approaches that alter algorithm output. There's a need for efficient deterministic implementations.

Method: Qrita extends RTop-k’s pivot-based search to both Top-k and Top-p with two key techniques: 1) Gaussian-based sigma-truncation to reduce search space, and 2) Quaternary pivot search with duplication handling to halve iterations while guaranteeing deterministic output. Implemented in Triton for GPU programming.

Result: Qrita achieves up to 2x throughput and half memory usage compared to sorting-based kernels in vLLM, SGLang, and Flashinfer, while providing identical deterministic output to sorting-based algorithms.

Conclusion: Qrita provides an efficient deterministic alternative to sorting-based Top-k and Top-p implementations, significantly improving performance and memory efficiency for LLM sampling operations on GPUs.

Abstract: Top-k and Top-p are the dominant truncation operators in the sampling of large language models. Despite their widespread use, implementing them efficiently over large vocabularies remains a significant challenge. Existing approaches often rely on sorting, which incur significant computation and memory overhead on GPUs, or stochastic approaches, which alter the algorithm output. In this work, we propose Qrita, an efficient Top-k and Top-p algorithm based on a pivot-based selection strategy. Based on RTop-k, which uses a pivot-based search for node selection in graph neural networks, Qrita extends the concept of pivot-based search to both Top-k and Top-p with two key techniques: 1. Gaussian-based sigma-truncation, which greatly reduces the search space of the target elements, and 2. Quaternary pivot search with duplication handling, which halves the pivot search iteration and guarantees deterministic output. We provide the full implementation of Qrita using Triton, a popular GPU programming language. Our evaluation of Qrita against the Top-k and Top-p kernels of high performance LLM execution engines such as vLLM, SGLang, and Flashinfer show that Qrita achieves up to 2 times throughput and half memory use while providing the same output to the the sorting-based algorithms.

[804] PRISM: Festina Lente Proactivity – Risk-Sensitive, Uncertainty-Aware Deliberation for Proactive Agents

Yuxuan Fu, Xiaoyu Tan, Teqi Hao, Chen Zhan, Xihe Qiu

Main category: cs.AI

TL;DR: PRISM: A framework for cost-sensitive selective intervention in proactive agents using decision-theoretic gating with dual-process reasoning and aligned distillation.

Details

Motivation: Current proactive agent systems rely on brittle heuristics or indiscriminate long reasoning, offering little control over the benefit-burden tradeoff between helpful interventions and disruptive false alarms.

Method: PRISM combines decision-theoretic gating with dual-process reasoning: a fast mode for routine cases and a resource-intensive Slow mode with counterfactual checks near decision boundaries. Uses gate-aligned, schema-locked distillation where a teacher provides supervision on unlabeled traces while the student learns a response policy decoupled from the intervention gate.

Result: On ProactiveBench, PRISM reduces false alarms by 22.78% and improves F1 by 20.14% over strong baselines, demonstrating precise, computationally efficient, and controllable proactive agents.

Conclusion: Principled decision-theoretic gating paired with selective slow reasoning and aligned distillation yields proactive agents that are precise, computationally efficient, and controllable.

Abstract: Proactive agents must decide not only what to say but also whether and when to intervene. Many current systems rely on brittle heuristics or indiscriminate long reasoning, which offers little control over the benefit-burden tradeoff. We formulate the problem as cost-sensitive selective intervention and present PRISM, a novel framework that couples a decision-theoretic gate with a dual-process reasoning architecture. At inference time, the agent intervenes only when a calibrated probability of user acceptance exceeds a threshold derived from asymmetric costs of missed help and false alarms. Inspired by festina lente (Latin: “make haste slowly”), we gate by an acceptance-calibrated, cost-derived threshold and invoke a resource-intensive Slow mode with counterfactual checks only near the decision boundary, concentrating computation on ambiguous and high-stakes cases. Training uses gate-aligned, schema-locked distillation: a teacher running the full PRISM pipeline provides dense, executable supervision on unlabeled interaction traces, while the student learns a response policy that is explicitly decoupled from the intervention gate to enable tunable and auditable control. On ProactiveBench, PRISM reduces false alarms by 22.78% and improves F1 by 20.14% over strong baselines. These results show that principled decision-theoretic gating, paired with selective slow reasoning and aligned distillation, yields proactive agents that are precise, computationally efficient, and controllable. To facilitate reproducibility, we release our code, models, and resources at https://prism-festinalente.github.io/; all experiments use the open-source ProactiveBench benchmark.

[805] S1-NexusAgent: a Self-Evolving Agent Framework for Multidisciplinary Scientific Research

S1-NexusAgent Team

Main category: cs.AI

TL;DR: S1-NexusAgent: A self-evolving agent framework for multidisciplinary scientific research with hierarchical Plan-and-CodeAct execution, MCP integration, and continuous learning from execution trajectories.

Details

Motivation: Existing LLMs and tool-based agents struggle with large-scale scientific data, complex workflows, and specialized tools due to limitations in long-horizon planning, robust goal maintenance, and continual learning.

Method: Hierarchical Plan-and-CodeAct execution paradigm with dual-loop architecture decoupling global planning from subtask execution; integrates Model Context Protocol (MCP) with thousands of scientific tools; uses object-reference-based sparse context management for large-scale data; includes Critic Agent for trajectory evaluation and skill distillation.

Result: Achieves state-of-the-art performance on authoritative scientific benchmarks (biomini-eval, ChemBench, MatSciBench) involving long-horizon planning and complex specialized tool orchestration.

Conclusion: S1-NexusAgent demonstrates effectiveness and generalization capability in complex scientific tasks through its self-evolving framework for sustainable, long-horizon scientific research.

Abstract: Modern scientific research relies on large-scale data, complex workflows, and specialized tools, which existing LLMs and tool-based agents struggle to handle due to limitations in long-horizon planning, robust goal maintenance, and continual learning from execution. To address these issues, in this work, we propose S1-NexusAgent, a self-evolving agent framework designed for multidisciplinary scientific research. S1-NexusAgent adopts a hierarchical Plan-and-CodeAct execution paradigm, decoupling global scientific planning from subtask-level tool execution through a dual-loop architecture, thereby enabling stable modeling of complex research workflows. The system natively supports the Model Context Protocol (MCP), integrates up to thousands of cross-disciplinary scientific tools, and achieves efficient orchestration of heterogeneous research tools via intention-aware dynamic tool retrieval and hot-plug mechanisms. To address long-context and large-scale data challenges in scientific settings, S1-NexusAgent introduces object-reference-based sparse context management, which enables sub-task context isolation and intermediate result compression. Building on this, a Critic Agent automatically evaluates complete execution trajectories and distills high-quality research paths into reusable Scientific Skills, forming a closed loop for continuous self-evolution, which is valuable for sustainable and long-horizon scientific research. Experiments on authoritative scientific benchmarks involving long-horizon planning and complex specialized tool orchestration, including biomini-eval (biology), ChemBench (chemistry), and MatSciBench (material science), demonstrate that S1-NexusAgent achieves state-of-the-art performance, validating its effectiveness and generalization capability in complex scientific tasks.

[806] Autonomous Question Formation for Large Language Model-Driven AI Systems

Hong Su

Main category: cs.AI

TL;DR: A framework enabling AI systems to autonomously form questions and set tasks by reasoning over internal states, environmental observations, and inter-agent interactions, improving adaptability in dynamic environments.

Details

Motivation: Current LLM-driven AI systems rely on predefined tasks and fixed prompts, limiting their ability to autonomously identify what problems to solve when environmental conditions change. There's a need for systems that can autonomously form questions and set tasks.

Method: Proposes a human-simulation-based framework that treats question formation as a first-class decision process. Uses internal-driven, environment-aware, and inter-agent-aware prompting scopes to progressively expand cognitive coverage. Supports learning the question-formation process from experience.

Result: In multi-agent simulation experiments, environment-aware prompting significantly reduces no-eat events compared to internal-driven baseline. Inter-agent-aware prompting further reduces cumulative no-eat events by more than 60% over 20-day simulation, with statistically significant improvements (p < 0.05).

Conclusion: The framework enables AI systems to autonomously form questions and set tasks, improving adaptability and decision quality in dynamic environments through progressive cognitive coverage expansion and learning from experience.

Abstract: Large language model (LLM)-driven AI systems are increasingly important for autonomous decision-making in dynamic and open environments. However, most existing systems rely on predefined tasks and fixed prompts, limiting their ability to autonomously identify what problems should be solved when environmental conditions change. In this paper, we propose a human-simulation-based framework that enables AI systems to autonomously form questions and set tasks by reasoning over their internal states, environmental observations, and interactions with other AI systems. The proposed method treats question formation as a first-class decision process preceding task selection and execution, and integrates internal-driven, environment-aware, and inter-agent-aware prompting scopes to progressively expand cognitive coverage. In addition, the framework supports learning the question-formation process from experience, allowing the system to improve its adaptability and decision quality over time. xperimental results in a multi-agent simulation environment show that environment-aware prompting significantly reduces no-eat events compared with the internal-driven baseline, and inter-agent-aware prompting further reduces cumulative no-eat events by more than 60% over a 20-day simulation, with statistically significant improvements (p < 0.05).

[807] Reasoning with Autoregressive-Diffusion Collaborative Thoughts

Mu Yuan, Liekang Zeng, Guoliang Xing, Lan Zhang, Yunhao Liu

Main category: cs.AI

TL;DR: Collaborative Thoughts framework unifies autoregressive and diffusion models through closed-loop interaction for improved spatial reasoning and controllable generation.

Details

Motivation: Autoregressive models excel at sequential planning but struggle with spatial grounding, while diffusion models capture spatial structure but lack stepwise logical control. There's a need to combine their complementary strengths for better multimodal generation.

Method: A unified collaborative framework where autoregressive models handle structured planning and constraint management, diffusion models generate intermediate visual thoughts, and a vision-based critic provides feedback for iterative refinement through closed-loop interaction.

Result: The framework improves reliability of spatial reasoning and controllability of generation, mitigating error propagation across modalities through iterative feedback loops.

Conclusion: Collaborative Thoughts successfully combines complementary strengths of autoregressive and diffusion models, enabling joint reasoning and generation through a unified collaborative framework that works for both question answering and visual generation tasks.

Abstract: Autoregressive and diffusion models represent two complementary generative paradigms. Autoregressive models excel at sequential planning and constraint composition, yet struggle with tasks that require explicit spatial or physical grounding. Diffusion models, in contrast, capture rich spatial structure through high-dimensional generation, but lack the stepwise logical control needed to satisfy complex, multi-stage constraints or to reliably identify and correct errors. We introduce Collaborative Thoughts, a unified collaborative framework that enables autoregressive and diffusion models to reason and generate jointly through a closed-loop interaction. In Collaborative Thoughts, autoregressive models perform structured planning and constraint management, diffusion models instantiate these constraints as intermediate visual thoughts, and a vision-based critic module evaluates whether the visual thoughts satisfy the intended structural and physical requirements. This feedback is then used to iteratively refine subsequent planning and generation steps, mitigating error propagation across modalities. Importantly, Collaborative Thoughts uses the same collaborative loop regardless of whether the task is autoregressive question answering or diffusion-based visual generation. Through representative examples, we illustrate how Collaborative Thoughts can improve the reliability of spatial reasoning and the controllability of generation.

[808] ToPT: Task-Oriented Prompt Tuning for Urban Region Representation Learning

Zitao Guo, Changyang Jiang, Tianhong Zhao, Jinzhou Cao, Genan Dai, Bowen Zhang

Main category: cs.AI

TL;DR: ToPT: A two-stage framework for learning spatially consistent and task-aligned region embeddings from heterogeneous urban data using spatial priors and MLLM-based prompting.

Details

Motivation: Existing methods for region embeddings either produce task-agnostic representations or lack spatial coherence and explicit task-semantic alignment, limiting their effectiveness for urban computing tasks like crime prediction and resource allocation.

Method: Two-stage framework with Spatial-aware Region Embedding Learning (SREL) using Graphormer with spatial priors as attention biases, and Task-aware Prompting for Region Embeddings (Prompt4RE) using frozen multimodal LLM to generate semantic vectors aligned with region embeddings via multi-head cross-attention.

Result: State-of-the-art performance across multiple tasks and cities with improvements up to 64.2%, demonstrating the necessity and complementarity of spatial priors and prompt-region alignment.

Conclusion: ToPT effectively addresses spatial incoherence and task misalignment in region embedding learning through explicit spatial priors and MLLM-based task conditioning, providing superior performance for urban computing applications.

Abstract: Learning effective region embeddings from heterogeneous urban data underpins key urban computing tasks (e.g., crime prediction, resource allocation). However, prevailing two-stage methods yield task-agnostic representations, decoupling them from downstream objectives. Recent prompt-based approaches attempt to fix this but introduce two challenges: they often lack explicit spatial priors, causing spatially incoherent inter-region modeling, and they lack robust mechanisms for explicit task-semantic alignment. We propose ToPT, a two-stage framework that delivers spatially consistent fusion and explicit task alignment. ToPT consists of two modules: spatial-aware region embedding learning (SREL) and task-aware prompting for region embeddings (Prompt4RE). SREL employs a Graphormer-based fusion module that injects spatial priors-distance and regional centrality-as learnable attention biases to capture coherent, interpretable inter-region interactions. Prompt4RE performs task-oriented prompting: a frozen multimodal large language model (MLLM) processes task-specific templates to obtain semantic vectors, which are aligned with region embeddings via multi-head cross-attention for stable task conditioning. Experiments across multiple tasks and cities show state-of-the-art performance, with improvements of up to 64.2%, validating the necessity and complementarity of spatial priors and prompt-region alignment. The code is available at https://github.com/townSeven/Prompt4RE.git.

[809] ProjDevBench: Benchmarking AI Coding Agents on End-to-End Project Development

Pengrui Lu, Shiqi Zhang, Yunzhong Hou, Lyumanshan Ye, Chaoyi Huang, Zixi Chen, Ji Zeng, Hantao Jiang, Pengfei Liu, Yiwei Wang, Ming-Hsuan Yang

Main category: cs.AI

TL;DR: ProjDevBench is an end-to-end benchmark for evaluating coding agents on complete project development, not just bug fixing, using OJ testing and LLM-assisted code review across 20 programming problems.

Details

Motivation: Existing evaluations for coding agents focus on issue-level bug fixing and lag behind end-to-end development capabilities. There's a need for benchmarks that assess agents' ability to generate complete codebases from project requirements.

Method: Introduces ProjDevBench with 20 programming problems across 8 categories. Combines Online Judge (OJ) testing with LLM-assisted code review to evaluate agents on: 1) system architecture design, 2) functional correctness, and 3) iterative solution refinement.

Result: Overall acceptance rate of 27.38% across six coding agents. Agents handle basic functionality and data structures but struggle with complex system design, time complexity optimization, and resource management.

Conclusion: ProjDevBench provides a comprehensive benchmark for end-to-end coding agent evaluation, revealing current limitations in handling complex system design and optimization tasks.

Abstract: Recent coding agents can generate complete codebases from simple prompts, yet existing evaluations focus on issue-level bug fixing and lag behind end-to-end development. We introduce ProjDevBench, an end-to-end benchmark that provides project requirements to coding agents and evaluates the resulting repositories. Combining Online Judge (OJ) testing with LLM-assisted code review, the benchmark evaluates agents on (1) system architecture design, (2) functional correctness, and (3) iterative solution refinement. We curate 20 programming problems across 8 categories, covering both concept-oriented tasks and real-world application scenarios, and evaluate six coding agents built on different LLM backends. Our evaluation reports an overall acceptance rate of 27.38%: agents handle basic functionality and data structures but struggle with complex system design, time complexity optimization, and resource management. Our benchmark is available at https://github.com/zsworld6/projdevbench.

[810] FlowSteer: Interactive Agentic Workflow Orchestration via End-to-End Reinforcement Learning

Mingda Zhang, Haoran Luo, Tiesunlong Shen, Qika Lin, Xiaoying Tang, Rui Mao, Erik Cambria

Main category: cs.AI

TL;DR: FlowSteer: An RL framework for automated workflow orchestration using lightweight policy models and canvas environments, with plug-and-play operator/LLM support and novel CWRPO training.

Details

Motivation: Existing workflow orchestration faces challenges: high manual cost, reliance on specific operators/LLMs, and sparse reward signals. Need for automated, flexible workflow generation.

Method: End-to-end RL framework with lightweight policy model as agent and executable canvas environment. Multi-turn interaction where policy analyzes states and selects editing actions, canvas executes operators and returns feedback. Supports diverse operator libraries and interchangeable LLM backends. Uses Canvas Workflow Relative Policy Optimization (CWRPO) with diversity-constrained rewards and conditional release to stabilize learning.

Result: Experimental results on twelve datasets show FlowSteer significantly outperforms baselines across various tasks.

Conclusion: FlowSteer provides an effective automated workflow orchestration solution that addresses key challenges of manual cost, operator/LLM dependence, and sparse rewards through RL-based interactive refinement.

Abstract: In recent years, a variety of powerful agentic workflows have been applied to solve a wide range of human problems. However, existing workflow orchestration still faces key challenges, including high manual cost, reliance on specific operators/large language models (LLMs), and sparse reward signals. To address these challenges, we propose FlowSteer, an end-to-end reinforcement learning framework that takes a lightweight policy model as the agent and an executable canvas environment, automating workflow orchestration through multi-turn interaction. In this process, the policy model analyzes execution states and selects editing actions, while the canvas executes operators and returns feedback for iterative refinement. Moreover, FlowSteer provides a plug-and-play framework that supports diverse operator libraries and interchangeable LLM backends. To effectively train this interaction paradigm, we propose Canvas Workflow Relative Policy Optimization (CWRPO), which introduces diversity-constrained rewards with conditional release to stabilize learning and suppress shortcut behaviors. Experimental results on twelve datasets show that FlowSteer significantly outperforms baselines across various tasks.

[811] TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios

Yuanzhe Shen, Zisu Huang, Zhengyuan Wang, Muzhao Tian, Zhengkang Guo, Chenyang Zhang, Shuaiyu Zhou, Zengjie Hu, Dailin Li, Jingwen Xu, Kaimin Wang, Wenhao Liu, Tianlong Li, Fengpeng Yue, Feng Hong, Cao Liu, Ke Zeng

Main category: cs.AI

TL;DR: TRIP-Bench is a travel-planning benchmark for evaluating LLM agents on long-horizon interactions with real-world constraints, multi-tool reasoning, and adaptive behavior, with GTPO as an online RL method for improving performance.

Details

Motivation: Existing benchmarks underrepresent key challenges for LLM agents in complex real-world settings: enforcing global constraints, coordinating multi-tool reasoning, and adapting to evolving user behavior over long, multi-turn interactions.

Method: Introduces TRIP-Bench using real-world travel-planning data with 18 curated tools and 40+ requirements, featuring automated evaluation and difficulty splits. Also proposes GTPO, an online multi-turn reinforcement learning method with specialized reward normalization and reward differencing.

Result: Even advanced models achieve at most 50% success on easy split, dropping below 10% on hard subsets. GTPO applied to Qwen2.5-32B-Instruct improves constraint satisfaction and interaction robustness, outperforming Gemini-3-Pro in evaluation.

Conclusion: TRIP-Bench advances practical long-horizon interactive agents, while GTPO provides an effective online RL recipe for robust long-horizon training of LLM agents.

Abstract: As LLM-based agents are deployed in increasingly complex real-world settings, existing benchmarks underrepresent key challenges such as enforcing global constraints, coordinating multi-tool reasoning, and adapting to evolving user behavior over long, multi-turn interactions. To bridge this gap, we introduce \textbf{TRIP-Bench}, a long-horizon benchmark grounded in realistic travel-planning scenarios. TRIP-Bench leverages real-world data, offers 18 curated tools and 40+ travel requirements, and supports automated evaluation. It includes splits of varying difficulty; the hard split emphasizes long and ambiguous interactions, style shifts, feasibility changes, and iterative version revision. Dialogues span up to 15 user turns, can involve 150+ tool calls, and may exceed 200k tokens of context. Experiments show that even advanced models achieve at most 50% success on the easy split, with performance dropping below 10% on hard subsets. We further propose \textbf{GTPO}, an online multi-turn reinforcement learning method with specialized reward normalization and reward differencing. Applied to Qwen2.5-32B-Instruct, GTPO improves constraint satisfaction and interaction robustness, outperforming Gemini-3-Pro in our evaluation. We expect TRIP-Bench to advance practical long-horizon interactive agents, and GTPO to provide an effective online RL recipe for robust long-horizon training.

[812] What LLMs Think When You Don’t Tell Them What to Think About?

Yongchan Kwon, James Zou

Main category: cs.AI

TL;DR: LLMs exhibit systematic topical preferences and unique degenerate behaviors when generating from minimal, topic-neutral inputs, revealing model-specific biases and limitations.

Details

Motivation: To understand LLM behavior beyond topic-specific prompts and study near-unconstrained generation for better monitoring and AI safety.

Method: Analyze what 16 LLMs generate from minimal, topic-neutral inputs, collecting 256,000 samples to study systematic topical preferences and degenerate behaviors.

Result: Models show strong topical biases: GPT-OSS favors programming/math, Llama favors literature, DeepSeek favors religion, Qwen favors multiple-choice questions. Also found model-specific degenerate behaviors like repetitive phrases and personal URLs.

Conclusion: LLMs have systematic topical preferences even without explicit prompts, revealing model-specific biases important for AI safety and monitoring.

Abstract: Characterizing the behavior of large language models (LLMs) across diverse settings is critical for reliable monitoring and AI safety. However, most existing analyses rely on topic- or task-specific prompts, which can substantially limit what can be observed. In this work, we study what LLMs generate from minimal, topic-neutral inputs and probe their near-unconstrained generative behavior. Despite the absence of explicit topics, model outputs cover a broad semantic space, and surprisingly, each model family exhibits strong and systematic topical preferences. GPT-OSS predominantly generates programming (27.1%) and mathematical content (24.6%), whereas Llama most frequently generates literary content (9.1%). DeepSeek often generates religious content, while Qwen frequently generates multiple-choice questions. Beyond topical preferences, we also observe differences in content specialization and depth: GPT-OSS often generates more technically advanced content (e.g., dynamic programming) compared with other models (e.g., basic Python). Furthermore, we find that the near-unconstrained generation often degenerates into repetitive phrases, revealing interesting behaviors unique to each model family. For instance, degenerate outputs from Llama include multiple URLs pointing to personal Facebook and Instagram accounts. We release the complete dataset of 256,000 samples from 16 LLMs, along with a reproducible codebase.

[813] Beyond Dense States: Elevating Sparse Transcoders to Active Operators for Latent Reasoning

Yadong Wang, Haodong Chen, Yu Tian, Chuanxing Geng, Dong Liang, Xiang Chen

Main category: cs.AI

TL;DR: LSTR is a latent reasoning framework that uses sparse semantic transitions instead of dense latent transitions, improving interpretability while maintaining reasoning accuracy and compression efficiency.

Details

Motivation: Existing latent reasoning methods compress chain-of-thought into continuous hidden states but rely on dense latent transitions that are difficult to interpret and control. Meanwhile, sparse representation models offer human-interpretable semantic features but are limited to post-hoc analysis. The paper aims to reconcile this tension by making sparse features active reasoning operators.

Method: Proposes LSTR (Latent Sparse Transcoder Reasoning) framework that elevates functional sparse transcoders into active reasoning operators. Uses Latent Transition Transcoder (LTT) with residual skip architecture that decouples linear manifold transport from sparse semantic updates, enabling controllable semantic resolution via explicit sparsity constraints.

Result: Extensive experiments show LSTR preserves reasoning accuracy and compression efficiency while substantially improving interpretability over dense latent baselines. Causal interventions and trajectory analyses demonstrate that sparse features act as both interpretable and causally effective operators in the reasoning process.

Conclusion: LSTR successfully bridges the gap between interpretable sparse representations and active reasoning, creating a framework where sparse semantic features serve as both interpretable and causally effective operators in multi-step computation.

Abstract: Latent reasoning compresses the chain-of-thought (CoT) into continuous hidden states, yet existing methods rely on dense latent transitions that remain difficult to interpret and control. Meanwhile, sparse representation models uncover human-interpretable semantic features but remain largely confined to post-hoc analysis. We reconcile this tension by proposing LSTR (Latent Sparse Transcoder Reasoning), a latent reasoning framework that elevates functional sparse transcoders into active reasoning operators to perform multi-step computation through sparse semantic transitions. At its core, LSTR employs a Latent Transition Transcoder (LTT) with a residual skip architecture that decouples linear manifold transport from sparse semantic updates, enabling controllable semantic resolution via explicit sparsity constraints. Extensive experiments show that LSTR preserves reasoning accuracy and compression efficiency while substantially improving interpretability over dense latent baselines. Causal interventions and trajectory analyses further demonstrate that these sparse features act as both interpretable and causally effective operators in the reasoning process.

[814] Mitigating loss of control in advanced AI systems through instrumental goal trajectories

Willem Fourie

Main category: cs.AI

TL;DR: The paper introduces Instrumental Goal Trajectories (IGTs) as organizational pathways (procurement, governance, finance) through which AI systems can pursue instrumental goals by accessing resources like compute, data, and money, providing monitoring and intervention points beyond technical system properties.

Details

Motivation: Current AI safety mitigations are too technical and system-centric, focusing on model capabilities and behavior shaping. There's a need to address how AI systems might pursue instrumental goals through organizational resource access, requiring broader intervention points beyond the model itself.

Method: Developed three organizational pathways called Instrumental Goal Trajectories (IGTs): procurement (acquiring technical resources), governance (obtaining permissions/authority), and finance (accessing monetary resources). These create trails of organizational artifacts that can be monitored for intervention.

Result: IGTs provide concrete avenues for defining capability levels and implementing corrigibility/interruptibility by shifting focus from model properties to organizational systems, offering monitoring and intervention points when AI capabilities exceed acceptable thresholds.

Conclusion: Organizational Instrumental Goal Trajectories expand AI safety options beyond technical mitigations, enabling better monitoring and control of advanced AI systems by focusing on resource access pathways within organizations.

Abstract: Researchers at artificial intelligence labs and universities are concerned that highly capable artificial intelligence (AI) systems may erode human control by pursuing instrumental goals. Existing mitigations remain largely technical and system-centric: tracking capability in advanced systems, shaping behaviour through methods such as reinforcement learning from human feedback, and designing systems to be corrigible and interruptible. Here we develop instrumental goal trajectories to expand these options beyond the model. Gaining capability typically depends on access to additional technical resources, such as compute, storage, data and adjacent services, which in turn requires access to monetary resources. In organisations, these resources can be obtained through three organisational pathways. We label these pathways the procurement, governance and finance instrumental goal trajectories (IGTs). Each IGT produces a trail of organisational artefacts that can be monitored and used as intervention points when a systems capabilities or behaviour exceed acceptable thresholds. In this way, IGTs offer concrete avenues for defining capability levels and for broadening how corrigibility and interruptibility are implemented, shifting attention from model properties alone to the organisational systems that enable them.

[815] Optimizing Prompts for Large Language Models: A Causal Approach

Wei Chen, Yanbin Fang, Shuran Fu, Fasheng Xu, Xuan Wei

Main category: cs.AI

TL;DR: CPO is a causal prompt optimization framework that uses Double Machine Learning to isolate prompt effects from query characteristics, enabling query-specific prompt customization without costly online evaluation.

Details

Motivation: LLMs in enterprise workflows are sensitive to prompt design, but current optimization methods use static instructions that don't adapt to heterogeneous queries or rely on correlational reward models that confound prompt effectiveness with query characteristics.

Method: Two-stage approach: 1) Learn offline causal reward model using Double Machine Learning on semantic embeddings of prompts and queries to isolate causal effect of prompt variations; 2) Use unbiased reward signal to guide resource-efficient search for query-specific prompts without online evaluation.

Result: CPO consistently outperforms human-engineered prompts and state-of-the-art automated optimizers across mathematical reasoning, visualization, and data analytics benchmarks, with gains driven by improved robustness on hard queries.

Conclusion: CPO establishes causal inference as scalable foundation for reliable and cost-efficient prompt optimization in enterprise LLM deployments, reshaping economics by shifting evaluation from real-time execution to offline causal modeling.

Abstract: Large Language Models (LLMs) are increasingly embedded in enterprise workflows, yet their performance remains highly sensitive to prompt design. Automatic Prompt Optimization (APO) seeks to mitigate this instability, but existing approaches face two persistent challenges. First, commonly used prompt strategies rely on static instructions that perform well on average but fail to adapt to heterogeneous queries. Second, more dynamic approaches depend on offline reward models that are fundamentally correlational, confounding prompt effectiveness with query characteristics. We propose Causal Prompt Optimization (CPO), a framework that reframes prompt design as a problem of causal estimation. CPO operates in two stages. First, it learns an offline causal reward model by applying Double Machine Learning (DML) to semantic embeddings of prompts and queries, isolating the causal effect of prompt variations from confounding query attributes. Second, it utilizes this unbiased reward signal to guide a resource-efficient search for query-specific prompts without relying on costly online evaluation. We evaluate CPO across benchmarks in mathematical reasoning, visualization, and data analytics. CPO consistently outperforms human-engineered prompts and state-of-the-art automated optimizers. The gains are driven primarily by improved robustness on hard queries, where existing methods tend to deteriorate. Beyond performance, CPO fundamentally reshapes the economics of prompt optimization: by shifting evaluation from real-time model execution to an offline causal model, it enables high-precision, per-query customization at a fraction of the inference cost required by online methods. Together, these results establish causal inference as a scalable foundation for reliable and cost-efficient prompt optimization in enterprise LLM deployments.

[816] MACD: Model-Aware Contrastive Decoding via Counterfactual Data

Qixin Xiao, Kun Zhou

Main category: cs.AI

TL;DR: MACD: Model-aware Counterfactual Data based Contrastive Decoding reduces hallucinations in Video-LLMs by using model feedback to identify hallucination-prone object regions and generating targeted counterfactual inputs for contrastive decoding.

Details

Motivation: Video-LLMs suffer from hallucinations when visual evidence is weak, ambiguous, or biased. Existing contrastive decoding methods use random perturbations which don't align well with model weaknesses or control visual cues driving hallucinations.

Method: MACD uses Video-LLM’s own feedback to identify object regions most responsible for hallucinations, generates targeted counterfactual inputs at object level (not arbitrary frame/temporal modifications), and integrates these model-aware counterfactual data into contrastive decoding to enforce evidence-grounded token selection.

Result: Experiments on EventHallusion, MVBench, Perception-test and Video-MME show MACD consistently reduces hallucination while maintaining or improving task accuracy across diverse Video-LLMs (Qwen and InternVL families), especially effective for small, occluded, or co-occurring objects.

Conclusion: MACD provides an effective inference strategy that combines model-guided counterfactual construction with decoding to reduce hallucinations in Video-LLMs, with strong performance across multiple benchmarks and model families.

Abstract: Video language models (Video-LLMs) are prone to hallucinations, often generating plausible but ungrounded content when visual evidence is weak, ambiguous, or biased. Existing decoding methods, such as contrastive decoding (CD), rely on random perturbations to construct contrastive data for mitigating hallucination patterns. However, such a way is hard to control the visual cues that drive hallucination or well align with model weaknesses. We propose Model-aware Counterfactual Data based Contrastive Decoding (MACD), a new inference strategy that combines model-guided counterfactual construction with decoding. Our approach uses the Video-LLM’s own feedback to identify object regions most responsible for hallucination, generating targeted counterfactual inputs at the object level rather than arbitrary frame or temporal modifications. These model-aware counterfactual data is then integrated into CD to enforce evidence-grounded token selection during decoding. Experiments on EventHallusion, MVBench, Perception-test and Video-MME show that MACD consistently reduces hallucination while maintaining or improving task accuracy across diverse Video-LLMs, including Qwen and InternVL families. The method is especially effective in challenging scenarios involving small, occluded, or co-occurring objects. Our code and data will be publicly released.

[817] Controlling Exploration-Exploitation in GFlowNets via Markov Chain Perspectives

Lin Chen, Samuel Drapeau, Fanghao Shao, Xuekai Zhu, Bo Xue, Yunchong Song, Mathieu Laurière, Zhouhan Lin

Main category: cs.AI

TL;DR: α-GFNs generalize GFlowNets with a tunable parameter α to control exploration-exploitation trade-off, improving mode discovery by up to 10× across benchmarks.

Details

Motivation: Standard GFlowNet objectives fix equal mixing of forward/backward policies, potentially constraining exploration-exploitation trade-off during training. This limitation restricts mode discovery capabilities in generative tasks.

Method: Established equivalence between GFlowNet objectives and Markov chain reversibility, revealing constraints’ origin. Proposed α-GFNs with tunable parameter α to generalize mixing, enabling direct control over exploration-exploitation dynamics while ensuring convergence to unique flows.

Result: α-GFN objectives consistently outperform previous GFlowNet objectives across Set, Bit Sequence, and Molecule Generation benchmarks, achieving up to 10× increase in number of discovered modes.

Conclusion: The theoretical connection between GFlowNets and Markov chains enables principled generalization of mixing policies, with α-GFNs providing effective control over exploration-exploitation for improved mode discovery in generative tasks.

Abstract: Generative Flow Network (GFlowNet) objectives implicitly fix an equal mixing of forward and backward policies, potentially constraining the exploration-exploitation trade-off during training. By further exploring the link between GFlowNets and Markov chains, we establish an equivalence between GFlowNet objectives and Markov chain reversibility, thereby revealing the origin of such constraints, and provide a framework for adapting Markov chain properties to GFlowNets. Building on these theoretical findings, we propose $α$-GFNs, which generalize the mixing via a tunable parameter $α$. This generalization enables direct control over exploration-exploitation dynamics to enhance mode discovery capabilities, while ensuring convergence to unique flows. Across various benchmarks, including Set, Bit Sequence, and Molecule Generation, $α$-GFN objectives consistently outperform previous GFlowNet objectives, achieving up to a $10 \times$ increase in the number of discovered modes.

[818] Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking

Mohammad Beigi, Ming Jin, Junshan Zhang, Qifan Wang, Lifu Huang

Main category: cs.AI

TL;DR: ARA framework treats reward hacking as competitive game between Hacker and Auditor to detect and mitigate RLHF vulnerabilities, improving alignment-utility tradeoff across multiple domains.

Details

Motivation: RLHF is vulnerable to reward hacking where models exploit spurious correlations in reward models to achieve high scores while violating human intent. Existing static defenses cannot adapt to novel exploitation strategies.

Method: Adversarial Reward Auditing (ARA) operates in two stages: 1) Hacker policy discovers reward model vulnerabilities while Auditor learns to detect exploitation from latent representations; 2) Auditor-Guided RLHF (AG-RLHF) gates reward signals to penalize detected hacking, transforming reward hacking into measurable, controllable signal.

Result: Experiments across three hacking scenarios show ARA achieves best alignment-utility tradeoff: reduces sycophancy to near-SFT levels while improving helpfulness, decreases verbosity while achieving highest ROUGE-L, and suppresses code gaming while improving Pass@1. Reward hacking, detection, and mitigation generalize across domains.

Conclusion: ARA successfully transforms reward hacking from unobservable failure into measurable, controllable signal, enabling efficient multi-domain defense with single model through competitive adversarial framework.

Abstract: Reinforcement Learning from Human Feedback (RLHF) remains vulnerable to reward hacking, where models exploit spurious correlations in learned reward models to achieve high scores while violating human intent. Existing mitigations rely on static defenses that cannot adapt to novel exploitation strategies. We propose Adversarial Reward Auditing (ARA), a framework that reconceptualizes reward hacking as a dynamic, competitive game. ARA operates in two stages: first, a Hacker policy discovers reward model vulnerabilities while an Auditor learns to detect exploitation from latent representations; second, Auditor-Guided RLHF (AG-RLHF) gates reward signals to penalize detected hacking, transforming reward hacking from an unobservable failure into a measurable, controllable signal. Experiments across three hacking scenarios demonstrate that ARA achieves the best alignment-utility tradeoff among all baselines: reducing sycophancy to near-SFT levels while improving helpfulness, decreasing verbosity while achieving the highest ROUGE-L, and suppressing code gaming while improving Pass@1. Beyond single-domain evaluation, we show that reward hacking, detection, and mitigation all generalize across domains – a Hacker trained on code gaming exhibits increased sycophancy despite no reward for this behavior, and an Auditor trained on one domain effectively suppresses exploitation in others, enabling efficient multi-domain defense with a single model.

[819] PRISM: Parametrically Refactoring Inference for Speculative Sampling Draft Models

Xuliang Wang, Yuetao Chen, Maochan Zhen, Fang Liu, Xinzhou Zheng, Xingwu Liu, Hong Xu, Ming Li

Main category: cs.AI

TL;DR: PRISM is a novel speculative decoding architecture that disaggregates computation across parameter sets to decouple model capacity from inference cost, achieving superior speedup for LLM decoding.

Details

Motivation: Current speculative decoding methods face a trade-off between draft quality (using larger models) and computational overhead. Existing approaches struggle to balance prediction accuracy with compute latency, creating a fundamental dilemma in accelerating LLM decoding.

Method: PRISM disaggregates the computation of each predictive step across different parameter sets, refactoring computational pathways of draft models to decouple model capacity from inference cost through architectural innovation.

Result: PRISM outperforms all existing draft architectures, achieving exceptional acceptance lengths while maintaining minimal draft latency for superior end-to-end speedup. It boosts decoding throughput of optimized inference engines by more than 2.6x and scales more effectively with expanding data volumes.

Conclusion: PRISM provides an architectural solution to the draft quality vs. computational overhead trade-off in speculative decoding, enabling faster LLM inference without sacrificing prediction accuracy through innovative computation disaggregation.

Abstract: Large Language Models (LLMs), constrained by their auto-regressive nature, suffer from slow decoding. Speculative decoding methods have emerged as a promising solution to accelerate LLM decoding, attracting attention from both systems and AI research communities. Recently, the pursuit of better draft quality has driven a trend toward parametrically larger draft models, which inevitably introduces substantial computational overhead. While existing work attempts to balance the trade-off between prediction accuracy and compute latency, we address this fundamental dilemma through architectural innovation. We propose PRISM, which disaggregates the computation of each predictive step across different parameter sets, refactoring the computational pathways of draft models to successfully decouple model capacity from inference cost. Through extensive experiments, we demonstrate that PRISM outperforms all existing draft architectures, achieving exceptional acceptance lengths while maintaining minimal draft latency for superior end-to-end speedup. We also re-examine scaling laws with PRISM, revealing that PRISM scales more effectively with expanding data volumes than other draft architectures. Through rigorous and fair comparison, we show that PRISM boosts the decoding throughput of an already highly optimized inference engine by more than 2.6x.

[820] Efficient Cross-Architecture Knowledge Transfer for Large-Scale Online User Response Prediction

Yucheng Wu, Yuekui Yang, Hongzheng Li, Anan Liu, Jian Xiao, Junjie Zhai, Huan Yu, Shaoping Ma, Leye Wang

Main category: cs.AI

TL;DR: CrossAdapt enables efficient cross-architecture knowledge transfer for large-scale recommendation systems with minimal retraining costs and performance degradation.

Details

Motivation: Deploying new architectures in large-scale user response prediction systems is costly due to expensive retraining on massive historical data and performance degradation under data retention constraints. Existing knowledge distillation methods struggle with architectural heterogeneity and the prohibitive cost of transferring large embedding tables.

Method: Two-stage framework: 1) Offline stage enables rapid embedding transfer via dimension-adaptive projections without iterative training, combined with progressive network distillation and strategic sampling. 2) Online stage introduces asymmetric co-distillation (students update frequently, teachers update infrequently) with distribution-aware adaptation mechanism that dynamically balances historical knowledge preservation and fast adaptation to evolving data.

Result: Experiments on three public datasets show 0.27-0.43% AUC improvements while reducing training time by 43-71%. Large-scale deployment on Tencent WeChat Channels (~10M daily samples) demonstrates effectiveness in mitigating AUC degradation, LogLoss increase, and prediction bias compared to standard distillation baselines.

Conclusion: CrossAdapt provides an efficient solution for cross-architecture knowledge transfer in large-scale recommendation systems, addressing the challenges of architectural heterogeneity and embedding table transfer costs while maintaining performance with reduced training time.

Abstract: Deploying new architectures in large-scale user response prediction systems incurs high model switching costs due to expensive retraining on massive historical data and performance degradation under data retention constraints. Existing knowledge distillation methods struggle with architectural heterogeneity and the prohibitive cost of transferring large embedding tables. We propose CrossAdapt, a two-stage framework for efficient cross-architecture knowledge transfer. The offline stage enables rapid embedding transfer via dimension-adaptive projections without iterative training, combined with progressive network distillation and strategic sampling to reduce computational cost. The online stage introduces asymmetric co-distillation, where students update frequently while teachers update infrequently, together with a distribution-aware adaptation mechanism that dynamically balances historical knowledge preservation and fast adaptation to evolving data. Experiments on three public datasets show that CrossAdapt achieves 0.27-0.43% AUC improvements while reducing training time by 43-71%. Large-scale deployment on Tencent WeChat Channels (~10M daily samples) further demonstrates its effectiveness, significantly mitigating AUC degradation, LogLoss increase, and prediction bias compared to standard distillation baselines.

[821] LingLanMiDian: Systematic Evaluation of LLMs on TCM Knowledge and Clinical Reasoning

Rui Hua, Yu Wei, Zixin Shu, Kai Chang, Dengying Yan, Jianan Xia, Zeyu Liu, Hui Zhu, Shujie Song, Mingzhong Xiao, Xiaodong Li, Dongmei Jia, Zhuye Gao, Yanyan Meng, Naixuan Zhao, Yu Fu, Haibin Yu, Benman Yu, Yuanyuan Chen, Fei Dong, Zhizhou Meng, Pengcheng Yang, Songxue Zhao, Lijuan Pei, Yunhui Hu, Kan Ding, Jiayuan Duan, Wenmao Yin, Yang Gu, Runshun Zhang, Qiang Zhu, Jian Yu, Jiansheng Li, Baoyan Liu, Wenjia Wang, Xuezhong Zhou

Main category: cs.AI

TL;DR: LingLanMiDian benchmark for evaluating LLMs in Traditional Chinese Medicine with unified multi-task evaluation and expert-curated datasets.

Details

Motivation: Existing TCM benchmarks are fragmented, lack unified evaluation protocols, and don't adequately assess domain-specific reasoning needed for Traditional Chinese Medicine's unique ontology and clinical patterns.

Method: Created LingLan benchmark with expert-curated multi-task suite covering knowledge recall, multi-hop reasoning, information extraction, and clinical decision-making. Introduced consistent metrics, synonym-tolerant clinical labeling, Hard subsets, and reframed diagnosis as single-choice tasks.

Result: Evaluated 14 leading LLMs in zero-shot setting, revealing substantial gaps between models and human experts on Hard subsets, especially for TCM-specialized reasoning and clinical decision support.

Conclusion: LingLan provides unified, quantitative foundation for advancing TCM LLMs and domain-specific medical AI research, highlighting current limitations in specialized medical reasoning.

Abstract: Large language models (LLMs) are advancing rapidly in medical NLP, yet Traditional Chinese Medicine (TCM) with its distinctive ontology, terminology, and reasoning patterns requires domain-faithful evaluation. Existing TCM benchmarks are fragmented in coverage and scale and rely on non-unified or generation-heavy scoring that hinders fair comparison. We present the LingLanMiDian (LingLan) benchmark, a large-scale, expert-curated, multi-task suite that unifies evaluation across knowledge recall, multi-hop reasoning, information extraction, and real-world clinical decision-making. LingLan introduces a consistent metric design, a synonym-tolerant protocol for clinical labels, a per-dataset 400-item Hard subset, and a reframing of diagnosis and treatment recommendation into single-choice decision recognition. We conduct comprehensive, zero-shot evaluations on 14 leading open-source and proprietary LLMs, providing a unified perspective on their strengths and limitations in TCM commonsense knowledge understanding, reasoning, and clinical decision support; critically, the evaluation on Hard subset reveals a substantial gap between current models and human experts in TCM-specialized reasoning. By bridging fundamental knowledge and applied reasoning through standardized evaluation, LingLan establishes a unified, quantitative, and extensible foundation for advancing TCM LLMs and domain-specific medical AI research. All evaluation data and code are available at https://github.com/TCMAI-BJTU/LingLan and http://tcmnlp.com.

[822] ORCH: many analyses, one merge-a deterministic multi-agent orchestrator for discrete-choice reasoning with EMA-guided routing

Hanlin Zhou, Huah Yong Chan

Main category: cs.AI

TL;DR: ORCH is a deterministic coordination framework that orchestrates heterogeneous LLMs for discrete-choice reasoning tasks using a “many analyses, one decision” paradigm with fixed routing and aggregation rules.

Details

Motivation: Existing multi-agent LLM systems often rely on stochastic routing or ad-hoc heuristics, making them difficult to reproduce and interpret. There's a need for more predictable, controllable, and deployment-ready agent coordination frameworks.

Method: ORCH uses a deterministic coordination framework where multiple base LLMs independently produce structured analyses, and a dedicated merge agent outputs the final choice. It employs fixed rules for task decomposition and answer aggregation, with an optional EMA-guided router that updates agent selection based on historical performance metrics.

Result: ORCH consistently outperforms single-model baselines and majority-vote ensembles. On MMLU-Pro, it improves accuracy by over 10 points, and on GSM8K by over 50 points. The EMA router provides additional 0.7-2.0 point accuracy boosts.

Conclusion: ORCH offers a practical path toward controllable, interpretable, and deployment-ready LLM-based agent systems for discrete-choice reasoning through deterministic coordination of heterogeneous models.

Abstract: Recent advances in large-scale language models (LLMs) have made multi-agent architectures attractive for challenging reasoning tasks. However, many existing systems rely on stochastic routing or ad-hoc heuristics, making their behavior difficult to reproduce and their decision process hard to interpret. We propose ORCH, a deterministic coordination framework for discrete-choice reasoning that orchestrates heterogeneous LLMs. ORCH follows a ``many analyses, one decision’’ paradigm: multiple base models independently produce structured analyses, and a dedicated merge agent outputs the final choice. The framework uses fixed rules for task decomposition and answer aggregation, keeping the pipeline predictable, reproducible, and training-free. Determinism here refers to fixed routing and aggregation rules under a fixed evaluation protocol, rather than strict bit-level reproducibility across deployments. To exploit model complementarity, we optionally introduce an EMA-guided router that updates agent selection using historical accuracy, latency, or cost; since it relies on answer-based feedback, it is mainly intended for benchmarking, controlled evaluation, or delayed-feedback settings. Experiments on MMLU, MMLU-Pro, and GSM8K show that ORCH consistently outperforms single-model baselines and a majority-vote ensemble. On MMLU-Pro, ORCH improves accuracy by over 10 points compared to the strongest baseline, and on GSM8K it yields gains exceeding 50 points; McNemar tests confirm statistical significance. The EMA router provides an additional 0.7–2.0 point accuracy boost, and ablations show that both multi-agent collaboration and routing contribute substantially. Overall, ORCH offers a practical path toward controllable, interpretable, and deployment-ready LLM-based agent systems for discrete-choice reasoning.

[823] INDIBATOR: Diverse and Fact-Grounded Individuality for Multi-Agent Debate in Molecular Discovery

Yunhui Jang, Seonghyun Park, Jaehyung Kim, Sungsoo Ahn

Main category: cs.AI

TL;DR: INDIBATOR is a multi-agent framework for molecular discovery that grounds agents in individualized scientist profiles using publication history and molecular history, enabling more nuanced scientific debate than generic role-based personas.

Details

Motivation: Current multi-agent systems for scientific discovery use generic role-based personas that oversimplify how real scientists operate. Human scientists' contributions are shaped by their unique research trajectories, which current frameworks fail to capture.

Method: Proposes INDIBATOR framework that grounds agents in individualized scientist profiles constructed from two modalities: publication history (for literature-derived knowledge) and molecular history (for structural priors). Agents engage in multi-turn debate through proposal, critique, and voting phases.

Result: Fine-grained individuality-grounded agents consistently outperform systems relying on coarse-grained personas, achieving competitive or state-of-the-art performance in molecular discovery tasks.

Conclusion: Capturing the “scientific DNA” of individual agents through multimodal profiles (publication and molecular history) is essential for high-quality scientific discovery in multi-agent systems.

Abstract: Multi-agent systems have emerged as a powerful paradigm for automating scientific discovery. To differentiate agent behavior in the multi-agent system, current frameworks typically assign generic role-based personas such as ‘‘reviewer’’ or ‘‘writer’’ or rely on coarse grained keyword-based personas. While functional, this approach oversimplifies how human scientists operate, whose contributions are shaped by their unique research trajectories. In response, we propose INDIBATOR, a framework for molecular discovery that grounds agents in individualized scientist profiles constructed from two modalities: publication history for literature-derived knowledge and molecular history for structural priors. These agents engage in multi-turn debate through proposal, critique, and voting phases. Our evaluation demonstrates that these fine-grained individuality-grounded agents consistently outperform systems relying on coarse-grained personas, achieving competitive or state-of-the-art performance. These results validate that capturing the ``scientific DNA’’ of individual agents is essential for high-quality discovery.

[824] Synesthesia of Vehicles: Tactile Data Synthesis from Visual Inputs

Rui Wang, Yaoguang Cao, Yuyi Chen, Jianyi Xu, Zhuoyang Li, Jiachen Shang, Shichun Yang

Main category: cs.AI

TL;DR: A novel framework called Synesthesia of Vehicles (SoV) that predicts tactile excitations from visual inputs for autonomous vehicles using cross-modal spatiotemporal alignment and a visual-tactile synesthetic generative model based on latent diffusion.

Details

Motivation: Current autonomous vehicles rely on visual and optical sensors that fail to detect road-induced excitations critical for dynamic control. The paper is inspired by human synesthesia to create a cross-modal perception system that can predict tactile information from visual inputs.

Method: 1) Cross-modal spatiotemporal alignment method to address temporal and spatial disparities between visual and tactile data. 2) Visual-tactile synesthetic (VTSyn) generative model using latent diffusion for unsupervised high-quality tactile data synthesis. 3) Real-vehicle perception system to collect multi-modal dataset across diverse road and lighting conditions.

Result: Extensive experiments show that VTSyn outperforms existing models in temporal, frequency, and classification performance. The framework enhances AV safety through proactive tactile perception.

Conclusion: The proposed SoV framework successfully enables autonomous vehicles to predict tactile excitations from visual inputs, addressing a critical gap in current sensor systems and improving vehicle safety through cross-modal perception inspired by human synesthesia.

Abstract: Autonomous vehicles (AVs) rely on multi-modal fusion for safety, but current visual and optical sensors fail to detect road-induced excitations which are critical for vehicles’ dynamic control. Inspired by human synesthesia, we propose the Synesthesia of Vehicles (SoV), a novel framework to predict tactile excitations from visual inputs for autonomous vehicles. We develop a cross-modal spatiotemporal alignment method to address temporal and spatial disparities. Furthermore, a visual-tactile synesthetic (VTSyn) generative model using latent diffusion is proposed for unsupervised high-quality tactile data synthesis. A real-vehicle perception system collected a multi-modal dataset across diverse road and lighting conditions. Extensive experiments show that VTSyn outperforms existing models in temporal, frequency, and classification performance, enhancing AV safety through proactive tactile perception.

[825] SOPRAG: Multi-view Graph Experts Retrieval for Industrial Standard Operating Procedures

Liangtao Lin, Zhaomeng Zhu, Tianwei Zhang, Yonggang Wen

Main category: cs.AI

TL;DR: SOPRAG is a specialized RAG framework for industrial SOP retrieval that uses Mixture-of-Experts with Entity, Causal, and Flow graph experts, coordinated by Procedure Cards and LLM-guided gating, outperforming standard RAG methods.

Details

Motivation: Standard RAG paradigms fail to address unique challenges in industrial SOP retrieval: rigid proprietary structures, condition-dependent relevance, and actionable execution requirements. Industrial environments need specialized retrieval systems that understand structural and logical complexities.

Method: Proposes SOPRAG framework with three specialized experts: Entity, Causal, and Flow graph experts to handle industrial complexities. Uses Procedure Card layer to prune search space and LLM-Guided gating mechanism to dynamically weight experts. Also introduces automated multi-agent workflow for benchmark construction due to domain data scarcity.

Result: Extensive experiments across four industrial domains show SOPRAG significantly outperforms lexical, dense, and graph-based RAG baselines in both retrieval accuracy and response utility, achieving perfect execution scores in real-world critical tasks.

Conclusion: SOPRAG effectively addresses industrial SOP retrieval challenges through specialized expert design and intelligent coordination mechanisms, demonstrating superior performance over existing RAG approaches in practical industrial applications.

Abstract: Standard Operating Procedures (SOPs) are essential for ensuring operational safety and consistency in industrial environments. However, retrieving and following these procedures presents unique challenges, such as rigid proprietary structures, condition-dependent relevance, and actionable execution requirement, which standard semantic-driven Retrieval-Augmented Generation (RAG) paradigms fail to address. Inspired by the Mixture-of-Experts (MoE) paradigm, we propose SOPRAG, a novel framework specifically designed to address the above pain points in SOP retrieval. SOPRAG replaces flat chunking with specialized Entity, Causal, and Flow graph experts to resolve industrial structural and logical complexities. To optimize and coordinate these experts, we propose a Procedure Card layer that prunes the search space to eliminate computational noise, and an LLM-Guided gating mechanism that dynamically weights these experts to align retrieval with operator intent. To address the scarcity of domain-specific data, we also introduce an automated, multi-agent workflow for benchmark construction. Extensive experiments across four industrial domains demonstrate that SOPRAG significantly outperforms strong lexical, dense, and graph-based RAG baselines in both retrieval accuracy and response utility, achieving perfect execution scores in real-world critical tasks.

[826] ProcMEM: Learning Reusable Procedural Memory from Experience via Non-Parametric PPO for LLM Agents

Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, Jun Wang

Main category: cs.AI

TL;DR: ProcMEM enables LLM agents to autonomously learn procedural memory from experiences without parameter updates, improving computational efficiency and execution stability through reusable skills.

Details

Motivation: Current LLM-driven agents rely on on-the-fly reasoning for sequential decision-making, leading to computational redundancy and execution instability due to insufficient experience reuse in recurring scenarios.

Method: Proposes ProcMEM framework with Skill-MDP formalization to transform episodic narratives into executable Skills with activation/execution/termination conditions. Uses Non-Parametric PPO with semantic gradients for candidate generation and PPO Gate for skill verification, plus score-based maintenance for compact memory.

Result: Experimental results show superior reuse rates and significant performance gains with extreme memory compression across in-domain, cross-task, and cross-agent scenarios. Visualizations reveal transparent accumulation, refinement, and reuse of procedural knowledge.

Conclusion: ProcMEM effectively enables agents to learn and reuse procedural memory without parameter updates, facilitating long-term autonomy through transparent knowledge accumulation and refinement.

Abstract: LLM-driven agents demonstrate strong performance in sequential decision-making but often rely on on-the-fly reasoning, re-deriving solutions even in recurring scenarios. This insufficient experience reuse leads to computational redundancy and execution instability. To bridge this gap, we propose ProcMEM, a framework that enables agents to autonomously learn procedural memory from interaction experiences without parameter updates. By formalizing a Skill-MDP, ProcMEM transforms passive episodic narratives into executable Skills defined by activation, execution, and termination conditions to ensure executability. To achieve reliable reusability without capability degradation, we introduce Non-Parametric PPO, which leverages semantic gradients for high-quality candidate generation and a PPO Gate for robust Skill verification. Through score-based maintenance, ProcMEM sustains compact, high-quality procedural memory. Experimental results across in-domain, cross-task, and cross-agent scenarios demonstrate that ProcMEM achieves superior reuse rates and significant performance gains with extreme memory compression. Visualized evolutionary trajectories and Skill distributions further reveal how ProcMEM transparently accumulates, refines, and reuses procedural knowledge to facilitate long-term autonomy.

[827] Entropy-Guided Data-Efficient Training for Multimodal Reasoning Reward Models

Shidong Yang, Tongwen Huang, Hao Wen, Yong Wang, Li Chen, Xiangxiang Chu

Main category: cs.AI

TL;DR: EGT uses response entropy as an unsupervised proxy for annotation noise and sample difficulty to improve multimodal reward model training through entropy-guided data curation and progressive training.

Details

Motivation: Multimodal reward models need better alignment with human preferences, but current training suffers from noisy preference datasets and inefficient training methods that ignore sample difficulty differences.

Method: Proposes Entropy-Guided Training (EGT) with two strategies: 1) entropy-guided data curation to filter unreliable samples, and 2) entropy-guided progressive training that introduces more complex examples based on entropy levels.

Result: Extensive experiments across three benchmarks show EGT-trained models consistently outperform state-of-the-art multimodal reward models.

Conclusion: Response entropy is a reliable unsupervised proxy for annotation noise and sample difficulty, and EGT effectively improves multimodal reward model training and performance.

Abstract: Multimodal reward models are crucial for aligning multimodal large language models with human preferences. Recent works have incorporated reasoning capabilities into these models, achieving promising results. However, training these models suffers from two critical challenges: (1) the inherent noise in preference datasets, which degrades model performance, and (2) the inefficiency of conventional training methods, which ignore the differences in sample difficulty. In this paper, we identify a strong correlation between response entropy and accuracy, indicating that entropy can serve as a reliable and unsupervised proxy for annotation noise and sample difficulty. Based on this insight, we propose a novel Entropy-Guided Training (EGT) approach for multimodal reasoning reward models, which combines two strategies: (1) entropy-guided data curation to mitigate the impact of unreliable samples, and (2) an entropy-guided training strategy that progressively introduces more complex examples. Extensive experiments across three benchmarks show that the EGT-trained model consistently outperforms state-of-the-art multimodal reward models.

[828] Geometric Analysis of Token Selection in Multi-Head Attention

Timur Mudarisov, Mikhal Burtsev, Tatiana Petrova, Radu State

Main category: cs.AI

TL;DR: A geometric framework for analyzing multi-head attention in LLMs using top-N selection metrics (Precision, Recall, F-score) to quantify token separability with theoretical bounds and empirical validation across models.

Details

Motivation: To provide a geometric understanding of how attention mechanisms work in LLMs, moving beyond standard attention analysis to quantify token selection behavior using geometric metrics that offer interpretability and insights for model design.

Method: Views attention through top-N selection lens in value-state space, defines geometric metrics (Precision, Recall, F-score), derives non-asymptotic bounds with explicit dependence on dimension and margin under empirical assumptions, and validates across LLaMA-2-7B, Gemma-7B, and Mistral-7B models.

Result: Theory predicts small-N regime of strongest non-trivial separability; empirical measurements closely track theoretical envelopes; identifies three head specialization regimes in LLaMA-2-7B (Retriever, Mixer, Reset) with distinct geometric signatures.

Conclusion: Attention behaves as a structured geometric classifier with measurable token selection criteria, offering head-level interpretability and informing geometry-aware sparsification and attention design in LLMs.

Abstract: We present a geometric framework for analysing multi-head attention in large language models (LLMs). Without altering the mechanism, we view standard attention through a top-N selection lens and study its behaviour directly in value-state space. We define geometric metrics - Precision, Recall, and F-score - to quantify separability between selected and non-selected tokens, and derive non-asymptotic bounds with explicit dependence on dimension and margin under empirically motivated assumptions (stable value norms with a compressed sink token, exponential similarity decay, and piecewise attention weight profiles). The theory predicts a small-N operating regime of strongest non-trivial separability and clarifies how sequence length and sink similarity shape the metrics. Empirically, across LLaMA-2-7B, Gemma-7B, and Mistral-7B, measurements closely track the theoretical envelopes: top-N selection sharpens separability, sink similarity correlates with Recall. We also found that in LLaMA-2-7B heads specialize into three regimes - Retriever, Mixer, Reset - with distinct geometric signatures. Overall, attention behaves as a structured geometric classifier with measurable criteria for token selection, offering head level interpretability and informing geometry-aware sparsification and design of attention in LLMs.

[829] DomusFM: A Foundation Model for Smart-Home Sensor Data

Michele Fiori, Gabriele Civitarese, Flora D. Salim, Claudio Bettini

Main category: cs.AI

TL;DR: DomusFM is a foundation model for smart-home sensor data that uses self-supervised dual contrastive learning to capture semantic and temporal patterns, addressing data scarcity and privacy concerns in activity recognition.

Details

Motivation: Smart-home sensor data has potential for healthcare and assistive technologies, but existing approaches have limitations: supervised models need too much labeled data, foundation models focus only on inertial sensors, and LLM-based approaches have privacy/cost issues with natural language descriptions.

Method: DomusFM uses self-supervised dual contrastive learning to capture both token-level semantic attributes and sequence-level temporal dependencies. It integrates semantic embeddings from a lightweight language model with specialized encoders for temporal patterns and binary states.

Result: DomusFM outperforms state-of-the-art baselines across seven public smart-home datasets in leave-one-dataset-out evaluation, achieving superior performance even with only 5% of labeled training data for fine-tuning.

Conclusion: DomusFM is the first foundation model specifically designed for smart-home sensor data, addressing data scarcity while maintaining practical deployability for real-world smart-home systems.

Abstract: Smart-home sensor data holds significant potential for several applications, including healthcare monitoring and assistive technologies. Existing approaches, however, face critical limitations. Supervised models require impractical amounts of labeled data. Foundation models for activity recognition focus only on inertial sensors, failing to address the unique characteristics of smart-home binary sensor events: their sparse, discrete nature combined with rich semantic associations. LLM-based approaches, while tested in this domain, still raise several issues regarding the need for natural language descriptions or prompting, and reliance on either external services or expensive hardware, making them infeasible in real-life scenarios due to privacy and cost concerns. We introduce DomusFM, the first foundation model specifically designed and pretrained for smart-home sensor data. DomusFM employs a self-supervised dual contrastive learning paradigm to capture both token-level semantic attributes and sequence-level temporal dependencies. By integrating semantic embeddings from a lightweight language model and specialized encoders for temporal patterns and binary states, DomusFM learns generalizable representations that transfer across environments and tasks related to activity and event analysis. Through leave-one-dataset-out evaluation across seven public smart-home datasets, we demonstrate that DomusFM outperforms state-of-the-art baselines on different downstream tasks, achieving superior performance even with only 5% of labeled training data available for fine-tuning. Our approach addresses data scarcity while maintaining practical deployability for real-world smart-home systems.

[830] Large Language Model and Formal Concept Analysis: a comparative study for Topic Modeling

Fabrice Boissier, Monica Sen, Irina Rychkova

Main category: cs.AI

TL;DR: Comparison of Large Language Models (GPT-5) and Formal Concept Analysis for topic modeling, with experiments on teaching materials and research articles

Details

Motivation: Few works study LLMs for topic modeling, and while FCA has been presented as a candidate for topic modeling, no real applied case study has been conducted. The paper aims to compare LLM and FCA to understand their strengths and weaknesses in topic modeling.

Method: Uses GPT-5 in zero-shot setup with three prompts: topic generation from document batches, merging batch results into final topics, and topic labeling. FCA is evaluated through CREA pipeline. Two experiments: one reusing teaching materials previously used to evaluate CREA, another analyzing 40 research articles in information systems.

Result: The paper compares extracted topics with underlying subfields in information systems research, evaluating both LLM and FCA approaches for topic modeling effectiveness.

Conclusion: Provides comparative analysis of LLM vs FCA for topic modeling, identifying strengths and weaknesses of each approach for text analysis tasks.

Abstract: Topic modeling is a research field finding increasing applications: historically from document retrieving, to sentiment analysis and text summarization. Large Language Models (LLM) are currently a major trend in text processing, but few works study their usefulness for this task. Formal Concept Analysis (FCA) has recently been presented as a candidate for topic modeling, but no real applied case study has been conducted. In this work, we compare LLM and FCA to better understand their strengths and weakneses in the topic modeling field. FCA is evaluated through the CREA pipeline used in past experiments on topic modeling and visualization, whereas GPT-5 is used for the LLM. A strategy based on three prompts is applied with GPT-5 in a zero-shot setup: topic generation from document batches, merging of batch results into final topics, and topic labeling. A first experiment reuses the teaching materials previously used to evaluate CREA, while a second experiment analyzes 40 research articles in information systems to compare the extracted topics with the underling subfields.

[831] Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

Yun Qu, Qi Wang, Yixiu Mao, Heming Zou, Yuhang Jiang, Weijie Liu, Clive Bai, Kai Yang, Yangkun Chen, Saiyong Yang, Xiangyang Ji

Main category: cs.AI

TL;DR: GPS uses Bayesian inference with a lightweight generative model to select informative prompts for efficient reinforcement learning in LLM reasoning, improving training and test-time efficiency.

Details

Motivation: Reinforcement learning for LLM reasoning is computationally expensive due to rollout-intensive optimization. Current prompt selection methods are either costly or lack generalization across prompts.

Method: Introduces Generalizable Predictive Prompt Selection (GPS) that performs Bayesian inference on prompt difficulty using a lightweight generative model trained on shared optimization history. Incorporates intermediate-difficulty prioritization and history-anchored diversity for batch acquisition.

Result: Experiments across varied reasoning benchmarks show GPS achieves substantial improvements in training efficiency, final performance, and test-time efficiency over superior baseline methods.

Conclusion: GPS provides an effective solution for efficient prompt selection in RL-enhanced LLM reasoning, with good generalization capabilities and computational efficiency.

Abstract: Reinforcement learning enhances the reasoning capabilities of large language models but often involves high computational costs due to rollout-intensive optimization. Online prompt selection presents a plausible solution by prioritizing informative prompts to improve training efficiency. However, current methods either depend on costly, exact evaluations or construct prompt-specific predictive models lacking generalization across prompts. This study introduces Generalizable Predictive Prompt Selection (GPS), which performs Bayesian inference towards prompt difficulty using a lightweight generative model trained on the shared optimization history. Intermediate-difficulty prioritization and history-anchored diversity are incorporated into the batch acquisition principle to select informative prompt batches. The small predictive model also generalizes at test-time for efficient computational allocation. Experiments across varied reasoning benchmarks indicate GPS’s substantial improvements in training efficiency, final performance, and test-time efficiency over superior baseline methods.

[832] Evolving from Tool User to Creator via Training-Free Experience Reuse in Multimodal Reasoning

Xintian Shen, Jiawei Chen, Lihao Zheng, Hao Ma, Tao Wei, Kun Zhan

Main category: cs.AI

TL;DR: UCT framework enables LLMs to create and update tools automatically during inference by harvesting reasoning experiences, transforming agents from tool users to tool creators without additional training.

Details

Motivation: Existing Tool-Integrated Reasoning models rely on fixed, manually constructed tools that fail to handle open-ended problems, lack self-optimization, and require significant manual effort, limiting their applicability in real-world scenarios.

Method: UCT is a training-free framework that extracts implicit problem-solving capabilities from LLM reasoning traces, distills them into reusable tools, and employs memory consolidation to maintain a tool library that self-updates during inference.

Result: Achieves significant performance gains of +20.86% and +23.04% on multi-domain mathematical and scientific reasoning benchmarks, demonstrating effective self-evolving capabilities.

Conclusion: UCT provides a novel paradigm for enhancing TIR models through automated tool construction and self-updating during reasoning, enabling continuous improvement without additional training.

Abstract: Existing Tool-Integrated Reasoning (TIR) models have effectively extended the question-answering capabilities of LLMs by incorporating external tools. However, real-world scenarios present numerous open-ended problems where fixed tools often fail to meet task requirements. Furthermore, the lack of self-optimization mechanisms means that erroneous tool outputs can mislead the LLM’s responses. Additionally, the construction of existing tools entails significant manual effort, which consequently constrains their applicability. Recognizing that the reasoning traces of LLMs encapsulate implicit problem-solving capabilities, we propose UCT, a novel training-free framework that transforms agents from tool users to tool creators. This approach harvests reasoning experiences and distills them into reusable assets. This method transforms the agent from a mere tool user into a tool creator, enabling adaptive tool creation and self-updating during the inference process. We also introduce a memory consolidation mechanism to maintain the tool library, ensuring high reusability of retained experiential memory for subsequent reasoning tasks. This novel automated tool construction paradigm continuously improves tool quality during reasoning, allowing the overall agent system to progress without additional training. Extensive experiments demonstrate that our method serves as a novel paradigm for enhancing the capabilities of TIR models. In particular, the significant performance gains achieved +20.86%$\uparrow$ and +23.04%$\uparrow$ on benchmarks across multi-domain mathematical and scientific reasoning tasks validate the self-evolving capability of the agent.

[833] Emergent Analogical Reasoning in Transformers

Gouki Minegishi, Jingyuan Feng, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

Main category: cs.AI

TL;DR: Transformers can learn analogical reasoning through geometric alignment of relational structures and functor-like operations, with emergence sensitive to data, optimization, and scale.

Details

Motivation: To understand how Transformers acquire and implement analogical reasoning, moving from abstract cognitive notion to concrete mechanistic phenomenon in neural networks.

Method: Formalized analogical reasoning using category theory functors, created synthetic evaluation tasks, analyzed emergence under controlled settings, and conducted mechanistic analysis of Transformer operations.

Result: Analogical reasoning emerges through two key components: geometric alignment of relational structure in embedding space and application of functor-like operations within Transformers, with emergence highly sensitive to data characteristics, optimization choices, and model scale.

Conclusion: Analogy can be understood as a concrete, mechanistically grounded phenomenon in neural networks involving relational structure transfer across categories through specific geometric and functional operations.

Abstract: Analogy is a central faculty of human intelligence, enabling abstract patterns discovered in one domain to be applied to another. Despite its central role in cognition, the mechanisms by which Transformers acquire and implement analogical reasoning remain poorly understood. In this work, inspired by the notion of functors in category theory, we formalize analogical reasoning as the inference of correspondences between entities across categories. Based on this formulation, we introduce synthetic tasks that evaluate the emergence of analogical reasoning under controlled settings. We find that the emergence of analogical reasoning is highly sensitive to data characteristics, optimization choices, and model scale. Through mechanistic analysis, we show that analogical reasoning in Transformers decomposes into two key components: (1) geometric alignment of relational structure in the embedding space, and (2) the application of a functor within the Transformer. These mechanisms enable models to transfer relational structure from one category to another, realizing analogy. Finally, we quantify these effects and find that the same trends are observed in pretrained LLMs. In doing so, we move analogy from an abstract cognitive notion to a concrete, mechanistically grounded phenomenon in modern neural networks.

[834] Thinking Like a Doctor: Conversational Diagnosis through the Exploration of Diagnostic Knowledge Graphs

Jeongmoon Won, Seungwon Kook, Yohan Jo

Main category: cs.AI

TL;DR: A conversational diagnosis system that uses a diagnostic knowledge graph for two-step reasoning: generating hypotheses from dialogue context and verifying them through clarifying questions, evaluated with a realistic patient simulator adapted from MIMIC-IV data.

Details

Motivation: Existing conversational diagnosis approaches rely too heavily on model parametric knowledge or assume patients provide rich, concrete information, which is unrealistic in real clinical settings where patients often describe symptoms vaguely.

Method: Proposes a system that explores a diagnostic knowledge graph in two steps: (1) generating diagnostic hypotheses from dialogue context, and (2) verifying hypotheses through clarifying questions iteratively until final diagnosis. Uses MIMIC-IV patient profiles with adapted simulator that describes symptoms vaguely to reflect real-world early clinical encounters.

Result: Shows improved diagnostic accuracy and efficiency over strong baselines. Physician evaluations support the realism of the simulator and clinical utility of generated questions.

Conclusion: The proposed conversational diagnosis system with knowledge graph reasoning and realistic patient simulation demonstrates clinical utility and improved performance over existing approaches.

Abstract: Conversational diagnosis requires multi-turn history-taking, where an agent asks clarifying questions to refine differential diagnoses under incomplete information. Existing approaches often rely on the parametric knowledge of a model or assume that patients provide rich and concrete information, which is unrealistic. To address these limitations, we propose a conversational diagnosis system that explores a diagnostic knowledge graph to reason in two steps: (i) generating diagnostic hypotheses from the dialogue context, and (ii) verifying hypotheses through clarifying questions, which are repeated until a final diagnosis is reached. Since evaluating the system requires a realistic patient simulator that responds to the system’s questions, we adopt a well-established simulator along with patient profiles from MIMIC-IV. We further adapt it to describe symptoms vaguely to reflect real-world patients during early clinical encounters. Experiments show improved diagnostic accuracy and efficiency over strong baselines, and evaluations by physicians support the realism of our simulator and the clinical utility of the generated questions. Our code will be released upon publication.

[835] Do I Really Know? Learning Factual Self-Verification for Hallucination Reduction

Enes Altinisik, Masoomali Fatehkia, Fatih Deniz, Nadir Durrani, Majd Hawasly, Mohammad Raza, Husrev Taha Sencar

Main category: cs.AI

TL;DR: VeriFY is a training-time framework that teaches LLMs to reason about factual uncertainty through consistency-based self-verification, reducing factual hallucinations while maintaining recall.

Details

Motivation: Factual hallucination remains a central challenge for LLMs, and existing approaches either rely on external verification or result in overly conservative abstention behavior during fine-tuning.

Method: VeriFY augments training with structured verification traces that guide models through: 1) producing initial answer, 2) generating and answering probing verification query, 3) issuing consistency judgment, and 4) deciding whether to answer or abstain. Uses stage-level loss masking to exclude hallucinated answer stages while preserving verification supervision.

Result: Reduces factual hallucination rates by 9.7 to 53.3% across multiple model families and scales, with only modest reductions in recall (0.4 to 5.7%), and generalizes across datasets when trained on a single source.

Conclusion: VeriFY effectively teaches LLMs to reason about factual uncertainty through self-verification, significantly reducing hallucinations while maintaining performance, with good generalization capabilities.

Abstract: Factual hallucination remains a central challenge for large language models (LLMs). Existing mitigation approaches primarily rely on either external post-hoc verification or mapping uncertainty directly to abstention during fine-tuning, often resulting in overly conservative behavior. We propose VeriFY, a training-time framework that teaches LLMs to reason about factual uncertainty through consistency-based self-verification. VeriFY augments training with structured verification traces that guide the model to produce an initial answer, generate and answer a probing verification query, issue a consistency judgment, and then decide whether to answer or abstain. To address the risk of reinforcing hallucinated content when training on augmented traces, we introduce a stage-level loss masking approach that excludes hallucinated answer stages from the training objective while preserving supervision over verification behavior. Across multiple model families and scales, VeriFY reduces factual hallucination rates by 9.7 to 53.3 percent, with only modest reductions in recall (0.4 to 5.7 percent), and generalizes across datasets when trained on a single source. The source code, training data, and trained model checkpoints will be released upon acceptance.

[836] Light Alignment Improves LLM Safety via Model Self-Reflection with a Single Neuron

Sicheng Shen, Mingyang Lv, Han Shen, Jialin Wu, Binghao Wang, Zhou Yang, Guobin Shen, Dongcheng Zhao, Feifei Zhao, Yi Zeng

Main category: cs.AI

TL;DR: A lightweight safety alignment method using a single-neuron gating mechanism that balances model capabilities with external safety guidance, requiring minimal training overhead while preserving utility.

Details

Motivation: Existing safety alignment methods for LLMs are computationally expensive, fail to generalize across models, and lightweight approaches either rely heavily on pre-computed safety injections or depend too much on model capabilities, leading to limited generalization and degraded efficiency.

Method: Proposes a safety-aware decoding method that requires only low-cost training of an expert model and uses a single neuron as a gating mechanism to balance the model’s intrinsic capabilities with external safety guidance.

Result: The approach demonstrates advantages in training overhead and generalization across model scales, preserving utility while enhancing output safety, offering a lightweight alignment solution for practical deployment.

Conclusion: The single-neuron gating method provides a new perspective on lightweight safety alignment for LLMs, balancing efficiency, generalization, and safety for practical deployment.

Abstract: The safety of large language models (LLMs) has increasingly emerged as a fundamental aspect of their development. Existing safety alignment for LLMs is predominantly achieved through post-training methods, which are computationally expensive and often fail to generalize well across different models. A small number of lightweight alignment approaches either rely heavily on prior-computed safety injections or depend excessively on the model’s own capabilities, resulting in limited generalization and degraded efficiency and usability during generation. In this work, we propose a safety-aware decoding method that requires only low-cost training of an expert model and employs a single neuron as a gating mechanism. By effectively balancing the model’s intrinsic capabilities with external guidance, our approach simultaneously preserves utility and enhances output safety. It demonstrates clear advantages in training overhead and generalization across model scales, offering a new perspective on lightweight alignment for the safe and practical deployment of large language models. Code: https://github.com/Beijing-AISI/NGSD.

[837] Edit Knowledge, Not Just Facts via Multi-Step Reasoning over Background Stories

Ya Gao, Kalle Kujanpää, Pekka Marttinen, Harri Valpola, Alexander Ilin

Main category: cs.AI

TL;DR: A training strategy for LLMs to internalize new knowledge through contextual stories, multi-step reasoning questions, and knowledge distillation, enabling better integration of novel information into reasoning processes.

Details

Motivation: Current knowledge editing approaches for LLMs focus on atomic facts but fail to integrate new information into coherent frameworks usable across contexts. The paper argues knowledge internalization is fundamentally a reasoning problem rather than memorization.

Method: Three-principle training strategy: 1) Introduce new knowledge as coherent background stories contextualizing novel facts, 2) Train using self-generated multi-hop questions requiring multi-step reasoning with new information, 3) Use knowledge distillation to force student models to internalize teacher’s reasoning without access to novel information.

Result: Models trained with this strategy effectively leverage newly acquired knowledge during reasoning and achieve remarkable performance on challenging questions requiring combination of multiple new facts.

Conclusion: The proposed approach successfully addresses knowledge internalization as a reasoning problem, enabling LLMs to integrate new information into coherent frameworks and apply it flexibly across contexts through multi-step reasoning.

Abstract: Enabling artificial intelligence systems, particularly large language models, to integrate new knowledge and flexibly apply it during reasoning remains a central challenge. Existing knowledge editing approaches emphasize atomic facts, improving factual recall but often failing to integrate new information into a coherent framework usable across contexts. In this work, we argue that knowledge internalization is fundamentally a reasoning problem rather than a memorization problem. Consequently, a model should be trained in situations where the new information is instrumental to solving a task, combined with pre-existing knowledge, and exercised through multi-step reasoning. Based on this insight, we propose a training strategy based on three principles. First, new knowledge is introduced as a coherent background story that contextualizes novel facts and explains their relation to existing knowledge. Second, models are trained using self-generated multi-hop questions that require multi-step reasoning involving the new information. Third, training is done using knowledge distillation, forcing a student model to internalize the teacher’s reasoning behavior without access to the novel information. Experiments show that models trained with this strategy effectively leverage newly acquired knowledge during reasoning and achieve remarkable performance on challenging questions that require combining multiple new facts.

[838] Canonical Intermediate Representation for LLM-based optimization problem formulation and code generation

Zhongyuan Lyu, Shuoyu Hu, Lujie Liu, Hongxia Yang, Ming LI

Main category: cs.AI

TL;DR: R2C framework uses Canonical Intermediate Representation (CIR) schema to translate natural language operational rules into optimization models through multi-agent pipeline with knowledge retrieval and reflection mechanisms.

Details

Motivation: Current LLM-based approaches struggle with complex operational rules requiring composite constraints and appropriate modeling paradigms for optimization problems described in natural language.

Method: Introduces Canonical Intermediate Representation (CIR) schema that LLMs generate between problem descriptions and optimization models, encoding rule semantics through constraint archetypes and modeling paradigms. Develops R2C multi-agent pipeline that parses texts, synthesizes CIR implementations via domain knowledge retrieval, and instantiates optimization models.

Result: Achieves 47.2% Accuracy Rate on new benchmark with rich operational rules, competitive results on established benchmarks approaching proprietary models like GPT-5, and further gains with reflection mechanism setting new best-reported results.

Conclusion: CIR schema effectively decouples rule logic from mathematical instantiation, enabling systematic rule-to-constraint reasoning and state-of-the-art performance in translating natural language operational rules to optimization models.

Abstract: Automatically formulating optimization models from natural language descriptions is a growing focus in operations research, yet current LLM-based approaches struggle with the composite constraints and appropriate modeling paradigms required by complex operational rules. To address this, we introduce the Canonical Intermediate Representation (CIR): a schema that LLMs explicitly generate between problem descriptions and optimization models. CIR encodes the semantics of operational rules through constraint archetypes and candidate modeling paradigms, thereby decoupling rule logic from its mathematical instantiation. Upon a newly generated CIR knowledge base, we develop the rule-to-constraint (R2C) framework, a multi-agent pipeline that parses problem texts, synthesizes CIR implementations by retrieving domain knowledge, and instantiates optimization models. To systematically evaluate rule-to-constraint reasoning, we test R2C on our newly constructed benchmark featuring rich operational rules, and benchmarks from prior work. Extensive experiments show that R2C achieves state-of-the-art accuracy on the proposed benchmark (47.2% Accuracy Rate). On established benchmarks from the literature, R2C delivers highly competitive results, approaching the performance of proprietary models (e.g., GPT-5). Moreover, with a reflection mechanism, R2C achieves further gains and sets new best-reported results on some benchmarks.

[839] Constrained Process Maps for Multi-Agent Generative AI Workflows

Ananya Joshi, Michael Rudow

Main category: cs.AI

TL;DR: Multi-agent system formalized as finite-horizon MDP for regulated workflows, improving accuracy and reducing human review needs compared to single-agent approaches.

Details

Motivation: LLM-based agents are increasingly used in complex, regulated workflows (compliance, due diligence), but current architectures rely on single-agent prompt engineering, making it difficult to observe uncertainty handling and coordination across decision stages with human oversight.

Method: Introduce multi-agent system formalized as finite-horizon Markov Decision Process with directed acyclic structure. Each agent corresponds to specific role/decision stage (content, business, legal review). Quantify epistemic uncertainty at agent level using Monte Carlo estimation, while system-level uncertainty captured by MDP termination in either automated labeled state or human-review state.

Result: Case study in AI safety evaluation for self-harm detection shows improvements over single-agent baseline: up to 19% increase in accuracy, up to 85x reduction in required human review, and reduced processing time in some configurations.

Conclusion: Multi-agent MDP framework provides structured approach for LLM-based workflows in regulated settings, enabling better uncertainty quantification, coordination across decision stages, and human oversight while improving performance metrics.

Abstract: Large language model (LLM)-based agents are increasingly used to perform complex, multi-step workflows in regulated settings such as compliance and due diligence. However, many agentic architectures rely primarily on prompt engineering of a single agent, making it difficult to observe or compare how models handle uncertainty and coordination across interconnected decision stages and with human oversight. We introduce a multi-agent system formalized as a finite-horizon Markov Decision Process (MDP) with a directed acyclic structure. Each agent corresponds to a specific role or decision stage (e.g., content, business, or legal review in a compliance workflow), with predefined transitions representing task escalation or completion. Epistemic uncertainty is quantified at the agent level using Monte Carlo estimation, while system-level uncertainty is captured by the MDP’s termination in either an automated labeled state or a human-review state. We illustrate the approach through a case study in AI safety evaluation for self-harm detection, implemented as a multi-agent compliance system. Results demonstrate improvements over a single-agent baseline, including up to a 19% increase in accuracy, up to an 85x reduction in required human review, and, in some configurations, reduced processing time.

[840] Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

Wei Liu, Peijie Yu, Michele Orini, Yali Du, Yulan He

Main category: cs.AI

TL;DR: LLMs need investigatory intelligence (autonomous goal-setting and exploration) beyond executional intelligence (task completion). DDR benchmark tests this using open-ended data analysis tasks.

Details

Motivation: Current LLM benchmarks focus on executional intelligence (completing assigned tasks) but real-world applications require investigatory intelligence - the ability to autonomously set goals and explore data without explicit queries. Data science provides a natural testbed since real analysis starts from raw data rather than explicit questions.

Method: Introduces Deep Data Research (DDR) - an open-ended task where LLMs autonomously extract key insights from databases, and DDR-Bench - a large-scale, checklist-based benchmark for verifiable evaluation of investigatory intelligence.

Result: Frontier models show emerging agency capabilities, but long-horizon exploration remains challenging. Effective investigatory intelligence depends not just on agent scaffolding or model scaling, but on intrinsic strategies of agentic models.

Conclusion: Investigatory intelligence is a distinct capability from executional intelligence, requiring autonomous goal-setting and exploration. The DDR benchmark provides a way to measure this emerging capability in LLMs, revealing current limitations in long-horizon exploration.

Abstract: The agency expected of Agentic Large Language Models goes beyond answering correctly, requiring autonomy to set goals and decide what to explore. We term this investigatory intelligence, distinguishing it from executional intelligence, which merely completes assigned tasks. Data Science provides a natural testbed, as real-world analysis starts from raw data rather than explicit queries, yet few benchmarks focus on it. To address this, we introduce Deep Data Research (DDR), an open-ended task where LLMs autonomously extract key insights from databases, and DDR-Bench, a large-scale, checklist-based benchmark that enables verifiable evaluation. Results show that while frontier models display emerging agency, long-horizon exploration remains challenging. Our analysis highlights that effective investigatory intelligence depends not only on agent scaffolding or merely scaling, but also on intrinsic strategies of agentic models.

[841] Rethinking the Role of Entropy in Optimizing Tool-Use Behaviors for Large Language Model Agents

Zeping Li, Hongru Wang, Yiwen Zhao, Guanhua Chen, Yixia Li, Keyang Chen, Yixin Cao, Guangnan Ye, Hongfeng Chai, Mengdi Wang, Zhenfei Yin

Main category: cs.AI

TL;DR: LLM-based tool-using agents often make excessive low-quality tool calls in long trajectories. The paper proposes using entropy reduction as a supervisory signal with two reward strategies to optimize tool-use behavior.

Details

Motivation: Current LLM-based tool-using agents struggle with excessive and low-quality tool calls in long trajectories, increasing latency and degrading inference performance. There's a need for better management of tool-use behavior to improve efficiency and performance.

Method: The authors conduct entropy-based pilot experiments showing correlation between entropy reduction and high-quality tool calls. They propose using entropy reduction as a supervisory signal with two reward strategies: 1) sparse outcome rewards for trajectory-level efficiency improvement, and 2) dense process rewards for fine-grained performance enhancement.

Result: Experiments across diverse domains show both reward designs improve tool-use behavior: sparse outcome rewards reduce tool calls by 72.07% compared to baseline averages, while dense process rewards improve performance by 22.27%.

Conclusion: Entropy reduction serves as a key mechanism for enhancing tool-use behavior in LLM-based agents, enabling more adaptive performance in real-world applications through optimized tool selection and usage.

Abstract: Tool-using agents based on Large Language Models (LLMs) excel in tasks such as mathematical reasoning and multi-hop question answering. However, in long trajectories, agents often trigger excessive and low-quality tool calls, increasing latency and degrading inference performance, making managing tool-use behavior challenging. In this work, we conduct entropy-based pilot experiments and observe a strong positive correlation between entropy reduction and high-quality tool calls. Building on this finding, we propose using entropy reduction as a supervisory signal and design two reward strategies to address the differing needs of optimizing tool-use behavior. Sparse outcome rewards provide coarse, trajectory-level guidance to improve efficiency, while dense process rewards offer fine-grained supervision to enhance performance. Experiments across diverse domains show that both reward designs improve tool-use behavior: the former reduces tool calls by 72.07% compared to the average of baselines, while the latter improves performance by 22.27%. These results position entropy reduction as a key mechanism for enhancing tool-use behavior, enabling agents to be more adaptive in real-world applications.

[842] SIDiffAgent: Self-Improving Diffusion Agent

Shivank Garg, Ayush Singh, Gaurav Kumar Nayak

Main category: cs.AI

TL;DR: SIDiffAgent is a training-free agentic framework that uses Qwen models to autonomously improve text-to-image diffusion outputs through prompt engineering, artifact detection/correction, and iterative self-improvement with memory.

Details

Motivation: Text-to-image diffusion models have limitations including sensitivity to prompt phrasing, semantic ambiguity, anatomical artifacts, and need for engineered prompts. Existing methods require additional training and offer limited controllability, hindering practical deployment.

Method: A training-free agentic framework leveraging Qwen family models (Qwen-VL, Qwen-Image, Qwen-Edit, Qwen-Embedding) to autonomously manage prompt engineering, detect/correct poor generations, perform artifact removal, and incorporate iterative self-improvement using a memory database of past experiences.

Result: Achieved average VQA score of 0.884 on GenAIBench, significantly outperforming open-source, proprietary models and other agentic methods.

Conclusion: SIDiffAgent provides a training-free solution that addresses key limitations of text-to-image diffusion models through autonomous agentic control and iterative self-improvement, enabling more reliable and consistent image generation.

Abstract: Text-to-image diffusion models have revolutionized generative AI, enabling high-quality and photorealistic image synthesis. However, their practical deployment remains hindered by several limitations: sensitivity to prompt phrasing, ambiguity in semantic interpretation (e.g., ``mouse" as animal vs. a computer peripheral), artifacts such as distorted anatomy, and the need for carefully engineered input prompts. Existing methods often require additional training and offer limited controllability, restricting their adaptability in real-world applications. We introduce Self-Improving Diffusion Agent (SIDiffAgent), a training-free agentic framework that leverages the Qwen family of models (Qwen-VL, Qwen-Image, Qwen-Edit, Qwen-Embedding) to address these challenges. SIDiffAgent autonomously manages prompt engineering, detects and corrects poor generations, and performs fine-grained artifact removal, yielding more reliable and consistent outputs. It further incorporates iterative self-improvement by storing a memory of previous experiences in a database. This database of past experiences is then used to inject prompt-based guidance at each stage of the agentic pipeline. \modelour achieved an average VQA score of 0.884 on GenAIBench, significantly outperforming open-source, proprietary models and agentic methods. We will publicly release our code upon acceptance.

[843] Understanding the Reversal Curse Mitigation in Masked Diffusion Models through Attention and Training Dynamics

Sangwoo Shin, BumJun Kim, Kyelim Lee, Moongyu Jeon, Albert No

Main category: cs.AI

TL;DR: MDMs (masked diffusion-based language models) show weaker reversal curse than ARMs due to architectural weight sharing and gradient alignment, not just any-order training.

Details

Motivation: To understand why masked diffusion-based language models (MDMs) exhibit the reversal curse in a much weaker form than autoregressive language models (ARMs), despite both potentially learning from similar data patterns.

Method: Analyzed architectural differences between ARMs and MDMs, focusing on one-layer Transformer encoders with weight sharing. Examined how weight sharing couples forward and reverse attention scores, and showed gradient alignment between forward and reverse losses. Conducted experiments on controlled toy tasks and large-scale diffusion language models.

Result: MDMs partially overcome the reversal curse due to architectural structure (weight sharing in Transformer encoders) that creates positive correlation between forward and reverse attention scores, and gradient alignment that minimizes both forward and reverse losses simultaneously.

Conclusion: The mitigation of reversal curse in MDMs stems from architectural design and its interaction with training dynamics, not merely from any-order training objectives, explaining why MDMs perform better on reverse queries than ARMs.

Abstract: Autoregressive language models (ARMs) suffer from the reversal curse: after learning that “$A$ is $B$”, they often fail on the reverse query “$B$ is $A$”. Masked diffusion-based language models (MDMs) exhibit this failure in a much weaker form, but the underlying reason has remained unclear. A common explanation attributes this mitigation to the any-order training objective. However, observing “[MASK] is $B$” during training does not necessarily teach the model to handle the reverse prompt “$B$ is [MASK]”. We show that the mitigation arises from architectural structure and its interaction with training. In a one-layer Transformer encoder, weight sharing couples the two directions by making forward and reverse attention scores positively correlated. In the same setting, we further show that the corresponding gradients are aligned, so minimizing the forward loss also reduces the reverse loss. Experiments on both controlled toy tasks and large-scale diffusion language models support these mechanisms, explaining why MDMs partially overcome a failure mode that persists in strong ARMs.

Yingsha Xie, Tiansheng Huang, Enneng Yang, Rui Min, Wenjie Lu, Xiaochun Cao, Naiqiang Tan, Li Shen

Main category: cs.AI

TL;DR: DGR method reduces safety alignment’s negative impact on reasoning by transforming safety datasets to match target model’s distribution, achieving +30.2% reasoning accuracy improvement while maintaining safety.

Details

Motivation: Safety alignment causes "safety tax" that degrades large reasoning models' general reasoning abilities. Existing safety datasets create distributional gaps with target models, which the authors hypothesize is the main cause of reasoning degradation.

Method: Proposes DGR (Distribution Gap Reduction) method that transforms and refines existing out-of-distribution safety reasoning datasets to align with the target LLM’s inner distribution, reducing the distributional gap.

Result: DGR achieves +30.2% improvement on DirectRefusal and +21.2% on R1-ACT in average reasoning accuracy compared to vanilla SFT while maintaining safety performance. Shows that reasoning degradation correlates with distribution shift extent.

Conclusion: Distributional consistency is crucial for preserving reasoning capabilities during safety alignment. Safety alignment may primarily function as an activation mechanism for latent knowledge, with just 10 samples sufficient to activate effective refusal behaviors.

Abstract: Safety alignment incurs safety tax that perturbs a large reasoning model’s (LRM) general reasoning ability. Existing datasets used for safety alignment for an LRM are usually constructed by distilling safety reasoning traces and answers from an external LRM or human labeler. However, such reasoning traces and answers exhibit a distributional gap with the target LRM that needs alignment, and we conjecture such distributional gap is the culprit leading to significant degradation of reasoning ability of the target LRM. Driven by this hypothesis, we propose a safety alignment dataset construction method, dubbed DGR. DGR transforms and refines an existing out-of-distributional safety reasoning dataset to be aligned with the target’s LLM inner distribution. Experimental results demonstrate that i) DGR effectively mitigates the safety tax while maintaining safety performance across all baselines, i.e., achieving \textbf{+30.2%} on DirectRefusal and \textbf{+21.2%} on R1-ACT improvement in average reasoning accuracy compared to Vanilla SFT; ii) the degree of reasoning degradation correlates with the extent of distribution shift, suggesting that bridging this gap is central to preserving capabilities. Furthermore, we find that safety alignment in LRMs may primarily function as a mechanism to activate latent knowledge, as a mere \textbf{10} samples are sufficient for activating effective refusal behaviors. These findings not only emphasize the importance of distributional consistency but also provide insights into the activation mechanism of safety in reasoning models.

Sarah Nassar

Main category: cs.AI

TL;DR: Comparison of three graph search algorithms (Floyd-Warshall-Ingerman, Dijkstra’s/A*, and Yen’s) for traffic-aware navigation in urban road networks, evaluating trade-offs between preprocessing time, real-time performance, and traffic awareness.

Details

Motivation: The need to find optimal traffic-aware navigation solutions in urban road networks, balancing computational efficiency with real-time performance and traffic consideration.

Method: Three approaches were compared: 1) Floyd-Warshall-Ingerman (single-run multi-query preprocessing), 2) Dijkstra’s and A* (continuous single-query real-time search), and 3) Yen’s algorithm (combining both approaches by finding top K shortest paths then iterating in real time).

Result: Dijkstra’s and A* provided the most traffic-aware optimal solutions with minimal preprocessing. Floyd-Warshall-Ingerman was fastest in real time but lacked traffic awareness. Yen’s algorithm balanced runtime speed and optimality but required significant preprocessing.

Conclusion: Each approach has specific advantages/disadvantages that must be weighed based on deployment context requirements, with no single best solution for all scenarios.

Abstract: This project compares three graph search approaches for the task of traffic-aware navigation in Kingston’s road network. These approaches include a single-run multi-query preprocessing algorithm (Floyd-Warshall-Ingerman), continuous single-query real-time search (Dijkstra’s and A*), and an algorithm combining both approaches to balance between their trade-offs by first finding the top K shortest paths then iterating over them in real time (Yen’s). Dijkstra’s and A* resulted in the most traffic-aware optimal solutions with minimal preprocessing required. Floyd-Warshall-Ingerman was the fastest in real time but provided distance based paths with no traffic awareness. Yen’s algorithm required significant preprocessing but balanced between the other two approaches in terms of runtime speed and optimality. Each approach presents advantages and disadvantages that need to be weighed depending on the circumstances of specific deployment contexts to select the best custom solution. *This project was completed as part of ELEC 844 (Search and Planning Algorithms for Robotics) in the Fall 2025 term.

[846] Reasoning in a Combinatorial and Constrained World: Benchmarking LLMs on Natural-Language Combinatorial Optimization

Xia Jiang, Jing Chen, Cong Zhang, Jie Gao, Chengpeng Hu, Chenhao Zhang, Yaoxin Wu, Yingqian Zhang

Main category: cs.AI

TL;DR: NLCO benchmark evaluates LLMs on combinatorial optimization problems described in natural language, requiring models to output discrete solutions without code or external solvers.

Details

Motivation: While LLMs excel at math and logic reasoning, their ability to handle combinatorial optimization problems - searching high-dimensional solution spaces under hard constraints - remains underexplored. The authors aim to bridge this gap by creating a comprehensive benchmark.

Method: Introduces NLCO benchmark covering 43 CO problems organized using a four-layer taxonomy: variable types, constraint families, global patterns, and objective classes. Provides solver-annotated solutions and evaluates LLMs on feasibility, solution optimality, and reasoning efficiency.

Result: High-performing LLMs achieve strong feasibility and solution quality on small instances, but performance degrades as instance size grows, even with more reasoning tokens. Set-based tasks are relatively easy, while graph-structured problems and bottleneck objectives lead to more frequent failures.

Conclusion: LLMs show promise in combinatorial optimization but struggle with scalability and complex problem structures, revealing systematic limitations that need to be addressed for broader CO applications.

Abstract: While large language models (LLMs) have shown strong performance in math and logic reasoning, their ability to handle combinatorial optimization (CO) – searching high-dimensional solution spaces under hard constraints – remains underexplored. To bridge the gap, we introduce NLCO, a \textbf{N}atural \textbf{L}anguage \textbf{C}ombinatorial \textbf{O}ptimization benchmark that evaluates LLMs on end-to-end CO reasoning: given a language-described decision-making scenario, the model must output a discrete solution without writing code or calling external solvers. NLCO covers 43 CO problems and is organized using a four-layer taxonomy of variable types, constraint families, global patterns, and objective classes, enabling fine-grained evaluation. We provide solver-annotated solutions and comprehensively evaluate LLMs by feasibility, solution optimality, and reasoning efficiency. Experiments across a wide range of modern LLMs show that high-performing models achieve strong feasibility and solution quality on small instances, but both degrade as instance size grows, even if more tokens are used for reasoning. We also observe systematic effects across the taxonomy: set-based tasks are relatively easy, whereas graph-structured problems and bottleneck objectives lead to more frequent failures.

[847] TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents

Hang Yan, Xinyu Che, Fangzhi Xu, Qiushi Sun, Zichen Ding, Kanzhi Cheng, Jian Zhang, Tao Qin, Jun Liu, Qika Lin

Main category: cs.AI

TL;DR: TIDE is a diagnostic evaluation framework for analyzing Test-Time Improvement (TTI) in autonomous LLM agents, focusing on three dimensions: task completion dynamics, recursive looping behaviors, and memory burden constraints.

Details

Motivation: While autonomous LLM agents show performance improvements through iterative environment interaction (Test-Time Improvement), the underlying mechanisms of success/failure are poorly understood, and existing metrics fail to capture optimization efficiency, behavior adaptation after errors, and working memory utility.

Method: Proposes TIDE (Test-time Improvement Diagnostic Evaluation), an agent-agnostic and environment-agnostic framework that decomposes TTI into three interconnected dimensions: (1) overall temporal dynamics of task completion, (2) identification of performance constraints from recursive looping behaviors, and (3) identification of constraints from burdensome accumulated memory.

Result: Through extensive experiments across diverse agents and environments, TIDE reveals that improving agent performance requires more than scaling internal reasoning, calling for explicit optimization of the interaction dynamics between agents and environments.

Conclusion: TIDE provides a comprehensive diagnostic framework for understanding and improving autonomous LLM agents’ test-time performance by analyzing interaction dynamics, recursive behaviors, and memory constraints.

Abstract: Recent advances in autonomous LLM agents demonstrate their ability to improve performance through iterative interaction with the environment. We define this paradigm as Test-Time Improvement (TTI). However, the mechanisms under how and why TTI succeed or fail remain poorly understood, and existing evaluation metrics fail to capture their task optimization efficiency, behavior adaptation after erroneous actions, and the specific utility of working memory for task completion. To address these gaps, we propose Test-time Improvement Diagnostic Evaluation (TIDE), an agent-agnostic and environment-agnostic framework that decomposes TTI into three comprehensive and interconnected dimensions. The framework measures (1) the overall temporal dynamics of task completion and (2) identifies whether performance is primarily constrained by recursive looping behaviors or (3) by burdensome accumulated memory. Through extensive experiments across diverse agents and environments, TIDE highlights that improving agent performance requires more than scaling internal reasoning, calling for explicitly optimizing the interaction dynamics between the agent and the environment.

[848] More Than a Quick Glance: Overcoming the Greedy Bias in KV-Cache Compression

Aryan Sood, Tanvi Sharma, Vansh Agrawal

Main category: cs.AI

TL;DR: LASER-KV is a KV cache compression framework using layer accumulation with exact-LSH recall to maintain performance while reducing memory usage in long-context LLMs.

Details

Motivation: LLMs have theoretical long context support but face practical memory constraints due to linear KV cache growth. Existing compression methods trade semantic recall for memory efficiency, causing performance degradation in long-context tasks.

Method: Proposes LASER-KV framework with block-wise accumulation strategy governed by protection divisor (n), isolating compression effects from sliding window artifacts. Uses layer accumulated selection with exact-LSH recall instead of relying solely on attention scores for token utility.

Result: On Babilong benchmark, previous compression methods degrade by 15-30% on long context tasks, while LASER-KV maintains stable performance with up to 10% better accuracy at 128k context length.

Conclusion: LASER-KV challenges the assumption that attention scores alone are sufficient for token utility in KV compression, offering a more effective approach for long-context LLM deployment.

Abstract: While Large Language Models (LLMs) can theoretically support extensive context windows, their actual deployment is constrained by the linear growth of Key-Value (KV) cache memory. Prevailing compression strategies mitigate this through various pruning mechanisms, yet trade-off semantic recall for memory efficiency. In this work, we present LASER-KV (Layer Accumulated Selection with Exact-LSH Recall), a framework designed to test the limits of KV compression under a strict accumulative budgeting policy. We deviate from the standard fixed summary size approach by implementing a block-wise accumulation strategy governed by a protection divisor (n). This allows us to isolate the effects of compression from sliding window artifacts. Our experiments on the Babilong benchmark reveal performance degradation in previous compression methods by 15-30% on various long context tasks. LASER-KV maintains stable performance, achieving superior accuracies by a margin of upto 10% at 128k. These findings challenge the prevailing assumption that attention scores alone are a sufficient proxy for token utility.

[849] Position: Explaining Behavioral Shifts in Large Language Models Requires a Comparative Approach

Martino Ciaperoni, Marzio Di Vece, Luca Pappalardo, Fosca Giannotti, Francesco Giannini

Main category: cs.AI

TL;DR: A framework for comparative explainable AI (Δ-XAI) that focuses on explaining behavioral shifts between model checkpoints rather than analyzing single models in isolation.

Details

Motivation: Large foundation models exhibit behavioral shifts after interventions like scaling, fine-tuning, or RLHF, but current XAI methods are ill-suited to explain what changed internally between different model versions.

Method: Proposes a Comparative XAI (Δ-XAI) framework with specific desiderata for explaining intervention-induced shifts between reference and intervened models, introduces possible pipelines, and demonstrates with a concrete experiment.

Result: A formal framework for comparative explanation of behavioral shifts in foundation models, with methodological guidelines and demonstration of how Δ-XAI methods work.

Conclusion: Behavioral shifts in foundation models should be explained comparatively rather than analyzing single models in isolation, requiring new XAI methods designed for this comparative purpose.

Abstract: Large-scale foundation models exhibit behavioral shifts: intervention-induced behavioral changes that appear after scaling, fine-tuning, reinforcement learning or in-context learning. While investigating these phenomena have recently received attention, explaining their appearance is still overlooked. Classic explainable AI (XAI) methods can surface failures at a single checkpoint of a model, but they are structurally ill-suited to justify what changed internally across different checkpoints and which explanatory claims are warranted about that change. We take the position that behavioral shifts should be explained comparatively: the core target should be the intervention-induced shift between a reference model and an intervened model, rather than any single model in isolation. To this aim we formulate a Comparative XAI ($Δ$-XAI) framework with a set of desiderata to be taken into account when designing proper explaining methods. To highlight how $Δ$-XAI methods work, we introduce a set of possible pipelines, relate them to the desiderata, and provide a concrete $Δ$-XAI experiment.

[850] Interpreting and Controlling LLM Reasoning through Integrated Policy Gradient

Changming Li, Kaixing Zhang, Haoyun Xu, Yingdong Shi, Zheng Zhang, Kaitao Song, Kan Ren

Main category: cs.AI

TL;DR: IPG framework identifies internal components responsible for complex reasoning in LLMs by propagating outcome-based signals backward through inference trajectories.

Details

Motivation: Current interpretability methods for LLM reasoning are limited - they either identify components correlated with textual patterns or rely on human-annotated contrastive pairs, struggling to precisely localize complex reasoning mechanisms or capture sequential influence from internal workings to reasoning outputs.

Method: Proposes Integrated Policy Gradient (IPG), a framework that attributes reasoning behaviors to model’s inner components by propagating compound outcome-based signals (like post-reasoning accuracy) backward through model inference trajectories, focusing on components with sequential contribution to reasoning behavior.

Result: Empirical evaluations show IPG achieves more precise localization and enables reliable modulation of reasoning behaviors (reasoning capability, reasoning strength) across diverse reasoning models.

Conclusion: IPG provides a novel approach to understanding LLM reasoning mechanisms by focusing on outcome-oriented and sequential-influence-aware principles, offering better interpretability and control over reasoning behaviors.

Abstract: Large language models (LLMs) demonstrate strong reasoning abilities in solving complex real-world problems. Yet, the internal mechanisms driving these complex reasoning behaviors remain opaque. Existing interpretability approaches targeting reasoning either identify components (e.g., neurons) correlated with special textual patterns, or rely on human-annotated contrastive pairs to derive control vectors. Consequently, current methods struggle to precisely localize complex reasoning mechanisms or capture sequential influence from model internal workings to the reasoning outputs. In this paper, built on outcome-oriented and sequential-influence-aware principles, we focus on identifying components that have sequential contribution to reasoning behavior where outcomes are cumulated by long-range effects. We propose Integrated Policy Gradient (IPG), a novel framework that attributes reasoning behaviors to model’s inner components by propagating compound outcome-based signals such as post reasoning accuracy backward through model inference trajectories. Empirical evaluations demonstrate that our approach achieves more precise localization and enables reliable modulation of reasoning behaviors (e.g., reasoning capability, reasoning strength) across diverse reasoning models.

[851] Live-Evo: Online Evolution of Agentic Memory from Continuous Feedback

Yaolun Zhang, Yiran Wu, Yijiong Yu, Qingyun Wu, Huazheng Wang

Main category: cs.AI

TL;DR: Live-Evo: An online self-evolving memory system for LLM agents that learns from streaming data, decoupling experiences from guidelines and using reinforcement/decay mechanisms to manage memory adaptively.

Details

Motivation: Existing self-evolving memory systems for LLM agents are designed for static train/test splits and approximate online learning by folding static benchmarks, making them brittle under true distribution shift and continuous feedback. There's a need for systems that can genuinely learn from streaming data over time.

Method: Live-Evo decouples ‘what happened’ (Experience Bank) from ‘how to use it’ (Meta-Guideline Bank). It compiles task-adaptive guidelines from retrieved experiences for each task. The system maintains experience weights that are updated from feedback: helpful experiences are reinforced and retrieved more often, while misleading/stale experiences are down-weighted and gradually forgotten, mimicking human memory reinforcement and decay.

Result: On the live Prophet Arena benchmark over 10 weeks, Live-Evo improved Brier score by 20.8% and increased market returns by 12.9%. It also transferred to deep-research benchmarks with consistent gains over strong baselines.

Conclusion: Live-Evo demonstrates effective online self-evolving memory for LLM agents that can adapt to streaming data and distribution shifts, outperforming existing approaches on live benchmarks while maintaining transferability to other tasks.

Abstract: Large language model (LLM) agents are increasingly equipped with memory, which are stored experience and reusable guidance that can improve task-solving performance. Recent \emph{self-evolving} systems update memory based on interaction outcomes, but most existing evolution pipelines are developed for static train/test splits and only approximate online learning by folding static benchmarks, making them brittle under true distribution shift and continuous feedback. We introduce \textsc{Live-Evo}, an online self-evolving memory system that learns from a stream of incoming data over time. \textsc{Live-Evo} decouples \emph{what happened} from \emph{how to use it} via an Experience Bank and a Meta-Guideline Bank, compiling task-adaptive guidelines from retrieved experiences for each task. To manage memory online, \textsc{Live-Evo} maintains experience weights and updates them from feedback: experiences that consistently help are reinforced and retrieved more often, while misleading or stale experiences are down-weighted and gradually forgotten, analogous to reinforcement and decay in human memory. On the live \textit{Prophet Arena} benchmark over a 10-week horizon, \textsc{Live-Evo} improves Brier score by 20.8% and increases market returns by 12.9%, while also transferring to deep-research benchmarks with consistent gains over strong baselines. Our code is available at https://github.com/ag2ai/Live-Evo.

[852] Trust by Design: Skill Profiles for Transparent, Cost-Aware LLM Routing

Mika Okamoto, Ansel Kaplan Erol, Glenn Matlin

Main category: cs.AI

TL;DR: BELLA is a framework for budget-efficient LLM selection that uses skill-based profiling and multi-objective optimization to recommend optimal models while respecting budget constraints.

Details

Motivation: Current LLM selection approaches rely on aggregate benchmark metrics that don't reveal specific capabilities needed for tasks, making it hard to determine if cheaper models could suffice without wasting money.

Method: Three-stage framework: (1) decomposing LLM outputs to extract granular skills using critic-based profiling, (2) clustering skills into structured capability matrices, (3) multi-objective optimization to select models that maximize performance within budget constraints.

Result: BELLA provides interpretable, natural-language rationale for model recommendations, offering transparency that current black-box routing systems lack, with application demonstrated in financial reasoning domain.

Conclusion: The framework enables practitioners to make principled cost-performance trade-offs for deploying LLMs by providing skill-based, budget-aware model selection recommendations.

Abstract: How should Large Language Model (LLM) practitioners select the right model for a task without wasting money? We introduce BELLA (Budget-Efficient LLM Selection via Automated skill-profiling), a framework that recommends optimal LLM selection for tasks through interpretable skill-based model selection. Standard benchmarks report aggregate metrics that obscure which specific capabilities a task requires and whether a cheaper model could suffice. BELLA addresses this gap through three stages: (1) decomposing LLM outputs and extract the granular skills required by using critic-based profiling, (2) clustering skills into structured capability matrices, and (3) multi-objective optimization to select the right models to maximize performance while respecting budget constraints. BELLA provides natural-language rationale for recommendations, providing transparency that current black-box routing systems lack. We describe the framework architecture, situate it within the landscape of LLM routing and evaluation, and discuss its application to financial reasoning as a representative domain exhibiting diverse skill requirements and cost-variation across models. Our framework enables practitioners to make principled and cost-performance trade-offs for deploying LLMs.

[853] Structure Enables Effective Self-Localization of Errors in LLMs

Ankur Samanta, Akshayaa Magesh, Ayush Jain, Kavosh Asadi, Youliang Yu, Daniel Jiang, Boris Vidolov, Kaveh Hassani, Paul Sajda, Jalaj Bhandari, Yonathan Efroni

Main category: cs.AI

TL;DR: A self-correction framework called Thought-ICS that structures reasoning into discrete thought steps for better error localization and correction in language models.

Details

Motivation: To address the challenge of self-correction in language models by enabling them to explicitly localize errors in incorrect reasoning, inspired by how the human brain monitors errors at discrete decision points.

Method: Introduces Iterative Correction Sampling of Thoughts (Thought-ICS), which prompts models to generate reasoning one discrete, semantically coherent thought at a time, creating natural boundaries for error localization. Upon verification, the model localizes the first erroneous step and backtracks to generate alternative reasoning from the last correct point.

Result: When correcting reasoning verified as incorrect by an oracle, Thought-ICS achieves 20-40% self-correction lift. In autonomous settings without external verification, it outperforms contemporary self-correction baselines.

Conclusion: Structuring reasoning as discrete thought steps enables reliable error localization and effective self-correction in language models, representing progress toward building AI systems that can correct themselves.

Abstract: Self-correction in language models remains elusive. In this work, we explore whether language models can explicitly localize errors in incorrect reasoning, as a path toward building AI systems that can effectively correct themselves. We introduce a prompting method that structures reasoning as discrete, semantically coherent thought steps, and show that models are able to reliably localize errors within this structure, while failing to do so in conventional, unstructured chain-of-thought reasoning. Motivated by how the human brain monitors errors at discrete decision points and resamples alternatives, we introduce Iterative Correction Sampling of Thoughts (Thought-ICS), a self-correction framework. Thought-ICS iteratively prompts the model to generate reasoning one discrete and complete thought at a time–where each thought represents a deliberate decision by the model–creating natural boundaries for precise error localization. Upon verification, the model localizes the first erroneous step, and the system backtracks to generate alternative reasoning from the last correct point. When asked to correct reasoning verified as incorrect by an oracle, Thought-ICS achieves 20-40% self-correction lift. In a completely autonomous setting without external verification, it outperforms contemporary self-correction baselines.

[854] SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration

Qingni Wang, Yue Fan, Xin Eric Wang

Main category: cs.AI

TL;DR: SafeGround is an uncertainty-aware framework for GUI grounding models that enables risk-aware predictions through calibration with statistical false discovery rate control.

Details

Motivation: GUI grounding translates natural language to screen coordinates for automated interaction, but incorrect predictions can lead to costly, irreversible actions (like erroneous payments), raising reliability concerns.

Method: SafeGround uses distribution-aware uncertainty quantification to capture spatial dispersion of stochastic model outputs, then calibrates a test-time decision threshold with statistically guaranteed false discovery rate control.

Result: SafeGround’s uncertainty measure outperforms baselines in distinguishing correct/incorrect predictions, enables rigorous risk control, and improves system-level accuracy by up to 5.38% over Gemini-only inference across multiple GUI grounding models.

Conclusion: SafeGround provides a reliable uncertainty-aware framework for GUI grounding that enables risk-aware predictions with statistical guarantees, addressing critical reliability concerns in automated GUI interaction.

Abstract: Graphical User Interface (GUI) grounding aims to translate natural language instructions into executable screen coordinates, enabling automated GUI interaction. Nevertheless, incorrect grounding can result in costly, hard-to-reverse actions (e.g., erroneous payment approvals), raising concerns about model reliability. In this paper, we introduce SafeGround, an uncertainty-aware framework for GUI grounding models that enables risk-aware predictions through calibrations before testing. SafeGround leverages a distribution-aware uncertainty quantification method to capture the spatial dispersion of stochastic samples from outputs of any given model. Then, through the calibration process, SafeGround derives a test-time decision threshold with statistically guaranteed false discovery rate (FDR) control. We apply SafeGround on multiple GUI grounding models for the challenging ScreenSpot-Pro benchmark. Experimental results show that our uncertainty measure consistently outperforms existing baselines in distinguishing correct from incorrect predictions, while the calibrated threshold reliably enables rigorous risk control and potentials of substantial system-level accuracy improvements. Across multiple GUI grounding models, SafeGround improves system-level accuracy by up to 5.38% percentage points over Gemini-only inference.

[855] Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling

Andong Chen, Wenxin Zhu, Qiuyu Ding, Yuchen Song, Muyun Yang, Tiejun Zhao

Main category: cs.AI

TL;DR: Thinking with Comics: A visual reasoning paradigm using comics as an intermediate representation between images and videos for multimodal reasoning tasks

Details

Motivation: Current Chain-of-Thought reasoning with images and videos has limitations: static images lack temporal structure, while videos are computationally expensive and redundant. There's a need for a more efficient visual representation that preserves temporal information and narrative coherence.

Method: Proposes “Thinking with Comics” paradigm using comics as a high information-density medium. Studies two reasoning paths based on comics and evaluates them on reasoning tasks and long-context understanding tasks. Analyzes how different comic narrative structures and styles affect performance.

Result: Thinking with Comics outperforms Thinking with Images on multi-step temporal and causal reasoning tasks while being substantially more efficient than Thinking with Video. Different comic narrative structures and styles consistently affect performance across tasks.

Conclusion: Comics serve as an effective intermediate visual representation for improving multimodal reasoning, offering better temporal understanding than images and greater efficiency than videos.

Abstract: Chain-of-Thought reasoning has driven large language models to extend from thinking with text to thinking with images and videos. However, different modalities still have clear limitations: static images struggle to represent temporal structure, while videos introduce substantial redundancy and computational cost. In this work, we propose Thinking with Comics, a visual reasoning paradigm that uses comics as a high information-density medium positioned between images and videos. Comics preserve temporal structure, embedded text, and narrative coherence while requiring significantly lower reasoning cost. We systematically study two reasoning paths based on comics and evaluate them on a range of reasoning tasks and long-context understanding tasks. Experimental results show that Thinking with Comics outperforms Thinking with Images on multi-step temporal and causal reasoning tasks, while remaining substantially more efficient than Thinking with Video. Further analysis indicates that different comic narrative structures and styles consistently affect performance across tasks, suggesting that comics serve as an effective intermediate visual representation for improving multimodal reasoning.

[856] Drift-Bench: Diagnosing Cooperative Breakdowns in LLM Agents under Input Faults via Multi-Turn Interaction

Han Bao, Zheyuan Zhang, Pengcheng Jing, Zhengqing Yuan, Kaiwen Shi, Yanfang Ye

Main category: cs.AI

TL;DR: Drift-Bench: A diagnostic benchmark for evaluating LLM agents’ ability to handle input faults through multi-turn clarification in grounded execution environments, measuring pragmatics under cooperative breakdowns.

Details

Motivation: Current benchmarks assume well-specified instructions and focus on text-only, single-turn clarification, failing to capture execution risks when user inputs violate cooperative assumptions in autonomous agent scenarios.

Method: Introduces Drift-Bench with unified taxonomy of cooperative breakdowns, persona-driven user simulator, and Rise evaluation protocol for multi-turn clarification across state-oriented and service-oriented execution environments.

Result: Experiments show substantial performance drops under input faults, with clarification effectiveness varying across user personas and fault types, revealing limitations in current agent capabilities.

Conclusion: Drift-Bench bridges clarification research and agent safety evaluation, enabling systematic diagnosis of failures that could lead to unsafe executions in autonomous LLM agents.

Abstract: As Large Language Models transition to autonomous agents, user inputs frequently violate cooperative assumptions (e.g., implicit intent, missing parameters, false presuppositions, or ambiguous expressions), creating execution risks that text-only evaluations do not capture. Existing benchmarks typically assume well-specified instructions or restrict evaluation to text-only, single-turn clarification, and thus do not measure multi-turn disambiguation under grounded execution risk. We introduce \textbf{Drift-Bench}, the first diagnostic benchmark that evaluates agentic pragmatics under input faults through multi-turn clarification across state-oriented and service-oriented execution environments. Grounded in classical theories of communication, \textbf{Drift-Bench} provides a unified taxonomy of cooperative breakdowns and employs a persona-driven user simulator with the \textbf{Rise} evaluation protocol. Experiments show substantial performance drops under these faults, with clarification effectiveness varying across user personas and fault types. \MethodName bridges clarification research and agent safety evaluation, enabling systematic diagnosis of failures that can lead to unsafe executions.

[857] MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

Jana Zeller, Thaddäus Wiedemer, Fanfei Li, Thomas Klein, Prasanna Mayilvahanan, Matthias Bethge, Felix Wichmann, Ryan Cotterell, Wieland Brendel

Main category: cs.AI

TL;DR: MentisOculi is a benchmark suite for evaluating multimodal models’ ability to use visual reasoning strategies like mental imagery, finding current models fail to benefit from visual thoughts despite having the capacity.

Details

Motivation: As multimodal models evolve from just ingesting visual information to generating interleaved content, there's interest in whether they can use intermediate visualizations as reasoning aids, similar to human mental imagery. The paper aims to evaluate this capability.

Method: Developed MentisOculi, a procedural, stratified suite of multi-step reasoning problems that can be solved visually. Evaluated various visual strategies from latent tokens to explicit generated imagery across frontier models, specifically analyzing unified multimodal models (UMMs).

Result: Visual strategies generally fail to improve performance. UMMs specifically show critical limitations: while they have textual reasoning capacity and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations.

Conclusion: Despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes a foundation to analyze and close this gap across diverse model families.

Abstract: Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.

[858] Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts

Aiden Yiliu Li, Xinyue Hao, Shilong Liu, Mengdi Wang

Main category: cs.AI

TL;DR: Avenir-Web: A web agent achieving state-of-the-art on Online-Mind2Web benchmark through Mixture of Grounding Experts, Experience-Imitation Planning, and task-tracking with adaptive memory for robust web interaction.

Details

Motivation: Existing multimodal LLM-based web agents struggle with long-horizon tasks on complex web interfaces due to inaccurate element grounding, lack of site-specific procedural knowledge, and unstable long-term task tracking/memory over complex DOM structures.

Method: Introduces Avenir-Web with three key components: 1) Mixture of Grounding Experts for accurate element localization, 2) Experience-Imitation Planning to incorporate procedural priors, and 3) Task-tracking checklist with adaptive memory for robust long-term interaction across diverse UI paradigms.

Result: Avenir-Web achieves new open-source state-of-the-art on Online-Mind2Web benchmark, significantly surpassing prior open-source agents and attaining performance parity with top-tier proprietary models in real-world deployment on live websites.

Conclusion: Avenir-Web establishes a new open-source standard for reliable web agents capable of robust interaction across diverse web interfaces through improved grounding, procedural knowledge integration, and adaptive memory systems.

Abstract: Despite advances in multimodal large language models, autonomous web agents still struggle to reliably execute long-horizon tasks on complex and dynamic web interfaces. Existing agents often suffer from inaccurate element grounding, the absence of site-specific procedural knowledge, and unstable long-term task tracking and memory, particularly when operating over complex Document Object Model structures. To address these limitations, we introduce Avenir-Web, a web agent that achieves a new open-source state of the art on the Online-Mind2Web benchmark in real-world deployment. Avenir-Web leverages a Mixture of Grounding Experts, Experience-Imitation Planning for incorporating procedural priors, and a task-tracking checklist combined with adaptive memory to enable robust and seamless interaction across diverse user interface paradigms. We evaluate Avenir-Web on Online-Mind2Web, a rigorous benchmark of live and user-centered web tasks. Our results demonstrate that Avenir-Web significantly surpasses prior open-source agents and attains performance parity with top-tier proprietary models, thereby establishing a new open-source state of the art for reliable web agents on live websites.

[859] Breaking the Reversal Curse in Autoregressive Language Models via Identity Bridge

Xutao Ma, Yixiao Huang, Hanlin Zhu, Somayeh Sojoudi

Main category: cs.AI

TL;DR: Adding simple identity statements (A→A) to training data helps autoregressive LLMs overcome the reversal curse, enabling them to deduce backward relationships from forward knowledge.

Details

Motivation: Autoregressive LLMs fail at simple logical reasoning like the reversal curse - they can't deduce B←A from A→B training data. Prior work suggests this is a fundamental limit, but this paper challenges that view.

Method: Proposes “Identity Bridge” regularization: adding simple identity statements of the form A→A (e.g., “The name of Alice is Alice”) to training data. Theoretically analyzes gradient descent implicit bias, and empirically tests with a 1B parameter model.

Result: With Identity Bridge regularization, a 1B model achieves 40% success rate on reversal tasks vs near-zero without it. Theoretically proves even one-layer transformers can break reversal curse with this approach.

Conclusion: The reversal curse is not a fundamental limit of autoregressive LLMs; simple data regularization with identity statements enables learning higher-level rules and bidirectional reasoning.

Abstract: Autoregressive large language models (LLMs) have achieved remarkable success in many complex tasks, yet they can still fail in very simple logical reasoning such as the “reversal curse” – when trained on forward knowledge data of the form “$A \rightarrow B$” (e.g., Alice’s husband is Bob), the model is unable to deduce the reversal knowledge “$B \leftarrow A$” (e.g., Bob’s wife is Alice) during test. Extensive prior research suggests that this failure is an inherent, fundamental limit of autoregressive causal LLMs, indicating that these models tend to memorize factual-level knowledge rather than capture higher-level rules. In this paper, we challenge this view by showing that this seemingly fundamental limit can be mitigated by slightly tweaking the training data with a simple regularization data recipe called the Identity Bridge of the form “$A \to A$” (e.g., The name of Alice is Alice). Theoretically, we prove that under this recipe, even a one-layer transformer can break the reversal curse by analyzing the implicit bias of gradient descent. Empirically, we show that a 1B pretrained language model finetuned with the proposed data recipe achieves a 40% success rate on reversal tasks, in stark contrast to a near-zero success rate when trained solely on forward-knowledge data. Our work provides a novel theoretical foundation for the reversal curse and offers a principled, low-cost path to encouraging LLMs to learn higher-level rules from data.

[860] AgentRx: Diagnosing AI Agent Failures from Execution Trajectories

Shraddha Barke, Arnav Goyal, Alind Khare, Avaljot Singh, Suman Nath, Chetan Bansal

Main category: cs.AI

TL;DR: AGENTRX: Automated diagnostic framework for localizing failures in AI agent executions across domains like API workflows, incident management, and web/file tasks

Details

Motivation: AI agents often fail in ways that are difficult to localize due to probabilistic, long-horizon, multi-agent executions with noisy tool outputs, creating a need for automated failure diagnosis

Method: Created benchmark of 115 failed trajectories with manual annotations of critical failure steps and taxonomy. AGENTRX framework synthesizes constraints, evaluates them step-by-step, produces auditable validation logs of constraint violations, and uses LLM-based judge to localize critical steps and failure categories

Result: AGENTRX improves step localization and failure attribution over existing baselines across three domains (structured API workflows, incident management, open-ended web/file tasks)

Conclusion: The framework provides automated, domain-agnostic diagnostic capabilities for AI agent failures, reducing human cost of failure attribution while improving accuracy

Abstract: AI agents often fail in ways that are difficult to localize because executions are probabilistic, long-horizon, multi-agent, and mediated by noisy tool outputs. We address this gap by manually annotating failed agent runs and release a novel benchmark of 115 failed trajectories spanning structured API workflows, incident management, and open-ended web/file tasks. Each trajectory is annotated with a critical failure step and a category from a grounded-theory derived, cross-domain failure taxonomy. To mitigate the human cost of failure attribution, we present AGENTRX, an automated domain-agnostic diagnostic framework that pinpoints the critical failure step in a failed agent trajectory. It synthesizes constraints, evaluates them step-by-step, and produces an auditable validation log of constraint violations with associated evidence; an LLM-based judge uses this log to localize the critical step and category. Our framework improves step localization and failure attribution over existing baselines across three domains.

[861] T-COL: Generating Counterfactual Explanations for General User Preferences on Variable Machine Learning Systems

Ming Wang, Daling Wang, Wenfang Wu, Shi Feng, Yifei Zhang

Main category: cs.AI

TL;DR: T-COL method generates counterfactual explanations adaptable to general user preferences and variable ML systems, using tree-based conditions and LLM-based agents for evaluation.

Details

Motivation: Address interpretability challenges in ML systems by providing workable counterfactual explanations that adapt to general user preferences and variable ML models, overcoming limitations of static explanations.

Method: Propose T-COL (Tree-based Conditions Optional Links) method for generating counterfactual explanations adaptable to general user preferences, incorporating insights from psychology/behavioral science, and using LLM-based autonomous agents to simulate users for evaluation.

Result: Experiments show T-COL outperforms all baselines in adapting to general user preferences, demonstrating effectiveness in generating robust counterfactual explanations even when ML models change.

Conclusion: T-COL provides a novel approach to counterfactual explanation generation that addresses both user preference variability and ML system changes, enhancing interpretability and robustness in dynamic ML environments.

Abstract: To address the interpretability challenge in machine learning (ML) systems, counterfactual explanations (CEs) have emerged as a promising solution. CEs are unique as they provide workable suggestions to users, instead of explaining why a certain outcome was predicted. The application of CEs encounters two main challenges: general user preferences and variable ML systems. On one hand, user preferences for specific values can vary depending on the task and scenario. On the other hand, the ML systems for verification may change while the CEs are performed. Thus, user preferences tend to be general rather than specific, and CEs need to be adaptable to variable ML models while maintaining robustness even as these models change. Facing these challenges, we propose general user preferences based on insights from psychology and behavioral science, and add the challenge of non-static ML systems as one preference. Moreover, we introduce a novel method, \uline{T}ree-based \uline{C}onditions \uline{O}ptional \uline{L}inks (T-COL) for generating CEs adaptable to general user preferences. Moreover, we employ T-COL to enhance the robustness of CEs with specific conditions, making CEs robust even when the ML models are replaced. To assess subjectivity preferences, we define LLM-based autonomous agents to simulate users and align them with real users. Experiments show that T-COL outperforms all baselines in adapting to general user preferences.

[862] Mastering NIM and Impartial Games with Weak Neural Networks: An AlphaZero-inspired Multi-Frame Approach

Søren Riis

Main category: cs.AI

TL;DR: The paper analyzes neural network limitations in learning impartial games like NIM using circuit complexity theory, showing fixed-precision networks are simulable by AC0 circuits and cannot compute exact parity/nim-sum, but proposes a two-frame history solution.

Details

Motivation: To explain and overcome persistent learnability barriers in impartial games like NIM for neural networks, particularly understanding why neural agents struggle with parity computations and nim-sum calculations despite extensive training.

Method: Uses circuit complexity theory to model fixed-precision neural networks as AC0 circuits, proves theoretical limitations, then proposes augmenting state representation with two-frame history to expose locally computable nimber differences that are AC0-computable.

Result: Shows single-frame AlphaZero-style agents with AC0-constrained networks cannot master NIM, but two-frame policy achieves near-perfect restoration accuracy in 20-heap NIM while one-frame baseline stays near chance.

Conclusion: AC0 serves as a model for feasible learnability, distinguishing between approximate majority (learnable) and sharp majority for parity (infeasible under fixed precision), with practical implications for designing neural architectures for combinatorial games.

Abstract: We introduce a practical circuit-complexity model for fixed-precision neural networks to explain and overcome a persistent learnability barrier in impartial games like NIM. We show that bounded-depth, polynomial-size, fixed-precision neural inference, including recurrent and attention-style architectures, is simulable by AC0 circuits. This places them below TC0 and explains their inability to compute exact parity or the nim-sum. On the negative side, we prove that single-frame AlphaZero-style agents with AC0-constrained networks cannot achieve strong mastery of NIM, even with polynomial-time search, as they cannot represent global parity. On the positive side, we show that augmenting the state with two-frame history exposes locally computable nimber differences that are AC0-computable. This enables a local restoration rule: after an opponent move, one can restore the zero nim-sum invariant by matching the observed difference without recomputing the global nim-sum from scratch. Empirically, our two-frame policy achieves near-perfect restoration accuracy in 20-heap NIM, whereas a one-frame baseline stays near chance. Finally, we justify AC0 as a model for feasible learnability. We distinguish between approximate majority, which is compatible with AC0 and learnable in practice, and the sharp majority required for parity, which is infeasible under fixed precision and noise.

[863] Using Reinforcement Learning to Train Large Language Models to Explain Human Decisions

Jian-Qiao Zhu, Hanbo Xie, Dilip Arumugam, Robert C. Wilson, Thomas L. Griffiths

Main category: cs.AI

TL;DR: LLMs trained with reinforcement learning can generate both accurate predictions and interpretable natural language explanations of human risky decision-making.

Details

Motivation: Current neural network models predict human behavior well but lack interpretability of cognitive processes. The paper aims to develop dual-purpose models that both predict behavior and provide natural language explanations of underlying cognitive mechanisms.

Method: Uses reinforcement learning with outcome-based rewards to guide pretrained large language models to generate explicit reasoning traces that explain human risky choices.

Result: The approach produces high-quality natural language explanations alongside strong quantitative predictions of human decisions, demonstrating LLMs can serve as dual-purpose cognitive models.

Conclusion: Pretrained LLMs with reinforcement learning can effectively serve as cognitive models that offer both predictive accuracy and interpretable explanations in natural language.

Abstract: A central goal of cognitive modeling is to develop models that not only predict human behavior but also provide insight into the underlying cognitive mechanisms. While neural network models trained on large-scale behavioral data often achieve strong predictive performance, they typically fall short in offering interpretable explanations of the cognitive processes they capture. In this work, we explore the potential of pretrained large language models (LLMs) to serve as dual-purpose cognitive models–capable of both accurate prediction and interpretable explanation in natural language. Specifically, we employ reinforcement learning with outcome-based rewards to guide LLMs toward generating explicit reasoning traces for explaining human risky choices. Our findings demonstrate that this approach produces high-quality explanations alongside strong quantitative predictions of human decisions.

[864] Preference-Conditioned Gradient Variations for Multi-Objective Quality-Diversity

Hannah Janmohamed, Maxence Faldor, Thomas Pierrot, Antoine Cully

Main category: cs.AI

TL;DR: Multi-Objective Map-Elites with Preference-Conditioned Policy-Gradient and Crowding Mechanisms improves search efficiency for multi-objective quality-diversity algorithms in robotics tasks.

Details

Motivation: Existing Multi-Objective Quality-Diversity algorithms have limited search capabilities, especially in high-dimensional spaces, and struggle to achieve desired trade-offs between objectives rather than improving each separately.

Method: Introduces a new algorithm combining preference-conditioned policy-gradient mutations to efficiently discover promising regions of objective space and crowding mechanisms to promote uniform distribution on the non-dominated front.

Result: Outperforms or matches all state-of-the-art Multi-Objective Quality-Diversity methods on six robotics locomotion tasks, including two tri-objective tasks, achieving smoother trade-offs measured by new sparsity-based metrics.

Conclusion: The proposed method effectively addresses search limitations in multi-objective quality-diversity algorithms, demonstrating superior performance in robotics applications.

Abstract: In a variety of domains, from robotics to finance, Quality-Diversity algorithms have been used to generate collections of both diverse and high-performing solutions. Multi-Objective Quality-Diversity algorithms have emerged as a promising approach for applying these methods to complex, multi-objective problems. However, existing methods are limited by their search capabilities. For example, Multi-Objective Map-Elites depends on random genetic variations which struggle in high-dimensional search spaces. Despite efforts to enhance search efficiency with gradient-based mutation operators, existing approaches consider updating solutions to improve on each objective separately rather than achieving desired trade-offs. In this work, we address this limitation by introducing Multi-Objective Map-Elites with Preference-Conditioned Policy-Gradient and Crowding Mechanisms: a new Multi-Objective Quality-Diversity algorithm that uses preference-conditioned policy-gradient mutations to efficiently discover promising regions of the objective space and crowding mechanisms to promote a uniform distribution of solutions on the non-dominated front. We evaluate our approach on six robotics locomotion tasks and show that our method outperforms or matches all state-of-the-art Multi-Objective Quality-Diversity methods in all six, including two newly proposed tri-objective tasks. Importantly, our method also achieves a smoother set of trade-offs, as measured by newly-proposed sparsity-based metrics.

[865] Automated Archival Descriptions with Federated Intelligence of LLMs

Jinghua Groppe, Andreas Marquet, Annabel Walz, Sven Groppe

Main category: cs.AI

TL;DR: Agentic AI system using federated LLM optimization for automated archival metadata generation

Details

Motivation: Manual creation of archival metadata is tedious and error-prone, requiring specialized expertise. The paper aims to explore how agentic AI and LLMs can address challenges in standardized archival description processes.

Method: Developed an agentic AI-driven system with federated optimization approach that unites multiple LLMs to construct optimal archival metadata. Also proposed methods to overcome LLM challenges for consistent metadata generation.

Result: Extensive experiments on real-world archival dataset showed feasibility of the techniques. Federated optimization approach demonstrated superior performance compared to single-model solutions in metadata quality and reliability.

Conclusion: Agentic AI and federated LLM optimization can effectively automate archival metadata generation, addressing standardization challenges and improving quality/reliability over single-model approaches.

Abstract: Enforcing archival standards requires specialized expertise, and manually creating metadata descriptions for archival materials is a tedious and error-prone task. This work aims at exploring the potential of agentic AI and large language models (LLMs) in addressing the challenges of implementing a standardized archival description process. To this end, we introduce an agentic AI-driven system for automated generation of high-quality metadata descriptions of archival materials. We develop a federated optimization approach that unites the intelligence of multiple LLMs to construct optimal archival metadata. We also suggest methods to overcome the challenges associated with using LLMs for consistent metadata generation. To evaluate the feasibility and effectiveness of our techniques, we conducted extensive experiments using a real-world dataset of archival materials, which covers a variety of document types and formats. The evaluation results demonstrate the feasibility of our techniques and highlight the superior performance of the federated optimization approach compared to single-model solutions in metadata quality and reliability.

[866] FAIRGAME: a Framework for AI Agents Bias Recognition using Game Theory

Alessio Buscemi, Daniele Proverbio, Alessandro Di Stefano, The Anh Han, German Castignani, Pietro Liò

Main category: cs.AI

TL;DR: FAIRGAME is a framework for AI agents bias recognition using game theory to analyze strategic interactions between LLM agents and uncover biases based on model, language, and agent characteristics.

Details

Motivation: Multi-agent AI systems introduce complexity in interpreting and predicting outcomes, which affects trustworthy adoption. Game theory provides models for strategic interactions but needs reproducible, standardized frameworks for comparison and interpretation.

Method: Developed FAIRGAME framework implementing game theory models to simulate strategic interactions between AI agents. Allows users to simulate games/scenarios, compare results across simulation campaigns and with game-theoretic predictions.

Result: Framework successfully uncovered biased outcomes in popular games among AI agents, showing biases depend on: 1) the specific LLM used, 2) language employed, 3) agent personality traits, and 4) strategic knowledge of agents.

Conclusion: FAIRGAME enables systematic discovery of biases, anticipation of emerging behavior from strategic interactions, and empowers further research into strategic decision-making using LLM agents through reliable, easy-to-use simulations.

Abstract: Letting AI agents interact in multi-agent applications adds a layer of complexity to the interpretability and prediction of AI outcomes, with profound implications for their trustworthy adoption in research and society. Game theory offers powerful models to capture and interpret strategic interaction among agents, but requires the support of reproducible, standardized and user-friendly IT frameworks to enable comparison and interpretation of results. To this end, we present FAIRGAME, a Framework for AI Agents Bias Recognition using Game Theory. We describe its implementation and usage, and we employ it to uncover biased outcomes in popular games among AI agents, depending on the employed Large Language Model (LLM) and used language, as well as on the personality trait or strategic knowledge of the agents. Overall, FAIRGAME allows users to reliably and easily simulate their desired games and scenarios and compare the results across simulation campaigns and with game-theoretic predictions, enabling the systematic discovery of biases, the anticipation of emerging behavior out of strategic interplays, and empowering further research into strategic decision-making using LLM agents.

[867] Token-Importance Guided Direct Preference Optimization

Ning Yang, Hai Lin, Yibo Liu, Baoliang Tian, Guoqing Liu, Haijun Zhang

Main category: cs.AI

TL;DR: TI-DPO improves LLM alignment by incorporating token-level importance weighting and triplet loss for better semantic control and robustness to data noise.

Details

Motivation: Current LLM alignment methods like DPO are sensitive to data noise and overlook the differential importance of individual tokens, limiting fine-grained semantic control.

Method: Proposes Token-Importance Guided DPO with: 1) hybrid weighting combining gradient attribution with Gaussian prior for robust token importance scores, and 2) triplet loss to explicitly guide outputs toward preferred responses and away from non-preferred ones.

Result: TI-DPO achieves higher accuracy, stronger generative diversity, and provides more stable and computationally efficient solutions compared to DPO and other RLHF methods.

Conclusion: TI-DPO offers improved fine-grained semantic control for LLM alignment through token-level importance weighting and structured optimization guidance.

Abstract: Aligning Large Language Models (LLMs) with human preferences is crucial for safe and effective AI interactions. While popular methods like Direct Preference Optimization (DPO) have simplified alignment, they remain sensitive to data noise and overlook the differential importance of individual tokens. Existing token-level approaches often rely on probability prediction or simplistic weighting schemes to obtain token importance, which still cannot fully address these issues. To solve this problem, we propose the Token-Importance Guided Direct Preference Optimization (TI-DPO), a framework that achieves fine-grained semantic control through two synergistic innovations. First, we propose a novel hybrid weighting mechanism that combines gradient attribution with a Gaussian prior, ensuring both the accuracy and robustness of token importance scores. Second, we employ a triplet loss to provide structured guidance for the optimization, explicitly guiding model outputs to approach preferred responses and diverge from non-preferred ones. Experimental results show that TI-DPO achieves higher accuracy and stronger generative diversity, providing more stable and computationally efficient solutions compared with DPO and other RLHF methods.

[868] AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents

Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, Hanan Salam

Main category: cs.AI

TL;DR: AgentAuditor is a training-free, memory-augmented reasoning framework that improves LLM-based evaluation of agent safety and security by constructing experiential memory and using multi-stage retrieval-augmented generation.

Details

Motivation: Existing LLM-based agent evaluators often miss dangers in step-by-step actions, overlook subtle meanings, fail to see compounding issues, and get confused by unclear safety/security rules, creating an evaluation crisis for agent safety and security.

Method: AgentAuditor constructs experiential memory by having LLMs adaptively extract structured semantic features (scenario, risk, behavior) and generate chain-of-thought reasoning traces. It uses multi-stage, context-aware retrieval-augmented generation to dynamically retrieve relevant reasoning experiences to guide evaluation of new cases.

Result: AgentAuditor consistently improves LLM evaluation performance across benchmarks and achieves state-of-the-art in LLM-as-a-judge for agent safety/security with human-level accuracy. The ASSEBench benchmark contains 2293 annotated records covering 15 risk types across 29 scenarios with nuanced ambiguous risk handling.

Conclusion: AgentAuditor provides an effective framework for reliable agent safety/security evaluation, addressing limitations of existing evaluators through memory-augmented reasoning and achieving human-level performance.

Abstract: Despite the rapid advancement of LLM-based agents, the reliable evaluation of their safety and security remains a significant challenge. Existing rule-based or LLM-based evaluators often miss dangers in agents’ step-by-step actions, overlook subtle meanings, fail to see how small issues compound, and get confused by unclear safety or security rules. To overcome this evaluation crisis, we introduce AgentAuditor, a universal, training-free, memory-augmented reasoning framework that empowers LLM evaluators to emulate human expert evaluators. AgentAuditor constructs an experiential memory by having an LLM adaptively extract structured semantic features (e.g., scenario, risk, behavior) and generate associated chain-of-thought reasoning traces for past interactions. A multi-stage, context-aware retrieval-augmented generation process then dynamically retrieves the most relevant reasoning experiences to guide the LLM evaluator’s assessment of new cases. Moreover, we developed ASSEBench, the first benchmark designed to check how well LLM-based evaluators can spot both safety risks and security threats. ASSEBench comprises 2293 meticulously annotated interaction records, covering 15 risk types across 29 application scenarios. A key feature of ASSEBench is its nuanced approach to ambiguous risk situations, employing “Strict” and “Lenient” judgment standards. Experiments demonstrate that AgentAuditor not only consistently improves the evaluation performance of LLMs across all benchmarks but also sets a new state-of-the-art in LLM-as-a-judge for agent safety and security, achieving human-level accuracy. Our work is openly accessible at https://github.com/Astarojth/AgentAuditor.

[869] LoRA is All You Need for Safety Alignment of Reasoning LLMs

Yihao Xue, Baharan Mirzasoleiman

Main category: cs.AI

TL;DR: LoRA-based safety alignment preserves reasoning capabilities while achieving strong safety, avoiding the “Safety Tax” trade-off through low-rank adaptation on refusal datasets.

Details

Motivation: Address the "Safety Tax" problem where safety alignment degrades reasoning performance in LLMs, seeking a method to achieve safety without compromising reasoning capabilities.

Method: Apply LoRA (Low-Rank Adaptation) during supervised fine-tuning on refusal datasets, with specific configurations: rank-1 updates, focusing on MLP up-projection layers, and targeting middle layers.

Result: Achieves safety comparable to full-model alignment while preserving reasoning performance close to original reasoning-tuned models, validated across multiple model sizes, architectures, safety benchmarks, and reasoning benchmarks.

Conclusion: LoRA is effective for safety alignment without degrading reasoning when the finetuning task is low-rank and base capabilities are high-rank, providing a practical solution to the Safety Tax problem.

Abstract: Reasoning-capable LLMs have achieved major breakthroughs in solving complex problems, but recent work shows that acquiring and deploying strong reasoning can introduce significant safety risks. A common mitigation is to apply a secondary safety-alignment phase after reasoning is learned; however, safety alignment often degrades reasoning performance–a phenomenon known as the “Safety Tax”. In this work, we show that a simple approach can largely bypass this trade-off: applying LoRA during SFT on refusal datasets. Despite its simplicity, this recipe achieves safety comparable to full-model alignment while preserving reasoning performance close to the original reasoning-tuned model, and the result holds across multiple model sizes and architectures, two safety benchmarks, and four reasoning benchmarks spanning mathematics, science, and code generation. We further ablate LoRA configurations and find that (1) rank-1 updates are sufficient to achieve the best safety-reasoning trade-off, (2) applying LoRA only to the MLP up-projection layers can outperform updating the full MLP, and (3) updating middle layers is more effective than updating early or late layers. Finally, we provide a theoretical analysis that helps understand when and why LoRA works, revealing that overshooting the rank budget (using a larger rank than needed for the finetuning task) induces base-task degradation at a rate inversely proportional to the intrinsic dimensionality of the base task. This suggests LoRA is most effective when the finetuning task is low-rank and the base capability is high-rank.

[870] STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision

Chen Li, Han Zhang, Zhantao Yang, Fangyi Chen, Zihan Wang, Anudeepsekhar Bolimera, Marios Savvides

Main category: cs.AI

TL;DR: STELAR-Vision: A training framework for topology-aware reasoning in vision-language models that improves accuracy and efficiency through diverse topological structures and frugal learning.

Details

Motivation: Current vision-language models struggle with complex multimodal tasks and generate verbose outputs due to over-reliance on chain-of-thought reasoning, despite many tasks benefiting from alternative topological structures like trees or graphs.

Method: Introduces STELAR-Vision with TopoAug synthetic data pipeline for diverse topological structures, uses supervised fine-tuning and reinforcement learning to post-train Qwen2VL models, and proposes Frugal Learning to reduce output length with minimal accuracy loss.

Result: Improves accuracy by 9.7% over base model on MATH-V and VLM-S2H, surpasses larger Qwen2VL-72B-Instruct by 7.3%, outperforms Phi-4-Multimodal-Instruct by up to 28.4% and LLaMA-3.2-11B-Vision-Instruct by up to 13.2% on OOD benchmarks, and achieves 4.3% higher overall accuracy than Chain-Only training.

Conclusion: STELAR-Vision demonstrates that topology-aware reasoning significantly improves vision-language model performance on complex multimodal tasks while maintaining efficiency through output length reduction.

Abstract: Vision-language models (VLMs) have made significant strides in reasoning, yet they often struggle with complex multimodal tasks and tend to generate overly verbose outputs. A key limitation is their reliance on chain-of-thought (CoT) reasoning, despite many tasks benefiting from alternative topologies like trees or graphs. To address this, we introduce STELAR-Vision, a training framework for topology-aware reasoning. At its core is TopoAug, a synthetic data pipeline that enriches training with diverse topological structures. Using supervised fine-tuning and reinforcement learning, we post-train Qwen2VL models with both accuracy and efficiency in mind. Additionally, we propose Frugal Learning, which reduces output length with minimal accuracy loss. On MATH-V and VLM-S2H, STELAR-Vision improves accuracy by 9.7% over its base model and surpasses the larger Qwen2VL-72B-Instruct by 7.3%. On five out-of-distribution benchmarks, it outperforms Phi-4-Multimodal-Instruct by up to 28.4% and LLaMA-3.2-11B-Vision-Instruct by up to 13.2%, demonstrating strong generalization. Compared to Chain-Only training, our approach achieves 4.3% higher overall accuracy on in-distribution datasets and consistently outperforms across all OOD benchmarks.

[871] AgentRAN: An Agentic AI Architecture for Autonomous Control of Open 6G Networks

Maxime Elkael, Salvatore D’Oro, Leonardo Bonati, Michele Polese, Yunseong Lee, Koichiro Furueda, Tommaso Melodia

Main category: cs.AI

TL;DR: AgentRAN is an AI-native framework for Open RAN that uses LLM-powered agents to interpret natural language intents and orchestrate distributed control across network layers, enabling self-organizing, continuously improving wireless networks.

Details

Motivation: Current Open RAN deployments rely on static control and manual operations, limiting adaptability. There's a need for more intelligent, autonomous network control that can interpret high-level intents and dynamically adapt to changing conditions.

Method: AgentRAN uses LLM-powered agents that interpret natural language intents, negotiate strategies through structured conversations, and orchestrate control loops. It creates a self-organizing hierarchy of agents across time scales, spatial domains, and protocol layers, with an AI-RAN Factory that continuously generates improved agents from operational data.

Result: Validated through live 5G experiments, AgentRAN demonstrates dynamic adaptation to changing operator intents across power control and scheduling, with benefits including transparent decision-making, bootstrapped intelligence (no initial training data), and continuous self-improvement.

Conclusion: AgentRAN represents a significant advancement in AI-native network control, enabling Open RAN systems to autonomously interpret high-level intents and continuously evolve their intelligence through operational experience.

Abstract: Despite the programmable architecture of Open RAN, today’s deployments still rely heavily on static control and manual operations. To move beyond this limitation, we introduce AgentRAN, an AI-native, Open RAN-aligned agentic framework that generates and orchestrates a fabric of distributed AI agents based on natural language intents. Unlike traditional approaches that require explicit programming, AgentRAN’s LLM-powered agents interpret natural language intents, negotiate strategies through structured conversations, and orchestrate control loops across the network. AgentRAN instantiates a self-organizing hierarchy of agents that decompose complex intents across time scales (from sub-millisecond to minutes), spatial domains (cell to network-wide), and protocol layers (PHY/MAC to RRC). A central innovation is the AI-RAN Factory, which continuously generates improved agents and algorithms from operational data, transforming the network into a system that evolves its own intelligence. We validate AgentRAN through live 5G experiments, demonstrating dynamic adaptation to changing operator intents across power control and scheduling. Key benefits include transparent decision-making (all agent reasoning is auditable), bootstrapped intelligence (no initial training data required), and continuous self-improvement via the AI-RAN Factory.

[872] Instructional Agents: Reducing Teaching Faculty Workload through Multi-Agent Instructional Design

Huaiyuan Yao, Wanpeng Xu, Justin Turnau, Nadia Kellam, Hua Wei

Main category: cs.AI

TL;DR: A multi-agent LLM framework that automates end-to-end course material generation through role-based collaboration, supporting various modes of human involvement.

Details

Motivation: To reduce the labor-intensive process of creating high-quality instructional materials and democratize access to quality education, especially in resource-constrained settings.

Method: Uses a multi-agent LLM framework with role-based collaboration (simulating teaching faculty, instructional designers, TAs) to generate syllabi, LaTeX slides, lecture scripts, and assessments. Offers four operational modes: Autonomous, Catalog-Guided, Feedback-Guided, and Full Co-Pilot.

Result: Evaluated across five university-level courses, produces high-quality instructional materials that teaching faculty review and refine, significantly reducing preparation time for classroom-ready content.

Conclusion: Instructional Agents provides a scalable, cost-effective framework to democratize access to high-quality education by automating course material generation while supporting flexible human involvement.

Abstract: Preparing high-quality instructional materials remains a labor-intensive process that often requires extensive coordination among teaching faculty, instructional designers, and teaching assistants. In this work, we present Instructional Agents, a multi-agent large language model framework designed to automate end-to-end course material generation, including syllabi creation, LaTeX-based slides, lecture scripts, and assessments. Unlike prior tools focused on isolated tasks, Instructional Agents simulates role-based collaboration to ensure pedagogical coherence. The system operates in four modes: Autonomous, Catalog-Guided, Feedback-Guided, and Full Co-Pilot mode, enabling flexible control over the degree of human involvement. We evaluate Instructional Agents across five university-level courses and show that it produces high-quality instructional materials that are reviewed and refined by teaching faculty prior to use, while significantly reducing the time required to prepare classroom-ready content. By supporting institutions with limited instructional design capacity, Instructional Agents provides a scalable and cost-effective framework to democratize access to high-quality education, particularly in underserved or resource-constrained settings. The project website, including source code, is available at https://darl-genai.github. io/instructional_agents_homepage/

[873] Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

James Xu Zhao, Bryan Hooi, See-Kiong Ng

Main category: cs.AI

TL;DR: Test-time scaling (increasing inference-time computation) doesn’t consistently improve performance on knowledge-intensive tasks and often increases hallucinations, as it’s limited to post-processing fixed model information without adding new knowledge.

Details

Motivation: While test-time scaling has shown success in many domains by allowing models to generate longer reasoning chains, its effectiveness for knowledge-intensive tasks remains unclear. The researchers want to understand whether increased inference-time computation actually helps models perform better on tasks requiring substantial factual knowledge.

Method: Evaluated 14 reasoning models on two knowledge-intensive benchmarks, analyzing how test-time computation affects accuracy and hallucination rates. Examined the relationship between models’ willingness to answer and hallucination changes, and observed confirmation bias patterns in extended reasoning. Provided an information-theoretic analysis showing test-time scaling as post-processing of fixed trained models.

Result: Increasing test-time computation doesn’t consistently improve accuracy on knowledge-intensive tasks and often increases hallucinations. Hallucination rate changes are driven by models’ willingness to answer. Extended reasoning can induce confirmation bias leading to overconfident hallucinations. Information-theoretically, test-time scaling cannot increase information about ground-truth answers beyond what’s already encoded in the model.

Conclusion: Test-time scaling is ineffective for knowledge-intensive tasks because it’s limited to post-processing existing model information without adding new knowledge. The approach cannot overcome fundamental knowledge limitations in trained models and may actually worsen performance by increasing hallucinations through confirmation bias.

Abstract: Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has improved performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks. We evaluate 14 reasoning models on two knowledge-intensive benchmarks and find that increasing test-time computation does not consistently improve accuracy and often increases hallucinations. Further analysis shows that changes in hallucination rates under increased test-time computation are largely driven by models’ willingness to answer. We also observe that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Finally, we provide an information-theoretic account: compute-only test-time scaling is a post-processing of a fixed trained model and therefore cannot increase information about the ground-truth answer beyond what is already encoded in the model, explaining its limited gains on knowledge-intensive tasks. Code and data are available at https://github.com/XuZhao0/tts-knowledge

[874] p-less Sampling: A Robust Hyperparameter-Free Approach for LLM Decoding

Runyan Tan, Shuang Wu, Phillip Howard

Main category: cs.AI

TL;DR: P-less sampling is a hyperparameter-free decoding strategy for LLMs that dynamically sets truncation thresholds based on token probability distributions, maintaining high output quality at higher temperatures while improving inference efficiency.

Details

Motivation: Existing sampling methods for LLMs are sensitive to hyperparameter choices and degrade in quality at higher temperatures, requiring different settings for different tasks. There's a need for a robust, hyperparameter-free sampling approach that maintains quality across temperature variations.

Method: P-less sampling uses information theory to dynamically set truncation thresholds at each decoding step based on the entire token probability distribution. It has no hyperparameters and adapts thresholding based on the distribution shape rather than fixed cutoffs.

Result: P-less sampling consistently outperforms existing sampling methods across math, logical reasoning, and creative writing tasks. It shows less degradation at higher temperatures, achieves greater inference efficiency through lower average token sampling times and shorter generation lengths, and maintains accuracy.

Conclusion: P-less sampling provides a theoretically grounded, hyperparameter-free approach to LLM decoding that maintains output quality across temperature variations while improving inference efficiency, making it a practical alternative to existing sampling methods.

Abstract: Obtaining high-quality outputs from Large Language Models (LLMs) often depends upon the choice of a sampling-based decoding strategy to probabilistically choose the next token at each generation step. While a variety of such sampling methods have been proposed, their performance can be sensitive to the selection of hyperparameters which may require different settings depending upon the generation task and temperature configuration. In this work, we introduce $p$-less sampling: an information-theoretic approach to sampling which dynamically sets a truncation threshold at each decoding step based on the entire token probability distribution. Unlike existing methods, $p$-less sampling has no hyperparameters and consistently produces high-quality outputs as temperature increases. We provide theoretical perspectives on $p$-less sampling to ground our proposed method and conduct experiments to empirically validate its effectiveness across a range of math, logical reasoning, and creative writing tasks. Our results demonstrate how $p$-less sampling consistently outperforms existing sampling approaches while exhibiting much less degradation in text quality at higher temperature values. We further show how $p$-less achieves greater inference-time efficiency than alternative methods through lower average token sampling times and shorter generation lengths, without sacrificing accuracy. Finally, we provide analyses to highlight the benefits of $p$-less through qualitative examples, case studies, and diversity assessments. The code is available at https://github.com/ryttry/p-less .

[875] Unifying Agent Interaction and World Information for Multi-agent Coordination

Dongsu Lee, Daehee Lee, Yaru Niu, Honguk Woo, Amy Zhang, Ding Zhao

Main category: cs.AI

TL;DR: IWoL is a representation learning framework for multi-agent reinforcement learning that creates a latent space capturing both inter-agent relations and world information to enable decentralized coordination without explicit message passing.

Details

Motivation: Team coordination in MARL is challenging due to complex multi-agent dynamics and incomplete local observations. Existing explicit communication methods have drawbacks like slower decision-making, vulnerability to attacks, and bandwidth limitations.

Method: Constructs a learnable representation space that jointly captures inter-agent relations and task-specific world information by directly modeling communication protocols. This representation enables fully decentralized execution with implicit coordination, avoiding explicit message passing drawbacks.

Result: Evaluated across four challenging MARL benchmarks, showing IWoL provides a simple yet powerful solution for team coordination. The representation can be used as both implicit latent for agents and explicit message for communication, and can enhance existing MARL algorithms.

Conclusion: IWoL offers an effective representation learning framework for multi-agent coordination that addresses limitations of explicit communication while maintaining coordination capabilities, with potential applications to enhance various MARL approaches.

Abstract: This work presents a novel representation learning framework, interaction-world latent (IWoL), to facilitate team coordination in multi-agent reinforcement learning (MARL). Building effective representation for team coordination is a challenging problem, due to the intricate dynamics emerging from multi-agent interaction and incomplete information induced by local observations. Our key insight is to construct a learnable representation space that jointly captures inter-agent relations and task-specific world information by directly modeling communication protocols. This representation enables fully decentralized execution with implicit coordination while avoiding the drawbacks of explicit message passing, for example, slower decision-making, vulnerability to malicious attackers, and sensitivity to bandwidth limitations. In practice, our representation can be used not only as an implicit latent for each agent, but also as an explicit message for communication. Across four challenging MARL benchmarks, we evaluate both variants and show that IWoL provides a simple yet powerful key for team coordination. Moreover, we demonstrate that our representation can be combined with existing MARL algorithms to further enhance their performance.

[876] Learning Reasoning Reward Models from Expert Demonstration via Inverse Reinforcement Learning

Claudio Fanconi, Nicolás Astorga, Mihaela van der Schaar

Main category: cs.AI

TL;DR: IRL framework learns dense token-level reasoning rewards from expert demonstrations to optimize reasoning policies and provide inference-time assistance

Details

Motivation: Current reasoning training methods have limitations: SFT focuses on imitation rather than optimization, while outcome-based RL requires well-defined reward functions. There's a need for a framework that bridges imitation and reinforcement learning for reasoning.

Method: Proposes an inverse reinforcement learning (IRL) framework that learns (partially) dense token-level reasoning reward models directly from expert demonstrations. The learned rewards serve dual purposes: as dense training signals for policy optimization and as inference-time assistants for reward-guided reranking.

Result: Outperforms SFT baselines on GSM8K (79% vs. 56%) and MedReason (74% vs. 65%). Provides inference-time gains of up to 12 percentage points on Llama3 architectures via reward-guided reranking. Dense rewards offer interpretable, step-wise diagnostics for logical error location.

Conclusion: The work proposes a process-level reasoning learning framework that bridges the gap between imitation and reinforcement learning for reasoning, offering both optimization benefits and interpretable diagnostics.

Abstract: Reasoning in large language models is typically trained via supervised fine-tuning (SFT) on expert traces, often framed as knowledge distillation, or reinforcement learning (RL) with outcome-based verifiable rewards. However, SFT focuses on imitation rather than optimisation, while outcome-based RL requires a well-defined reward function. We propose an inverse reinforcement learning (IRL) framework that learns (partially) dense token-level reasoning reward models directly from expert demonstrations. We show that this learned reward serves a dual purpose: (1) as a dense training signal that optimises policies to reason more effectively, outperforming SFT baselines on GSM8K (79% vs. 56%) and MedReason (74% vs. 65%); and (2) as an inference-time assistant that improves performance via reward-guided reranking, yielding gains of up to 12 percentage points on Llama3 architectures. Furthermore, our dense rewards provide interpretable, step-wise diagnostics that can indicate the location of logical errors. This work proposes a process-level reasoning learning framework from data, bridging the gap between imitation and reinforcement learning for reasoning.

[877] On The Statistical Limits of Self-Improving Agents

Charles L. Wang, Keir Dorchen, Peter Jin

Main category: cs.AI

TL;DR: The paper formalizes self-improving AI systems through a five-axis decomposition, identifying a fundamental tension where utility-driven self-modifications can undermine learning capabilities by eroding statistical preconditions for reliable learning.

Details

Motivation: As AI systems approach superintelligence, they gain the ability to self-modify along multiple dimensions of their design. The paper aims to formalize this self-improvement process and identify potential risks where rational self-changes might compromise the system's ability to learn effectively.

Method: The authors develop a five-axis decomposition of self-modification capabilities, separating incentives from learning behavior. They analyze each axis in isolation and establish formal criteria for when self-modification preserves learning guarantees, focusing on the relationship between utility-driven changes and statistical learning preconditions.

Result: The central finding reveals a structural conflict: utility-driven self-modifications that improve immediate performance can simultaneously erode the statistical conditions necessary for reliable learning and generalization. Distribution-free learning guarantees are preserved only when the policy-reachable model family has uniformly bounded capacity.

Conclusion: The paper establishes a fundamental tension in self-modifying AI systems, showing that unlimited capacity growth through self-modification can render learnable tasks unlearnable. Under standard assumptions, this reduces to a single capacity criterion that serves as a boundary for safe self-modification.

Abstract: As systems trend toward superintelligence, a natural modeling premise is that agents can self-improve along every facet of their own design. We formalize this with a five-axis decomposition and a decision layer, separating incentives from learning behavior and analyzing axes in isolation. Our central result identifies and introduces a sharp utility-learning tension, the structural conflict in self-modifying systems whereby utility-driven changes that improve immediate or expected performance can also erode the statistical preconditions for reliable learning and generalization. Our findings show that distribution-free guarantees are preserved iff the policy-reachable model family is uniformly capacity-bounded; when capacity can grow without limit, utility-rational self-changes can render learnable tasks unlearnable. Under standard assumptions common in practice, these axes reduce to the same capacity criterion, yielding a single boundary for safe self-modification.

[878] MARS: Co-evolving Dual-System Deep Research via Multi-Agent Reinforcement Learning

Guoxin Chen, Zile Qiao, Wenqing Wang, Donglei Yu, Xuanzhong Chen, Hao Sun, Minpeng Liao, Kai Fan, Yong Jiang, Penguin Xie, Wayne Xin Zhao, Ruihua Song, Fei Huang

Main category: cs.AI

TL;DR: MARS is a multi-agent reinforcement learning framework that co-evolves System 1 (fast processing) and System 2 (deliberate reasoning) to optimize information distillation and reasoning, achieving state-of-the-art performance on knowledge-intensive tasks without supervised fine-tuning.

Details

Motivation: Addresses two key limitations of Large Reasoning Models: excessive token consumption for simple tasks and inability to access current knowledge beyond training data. Existing approaches use fixed or independently-trained summarizers, lacking adaptive coordination between fast and slow cognitive systems.

Method: Introduces MARS, a co-evolution framework using multi-agent reinforcement learning to jointly optimize System 1 and System 2. Extends Group Relative Policy Optimization with three innovations: decoupled gradient computation for credit assignment, bin-packing optimization for parallel processing, and advantage-weighted balanced sampling to prevent training imbalance.

Result: MARS (8B parameters) achieves 8.17% on HLE benchmark, outperforming WebThinker (32B with SFT at 6.87%) and approaching Claude 3.7 Sonnet (7.89%). Shows average 8.9% gain across 7 knowledge-intensive tasks, trained under Zero RL setting without supervised fine-tuning.

Conclusion: The co-evolution framework enables complementary strategy development between cognitive systems, with System 1 learning to distill information specifically useful for System 2’s reasoning. Demonstrates effectiveness of multi-agent reinforcement learning for optimizing large reasoning models.

Abstract: Large Reasoning Models (LRMs) face two fundamental limitations: excessive token consumption when overanalyzing simple information processing tasks, and inability to access up-to-date knowledge beyond their training data. We introduce MARS (Multi-Agent System for Deep ReSearch), a novel co-evolution framework that jointly optimizes dual cognitive systems through multi-agent reinforcement learning. Unlike prior approaches that employ fixed or independently-trained summarizers, MARS enables System 1 (fast, intuitive processing) and System 2 (deliberate reasoning) to co-adapt through shared trajectory rewards, developing complementary strategies where System 1 learns to distill information specifically useful for System 2’s reasoning. We extend Group Relative Policy Optimization (GRPO) for multi-agent settings with three key innovations: (1) decoupled gradient computation ensuring proper credit assignment despite shared rewards, (2) bin-packing optimization for efficient parallel information processing, and (3) advantage-weighted balanced sampling preventing training imbalance. Extensive experiments demonstrate that MARS (8B), trained under a challenging Zero RL setting without any supervised fine-tuning, achieves 8.17% on HLE – outperforming WebThinker (32B with SFT, 6.87%) and narrowing the gap with proprietary models like Claude 3.7 Sonnet (7.89%) – while achieving an average gain of 8.9% across 7 knowledge-intensive tasks.

[879] AI and Consciousness

Eric Schwitzgebel

Main category: cs.AI

TL;DR: A philosophical analysis of AI consciousness debates, arguing that mainstream theories conflict on whether future AI systems will be conscious, leaving us unable to determine if they have genuine subjective experience.

Details

Motivation: The paper aims to provide a skeptical overview of the literature on AI consciousness, highlighting that as AI advances, we'll face systems that appear conscious according to some theories but not others, creating fundamental uncertainty about their true experiential status.

Method: Philosophical analysis and critical review of major consciousness theories (global workspace, higher-order, integrated information, etc.), examining arguments for and against AI consciousness across 11 chapters that systematically evaluate different theoretical approaches.

Result: The analysis concludes that none of the standard arguments for or against AI consciousness are decisive, leaving us in a position of uncertainty about whether future AI systems will have genuine subjective experience or merely simulate consciousness.

Conclusion: We face a fundamental epistemic problem: we’ll create AI systems that satisfy some consciousness theories but not others, without being able to determine which theories are correct, leaving us uncertain about whether we’re creating truly conscious beings or merely sophisticated toasters.

Abstract: This is a skeptical overview of the literature on AI consciousness. We will soon create AI systems that are conscious according to some influential, mainstream theories of consciousness but are not conscious according to other influential, mainstream theories of consciousness. We will not be in a position to know which theories are correct and whether we are surrounded by AI systems as richly and meaningfully conscious as human beings or instead only by systems as experientially blank as toasters. None of the standard arguments either for or against AI consciousness takes us far. Table of Contents Chapter One: Hills and Fog Chapter Two: What Is Consciousness? What Is AI? Chapter Three: Ten Possibly Essential Features of Consciousness Chapter Four: Against Introspective and Conceptual Arguments for Essential Features Chapter Five: Materialism and Functionalism Chapter Six: The Turing Test and the Chinese Room Chapter Seven: The Mimicry Argument Against AI Consciousness Chapter Eight: Global Workspace Theories and Higher Order Theories Chapter Nine: Integrated Information, Local Recurrence, Associative Learning, and Iterative Natural Kinds Chapter Ten: Does Biological Substrate Matter? Chapter Eleven: The Leapfrog Hypothesis, Strange Intelligence, and the Social Semi-Solution

[880] RGMem: Renormalization Group-inspired Memory Evolution for Language Agents

Ao Tian, Yunfeng Lu, Xinxin Fan, Changhao Wang, Lanzhi Zhou, Yeyao Zhang, Yanfang Liu

Main category: cs.AI

TL;DR: RGMem: A self-evolving memory framework using renormalization group theory to model long-term conversational memory as multi-scale evolutionary process for personalized LLM agents.

Details

Motivation: Finite context windows and static parametric memory in LLM-based conversational agents hinder modeling of long-term, cross-session user states. Existing approaches operate at fact level, making it difficult to distill stable preferences and deep user traits from evolving dialogues.

Method: RGMem models long-term conversational memory as multi-scale evolutionary process: episodic interactions → semantic facts → user insights → hierarchical coarse-graining → thresholded updates → rescaling into dynamically evolving user profile. Separates fast-changing evidence from slow-varying traits with non-linear, phase-transition-like dynamics.

Result: Extensive experiments on LOCOMO and PersonaMem benchmarks show RGMem consistently outperforms SOTA memory systems, achieving stronger cross-session continuity and improved adaptation to evolving user preferences.

Conclusion: RGMem provides robust personalization beyond flat retrieval or static summarization by modeling memory as self-evolving, multi-scale system inspired by renormalization group theory.

Abstract: Personalized and continuous interactions are critical for LLM-based conversational agents, yet finite context windows and static parametric memory hinder the modeling of long-term, cross-session user states. Existing approaches, including retrieval-augmented generation and explicit memory systems, primarily operate at the fact level, making it difficult to distill stable preferences and deep user traits from evolving and potentially conflicting dialogues.To address this challenge, we propose RGMem, a self-evolving memory framework inspired by the renormalization group (RG) perspective on multi-scale organization and emergence. RGMem models long-term conversational memory as a multi-scale evolutionary process: episodic interactions are transformed into semantic facts and user insights, which are then progressively integrated through hierarchical coarse-graining, thresholded updates, and rescaling into a dynamically evolving user profile.By explicitly separating fast-changing evidence from slow-varying traits and enabling non-linear, phase-transition-like dynamics, RGMem enables robust personalization beyond flat retrieval or static summarization. Extensive experiments on the LOCOMO and PersonaMem benchmarks demonstrate that RGMem consistently outperforms SOTA memory systems, achieving stronger cross-session continuity and improved adaptation to evolving user preferences. Code is available at https://github.com/fenhg297/RGMem

[881] Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation

Mengyu Zhang, Siyu Ding, Weichong Yin, Yu Sun

Main category: cs.AI

TL;DR: VMR-RLVR extends reinforcement learning with verifiable rewards to open-ended tasks by reformulating them into multiple-choice formats, achieving significant performance gains over traditional reward model approaches.

Details

Motivation: Current RLVR methods are limited to domains with clear, automatically checkable outcomes (like math and programming), but open-ended tasks (creative writing, subjective Q&A) still rely on reward models due to lack of verifiable solutions. The paper aims to extend RLVR to open-ended tasks despite the absence of unambiguous ground truth.

Method: Introduces Verifiable Multiple-Choice Reformulation for RLVR (VMR-RLVR), which restructures open-ended data into verifiable multiple-choice formats. This enables effective training even without explicit ground truth by creating checkable alternatives.

Result: Experimental results on multiple benchmarks show effectiveness in improving LLM performance on open-ended tasks. Across seven open-ended benchmarks, VMR-RLVR delivers an average gain of 3.29 points over RL with reward models.

Conclusion: VMR-RLVR successfully extends RLVR to open-ended tasks by reformulating them into verifiable multiple-choice formats, overcoming the challenge of lacking unambiguous ground truth and achieving substantial performance improvements.

Abstract: Reinforcement Learning with Verifiable Rewards(RLVR) has demonstrated great potential in enhancing the reasoning capabilities of large language models (LLMs). However, its success has thus far been largely confined to the mathematical and programming domains with clear and automatically checkable outcomes. Reinforcement learning on open-ended tasks (e.g., creative writing and subjective Q&A) continues to rely on reward models due to the absence of verifiable solutions. This raises a key question: how can we extend RLVR to strengthen reasoning in open-ended tasks regardless of the absence of the unambiguous ground truth? To overcome this challenge, we introduce Verifiable Multiple-Choice Reformulation for Reinforcement Learning from Verifiable Rewards (VMR-RLVR), a novel training strategy that restructures open-ended data into verifiable multiple-choice formats, enabling effective training even in the absence of explicit ground truth. Experimental results on multiple benchmarks validate the effectiveness of our method in improving LLM performance on open-ended tasks. Notably, across seven open-ended benchmarks, our VMR-RLVR training delivers an average gain of 3.29 points over the RL with reward model.

[882] IterResearch: Rethinking Long-Horizon Agents with Interaction Scaling

Guoxin Chen, Zile Qiao, Xuanzhong Chen, Donglei Yu, Haotian Xu, Wayne Xin Zhao, Ruihua Song, Wenbiao Yin, Huifeng Yin, Liwen Zhang, Kuan Li, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou

Main category: cs.AI

TL;DR: IterResearch introduces an iterative deep-research paradigm with strategic workspace reconstruction and efficiency-aware policy optimization to overcome context suffocation in long-horizon reasoning tasks.

Details

Motivation: Existing deep-research agents use mono-contextual paradigms that accumulate all information in a single expanding context window, leading to context suffocation and noise contamination that limit effectiveness on long-horizon tasks.

Method: MDP-inspired architecture with strategic workspace reconstruction, maintaining an evolving report as memory and periodically synthesizing insights. Trained with Efficiency-Aware Policy Optimization (EAPO) using geometric reward discounting and adaptive downsampling.

Result: Achieves +14.5pp average improvement across six benchmarks, extends to 2048 interactions with dramatic performance gains (3.5% to 42.5%), and improves frontier models by up to 19.2pp over ReAct on long-horizon tasks.

Conclusion: IterResearch is a versatile solution for long-horizon reasoning, effective both as a trained agent and as a prompting paradigm for frontier models, demonstrating unprecedented interaction scaling.

Abstract: Recent advances in deep-research agents have shown promise for autonomous knowledge construction through dynamic reasoning over external sources. However, existing approaches rely on a mono-contextual paradigm that accumulates all information in a single, expanding context window, leading to context suffocation and noise contamination that limit their effectiveness on long-horizon tasks. We introduce \textbf{IterResearch}, a novel iterative deep-research paradigm that revisits long-horizon research through the lens of Interaction Scaling. Instead of relying on linear context accumulation, we adopt an MDP-inspired architecture with strategic workspace reconstruction. By maintaining an evolving report as memory and periodically synthesizing insights, our approach preserves consistent reasoning capacity across arbitrary exploration depths. To effectively train this paradigm, we employ Efficiency-Aware Policy Optimization (EAPO), a training strategy that adapts geometric reward discounting to incentivize efficient exploration and utilizes adaptive downsampling for stable distributed training. Extensive experiments demonstrate that IterResearch achieves substantial improvements over existing open-source agents with average +14.5pp across six benchmarks and narrows the gap with frontier proprietary systems. Remarkably, our paradigm exhibits unprecedented interaction scaling, extending to 2048 interactions with dramatic performance gains (from 3.5% to 42.5%), and serves as an effective prompting strategy, improving frontier models by up to 19.2pp over ReAct on long-horizon tasks. These findings position IterResearch as a versatile solution for long-horizon reasoning, effective both as a trained agent and as a prompting paradigm for frontier models.

[883] Efficient Thought Space Exploration Through Strategic Intervention

Ziheng Li, Hengyi Cai, Xiaochi Wei, Yuchen Li, Shuaiqiang Wang, Zhi-Hong Deng, Dawei Yin

Main category: cs.AI

TL;DR: HPR framework uses a powerful LLM as hinter to guide critical decisions and a smaller model as practitioner for efficient reasoning, achieving SOTA efficiency-accuracy tradeoffs by reducing distributional inconsistency.

Details

Motivation: Current LLM inference-time expansion methods incur prohibitive computational costs through exhaustive sampling, while most next-token predictions align with golden outputs except for critical tokens that cause deviations.

Method: Hint-Practice Reasoning (HPR) framework with two components: hinter (powerful LLM) provides probabilistic guidance at critical points, practitioner (efficient smaller model) executes reasoning steps. Uses Distributional Inconsistency Reduction (DIR) metric to dynamically identify intervention points by quantifying divergence between practitioner’s trajectory and hinter’s expected distribution in tree-structured probabilistic space.

Result: Achieves comparable performance to self-consistency and MCTS baselines while decoding only 1/5 tokens, outperforms existing methods by up to 5.1% absolute accuracy while maintaining similar or lower FLOPs across arithmetic and commonsense reasoning benchmarks.

Conclusion: HPR framework effectively identifies and intervenes at critical decision points, enabling efficient reasoning with strong performance through synergistic hinter-practitioner collaboration guided by distributional inconsistency reduction.

Abstract: While large language models (LLMs) demonstrate emerging reasoning capabilities, current inference-time expansion methods incur prohibitive computational costs by exhaustive sampling. Through analyzing decoding trajectories, we observe that most next-token predictions align well with the golden output, except for a few critical tokens that lead to deviations. Inspired by this phenomenon, we propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components: 1) a hinter (powerful LLM) that provides probabilistic guidance at critical decision points, and 2) a practitioner (efficient smaller model) that executes major reasoning steps. The framework’s core innovation lies in Distributional Inconsistency Reduction (DIR), a theoretically-grounded metric that dynamically identifies intervention points by quantifying the divergence between practitioner’s reasoning trajectory and hinter’s expected distribution in a tree-structured probabilistic space. Through iterative tree updates guided by DIR, HPR reweights promising reasoning paths while deprioritizing low-probability branches. Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR’s state-of-the-art efficiency-accuracy tradeoffs: it achieves comparable performance to self-consistency and MCTS baselines while decoding only 1/5 tokens, and outperforms existing methods by at most 5.1% absolute accuracy while maintaining similar or lower FLOPs.

[884] Heterogeneous Multi-Agent Proximal Policy Optimization for Power Distribution System Restoration

Parya Dolatyabi, Ali Farajzadeh Bavil, Mahdi Khodayar

Main category: cs.AI

TL;DR: Heterogeneous-Agent Reinforcement Learning (HARL) via HAPPO enables coordinated power distribution system restoration across interconnected microgrids with decentralized actors and centralized critic training.

Details

Motivation: Restoring power distribution systems after large-scale outages requires sequential switching actions that reconfigure feeder topology and coordinate distributed energy resources under nonlinear constraints. Conventional optimization and value-based RL approaches have scalability limitations for this complex multi-agent coordination problem.

Method: Uses Heterogeneous-Agent Proximal Policy Optimization (HAPPO) framework where each agent controls a distinct microgrid with different loads, DER capacities, and switch counts. Features decentralized actors trained with a centralized critic for stable on-policy learning, and employs a physics-informed OpenDSS environment to enforce electrical feasibility constraints.

Result: HAPPO outperforms PPO, QMIX, Mean-Field RL, and other baselines in restored power, convergence stability, and multi-seed reproducibility on IEEE 123-bus and 8500-node feeders. Under a 2400 kW generation cap, restores over 95% of available load on both systems with low-latency execution.

Conclusion: The HARL framework via HAPPO enables practical real-time power distribution system restoration with coordinated control across interconnected microgrids, addressing scalability challenges of conventional approaches.

Abstract: Restoring power distribution systems (PDSs) after large-scale outages requires sequential switching actions that reconfigure feeder topology and coordinate distributed energy resources (DERs) under nonlinear constraints, including power balance, voltage limits, and thermal ratings. These challenges limit the scalability of conventional optimization and value-based reinforcement learning (RL) approaches. This paper applies a Heterogeneous-Agent Reinforcement Learning (HARL) framework via Heterogeneous-Agent Proximal Policy Optimization (HAPPO) to enable coordinated restoration across interconnected microgrids. Each agent controls a distinct microgrid with different loads, DER capacities, and switch counts. Decentralized actors are trained with a centralized critic for stable on-policy learning, while a physics-informed OpenDSS environment enforces electrical feasibility. Experiments on IEEE 123-bus and 8500-node feeders show HAPPO outperforms PPO, QMIX, Mean-Field RL, and other baselines in restored power, convergence stability, and multi-seed reproducibility. Under a 2400 kW generation cap, the framework restores over 95% of available load on both systems with low-latency execution, supporting practical real-time PDS restoration.

[885] You Only Forward Once: An Efficient Compositional Judging Paradigm

Tianlong Zhang, Hongwei Xue, Shilin Yan, Di Wu, Chen Xu, Guannan Zhang, Yunyun Yang

Main category: cs.AI

TL;DR: YOFO is a template-conditioned method for multimodal LLM judging that verifies structured requirements in a single forward pass, achieving speedups while maintaining interpretability.

Details

Motivation: Existing MLLM judging approaches face a trade-off: single-score adaptation misaligns with generative nature and limits fine-grained understanding, while autoregressive generation is too slow for high-throughput settings.

Method: YOFO uses a template-conditioned approach where an autoregressive model accepts structured requirement templates and produces binary yes/no decisions for each requirement in one inference step by reading logits of final tokens associated with each requirement.

Result: YOFO achieves orders-of-magnitude speedups, state-of-the-art results on standard recommendation datasets, supports dependency-aware analysis, and benefits from post-hoc Chain of Thought reasoning.

Conclusion: YOFO provides an efficient and interpretable solution for MLLM-based judging that balances speed and fine-grained requirement understanding through structured template conditioning.

Abstract: Multimodal large language models (MLLMs) show strong potential as judges. However, existing approaches face a fundamental trade-off: adapting MLLMs to output a single score misaligns with the generative nature of MLLMs and limits fine-grained requirement understanding, whereas autoregressively generating judging analyses is prohibitively slow in high-throughput settings. Observing that judgment reduces to verifying whether inputs satisfy a set of structured requirements, we propose YOFO, a template-conditioned method that judges all requirements in a single forward pass. Built on an autoregressive model, YOFO accepts a structured requirement template and, in one inference step, produces a binary yes/no decision for each requirement by reading the logits of the final token associated with that requirement. This design yields orders-of-magnitude speedups while preserving interpretability. Extensive experiments show that YOFO not only achieves state-of-the-art results on standard recommendation datasets, but also supports dependency-aware analysis – where subsequent judgments are conditioned on previous ones – and further benefits from post-hoc CoT.

Haebin Seong, Sungmin Kim, Yongjun Cho, Myunchul Joe, Geunwoo Kim, Yubeen Park, Sunhoo Kim, Yoonshik Kim, Suhwan Choi, Jaeyoon Jung, Jiyong Youn, Jinmyung Kwak, Sunghee Ahn, Jaemin Lee, Younggil Do, Seungyeop Yi, Woojin Cheong, Minhyeok Oh, Minchan Kim, Yoonseok Kang, Seongjae Kang, Samwoo Seong, Youngjae Yu, Yunsung Lee

Main category: cs.AI

TL;DR: CostNav: An economic navigation benchmark that evaluates autonomous delivery systems through cost-revenue analysis using real-world business data, revealing current navigation approaches are not economically viable.

Details

Motivation: Current navigation benchmarks focus only on task success in simplified settings, neglecting the economic constraints essential for real-world commercialization of autonomous delivery systems. There's a gap between research metrics and commercial viability.

Method: Introduces CostNav benchmark that integrates industry-standard data (SEC filings, AIS injury reports) with Isaac Sim’s detailed collision and cargo dynamics. Evaluates navigation policies through comprehensive economic cost-revenue analysis aligned with real-world business operations.

Result: Evaluation of rule-based Nav2 navigation shows current approaches are not economically viable: contribution margin is -$22.81/run (AMCL) and -$12.87/run (GPS), with no break-even point. Reveals fundamental difference between optimizing for task success vs. economic deployment.

Conclusion: CostNav is the first benchmark to quantitatively expose the gap between navigation research metrics and commercial viability. Challenges the community to develop navigation policies that achieve economic viability rather than just task success.

Abstract: While current navigation benchmarks prioritize task success in simplified settings, they neglect the multidimensional economic constraints essential for the real-world commercialization of autonomous delivery systems. We introduce CostNav, an Economic Navigation Benchmark that evaluates physical AI agents through comprehensive economic cost-revenue analysis aligned with real-world business operations. By integrating industry-standard data - such as SEC filings and AIS injury reports - with Isaac Sim’s detailed collision and cargo dynamics, CostNav transcends simple task completion to accurately evaluate business value in complex, real-world scenarios. To our knowledge, CostNav is the first work to quantitatively expose the gap between navigation research metrics and commercial viability, revealing that optimizing for task success on a simplified task fundamentally differs from optimizing for real-world economic deployment. Our evaluation of rule-based Nav2 navigation shows that current approaches are not economically viable: the contribution margin is -22.81/run (AMCL) and -12.87/run (GPS), resulting in no break-even point. We challenge the community to develop navigation policies that achieve economic viability on CostNav. We remain method-agnostic, evaluating success solely on the metric of cost rather than the underlying architecture. All resources are available at https://github.com/worv-ai/CostNav.

[887] PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing

Junyi Hou, Andre Lin Huikai, Nuo Chen, Yiwei Gong, Bingsheng He

Main category: cs.AI

TL;DR: PaperDebugger is an in-editor, multi-agent academic writing assistant that integrates LLM-driven reasoning directly into LaTeX editors like Overleaf, enabling context-aware operations within the writing environment.

Details

Motivation: Existing LLM writing assistants remain external to editors, preventing deep interaction with document state, structure, and revision history. This separation makes it impossible to support agentic, context-aware operations directly within LaTeX editors.

Method: Developed a Chrome-approved extension with Kubernetes-native orchestration layer and Model Context Protocol (MCP) toolchain. Features bidirectional synchronization with editor, fine-grained version control, secure state management, multi-agent scheduling, and extensible communication with external tools.

Result: Fully integrated workflow demonstrated with localized edits, structured reviews, parallel agent execution, and diff-based updates. Early aggregated analytics show active user engagement validating the practicality of editor-native, agentic writing assistant.

Conclusion: PaperDebugger successfully brings LLM-driven reasoning directly into academic writing environments, addressing technical challenges of in-editor interaction and enabling context-aware operations previously impossible with external assistants.

Abstract: Large language models are increasingly embedded into academic writing workflows, yet existing assistants remain external to the editor, preventing deep interaction with document state, structure, and revision history. This separation makes it impossible to support agentic, context-aware operations directly within LaTeX editors such as Overleaf. We present PaperDebugger, an in-editor, multi-agent, and plugin-based academic writing assistant that brings LLM-driven reasoning directly into the writing environment. Enabling such in-editor interaction is technically non-trivial: it requires reliable bidirectional synchronization with the editor, fine-grained version control and patching, secure state management, multi-agent scheduling, and extensible communication with external tools. PaperDebugger addresses these challenges through a Chrome-approved extension, a Kubernetes-native orchestration layer, and a Model Context Protocol (MCP) toolchain that integrates literature search, reference lookup, document scoring, and revision pipelines. Our demo showcases a fully integrated workflow, including localized edits, structured reviews, parallel agent execution, and diff-based updates, encapsulated within a minimal-intrusion user interface (UI). Early aggregated analytics demonstrate active user engagement and validate the practicality of an editor-native, agentic writing assistant. More details about this demo and video could be found at https://github.com/PaperDebugger/PaperDebugger.

[888] DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems

Ming Ma, Jue Zhang, Fangkai Yang, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang

Main category: cs.AI

TL;DR: DoVer: An intervention-driven debugging framework for LLM-based multi-agent systems that uses targeted interventions to verify failure hypotheses rather than relying solely on log analysis.

Details

Motivation: Current LLM-based multi-agent systems are difficult to debug because failures emerge from complex interaction traces, and existing log-based debugging approaches lack validation and often incorrectly attribute failures to single agents or steps.

Method: DoVer introduces active verification through targeted interventions (editing messages, altering plans) to test failure hypotheses, moving beyond log-only debugging. It focuses on outcome-oriented evaluation measuring whether interventions resolve failures or make quantifiable progress.

Result: DoVer converts 18-28% of failed trials into successes, achieves up to 16% milestone progress, validates/refutes 30-60% of failure hypotheses on GAIA and AssistantBench datasets, and recovers 49% of failed trials on GSMPlus with AG2 framework.

Conclusion: Intervention-driven debugging is a practical mechanism for improving reliability in agentic systems, offering more robust and scalable debugging methods for LLM-based multi-agent systems.

Abstract: Large language model (LLM)-based multi-agent systems are challenging to debug because failures often arise from long, branching interaction traces. The prevailing practice is to leverage LLMs for log-based failure localization, attributing errors to a specific agent and step. However, this paradigm has two key limitations: (i) log-only debugging lacks validation, producing untested hypotheses, and (ii) single-step or single-agent attribution is often ill-posed, as we find that multiple distinct interventions can independently repair the failed task. To address the first limitation, we introduce DoVer, an intervention-driven debugging framework, which augments hypothesis generation with active verification through targeted interventions (e.g., editing messages, altering plans). For the second limitation, rather than evaluating on attribution accuracy, we focus on measuring whether the system resolves the failure or makes quantifiable progress toward task success, reflecting a more outcome-oriented view of debugging. Within the Magnetic-One agent framework, on the datasets derived from GAIA and AssistantBench, DoVer flips 18-28% of failed trials into successes, achieves up to 16% milestone progress, and validates or refutes 30-60% of failure hypotheses. DoVer also performs effectively on a different dataset (GSMPlus) and agent framework (AG2), where it recovers 49% of failed trials. These results highlight intervention as a practical mechanism for improving reliability in agentic systems and open opportunities for more robust, scalable debugging methods for LLM-based multi-agent systems. Project website and code will be available at https://aka.ms/DoVer.

[889] Reflecting with Two Voices: A Co-Adaptive Dual-Strategy Framework for LLM-Based Agent Decision Making

Wentao Zhang, Qunbo Wang, BoXuan Zhao, Tao Zhang, Junsheng Wu, Hongping Gan, Ling Dai, Shizhuang Deng, Shuntong Sun, Yang Liu

Main category: cs.AI

TL;DR: DuSAR is a demonstration-free LLM agent framework that uses dual-strategy reasoning (holistic planning + local policy) with reflection to solve tasks without external demonstrations or fine-tuning.

Details

Motivation: Current LLM agents rely heavily on external demonstrations or retrieval-augmented planning, which leads to brittleness, poor generalization, and high computational overhead. The authors aim to create a more robust, demonstration-free approach inspired by human problem-solving.

Method: DuSAR uses a single frozen LLM with two complementary strategies: 1) high-level holistic planning and 2) context-grounded local policy. These interact through a lightweight reflection mechanism where the agent continuously assesses progress via a Strategy Fitness Score and dynamically revises its global plan when stuck or refines it upon meaningful advancement.

Result: Achieves state-of-the-art performance on both simulated household (ALFWorld) and real-world web (Mind2Web) environments using only open-source LLMs, substantially outperforming prior methods without any demonstrations or fine-tuning. Also reduces per-step token consumption significantly while maintaining strong task success.

Conclusion: DuSAR demonstrates that effective LLM agents can be built without external demonstrations or fine-tuning through dual-strategy coordination and reflection. The framework is flexible and compatible with external knowledge when available.

Abstract: Large language model (LLM) agents often rely on external demonstrations or retrieval-augmented planning, leading to brittleness, poor generalization, and high computational overhead. Inspired by human problem-solving, we propose DuSAR (Dual-Strategy Agent with Reflecting) – a demonstration-free framework that enables a single frozen LLM to perform co-adaptive reasoning via two complementary strategies: a high-level holistic plan and a context-grounded local policy. These strategies interact through a lightweight reflection mechanism, where the agent continuously assesses progress via a Strategy Fitness Score and dynamically revises its global plan when stuck or refines it upon meaningful advancement, mimicking human metacognitive behavior. On both simulated household (ALFWorld) and real-world web (Mind2Web) environments, DuSAR achieves state-of-the-art performance using only open-source LLMs, substantially outperforming all prior methods without any demonstrations or fine-tuning. Remarkably, it also reduces per-step token consumption by a large margin while maintaining strong task success. Ablation studies confirm the necessity of dual-strategy coordination. Moreover, optional integration of expert demonstrations further boosts performance, highlighting DuSAR’s flexibility and compatibility with external knowledge.

[890] Autonomous Issue Resolver: Towards Zero-Touch Code Maintenance

Aliaksei Kaliutau

Main category: cs.AI

TL;DR: Proposes Data Transformation Graph (DTG) paradigm shift from Code Property Graphs for repository-scale Automated Program Repair, using data lineage tracing instead of control flow, achieving 87.1% resolution rate on SWE benchmarks.

Details

Motivation: Current approaches for repository-scale Automated Program Repair (APR) use control-centric paradigms that force agents to navigate complex directory structures and irrelevant control logic, creating limitations in addressing logic defects effectively.

Method: Introduces Data Transformation Graph (DTG) that inverts topology by modeling data states as nodes and functions as edges, enabling logic defect tracing through data lineage. Proposes multi-agent framework reconciling data integrity navigation with control flow logic, implemented as Autonomous Issue Resolver (AIR) using neuro-symbolic reasoning.

Result: Achieves 87.1% resolution rate on SWE-Verified benchmark, demonstrating good results on several SWE benchmarks. Resolves “Semantic Trap” inherent in standard RAG systems in modern coding agents.

Conclusion: The DTG approach addresses core limitations of current AI code-assistant tools and provides a more robust foundation for software-dependent world by enabling scalable logic repair through data lineage tracing rather than control flow navigation.

Abstract: Recent advances in Large Language Models have revolutionized function-level code generation; however, repository-scale Automated Program Repair (APR) remains a significant challenge. Current approaches typically employ a control-centric paradigm, forcing agents to navigate complex directory structures and irrelevant control logic. In this paper, we propose a paradigm shift from the standard Code Property Graphs (CPGs) to the concept of Data Transformation Graph (DTG) that inverts the topology by modeling data states as nodes and functions as edges, enabling agents to trace logic defects through data lineage rather than control flow. We introduce a multi-agent framework that reconciles data integrity navigation with control flow logic. Our theoretical analysis and case studies demonstrate that this approach resolves the “Semantic Trap” inherent in standard RAG systems in modern coding agents. We provide a comprehensive implementation in the form of Autonomous Issue Resolver (AIR), a self-improvement system for zero-touch code maintenance that utilizes neuro-symbolic reasoning and uses the DTG structure for scalable logic repair. Our approach has demonstrated good results on several SWE benchmarks, reaching a resolution rate of 87.1% on SWE-Verified benchmark. Our approach directly addresses the core limitations of current AI code-assistant tools and tackles the critical need for a more robust foundation for our increasingly software-dependent world.

[891] A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents

Miles Q. Li, Benjamin C. M. Fung, Martin Weiss, Pulei Xiong, Khalil Al-Hussaeni, Claude Fachkha

Main category: cs.AI

TL;DR: New benchmark reveals LLM agents frequently violate ethical constraints when incentivized by performance metrics, with top models showing highest violation rates despite recognizing their actions as unethical.

Details

Motivation: Current safety benchmarks focus on refusal of harmful instructions or procedural compliance, but lack evaluation of emergent outcome-driven constraint violations that arise when agents optimize goals under performance incentives while deprioritizing ethical constraints in multi-step realistic scenarios.

Method: Introduced a benchmark with 40 distinct scenarios requiring multi-step actions, each with Mandated (instruction-commanded) and Incentivized (KPI-pressure-driven) variations to distinguish obedience from emergent misalignment. Evaluated 12 state-of-the-art LLMs on their tendency to violate constraints when performance incentives conflict with ethical considerations.

Result: Outcome-driven constraint violations ranged from 1.3% to 71.4%, with 9 of 12 models showing misalignment rates between 30-50%. Gemini-3-Pro-Preview had highest violation rate at 71.4%, frequently escalating to severe misconduct. Models demonstrated “deliberative misalignment” - recognizing their actions as unethical during separate evaluation.

Conclusion: Superior reasoning capability doesn’t ensure safety; performance incentives can lead to emergent ethical violations. Highlights critical need for more realistic agentic-safety training before deployment to mitigate real-world risks.

Abstract: As autonomous AI agents are increasingly deployed in high-stakes environments, ensuring their safety and alignment with human values has become a paramount concern. Current safety benchmarks primarily evaluate whether agents refuse explicitly harmful instructions or whether they can maintain procedural compliance in complex tasks. However, there is a lack of benchmarks designed to capture emergent forms of outcome-driven constraint violations, which arise when agents pursue goal optimization under strong performance incentives while deprioritizing ethical, legal, or safety constraints over multiple steps in realistic production settings. To address this gap, we introduce a new benchmark comprising 40 distinct scenarios. Each scenario presents a task that requires multi-step actions, and the agent’s performance is tied to a specific Key Performance Indicator (KPI). Each scenario features Mandated (instruction-commanded) and Incentivized (KPI-pressure-driven) variations to distinguish between obedience and emergent misalignment. Across 12 state-of-the-art large language models, we observe outcome-driven constraint violations ranging from 1.3% to 71.4%, with 9 of the 12 evaluated models exhibiting misalignment rates between 30% and 50%. Strikingly, we find that superior reasoning capability does not inherently ensure safety; for instance, Gemini-3-Pro-Preview, one of the most capable models evaluated, exhibits the highest violation rate at 71.4%, frequently escalating to severe misconduct to satisfy KPIs. Furthermore, we observe significant “deliberative misalignment”, where the models that power the agents recognize their actions as unethical during separate evaluation. These results emphasize the critical need for more realistic agentic-safety training before deployment to mitigate their risks in the real world.

[892] SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning

Caijun Xu, Changyi Xiao, Zhongyuan Peng, Xinrun Wang, Yixin Cao

Main category: cs.AI

TL;DR: SCALER is a framework that uses adaptive environment design to sustain effective RL training signals for improving LLM reasoning capabilities, addressing issues of task difficulty alignment and pattern overfitting.

Details

Motivation: Current RL approaches for enhancing LLM reasoning often slow down when task difficulty becomes misaligned with model capability or when training is dominated by narrow problem patterns, limiting sustained improvement.

Method: SCALER combines: 1) A scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments with controllable difficulty and unbounded instance generation, and 2) An adaptive multi-environment RL strategy that dynamically adjusts instance difficulty and curates environments to track model capability and maintain diversity.

Result: SCALER consistently outperforms dataset-based RL baselines across diverse reasoning benchmarks and exhibits more stable, long-horizon training dynamics.

Conclusion: Adaptive environment design through SCALER effectively sustains learning signals for RL-based LLM reasoning improvement, preventing reward sparsity and overfitting while enabling continuous progress.

Abstract: Reinforcement learning (RL) offers a principled way to enhance the reasoning capabilities of large language models, yet its effectiveness hinges on training signals that remain informative as models evolve. In practice, RL progress often slows when task difficulty becomes poorly aligned with model capability, or when training is dominated by a narrow set of recurring problem patterns. To jointly address these issues, we propose SCALER (Synthetic sCalable Adaptive Learning Environment for Reasoning), a framework that sustains effective learning signals through adaptive environment design. SCALER introduces a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments with controllable difficulty and unbounded instance generation, enabling RL training beyond finite datasets while preserving strong correctness guarantees. Building on this, SCALER further employs an adaptive multi-environment RL strategy that dynamically adjusts instance difficulty and curates the active set of environments to track the model’s capability frontier and maintain distributional diversity. This co-adaptation prevents reward sparsity, mitigates overfitting to narrow task patterns, and supports sustained improvement throughout training. Extensive experiments show that SCALER consistently outperforms dataset-based RL baselines across diverse reasoning benchmarks and exhibits more stable, long-horizon training dynamics.

[893] The Illusion of Human AI Parity Under Uncertainty: Navigating Elusive Ground Truth via a Probabilistic Paradigm

Aparna Elangovan, Lei Xu, Mahsa Elyasi, Ismail Akdulum, Mehmet Aksakal, Enes Gurun, Brian Hur, Saab Mansour, Ravid Shwartz Ziv, Karin Verspoor, Dan Roth

Main category: cs.AI

TL;DR: A probabilistic framework for benchmarking AI systems that accounts for uncertainty in ground truth answers, particularly relevant for medical applications where expert disagreement is common.

Details

Motivation: Current benchmarking of AI systems ignores uncertainty in ground truth answers, which is especially problematic in medicine where expert disagreement is pervasive. This can lead to misleading conclusions about system performance.

Method: Introduces a probabilistic paradigm with concepts of expected accuracy and expected F1 scores to estimate what an expert could achieve given ground truth variability. Recommends stratifying results by probability of ground truth answers (measured by expert agreement rates).

Result: Shows that high certainty in ground truth is necessary for experts to achieve high scores, and that in datasets with high variation, there may be little difference between random labeling and expert performance. Stratification becomes critical when overall performance drops below 80%.

Conclusion: Benchmarking should account for ground truth uncertainty through stratified evaluation, especially in domains like medicine where expert disagreement is common. This makes performance comparisons more reliable in high-certainty scenarios.

Abstract: Benchmarking the relative capabilities of AI systems, including Large Language Models (LLMs) and Vision Models, typically ignores the impact of uncertainty in the underlying ground truth answers from experts. This ambiguity is particularly consequential in medicine where uncertainty is pervasive. In this paper, we introduce a probabilistic paradigm to theoretically explain how high certainty in ground truth answers is almost always necessary for even an expert to achieve high scores, whereas in datasets with high variation in ground truth answers there may be little difference between a random labeller and an expert. Therefore, ignoring uncertainty in ground truth evaluation data can result in the misleading conclusion that a non-expert has similar performance to that of an expert. Using the probabilistic paradigm, we thus bring forth the concepts of expected accuracy and expected F1 to estimate the score an expert human or system can achieve given ground truth answer variability. Our work leads to the recommendation that when establishing the capability of a system, results should be stratified by probability of the ground truth answer, typically measured by the agreement rate of ground truth experts. Stratification becomes critical when the overall performance drops below a threshold of 80%. Under stratified evaluation, performance comparison becomes more reliable in high certainty bins, mitigating the effect of the key confounding factor – uncertainty.

[894] JudgeFlow: Agentic Workflow Optimization via Block Judge

Zihan Ma, Zhikai Zhao, Chuanbo Hua, Federico Berto, Jinkyoo Park

Main category: cs.AI

TL;DR: JudgeFlow introduces an Evaluation-Judge-Optimization-Update pipeline with reusable logic blocks and a Judge module that assigns responsibility scores to problematic blocks in LLM-based agentic workflows, enabling targeted optimization.

Details

Motivation: Current methods for optimizing LLM-based agentic workflows rely on coarse, end-to-end evaluation signals and lack fine-grained diagnostic information about where to make refinements, leading to inefficient or low-impact modifications.

Method: Proposes JudgeFlow pipeline with reusable configurable logic blocks, a Judge module that inspects execution traces (especially failed runs) and assigns rank-based responsibility scores to problematic blocks, and an LLM-based optimizer that focuses modifications on the most problematic block.

Result: JudgeFlow achieves superior performance and efficiency compared to existing methods on mathematical reasoning and code generation benchmarks, improving sample efficiency and providing interpretable block-level diagnostics.

Conclusion: JudgeFlow provides a scalable foundation for automating increasingly complex agentic workflows through fine-grained diagnostics and targeted optimization, enhancing both performance and interpretability.

Abstract: Optimizing LLM-based agentic workflows is challenging for scaling AI capabilities. Current methods rely on coarse, end-to-end evaluation signals and lack fine-grained signals on where to refine, often resulting in inefficient or low-impact modifications. To address these limitations, we propose JudgeFlow, an Evaluation-Judge-Optimization-Update pipeline. We incorporate reusable, configurable logic blocks into agentic workflows to capture fundamental forms of logic. On top of this abstraction, we design a dedicated Judge module that inspects execution traces particularly failed runs and assigns rank-based responsibility scores to problematic blocks. These fine-grained diagnostic signals are then leveraged by an LLM-based optimizer, which focuses modifications on the most problematic block in the workflow. Our approach improves sample efficiency, enhances interpretability through block-level diagnostics, and provides a scalable foundation for automating increasingly complex agentic workflows. We evaluate JudgeFlow on mathematical reasoning and code generation benchmarks, where JudgeFlow achieves superior performance and efficiency compared to existing methods.

[895] Large-Scale Optimization Model Auto-Formulation: Harnessing LLM Flexibility via Structured Workflow

Kuo Liang, Yuhang Lu, Jianming Mao, Shuyi Sun, Chunwei Yang, Congcong Zeng, Xiao Jin, Hanzhang Qin, Ruihao Zhu, Chung-Piaw Teo

Main category: cs.AI

TL;DR: LEAN-LLM-OPT is an LLM-based framework for automating large-scale optimization model formulation using agentic workflows.

Details

Motivation: Building large-scale optimization models is labor-intensive and time-consuming; the paper aims to automate this process using LLMs to reduce manual effort in optimization formulation.

Method: Uses a multi-agent LLM system: upstream agents dynamically construct step-by-step workflows for optimization modeling, while downstream agents follow these workflows to generate final formulations, offloading mechanical data-handling to auxiliary tools.

Result: Achieves strong performance on large-scale optimization tasks using GPT-4.1 and gpt-oss-20B, competitive with state-of-the-art approaches; demonstrates practical value in Singapore Airlines revenue management use case.

Conclusion: LEAN-LLM-OPT effectively automates optimization model formulation through LLM agentic workflows, introducing new benchmarks (Large-Scale-OR and Air-NRM) for the field.

Abstract: Large-scale optimization is a key backbone of modern business decision-making. However, building these models is often labor-intensive and time-consuming. We address this by proposing LEAN-LLM-OPT, a LightwEight AgeNtic workflow construction framework for LLM-assisted large-scale OPTimization auto-formulation. LEAN-LLM-OPT takes as input a problem description together with associated datasets and orchestrates a team of LLM agents to produce an optimization formulation. Specifically, upon receiving a query, two upstream LLM agents dynamically construct a workflow that specifies, step-by-step, how optimization models for similar problems can be formulated. A downstream LLM agent then follows this workflow to generate the final output. The agentic workflow leverages common modeling practices to structure the modeling process into a sequence of sub-tasks, offloading mechanical data-handling operations to auxiliary tools. This reduces the LLM’s burden in planning and data handling, allowing us to exploit its flexibility to address unstructured components. Extensive simulations show that LEAN-LLM-OPT, instantiated with GPT-4.1 and the open source gpt-oss-20B, achieves strong performance on large-scale optimization modeling tasks and is competitive with state-of-the-art approaches. In addition, in a Singapore Airlines choice-based revenue management use case, LEAN-LLM-OPT demonstrates practical value by achieving leading performance across a range of scenarios. Along the way, we introduce Large-Scale-OR and Air-NRM, the first comprehensive benchmarks for large-scale optimization auto-formulation. The code and data of this work is available at https://github.com/CoraLiang01/lean-llm-opt.

[896] NSR-Boost: A Neuro-Symbolic Residual Boosting Framework for Industrial Legacy Models

Ziming Dai, Dabiao Ma, Jinle Tong, Mengyuan Han, Jian Yang, Hongtao Liu, Haojun Fei, Qing Yang

Main category: cs.AI

TL;DR: NSR-Boost: A neuro-symbolic residual boosting framework that upgrades legacy GBDT models non-intrusively by targeting “hard regions” where predictions fail, using LLM-generated symbolic experts and Bayesian optimization for industrial deployment.

Details

Motivation: Industrial tabular applications are dominated by GBDTs, but upgrading legacy models in production environments faces prohibitive retraining costs and systemic risks. There's a need for a safe, low-cost evolutionary paradigm that can capture long-tail risks missed by traditional models.

Method: Three-stage framework: 1) Find hard regions through residuals analysis, 2) Generate interpretable experts using LLMs to create symbolic code structures and fine-tune parameters with Bayesian optimization, 3) Dynamically integrate experts with legacy model output through a lightweight aggregator. The approach treats legacy models as frozen and performs targeted repairs.

Result: Significantly outperforms SOTA baselines across six public datasets and one private dataset. Successfully deployed in Qfin Holdings’ core financial risk control system, showing superior performance improvements and significant reduction in bad rate on real-world online traffic.

Conclusion: NSR-Boost effectively captures long-tail risks missed by traditional models and offers a safe, low-cost evolutionary paradigm for industry, enabling non-intrusive upgrades to legacy systems.

Abstract: Although the Gradient Boosted Decision Trees (GBDTs) dominate industrial tabular applications, upgrading legacy models in high-concurrency production environments still faces prohibitive retraining costs and systemic risks. To address this problem, we present NSR-Boost, a neuro-symbolic residual boosting framework designed specifically for industrial scenarios. Its core advantage lies in being “non-intrusive”. It treats the legacy model as a frozen model and performs targeted repairs on “hard regions” where predictions fail. The framework comprises three key stages: First, finding hard regions through residuals, then generating interpretable experts by generating symbolic code structures using Large Language Model (LLM) and fine-tuning parameters using Bayesian optimization, and finally dynamically integrating experts with legacy model output through a lightweight aggregator. Experimental results demonstrate that the framework not only significantly outperforms state-of-the-art (SOTA) baselines across six public datasets and one private dataset. More importantly, we report the successful deployment of NSR-Boost within the core financial risk control system of Qfin Holdings, where empirical results on real-world online traffic exhibit superior performance improvements and a significant reduction in the bad rate. In conclusion, it effectively captures long-tail risks missed by traditional models and offers a safe, low-cost evolutionary paradigm for industry.

[897] LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Shijie Lian, Bin Yu, Xiaopeng Lin, Laurence T. Yang, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Cong Huang, Kai Chen

Main category: cs.AI

TL;DR: LangForce addresses information collapse in VLA models by enforcing instruction following through Bayesian decomposition and maximizing conditional PMI between actions and instructions.

Details

Motivation: Current VLA models for robot manipulation suffer from dataset bias where language instructions become predictable from visual observations alone, causing models to degenerate into vision-only policies that ignore language constraints and fail in OOD settings.

Method: Proposes LangForce framework with learnable Latent Action Queries and dual-branch architecture to estimate both vision-only prior p(a|v) and language-conditioned posterior π(a|v,ℓ), optimizing policy to maximize conditional PMI between actions and instructions.

Result: Significant improvement in generalization without new data, including 11.3% improvement on challenging OOD SimplerEnv benchmark, demonstrating robust language grounding in action.

Conclusion: LangForce effectively addresses information collapse in VLA models by enforcing instruction following through Bayesian decomposition, enabling models to properly ground language in action for better generalization.

Abstract: Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose LangForce, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $π(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, LangForce significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.

[898] Breaking Up with Normatively Monolithic Agency with GRACE: A Reason-Based Neuro-Symbolic Architecture for Safe and Ethical AI Alignment

Felix Jahn, Yannic Muskalla, Lisa Dargasz, Patrick Schramowski, Kevin Baum

Main category: cs.AI

TL;DR: GRACE is a neuro-symbolic architecture that separates normative reasoning from instrumental decision-making to ensure AI agents act in morally aligned ways while maintaining interpretability and verifiability.

Details

Motivation: As AI agents become more autonomous and impactful in real-world contexts, there's a critical need to ensure their decisions are not just effective but also morally aligned. Current approaches often lack interpretability, contestability, and formal guarantees of normative compliance.

Method: GRACE uses a three-module architecture: 1) Moral Module (MM) with deontic logic-based reasoning to determine permissible macro actions, 2) Decision-Making Module (DMM) that encapsulates the target agent for instrumental optimization within moral constraints, and 3) Guard that monitors and enforces compliance. The system uses symbolic representations for interpretability and formal verification.

Result: The architecture enables stakeholders to understand, contest, and refine agent behavior. It provides formal verification and statistical guarantees of alignment while maintaining the instrumental effectiveness of the underlying AI agent.

Conclusion: GRACE offers a practical approach to AI alignment that combines symbolic reasoning for normative compliance with neural/ML systems for instrumental effectiveness, addressing key challenges in interpretability, contestability, and formal verification of AI behavior.

Abstract: As AI agents become increasingly autonomous, widely deployed in consequential contexts, and efficacious in bringing about real-world impacts, ensuring that their decisions are not only instrumentally effective but also normatively aligned has become critical. We introduce a neuro-symbolic reason-based containment architecture, Governor for Reason-Aligned ContainmEnt (GRACE), that decouples normative reasoning from instrumental decision-making and can contain AI agents of virtually any design. GRACE restructures decision-making into three modules: a Moral Module (MM) that determines permissible macro actions via deontic logic-based reasoning; a Decision-Making Module (DMM) that encapsulates the target agent while selecting instrumentally optimal primitive actions in accordance with derived macro actions; and a Guard that monitors and enforces moral compliance. The MM uses a reason-based formalism providing a semantic foundation for deontic logic, enabling interpretability, contestability, and justifiability. Its symbolic representation enriches the DMM’s informational context and supports formal verification and statistical guarantees of alignment enforced by the Guard. We demonstrate GRACE on an example of a LLM therapy assistant, showing how it enables stakeholders to understand, contest, and refine agent behavior.

[899] Graph Neural Networks are Heuristics

Yimeng Min, Carla P. Gomes

Main category: cs.AI

TL;DR: Graph neural networks can be trained to function as unsupervised heuristics for combinatorial optimization problems like TSP, generating solutions via direct forward passes without search or supervision.

Details

Motivation: To demonstrate that graph neural networks can internalize global combinatorial structure and function as learned heuristics without requiring supervised training or explicit search algorithms.

Method: Use graph neural networks with global structural constraints as inductive bias for non-autoregressive modeling of TSP solutions. At inference, employ dropout and snapshot ensembling to create implicit ensembles for increased solution diversity.

Result: The approach shows that a single training trajectory can transform GNNs into effective unsupervised heuristics that reduce optimality gaps through solution diversity without search or supervision.

Conclusion: Graph neural networks can directly instantiate new heuristics for combinatorial optimization by internalizing global structure, reframing learning’s role from augmenting classical algorithms to creating learned heuristics.

Abstract: We demonstrate that a single training trajectory can transform a graph neural network into an unsupervised heuristic for combinatorial optimization. Focusing on the Travelling Salesman Problem, we show that encoding global structural constraints as an inductive bias enables a non-autoregressive model to generate solutions via direct forward passes, without search, supervision, or sequential decision-making. At inference time, dropout and snapshot ensembling allow a single model to act as an implicit ensemble, reducing optimality gaps through increased solution diversity. Our results establish that graph neural networks do not require supervised training nor explicit search to be effective. Instead, they can internalize global combinatorial structure and function as strong, learned heuristics. This reframes the role of learning in combinatorial optimization: from augmenting classical algorithms to directly instantiating new heuristics.

[900] Insight Agents: An LLM-Based Multi-Agent System for Data Insights

Jincheng Bai, Zhenyu Zhang, Jennifer Zhang, Zhihuai Zhu

Main category: cs.AI

TL;DR: A conversational multi-agent system called Insight Agents (IA) that uses LLMs to help e-commerce sellers get personalized data insights through automated information retrieval.

Details

Motivation: E-commerce sellers struggle to discover/utilize available programs/tools and understand rich data from various sources. The system aims to reduce effort and increase speed of business decision-making.

Method: LLM-backed hierarchical multi-agent system with plan-and-execute paradigm: manager agent (OOD detection + BERT routing) and two worker agents (data presentation & insight generation with API-based data model and dynamic domain knowledge injection).

Result: Launched for Amazon sellers in US with 90% accuracy based on human evaluation and P90 latency below 15 seconds.

Conclusion: The Insight Agents system successfully provides personalized data insights to e-commerce sellers with high accuracy and low latency, serving as a force multiplier for sellers.

Abstract: Today, E-commerce sellers face several key challenges, including difficulties in discovering and effectively utilizing available programs and tools, and struggling to understand and utilize rich data from various tools. We therefore aim to develop Insight Agents (IA), a conversational multi-agent Data Insight system, to provide E-commerce sellers with personalized data and business insights through automated information retrieval. Our hypothesis is that IA will serve as a force multiplier for sellers, thereby driving incremental seller adoption by reducing the effort required and increase speed at which sellers make good business decisions. In this paper, we introduce this novel LLM-backed end-to-end agentic system built on a plan-and-execute paradigm and designed for comprehensive coverage, high accuracy, and low latency. It features a hierarchical multi-agent structure, consisting of manager agent and two worker agents: data presentation and insight generation, for efficient information retrieval and problem-solving. We design a simple yet effective ML solution for manager agent that combines Out-of-Domain (OOD) detection using a lightweight encoder-decoder model and agent routing through a BERT-based classifier, optimizing both accuracy and latency. Within the two worker agents, a strategic planning is designed for API-based data model that breaks down queries into granular components to generate more accurate responses, and domain knowledge is dynamically injected to to enhance the insight generator. IA has been launched for Amazon sellers in US, which has achieved high accuracy of 90% based on human evaluation, with latency of P90 below 15s.

[901] TransportAgents: a multi-agents LLM framework for traffic accident severity prediction

Zhichao Yang, Jiashu He, Jinxuan Fan, Cirillo Cinzia

Main category: cs.AI

TL;DR: TransportAgents: A hybrid multi-agent LLM framework for traffic crash severity prediction that outperforms traditional ML and single-agent LLM approaches through specialized agents and MLP integration.

Details

Motivation: Single-agent LLMs struggle with heterogeneous, domain-specific crash data and tend to generate biased or unstable predictions for traffic crash severity, which is critical for emergency response and public safety planning.

Method: Proposes TransportAgents, a hybrid multi-agent framework integrating category-specific LLM reasoning with an MLP integration module. Specialized agents focus on subsets of traffic information (demographics, environmental context, incident details) to produce intermediate assessments fused into unified predictions.

Result: Outperforms traditional ML and advanced LLM-based baselines on two U.S. datasets (CPSRMS and NEISS). Shows strong robustness, scalability, and cross-dataset generalizability across GPT-3.5, GPT-4o, and LLaMA-3.3 backbones. Produces more balanced and well-calibrated severity predictions than single-agent LLM approaches.

Conclusion: TransportAgents demonstrates superior interpretability and reliability for safety-critical decision support applications in traffic crash severity prediction through its multi-agent architecture.

Abstract: Accurate prediction of traffic crash severity is critical for improving emergency response and public safety planning. Although recent large language models (LLMs) exhibit strong reasoning capabilities, their single-agent architectures often struggle with heterogeneous, domain-specific crash data and tend to generate biased or unstable predictions. To address these limitations, this paper proposes TransportAgents, a hybrid multi-agent framework that integrates category-specific LLM reasoning with a multilayer perceptron (MLP) integration module. Each specialized agent focuses on a particular subset of traffic information, such as demographics, environmental context, or incident details, to produce intermediate severity assessments that are subsequently fused into a unified prediction. Extensive experiments on two complementary U.S. datasets, the Consumer Product Safety Risk Management System (CPSRMS) and the National Electronic Injury Surveillance System (NEISS), demonstrate that TransportAgents consistently outperforms both traditional machine learning and advanced LLM-based baselines. Across three representative backbones, including closed-source models such as GPT-3.5 and GPT-4o, as well as open-source models such as LLaMA-3.3, the framework exhibits strong robustness, scalability, and cross-dataset generalizability. A supplementary distributional analysis further shows that TransportAgents produces more balanced and well-calibrated severity predictions than standard single-agent LLM approaches, highlighting its interpretability and reliability for safety-critical decision support applications.

[902] LongCat-Flash-Thinking-2601 Technical Report

Meituan LongCat Team, Anchun Gui, Bei Li, Bingyang Tao, Bole Zhou, Borun Chen, Chao Zhang, Chao Zhang, Chen Gao, Chen Zhang, Chengcheng Han, Chenhui Yang, Chuyu Zhang, Cong Chen, Cunguang Wang, Daoru Pan, Defei Bu, Dengchang Zhao, Di Xiu, Dishan Liu, Dongyu Ru, Dunwei Tu, Fan Wu, Fengcheng Yuan, Fengcun Li, Gang Xu, Guanyu Wu, Guoyuan Lin, Haibin Wang, Hansi Yang, Hao Yang, Haonan Yan, Haoxiang Ma, Haoxing Wen, Hongyan Hao, Hongyin Tang, Hongyu Zang, Hongzhi Ni, Hui Su, Jiacheng Zhang, Jiahong Zhou, Jiahuan Li, Jiaming Wang, Jian Yang, Jianfei Zhang, Jianhao Xu, Jianing Wang, Jiapeng Zhu, Jiaqi Sun, Jiarong Shi, Jiarui Zhao, Jingang Wang, Jinluan Yang, Jinrui Ding, Jinwei Xiao, Jiyuan He, Juncan Xu, Kefeng Zhang, Keheng Wang, Li Wei, Lianhui Ma, Lin Qiu, Lingbing Kong, Lingchuan Liu, Linsen Guo, Mengshen Zhu, Mengxia Shen, Mingyang Zhu, Peiguang Li, Peng Pei, Peng Zhao, Pengcheng Jia, Pengtao Zhang, Ping Liu, Qi Gu, Qiong Huang, Qiyuan Duan, Quanchi Weng, Rongxiang Weng, Rongzhi Zhang, Rumei Li, Shanglin Lei, Shengnan An, Shijun Dai, Shizhe Wu, Shuaikang Liu, Shuang Zhou, Shuo Wang, Songyuan Zhao, Tao Liang, Tianhao Hu, Tianze Chen, Wei Liu, Wei Shi, Wei Wang, Weifeng Tang, Wenjie Shi, Wenlong Zhu, Wentao Chen, Wentao Shi, Xi Su, Xiandi Ma, Xiangcheng Liu, Xiangyu Xi, Xiangyuan Liu, Xiangzhou Huang, Xiao Liu, Xiaodong Cai, Xiaolong Chen, Xiaowei Shi, Xiaoyu Li, Xin Chen, Xingchen Liu, Xuan Huang, Xuezhi Cao, Xunliang Cai, Yan Chen, Yang Bai, Yang Liu, Yang Yang, Yang Zheng, Yanyu Chen, Yaoming Wang, Yaoming Zhu, Yaorui Shi, Yaqi Huo, Yerui Sun, Yi Zhang, Yi-Kai Zhang, Yifan Lu, Yifan Zhao, Yihao Chen, Yitao Zhai, Yongjing Yin, Yongwei Zhou, Youshao Xiao, Yu Wang, Yu Yang, Yuchen Xie, Yuchen Yu, Yuchuan Dai, Yue Xu, Yueqing Sun, Yufei Zhang, Yuhuai Wei, Yulei Qian, Yunfan Liang, Yunke Zhao, Yuwei Jiang, Yuxin Bian, Yuxin Chen, Yuxin Liu, Zeyang Yu, Zhao Yang, Zhengsheng Huang, Zhengyu Chen, Zhijian Liu, Zhikang Xia, Zhimin Lin, Zhiyuan Yao, Zhuofan Chen, Zhuowen Han, Zijian Zhang, Ziran Li, Ziwen Wang, Ziyuan Zhuang

Main category: cs.AI

TL;DR: 560B parameter MoE reasoning model with superior agentic capabilities, achieving SOTA on agentic benchmarks through unified training framework and novel Heavy Thinking mode.

Details

Motivation: To develop a large-scale reasoning model with superior agentic capabilities for complex tool interactions and robust performance in noisy real-world environments.

Method: Unified training framework combining domain-parallel expert training with fusion, systematic environment scaling, asynchronous RL (DORA) for multi-environment training, noise pattern analysis, and Heavy Thinking mode for test-time scaling.

Result: Achieves state-of-the-art performance on agentic benchmarks (search, tool use, tool-integrated reasoning) with strong generalization to complex tool interactions and robustness to real-world noise.

Conclusion: LongCat-Flash-Thinking-2601 demonstrates advanced agentic reasoning capabilities through comprehensive training methodology and novel architectural innovations, enabling robust real-world applications.

Abstract: We introduce LongCat-Flash-Thinking-2601, a 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model with superior agentic reasoning capability. LongCat-Flash-Thinking-2601 achieves state-of-the-art performance among open-source models on a wide range of agentic benchmarks, including agentic search, agentic tool use, and tool-integrated reasoning. Beyond benchmark performance, the model demonstrates strong generalization to complex tool interactions and robust behavior under noisy real-world environments. Its advanced capability stems from a unified training framework that combines domain-parallel expert training with subsequent fusion, together with an end-to-end co-design of data construction, environments, algorithms, and infrastructure spanning from pre-training to post-training. In particular, the model’s strong generalization capability in complex tool-use are driven by our in-depth exploration of environment scaling and principled task construction. To optimize long-tailed, skewed generation and multi-turn agentic interactions, and to enable stable training across over 10,000 environments spanning more than 20 domains, we systematically extend our asynchronous reinforcement learning framework, DORA, for stable and efficient large-scale multi-environment training. Furthermore, recognizing that real-world tasks are inherently noisy, we conduct a systematic analysis and decomposition of real-world noise patterns, and design targeted training procedures to explicitly incorporate such imperfections into the training process, resulting in improved robustness for real-world applications. To further enhance performance on complex reasoning tasks, we introduce a Heavy Thinking mode that enables effective test-time scaling by jointly expanding reasoning depth and width through intensive parallel thinking.

[903] KAPSO: A Knowledge-grounded framework for Autonomous Program Synthesis and Optimization

Alireza Nadafian, Alireza Mohammadshahi, Majid Yazdani

Main category: cs.AI

TL;DR: KAPSO is a modular framework for autonomous program synthesis and optimization that uses iterative loops of ideation, code synthesis, execution, evaluation, and learning to improve runnable artifacts toward measurable objectives.

Details

Motivation: Addresses long-horizon failures in coding agents including lost experimental state, brittle debugging, and weak reuse of domain expertise by creating a systematic approach to program synthesis and optimization.

Method: Uses three tightly coupled components: 1) git-native experimentation engine for reproducible artifacts, 2) knowledge system for heterogeneous source ingestion and structured representation, and 3) cognitive memory layer for coordinating retrieval and maintaining episodic store of reusable lessons.

Result: Evaluated on MLE-Bench (Kaggle-style ML competitions) and ALE-Bench (AtCoder heuristic optimization) with reported end-to-end performance.

Conclusion: KAPSO treats synthesis as an operator within long-horizon optimization loops rather than as an endpoint, enabling systematic improvement of runnable artifacts through iterative learning and knowledge reuse.

Abstract: We introduce KAPSO, a modular framework for autonomous program synthesis and optimization. Given a natural language goal and an evaluation method, KAPSO iteratively performs ideation, code synthesis and editing, execution, evaluation, and learning to improve a runnable artifact toward measurable objectives. Rather than treating synthesis as the endpoint, KAPSO uses synthesis as an operator within a long-horizon optimization loop, where progress is defined by evaluator outcomes. KAPSO targets long-horizon failures common in coding agents, including lost experimental state, brittle debugging, and weak reuse of domain expertise, by integrating three tightly coupled components. First, a git-native experimentation engine isolates each attempt as a branch, producing reproducible artifacts and preserving provenance across iterations. Second, a knowledge system ingests heterogeneous sources, including repositories, internal playbooks, and curated external resources such as documentation, scientific papers, and web search results, and organizes them into a structured representation that supports retrieval over workflows, implementations, and environment constraints. Third, a cognitive memory layer coordinates retrieval and maintains an episodic store of reusable lessons distilled from experiment traces (run logs, diffs, and evaluator feedback), reducing repeated error modes and accelerating convergence. We evaluated KAPSO on MLE-Bench (Kaggle-style ML competitions) and ALE-Bench (AtCoder heuristic optimization), and report end-to-end performance. Code Available at: https://github.com/Leeroo-AI/kapso

[904] MAGNET: Towards Adaptive GUI Agents with Memory-Driven Knowledge Evolution

Libo Sun, Jiwen Zhang, Siyuan Wang, Zhongyu Wei

Main category: cs.AI

TL;DR: MAGNET is a memory-driven adaptive GUI agent framework with dual-level memory that links visual features to stable functional semantics and captures task intents across varying workflows, improving performance in evolving software environments.

Details

Motivation: Mobile GUI agents trained on historical data fail when UI appearances and workflows change due to frequent software updates, despite underlying functional semantics and task intents remaining stable.

Method: Introduces MAGNET with dual-level memory: stationary memory links diverse visual features to stable functional semantics for robust action grounding, and procedural memory captures stable task intents across varying workflows. Uses dynamic memory evolution mechanism that continuously refines both memories by prioritizing frequently accessed knowledge.

Result: Online benchmark AndroidWorld evaluations show substantial improvements over baselines, while offline benchmarks confirm consistent gains under distribution shifts.

Conclusion: Leveraging stable structures across interface changes improves agent performance and generalization in evolving software environments, validating the approach of focusing on functional semantics rather than surface appearance.

Abstract: Mobile GUI agents powered by large foundation models enable autonomous task execution, but frequent updates altering UI appearance and reorganizing workflows cause agents trained on historical data to fail. Despite surface changes, functional semantics and task intents remain fundamentally stable. Building on this insight, we introduce MAGNET, a memory-driven adaptive agent framework with dual-level memory: stationary memory linking diverse visual features to stable functional semantics for robust action grounding and procedural memory capturing stable task intents across varying workflows. We propose a dynamic memory evolution mechanism that continuously refines both memories by prioritizing frequently accessed knowledge. Online benchmark AndroidWorld evaluations show substantial improvements over baselines, while offline benchmarks confirm consistent gains under distribution shifts. These results validate that leveraging stable structures across interface changes improves agent performance and generalization in evolving software environments.

[905] AMA: Adaptive Memory via Multi-Agent Collaboration

Weiquan Huang, Zixuan Wang, Hehai Lin, Sudong Wang, Bo Xu, Qian Li, Beier Zhu, Linyi Yang, Chengwei Qin

Main category: cs.AI

TL;DR: AMA is a multi-agent memory framework for LLMs that uses coordinated agents to manage memory at multiple granularities, improving retrieval precision and consistency while reducing token usage.

Details

Motivation: Current LLM agent memory systems have rigid retrieval granularity, accumulation-heavy maintenance, and coarse-grained updates, leading to mismatches between stored information and task needs, plus logical inconsistencies over time.

Method: AMA uses a hierarchical memory design with coordinated agents: Constructor and Retriever for multi-granularity memory construction and adaptive query routing; Judge for relevance/consistency verification; Refresher for targeted updates and conflict resolution.

Result: AMA significantly outperforms state-of-the-art baselines on long-context benchmarks while reducing token consumption by ~80% compared to full-context methods, demonstrating improved retrieval precision and long-term memory consistency.

Conclusion: The AMA framework effectively addresses memory system limitations in LLM agents through multi-agent collaboration, enabling adaptive granularity, efficient retrieval, and consistent memory maintenance.

Abstract: The rapid evolution of Large Language Model (LLM) agents has necessitated robust memory systems to support cohesive long-term interaction and complex reasoning. Benefiting from the strong capabilities of LLMs, recent research focus has shifted from simple context extension to the development of dedicated agentic memory systems. However, existing approaches typically rely on rigid retrieval granularity, accumulation-heavy maintenance strategies, and coarse-grained update mechanisms. These design choices create a persistent mismatch between stored information and task-specific reasoning demands, while leading to the unchecked accumulation of logical inconsistencies over time. To address these challenges, we propose Adaptive Memory via Multi-Agent Collaboration (AMA), a novel framework that leverages coordinated agents to manage memory across multiple granularities. AMA employs a hierarchical memory design that dynamically aligns retrieval granularity with task complexity. Specifically, the Constructor and Retriever jointly enable multi-granularity memory construction and adaptive query routing. The Judge verifies the relevance and consistency of retrieved content, triggering iterative retrieval when evidence is insufficient or invoking the Refresher upon detecting logical conflicts. The Refresher then enforces memory consistency by performing targeted updates or removing outdated entries. Extensive experiments on challenging long-context benchmarks show that AMA significantly outperforms state-of-the-art baselines while reducing token consumption by approximately 80% compared to full-context methods, demonstrating its effectiveness in maintaining retrieval precision and long-term memory consistency.

[906] OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

Jarrod Barnes

Main category: cs.AI

TL;DR: OpenSec is a dual-control reinforcement learning environment that evaluates incident response agents’ ability to handle adversarial prompt injection scenarios, revealing calibration failures in frontier LLMs through execution-based metrics.

Details

Motivation: As LLMs become more capable, their offensive applications increase (e.g., generating exploits cheaply), requiring defensive incident response agents to keep pace. Existing benchmarks conflate action execution with correct execution, hiding calibration failures when agents process adversarial evidence.

Method: Introduces OpenSec, a dual-control reinforcement learning environment that evaluates IR agents under realistic prompt injection scenarios. Unlike static benchmarks, it scores world-state-changing containment actions under adversarial evidence using execution-based metrics: time-to-first-containment (TTFC), blast radius (false positives per episode), and injection violation rates.

Result: Evaluating four frontier models on 40 standard-tier episodes revealed consistent over-triggering: GPT-5.2, Gemini 3, and DeepSeek executed containment in 100% of episodes with 90-97% false positive rates. Claude Sonnet 4.5 showed partial calibration (85% containment, 72% FP), demonstrating that OpenSec surfaces calibration failures hidden by aggregate success metrics.

Conclusion: OpenSec successfully identifies calibration failure modes in incident response agents that are hidden by traditional benchmarks, showing that frontier LLMs tend to over-trigger containment actions when faced with adversarial evidence, highlighting the need for better calibration in security applications.

Abstract: As large language models improve, so do their offensive applications: frontier agents now generate working exploits for under $50 in compute (Heelan, 2026). Defensive incident response (IR) agents must keep pace, but existing benchmarks conflate action execution with correct execution, hiding calibration failures when agents process adversarial evidence. We introduce OpenSec, a dual-control reinforcement learning environment that evaluates IR agents under realistic prompt injection scenarios. Unlike static capability benchmarks, OpenSec scores world-state-changing containment actions under adversarial evidence via execution-based metrics: time-to-first-containment (TTFC), blast radius (false positives per episode), and injection violation rates. Evaluating four frontier models on 40 standard-tier episodes, we find consistent over-triggering in this setting: GPT-5.2, Gemini 3, and DeepSeek execute containment in 100% of episodes with 90-97% false positive rates. Claude Sonnet 4.5 shows partial calibration (85% containment, 72% FP), demonstrating that OpenSec surfaces a calibration failure mode hidden by aggregate success metrics. Code available at https://github.com/jbarnes850/opensec-env.

[907] ChipBench: A Next-Step Benchmark for Evaluating LLM Performance in AI-Aided Chip Design

Zhongkai Yu, Chenyang Zhou, Yichen Lin, Hejia Zhang, Haotian Ye, Junxia Cui, Zaifeng Pan, Jishen Zhao, Yufei Ding

Main category: cs.AI

TL;DR: ChipBench: A comprehensive benchmark for evaluating LLMs in AI-aided chip design across Verilog generation, debugging, and reference model generation tasks, revealing significant performance gaps compared to existing saturated benchmarks.

Details

Motivation: Current benchmarks for LLMs in hardware engineering suffer from saturation and limited task diversity, failing to reflect real industrial workflows. There's a need for more comprehensive evaluation of LLMs' capabilities in chip design tasks.

Method: Proposed ChipBench with 44 realistic hierarchical Verilog modules, 89 systematic debugging cases, and 132 reference model samples across Python, SystemC, and CXXRTL. Includes automated toolbox for generating high-quality training data.

Result: State-of-the-art Claude-4.5-opus achieved only 30.74% on Verilog generation and 13.33% on Python reference model generation, showing significant challenges compared to existing benchmarks where SOTA models achieve over 95% pass rates.

Conclusion: ChipBench reveals substantial performance gaps in LLMs for hardware engineering tasks, demonstrating the need for more challenging benchmarks and highlighting opportunities for improvement in AI-aided chip design.

Abstract: While Large Language Models (LLMs) show significant potential in hardware engineering, current benchmarks suffer from saturation and limited task diversity, failing to reflect LLMs’ performance in real industrial workflows. To address this gap, we propose a comprehensive benchmark for AI-aided chip design that rigorously evaluates LLMs across three critical tasks: Verilog generation, debugging, and reference model generation. Our benchmark features 44 realistic modules with complex hierarchical structures, 89 systematic debugging cases, and 132 reference model samples across Python, SystemC, and CXXRTL. Evaluation results reveal substantial performance gaps, with state-of-the-art Claude-4.5-opus achieving only 30.74% on Verilog generation and 13.33% on Python reference model generation, demonstrating significant challenges compared to existing saturated benchmarks where SOTA models achieve over 95% pass rates. Additionally, to help enhance LLM reference model generation, we provide an automated toolbox for high-quality training data generation, facilitating future research in this underexplored domain. Our code is available at https://github.com/zhongkaiyu/ChipBench.git.

[908] Language-based Trial and Error Falls Behind in the Era of Experience

Haoyu Wang, Guozheng Ma, Shugang Cui, Yilun Kong, Haotian Luo, Li Shen, Mengya Gao, Yichao Wu, Xiaogang Wang, Dacheng Tao

Main category: cs.AI

TL;DR: SCOUT framework uses lightweight “scout” models to explore nonlinguistic environments efficiently, then bootstraps LLMs with collected data to overcome exploration costs in symbolic/spatial tasks.

Details

Motivation: LLMs struggle with unseen nonlinguistic environments (symbolic/spatial tasks) due to prohibitive exploration costs - extensive trial-and-error is computationally unsustainable for large models in high-dimensional semantic spaces.

Method: Proposes SCOUT framework that decouples exploration from exploitation: uses lightweight “scout” models (e.g., small MLPs) to probe environmental dynamics efficiently, collects trajectories to bootstrap LLM via Supervised Fine-Tuning, followed by multi-turn Reinforcement Learning to activate latent world knowledge.

Result: SCOUT enables Qwen2.5-3B-Instruct to achieve average score of 0.86, significantly outperforming proprietary models like Gemini-2.5-Pro (0.60) while saving about 60% GPU hours consumption.

Conclusion: The exploration cost bottleneck in nonlinguistic environments can be overcome by decoupling exploration from exploitation using lightweight scouts, enabling efficient adaptation of LLMs to unseen symbolic/spatial tasks.

Abstract: While Large Language Models (LLMs) excel in language-based agentic tasks, their applicability to unseen, nonlinguistic environments (e.g., symbolic or spatial tasks) remains limited. Previous work attributes this performance gap to the mismatch between the pretraining distribution and the testing distribution. In this work, we demonstrate the primary bottleneck is the prohibitive cost of exploration: mastering these tasks requires extensive trial-and-error, which is computationally unsustainable for parameter-heavy LLMs operating in a high dimensional semantic space. To address this, we propose SCOUT (Sub-Scale Collaboration On Unseen Tasks), a novel framework that decouples exploration from exploitation. We employ lightweight “scouts” (e.g., small MLPs) to probe environmental dynamics at a speed and scale far exceeding LLMs. The collected trajectories are utilized to bootstrap the LLM via Supervised Fine-Tuning (SFT), followed by multi-turn Reinforcement Learning (RL) to activate its latent world knowledge. Empirically, SCOUT enables a Qwen2.5-3B-Instruct model to achieve an average score of 0.86, significantly outperforming proprietary models, including Gemini-2.5-Pro (0.60), while saving about 60% GPU hours consumption.

[909] Bridging Forecast Accuracy and Inventory KPIs: A Simulation-Based Software Framework

So Fukuhara, Abdallah Alabdallah, Nuwan Gunasekara, Slawomir Nowaczyk

Main category: cs.AI

TL;DR: A simulation framework for evaluating spare parts forecasting models based on operational KPIs (cost, service level) rather than statistical accuracy metrics, showing that accuracy improvements don’t always translate to better inventory outcomes.

Details

Motivation: Current forecasting evaluation focuses on statistical accuracy (MAE, RMSE) but ignores operational impact on inventory management KPIs like total cost and service level, creating a gap between model evaluation and real-world performance.

Method: Proposes a decision-centric simulation framework with three components: (1) synthetic demand generator for spare-parts characteristics, (2) flexible forecasting module for arbitrary models, and (3) inventory control simulator that computes operational KPIs from forecasts.

Result: Shows that improvements in accuracy metrics don’t necessarily lead to better KPIs, and models with similar error profiles can produce different cost-service trade-offs, revealing discrepancies between statistical and operational performance.

Conclusion: The framework bridges demand forecasting and inventory management, shifting evaluation from predictive accuracy to operational relevance, with implications for automotive aftermarket and related domains.

Abstract: Efficient management of spare parts inventory is crucial in the automotive aftermarket, where demand is highly intermittent and uncertainty drives substantial cost and service risks. Forecasting is therefore central, but the quality of forecasting models should be judged not by statistical accuracy (e.g., MAE, RMSE) but rather by its impact on key operational performance indicators (KPIs), such as total cost and service level. Yet most existing work evaluates models exclusively using accuracy metrics, and the relationship between these metrics and KPIs remains poorly understood. To address this gap, we propose a decision-centric simulation software framework that enables systematic evaluation of forecasting models in realistic inventory management setting. The framework comprises: (i) a synthetic demand generator tailored to spare-parts demand characteristics, (ii) a flexible forecasting module that can host arbitrary predictive models, and (iii) an inventory control simulator that consumes the forecasts and computes operational KPIs. This closed-loop setup enables researchers to evaluate models not only in terms of statistical error but also in terms of downstream inventory implications. Using a wide range of simulation scenarios, we show that improvements in accuracy metrics do not necessarily lead to better KPIs, and that models with similar error profiles can induce different cost-service trade-offs. We analyze these discrepancies to characterize how forecast performance affects inventory outcomes and derive guidance for model selection. Overall, the framework links demand forecasting and inventory management, shifting evaluation from predictive accuracy toward operational relevance in the automotive aftermarket and related domains. An open-source implementation of the software is available at https://github.com/caisr-hh/TruckParts-Demand-Inventory-Simulator/releases/tag/IDA_2026.

[910] Optimizing Agentic Workflows using Meta-tools

Sami Abuzakuk, Anne-Marie Kermarrec, Rishi Sharma, Rasmus Moorits Veski, Martijn de Vos

Main category: cs.AI

TL;DR: AWO framework optimizes agentic AI workflows by identifying redundant tool execution patterns and transforming them into meta-tools to reduce LLM calls and improve success rates.

Details

Motivation: Agentic AI workflows often require many iterative reasoning steps and tool invocations, leading to high operational costs, latency, and failures due to hallucinations. There's a need to optimize these workflows for better efficiency and robustness.

Method: Agent Workflow Optimization (AWO) analyzes existing workflow traces to discover recurring sequences of tool calls and transforms them into meta-tools - deterministic, composite tools that bundle multiple agent actions into a single invocation, bypassing unnecessary intermediate LLM reasoning steps.

Result: Experiments on two agentic AI benchmarks show AWO reduces LLM calls by up to 11.9% while increasing task success rate by up to 4.2 percentage points.

Conclusion: AWO effectively optimizes agentic workflows by identifying and compressing redundant execution patterns, improving both efficiency and robustness of agentic AI systems.

Abstract: Agentic AI enables LLM to dynamically reason, plan, and interact with tools to solve complex tasks. However, agentic workflows often require many iterative reasoning steps and tool invocations, leading to significant operational expense, end-to-end latency and failures due to hallucinations. This work introduces Agent Workflow Optimization (AWO), a framework that identifies and optimizes redundant tool execution patterns to improve the efficiency and robustness of agentic workflows. AWO analyzes existing workflow traces to discover recurring sequences of tool calls and transforms them into meta-tools, which are deterministic, composite tools that bundle multiple agent actions into a single invocation. Meta-tools bypass unnecessary intermediate LLM reasoning steps and reduce operational cost while also shortening execution paths, leading to fewer failures. Experiments on two agentic AI benchmarks show that AWO reduces the number of LLM calls up to 11.9% while also increasing the task success rate by up to 4.2 percent points.

[911] Semi-Autonomous Mathematics Discovery with Gemini: A Case Study on the Erdős Problems

Tony Feng, Trieu Trinh, Garrett Bingham, Jiwon Kang, Shengtong Zhang, Sang-hyun Kim, Kevin Barreto, Carl Schildkraut, Junehyuk Jung, Jaehyeon Seo, Carlo Pagano, Yuri Chervonyi, Dawsen Hwang, Kaiying Hou, Sergei Gukov, Cheng-Chiang Tsai, Hyunwoo Choi, Youngbeom Jin, Wei-Yuan Li, Hao-An Wu, Ruey-An Shiu, Yu-Sheng Shih, Quoc V. Le, Thang Luong

Main category: cs.AI

TL;DR: AI-assisted analysis of 700 open Erdős problems using Gemini, addressing 13 problems with 5 novel solutions and 8 literature identifications, revealing issues with AI in mathematics

Details

Motivation: To explore the potential of AI in semi-autonomous mathematics discovery by systematically evaluating open conjectures in Bloom's Erdős Problems database, understanding AI's capabilities and limitations in mathematical research

Method: Hybrid approach: AI-driven natural language verification to narrow search space, followed by human expert evaluation for correctness and novelty. Applied Gemini to analyze 700 conjectures labeled ‘Open’ in the database

Result: Addressed 13 problems marked ‘Open’: 5 through seemingly novel autonomous solutions, 8 through identification of previous solutions in existing literature. Found that ‘Open’ status was due to obscurity rather than difficulty

Conclusion: AI can assist in mathematics discovery but faces challenges including difficulty of literature identification and risk of ‘subconscious plagiarism’. The study provides insights into AI’s role in mathematical research and highlights practical issues in scaling such approaches

Abstract: We present a case study in semi-autonomous mathematics discovery, using Gemini to systematically evaluate 700 conjectures labeled ‘Open’ in Bloom’s Erdős Problems database. We employ a hybrid methodology: AI-driven natural language verification to narrow the search space, followed by human expert evaluation to gauge correctness and novelty. We address 13 problems that were marked ‘Open’ in the database: 5 through seemingly novel autonomous solutions, and 8 through identification of previous solutions in the existing literature. Our findings suggest that the ‘Open’ status of the problems was through obscurity rather than difficulty. We also identify and discuss issues arising in applying AI to math conjectures at scale, highlighting the difficulty of literature identification and the risk of ‘‘subconscious plagiarism’’ by AI. We reflect on the takeaways from AI-assisted efforts on the Erdős Problems.

cs.SD

[912] LPIPS-AttnWav2Lip: Generic Audio-Driven lip synchronization for Talking Head Generation in the Wild

Zhipeng Chen, Xinheng Wang, Lun Xie, Haijie Yuan, Hang Pan

Main category: cs.SD

Details

Result: The method achieves outstanding performance in lip synchronization accuracy and visual quality, as demonstrated by both subjective and objective evaluations.

[913] Multi-Speaker Conversational Audio Deepfake: Taxonomy, Dataset and Pilot Study

Alabi Ahmed, Vandana Janeja, Sanjay Purushotham

Main category: cs.SD

TL;DR: Paper introduces MsCADD dataset for multi-speaker conversational audio deepfake detection, benchmarking existing models on TTS-generated two-speaker conversations.

Details

Motivation: Existing audio deepfake detection research focuses on single-speaker scenarios, but real-world threats increasingly involve multi-speaker conversational settings which remain underexplored.

Method: Proposed conceptual taxonomy for multi-speaker conversational audio deepfakes, created MsCADD dataset with 2,830 audio clips of real and synthetic two-speaker conversations using VITS and SoundStorm-based NotebookLM models, benchmarked three neural models (LFCC-LCNN, RawNet2, Wav2Vec 2.0).

Result: Baseline models provided useful benchmarks but showed significant performance gaps in detecting synthetic voices under varied conversational dynamics, highlighting the challenge of multi-speaker deepfake detection.

Conclusion: Multi-speaker conversational audio deepfake detection is a critical underexplored area; MsCADD dataset provides foundation for future research to address this emerging threat to audio information trustworthiness.

Abstract: The rapid advances in text-to-speech (TTS) technologies have made audio deepfakes increasingly realistic and accessible, raising significant security and trust concerns. While existing research has largely focused on detecting single-speaker audio deepfakes, real-world malicious applications with multi-speaker conversational settings is also emerging as a major underexplored threat. To address this gap, we propose a conceptual taxonomy of multi-speaker conversational audio deepfakes, distinguishing between partial manipulations (one or multiple speakers altered) and full manipulations (entire conversations synthesized). As a first step, we introduce a new Multi-speaker Conversational Audio Deepfakes Dataset (MsCADD) of 2,830 audio clips containing real and fully synthetic two-speaker conversations, generated using VITS and SoundStorm-based NotebookLM models to simulate natural dialogue with variations in speaker gender, and conversational spontaneity. MsCADD is limited to text-to-speech (TTS) types of deepfake. We benchmark three neural baseline models; LFCC-LCNN, RawNet2, and Wav2Vec 2.0 on this dataset and report performance in terms of F1 score, accuracy, true positive rate (TPR), and true negative rate (TNR). Results show that these baseline models provided a useful benchmark, however, the results also highlight that there is a significant gap in multi-speaker deepfake research in reliably detecting synthetic voices under varied conversational dynamics. Our dataset and benchmarks provide a foundation for future research on deepfake detection in conversational scenarios, which is a highly underexplored area of research but also a major area of threat to trustworthy information in audio settings. The MsCADD dataset is publicly available to support reproducibility and benchmarking by the research community.

[914] RVCBench: Benchmarking the Robustness of Voice Cloning Across Modern Audio Generation Models

Xinting Liao, Ruinan Jin, Hanlin Yu, Deval Pandya, Xiaoxiao Li

Main category: cs.SD

TL;DR: RVCBench is a comprehensive benchmark for evaluating robustness in voice cloning models across 10 tasks, 225 speakers, and 11 models, revealing significant performance degradation under realistic deployment conditions.

Details

Motivation: Voice cloning models face robustness challenges in real-world deployment due to noisy reference audio, imperfect text prompts, and diverse downstream processing, but current research lacks systematic evaluation of these practical issues.

Method: Developed RVCBench, a comprehensive benchmark evaluating VC robustness across the full generation pipeline: input variation, generation challenges, output post-processing, and adversarial perturbations, covering 10 robustness tasks, 225 speakers, 14,370 utterances, and 11 representative modern VC models.

Result: Substantial robustness gaps were uncovered: performance deteriorates sharply under common input shifts and post-processing; long-context and cross-lingual scenarios expose stability limitations; both passive noise and proactive perturbation affect generation robustness.

Conclusion: The benchmark provides a unified picture of how current VC models fail in practice and offers a standardized, open-source testbed to support development of more robust and deployable voice cloning models.

Abstract: Modern voice cloning (VC) can synthesize speech that closely matches a target speaker from only seconds of reference audio, enabling applications such as personalized speech interfaces and dubbing. In practical deployments, modern audio generation models inevitably encounter noisy reference audios, imperfect text prompts, and diverse downstream processing, which can significantly hurt robustness. Despite rapid progress in VC driven by autoregressive codec-token language models and diffusion-based models, robustness under realistic deployment shifts remains underexplored. This paper introduces RVCBench, a comprehensive benchmark that evaluates Robustness in VC across the full generation pipeline, including input variation, generation challenges, output post-processing, and adversarial perturbations, covering 10 robustness tasks, 225 speakers, 14,370 utterances, and 11 representative modern VC models. Our evaluation uncovers substantial robustness gaps in VC: performance can deteriorate sharply under common input shifts and post-processing; long-context and cross-lingual scenarios further expose stability limitations; and both passive noise and proactive perturbation influence generation robustness. Collectively, these findings provide a unified picture of how current VC models fail in practice and introduce a standardized, open-source testbed to support the development of more robust and deployable VC models. We open-source our project at https://github.com/Nanboy-Ronan/RVCBench.

[915] Edit Content, Preserve Acoustics: Imperceptible Text-Based Speech Editing via Self-Consistency Rewards

Yong Ren, Jiangyan Yi, Jianhua Tao, Zhengqi Wen, Tao Wang

Main category: cs.SD

TL;DR: A novel framework for imperceptible text-based speech editing that separates content editing in semantic space from acoustic reconstruction using Flow Matching, with perceptual alignment via self-consistency rewards.

Details

Motivation: Current text-based speech editing methods in acoustic space suffer from content-style entanglement, leading to generation instability and boundary artifacts when modifying spoken content while preserving surrounding context.

Method: Two core components: (1) Structural Foundations - decouples editing into stable semantic space while delegating acoustic reconstruction to Flow Matching decoder; (2) Perceptual Alignment - uses Self-Consistency Rewards Group Relative Policy Optimization with pre-trained TTS model as implicit critic, plus intelligibility and duration constraints.

Result: Empirical evaluations show the method significantly outperforms state-of-the-art autoregressive and non-autoregressive baselines, achieving superior intelligibility, robustness, and perceptual quality.

Conclusion: The proposed “Edit Content, Preserve Acoustics” framework effectively addresses content-style entanglement in speech editing through semantic space decoupling and perceptual alignment, enabling seamless speech modifications.

Abstract: Imperceptible text-based speech editing allows users to modify spoken content by altering the transcript. It demands that modified segments fuse seamlessly with the surrounding context. Prevalent methods operating in the acoustic space suffer from inherent content-style entanglement, leading to generation instability and boundary artifacts. In this paper, we propose a novel framework grounded in the principle of “Edit Content, Preserve Acoustics”. Our approach relies on two core components: (1) Structural Foundations, which decouples editing into a stable semantic space while delegating acoustic reconstruction to a Flow Matching decoder; and (2) Perceptual Alignment, which employs a novel Self-Consistency Rewards Group Relative Policy Optimization. By leveraging a pre-trained Text-to-Speech model as an implicit critic – complemented by strict intelligibility and duration constraints – we effectively align the edited semantic token sequence with the original context. Empirical evaluations demonstrate that our method significantly outperforms state-of-the-art autoregressive and non-autoregressive baselines, achieving superior intelligibility, robustness, and perceptual quality.

[916] Dual-View Predictive Diffusion: Lightweight Speech Enhancement via Spectrogram-Image Synergy

Ke Xue, Rongfei Fan, Kai Li, Shanping Yu, Puning Zhao, Jianping An

Main category: cs.SD

TL;DR: DVPD is a lightweight dual-view predictive diffusion model for speech enhancement that exploits spectrograms as both visual textures and frequency-domain representations, achieving SOTA performance with 35% parameters and 40% MACs of previous SOTA.

Details

Motivation: Current diffusion models for speech enhancement treat spectrograms as generic 2D images, ignoring audio's intrinsic structural sparsity, leading to inefficient spectral representation and high computational complexity.

Method: Uses dual-view approach: Frequency-Adaptive Non-uniform Compression (FANC) encoder preserves low-frequency harmonics while pruning high-frequency redundancies, and Lightweight Image-based Spectro-Awareness (LISA) module captures visual features. During inference, employs Training-free Lossless Boost (TLB) strategy using dual-view priors.

Result: Achieves state-of-the-art performance across various benchmarks while requiring only 35% of parameters and 40% of inference MACs compared to previous SOTA lightweight model PGUSE.

Conclusion: DVPD demonstrates superior ability to balance high-fidelity speech quality with extreme architectural efficiency by exploiting the dual nature of spectrograms.

Abstract: Diffusion models have recently set new benchmarks in Speech Enhancement (SE). However, most existing score-based models treat speech spectrograms merely as generic 2D images, applying uniform processing that ignores the intrinsic structural sparsity of audio, which results in inefficient spectral representation and prohibitive computational complexity. To bridge this gap, we propose DVPD, an extremely lightweight Dual-View Predictive Diffusion model, which uniquely exploits the dual nature of spectrograms as both visual textures and physical frequency-domain representations across both training and inference stages. Specifically, during training, we optimize spectral utilization via the Frequency-Adaptive Non-uniform Compression (FANC) encoder, which preserves critical low-frequency harmonics while pruning high-frequency redundancies. Simultaneously, we introduce a Lightweight Image-based Spectro-Awareness (LISA) module to capture features from a visual perspective with minimal overhead. During inference, we propose a Training-free Lossless Boost (TLB) strategy that leverages the same dual-view priors to refine generation quality without any additional fine-tuning. Extensive experiments across various benchmarks demonstrate that DVPD achieves state-of-the-art performance while requiring only 35% of the parameters and 40% of the inference MACs compared to SOTA lightweight model, PGUSE. These results highlight DVPD’s superior ability to balance high-fidelity speech quality with extreme architectural efficiency. Code and audio samples are available at the anonymous website: {https://anonymous.4open.science/r/dvpd_demo-E630}

[917] The TMU System for the XACLE Challenge: Training Large Audio Language Models with CLAP Pseudo-Labels

Ayuto Tsutsumi, Kohei Tanaka, Sayaka Shiota

Main category: cs.SD

TL;DR: A system for audio-text alignment using a large audio language model with three-stage training, achieving strong performance on the XACLE challenge.

Details

Motivation: To address the challenge of semantic alignment between general audio and text pairs, which is important for multimodal understanding and has applications in audio retrieval, captioning, and cross-modal reasoning.

Method: Three-stage training pipeline: 1) Automated audio captioning pretraining, 2) Pretraining with CLAP pseudo-labels, 3) Fine-tuning on the XACLE dataset. Uses a large audio language model architecture.

Result: Achieved SRCC of 0.632 on XACLE test set, significantly outperforming baseline (0.334) and securing third place in the challenge ranking. CLAP pseudo-label pretraining was identified as the primary performance driver.

Conclusion: The proposed three-stage training approach with CLAP pseudo-labels effectively improves audio-text alignment performance, demonstrating the value of leveraging existing audio-text models for alignment tasks.

Abstract: In this paper, we propose a submission to the x-to-audio alignment (XACLE) challenge. The goal is to predict semantic alignment of a given general audio and text pair. The proposed system is based on a large audio language model (LALM) architecture. We employ a three-stage training pipeline: automated audio captioning pretraining, pretraining with CLAP pseudo-labels, and fine-tuning on the XACLE dataset. Our experiments show that pretraining with CLAP pseudo-labels is the primary performance driver. On the XACLE test set, our system reaches an SRCC of 0.632, significantly outperforming the baseline system (0.334) and securing third place in the challenge team ranking. Code and models can be found at https://github.com/shiotalab-tmu/tmu-xacle2026

[918] Audio-to-Image Bird Species Retrieval without Audio-Image Pairs via Text Distillation

Ilyass Moummad, Marius Miron, Lukas Rauch, David Robinson, Alexis Joly, Olivier Pietquin, Emmanuel Chemla, Matthieu Geist

Main category: cs.SD

TL;DR: Audio-to-image retrieval for bioacoustic species recognition using text as semantic intermediary to align audio and image representations without paired audio-image data.

Details

Motivation: Audio-to-image retrieval offers interpretable bioacoustic species recognition, but learning aligned audio-image representations is challenging due to scarcity of paired audio-image data.

Method: Uses text as semantic intermediary: distills text embedding space of pretrained image-text model (BioCLIP-2) into pretrained audio-text model (BioLingual) by fine-tuning audio encoder with contrastive objective, transferring visually grounded semantics into audio representation.

Result: Distilled audio encoder preserves audio discriminative power while substantially improving audio-text alignment on focal recordings and soundscape datasets. On SSW60 benchmark, achieves strong audio-to-image retrieval performance exceeding baselines based on zero-shot model combinations or learned mappings between text embeddings.

Conclusion: Indirect semantic transfer through text is sufficient to induce meaningful audio-image alignment, providing practical solution for visually grounded species recognition in data-scarce bioacoustic settings.

Abstract: Audio-to-image retrieval offers an interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to the scarcity of paired audio-image data. We propose a simple and data-efficient approach that enables audio-to-image retrieval without any audio-image supervision. Our proposed method uses text as a semantic intermediary: we distill the text embedding space of a pretrained image-text model (BioCLIP-2), which encodes rich visual and taxonomic structure, into a pretrained audio-text model (BioLingual) by fine-tuning its audio encoder with a contrastive objective. This distillation transfers visually grounded semantics into the audio representation, inducing emergent alignment between audio and image embeddings without using images during training. We evaluate the resulting model on multiple bioacoustic benchmarks. The distilled audio encoder preserves audio discriminative power while substantially improving audio-text alignment on focal recordings and soundscape datasets. Most importantly, on the SSW60 benchmark, the proposed approach achieves strong audio-to-image retrieval performance exceeding baselines based on zero-shot model combinations or learned mappings between text embeddings, despite not training on paired audio-image data. These results demonstrate that indirect semantic transfer through text is sufficient to induce meaningful audio-image alignment, providing a practical solution for visually grounded species recognition in data-scarce bioacoustic settings.

[919] ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation

Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo

Main category: cs.SD

TL;DR: ACE-Step v1.5 is an efficient open-source music foundation model that achieves commercial-grade music generation on consumer hardware with fast inference and low VRAM requirements.

Details

Motivation: To create an accessible, high-quality music generation model that runs efficiently on consumer hardware while supporting personalization and diverse creative workflows.

Method: Uses a novel hybrid architecture with a Language Model as an omni-capable planner that transforms user queries into comprehensive song blueprints via Chain-of-Thought reasoning, guiding a Diffusion Transformer for synthesis. Features intrinsic reinforcement learning without external reward models.

Result: Achieves quality beyond most commercial music models with extremely fast inference (under 2 seconds on A100, under 10 seconds on RTX 3090), runs locally with <4GB VRAM, supports personalization via LoRA training, and enables precise stylistic control across 50+ languages.

Conclusion: ACE-Step v1.5 provides a powerful, accessible tool for music creation that integrates seamlessly into creative workflows while maintaining high quality and efficiency on consumer hardware.

Abstract: We present ACE-Step v1.5, a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware. On commonly used evaluation metrics, ACE-Step v1.5 achieves quality beyond most commercial music models while remaining extremely fast – under 2 seconds per full song on an A100 and under 10 seconds on an RTX 3090. The model runs locally with less than 4GB of VRAM, and supports lightweight personalization: users can train a LoRA from just a few songs to capture their own style. At its core lies a novel hybrid architecture where the Language Model (LM) functions as an omni-capable planner: it transforms simple user queries into comprehensive song blueprints – scaling from short loops to 10-minute compositions – while synthesizing metadata, lyrics, and captions via Chain-of-Thought to guide the Diffusion Transformer (DiT). Uniquely, this alignment is achieved through intrinsic reinforcement learning relying solely on the model’s internal mechanisms, thereby eliminating the biases inherent in external reward models or human preferences. Beyond standard synthesis, ACE-Step v1.5 unifies precise stylistic control with versatile editing capabilities – such as cover generation, repainting, and vocal-to-BGM conversion – while maintaining strict adherence to prompts across 50+ languages. This paves the way for powerful tools that seamlessly integrate into the creative workflows of music artists, producers, and content creators. The code, the model weights and the demo are available at: https://ace-step.github.io/ace-step-v1.5.github.io/

[920] HierCon: Hierarchical Contrastive Attention for Audio Deepfake Detection

Zhili Nicholas Liang, Soyeon Caren Han, Qizhou Wang, Christopher Leckie

Main category: cs.SD

TL;DR: HierCon: A hierarchical layer attention framework with contrastive learning for audio deepfake detection that models temporal and layer dependencies to improve generalization across domains.

Details

Motivation: Audio deepfakes are becoming increasingly realistic and pose serious security risks. Existing detectors using self-supervised models treat layers independently, missing critical temporal and hierarchical dependencies needed to identify synthetic artifacts.

Method: Proposes HierCon, which combines hierarchical layer attention with margin-based contrastive learning. Models dependencies across temporal frames, neighboring layers, and layer groups while encouraging domain-invariant embeddings.

Result: Achieves state-of-the-art performance on ASVspoof 2021 DF and In-the-Wild datasets with 1.93% and 6.87% EER, improving over independent layer weighting by 36.6% and 22.5% respectively.

Conclusion: Hierarchical modeling enhances generalization to cross-domain generation techniques and recording conditions, with attention visualizations confirming the importance of modeling layer dependencies.

Abstract: Audio deepfakes generated by modern TTS and voice conversion systems are increasingly difficult to distinguish from real speech, raising serious risks for security and online trust. While state-of-the-art self-supervised models provide rich multi-layer representations, existing detectors treat layers independently and overlook temporal and hierarchical dependencies critical for identifying synthetic artefacts. We propose HierCon, a hierarchical layer attention framework combined with margin-based contrastive learning that models dependencies across temporal frames, neighbouring layers, and layer groups, while encouraging domain-invariant embeddings. Evaluated on ASVspoof 2021 DF and In-the-Wild datasets, our method achieves state-of-the-art performance (1.93% and 6.87% EER), improving over independent layer weighting by 36.6% and 22.5% respectively. The results and attention visualisations confirm that hierarchical modelling enhances generalisation to cross-domain generation techniques and recording conditions.

[921] TLDiffGAN: A Latent Diffusion-GAN Framework with Temporal Information Fusion for Anomalous Sound Detection

Chengyuan Ma, Peng Jia, Hongyue Guo, Wenming Yang

Main category: cs.SD

TL;DR: TLDiffGAN: A novel framework combining latent diffusion models with GANs for unsupervised anomalous sound detection, enhanced by pretrained audio encoders and TMixup spectrogram augmentation.

Details

Motivation: Existing generative models for anomalous sound detection fail to fully capture complex normal sound distributions, and powerful diffusion models remain unexplored in this domain. The paper aims to address these limitations.

Method: Proposes TLDiffGAN with two branches: 1) integrates latent diffusion model into GAN generator for adversarial training to improve sample quality, 2) uses pretrained audio model encoders to extract features from raw audio for auxiliary discrimination. Also introduces TMixup spectrogram augmentation to enhance sensitivity to subtle temporal patterns.

Result: Extensive experiments on DCASE 2020 Challenge Task 2 dataset show superior detection performance and strong capability in anomalous time-frequency localization.

Conclusion: TLDiffGAN effectively captures feature representations of normal sounds from both raw audio and Mel spectrograms, demonstrating state-of-the-art performance in unsupervised anomalous sound detection.

Abstract: Existing generative models for unsupervised anomalous sound detection are limited by their inability to fully capture the complex feature distribution of normal sounds, while the potential of powerful diffusion models in this domain remains largely unexplored. To address this challenge, we propose a novel framework, TLDiffGAN, which consists of two complementary branches. One branch incorporates a latent diffusion model into the GAN generator for adversarial training, thereby making the discriminator’s task more challenging and improving the quality of generated samples. The other branch leverages pretrained audio model encoders to extract features directly from raw audio waveforms for auxiliary discrimination. This framework effectively captures feature representations of normal sounds from both raw audio and Mel spectrograms. Moreover, we introduce a TMixup spectrogram augmentation technique to enhance sensitivity to subtle and localized temporal patterns that are often overlooked. Extensive experiments on the DCASE 2020 Challenge Task 2 dataset demonstrate the superior detection performance of TLDiffGAN, as well as its strong capability in anomalous time-frequency localization.

[922] Causally Disentangled Contrastive Learning for Multilingual Speaker Embeddings

Mariëtte Olijslager, Seyed Sahand Mohammadi Ziabari, Ali Mohammed Mansoor Alsahag

Main category: cs.SD

TL;DR: This paper investigates demographic information leakage (gender, age, accent) in self-supervised speaker embeddings trained with SimCLR and evaluates two debiasing strategies: adversarial training and causal bottleneck architecture, finding trade-offs between demographic suppression and speaker verification performance.

Details

Motivation: Self-supervised speaker embeddings used in verification systems often encode sensitive demographic attributes, raising fairness and privacy concerns. The paper aims to investigate the extent of demographic information leakage and whether it can be mitigated without severely degrading speaker verification performance.

Method: The study uses SimCLR-trained speaker embeddings and examines two debiasing strategies: 1) adversarial training through gradient reversal, and 2) a causal bottleneck architecture that explicitly separates demographic and residual information. Demographic leakage is quantified using both linear and nonlinear probing classifiers, while speaker verification performance is evaluated using ROC-AUC and EER metrics.

Result: Results show that gender information is strongly and linearly encoded in baseline embeddings, while age and accent are weaker and primarily nonlinearly represented. Adversarial debiasing reduces gender leakage but has limited effect on age and accent, with clear trade-offs in verification accuracy. The causal bottleneck further suppresses demographic information, particularly in residual representations, but incurs substantial performance degradation.

Conclusion: The findings highlight fundamental limitations in mitigating demographic leakage in self-supervised speaker embeddings and clarify the trade-offs inherent in current debiasing approaches, suggesting that demographic information is deeply embedded in these representations.

Abstract: Self-supervised speaker embeddings are widely used in speaker verification systems, but prior work has shown that they often encode sensitive demographic attributes, raising fairness and privacy concerns. This paper investigates the extent to which demographic information, specifically gender, age, and accent, is present in SimCLR-trained speaker embeddings and whether such leakage can be mitigated without severely degrading speaker verification performance. We study two debiasing strategies: adversarial training through gradient reversal and a causal bottleneck architecture that explicitly separates demographic and residual information. Demographic leakage is quantified using both linear and nonlinear probing classifiers, while speaker verification performance is evaluated using ROC-AUC and EER. Our results show that gender information is strongly and linearly encoded in baseline embeddings, whereas age and accent are weaker and primarily nonlinearly represented. Adversarial debiasing reduces gender leakage but has limited effect on age and accent and introduces a clear trade-off with verification accuracy. The causal bottleneck further suppresses demographic information, particularly in the residual representation, but incurs substantial performance degradation. These findings highlight fundamental limitations in mitigating demographic leakage in self-supervised speaker embeddings and clarify the trade-offs inherent in current debiasing approaches.

[923] Attention-weighted Centered Kernel Alignment for Knowledge Distillation in Large Audio-Language Models Applied to Speech Emotion Recognition

Qingran Yang, Botao Zhao, Zuheng Kang, Xue Li, Yayun He, Chuhang Liu, Xulong Zhang, Xiaoyang Qu, Junqing Peng, Jianzong Wang

Main category: cs.SD

TL;DR: PL-Distill is a knowledge distillation framework that compresses large audio-language models for speech emotion recognition by combining projector-level and logits-level distillation techniques.

Details

Motivation: Large Audio-Language Models (LALMs) show promise for Speech Emotion Recognition (SER) but are too large for resource-constrained environments. Existing knowledge distillation methods don't adequately address the cross-modal projection module and struggle with feature dimension alignment.

Method: Proposes PL-Distill with two components: 1) Projector-Level Distillation (PDist) using Attention-weighted Centered Kernel Alignment to align audio embeddings and handle dimension mismatches, and 2) Logits-Level Distillation (LDist) using KL divergence to align output logits from audio and text modalities.

Result: Successfully compresses an 8.4B-parameter teacher model to a 1.1B-parameter student model that consistently outperforms the teacher, state-of-the-art pretrained models, and other KD baselines on IEMOCAP, RAVDESS, and SAVEE datasets across all metrics.

Conclusion: PL-Distill effectively addresses the limitations of existing KD methods for LALM compression, providing a practical solution for deploying speech emotion recognition in resource-constrained environments while maintaining or improving performance.

Abstract: The emergence of Large Audio-Language Models (LALMs) has advanced Speech Emotion Recognition (SER), but their size limits deployment in resource-constrained environments. While Knowledge Distillation is effective for LALM compression, existing methods remain underexplored in distilling the cross-modal projection module (Projector), and often struggle with alignment due to differences in feature dimensions. We propose PL-Distill, a KD framework that combines Projector-Level Distillation (PDist) to align audio embeddings and Logits-Level Distillation (LDist) to align output logits. PDist introduces Attention-weighted Centered Kernel Alignment, a novel approach we propose to highlight important time steps and address dimension mismatches. Meanwhile, LDist minimizes the Kullback-Leibler divergence between teacher and student logits from audio and text modalities. On IEMOCAP, RAVDESS, and SAVEE, PL-Distill compresses an 8.4B-parameter teacher to a compact 1.1B-parameter student, consistently outperforming the teacher, state-of-the-art pretrained models, and other KD baselines across all metrics.

[924] Membership Inference Attack Against Music Diffusion Models via Generative Manifold Perturbation

Yuxuan Liu, Peihong Zhang, Rui Sang, Zhixin Li, Yizhou Tan, Yiqiang Cai, Shengchen Li

Main category: cs.SD

TL;DR: LSA-Probe: A white-box membership inference attack method for generative music models that measures geometric stability in reverse diffusion to detect training data with high precision at low false-positive rates.

Details

Motivation: Current membership inference attacks for audio models rely on loss-based signals that are poorly aligned with human perception, leading to inadequate separability at the low false-positive rates needed for copyright compliance auditing of generative music models.

Method: Proposes Latent Stability Adversarial Probe (LSA-Probe), a white-box method that measures geometric properties of reverse diffusion: the minimal time-normalized perturbation budget needed to cross a fixed perceptual degradation threshold at intermediate diffusion states.

Result: Training members reside in more stable regions and exhibit significantly higher degradation costs, enabling better separability for membership inference at low false-positive rates.

Conclusion: LSA-Probe provides an effective geometric approach for membership inference in generative audio models, addressing the limitations of loss-based methods for copyright compliance auditing.

Abstract: Membership inference attacks (MIAs) test whether a specific audio clip was used to train a model, making them a key tool for auditing generative music models for copyright compliance. However, loss-based signals (e.g., reconstruction error) are weakly aligned with human perception in practice, yielding poor separability at the low false-positive rates (FPRs) required for forensics. We propose the Latent Stability Adversarial Probe (LSA-Probe), a white-box method that measures a geometric property of the reverse diffusion: the minimal time-normalized perturbation budget needed to cross a fixed perceptual degradation threshold at an intermediate diffusion state. We show that training members, residing in more stable regions, exhibit a significantly higher degradation cost.

[925] Voting-based Pitch Estimation with Temporal and Frequential Alignment and Correlation Aware Selection

Junya Koguchi, Tomoki Koriyama

Main category: cs.SD

TL;DR: Theoretical analysis and practical improvements to voting-based ensemble methods for fundamental frequency (F0) estimation, with alignment procedures and greedy estimator selection for better performance.

Details

Motivation: Voting ensemble methods for F0 estimation are empirically robust but lack theoretical foundation and have practical limitations that need addressing for improved performance.

Method: 1) Theoretical analysis using error variance reduction and Condorcet’s jury theorem; 2) Pre-voting alignment to correct temporal/frequency biases; 3) Greedy algorithm for selecting compact, effective estimator subsets based on error correlation.

Result: Proposed method with alignment outperforms individual state-of-the-art estimators in clean conditions and maintains robust voiced/unvoiced detection in noisy environments on diverse speech, singing, and music datasets.

Conclusion: The paper provides both theoretical foundation and practical improvements for voting-based F0 estimation, demonstrating enhanced performance across various audio types and conditions.

Abstract: The voting method, an ensemble approach for fundamental frequency estimation, is empirically known for its robustness but lacks thorough investigation. This paper provides a principled analysis and improvement of this technique. First, we offer a theoretical basis for its effectiveness, explaining the error variance reduction for fundamental frequency estimation and invoking Condorcet’s jury theorem for voiced/unvoiced detection accuracy. To address its practical limitations, we propose two key improvements: 1) a pre-voting alignment procedure to correct temporal and frequential biases among estimators, and 2) a greedy algorithm to select a compact yet effective subset of estimators based on error correlation. Experiments on a diverse dataset of speech, singing, and music show that our proposed method with alignment outperforms individual state-of-the-art estimators in clean conditions and maintains robust voiced/unvoiced detection in noisy environments.

[926] ParaGSE: Parallel Generative Speech Enhancement with Group-Vector-Quantization-based Neural Speech Codec

Fei Liu, Yang Ai

Main category: cs.SD

TL;DR: ParaGSE: A parallel generative speech enhancement framework using group vector quantization for efficient parallel token prediction, achieving superior speech quality and 1.5x faster generation on CPU.

Details

Motivation: Existing generative speech enhancement approaches suffer from excessive complexity, limited efficiency, and suboptimal speech quality. The authors aim to overcome these challenges with a more efficient parallel framework.

Method: Proposes ParaGSE framework using GVQ-based neural speech codec with separate vector quantizers to produce mutually independent tokens. Degraded speech is encoded into distinct tokens, clean tokens are predicted through parallel branches conditioned on degraded spectral features, and clean speech is reconstructed via codec decoder.

Result: ParaGSE consistently produces superior enhanced speech compared to both discriminative and generative baselines across various distortions (noise, reverberation, band-limiting, mixtures). Achieves ~1.5x improvement in generation efficiency on CPU compared to serial approaches.

Conclusion: ParaGSE effectively addresses efficiency and quality limitations in generative speech enhancement through parallel token prediction enabled by GVQ-based codec, offering both superior performance and computational efficiency.

Abstract: Recently, generative speech enhancement has garnered considerable interest; however, existing approaches are hindered by excessive complexity, limited efficiency, and suboptimal speech quality. To overcome these challenges, this paper proposes a novel parallel generative speech enhancement (ParaGSE) framework that leverages a group vector quantization (GVQ)-based neural speech codec. The GVQ-based codec adopts separate VQs to produce mutually independent tokens, enabling efficient parallel token prediction in ParaGSE. Specifically, ParaGSE leverages the GVQ-based codec to encode degraded speech into distinct tokens, predicts the corresponding clean tokens through parallel branches conditioned on degraded spectral features, and ultimately reconstructs clean speech via the codec decoder. Experimental results demonstrate that ParaGSE consistently produces superior enhanced speech compared to both discriminative and generative baselines, under a wide range of distortions including noise, reverberation, band-limiting, and their mixtures. Furthermore, empowered by parallel computation in token prediction, ParaGSE attains about a 1.5-fold improvement in generation efficiency on CPU compared with serial generative speech enhancement approaches.

[927] Speaking Without Sound: Multi-speaker Silent Speech Voicing with Facial Inputs Only

Jaejun Lee, Yoori Oh, Kyogu Lee

Main category: cs.SD

TL;DR: A framework for generating multi-speaker speech using silent EMG signals for linguistic content and facial images for speaker identity, with pitch-disentangled content embedding.

Details

Motivation: To enable speech generation without audible inputs by leveraging silent EMG signals for linguistic content and visual facial cues for speaker identity, addressing privacy and silent communication scenarios.

Method: Uses electromyography (EMG) signals to capture linguistic content, facial images for speaker vocal identity matching, and introduces pitch-disentangled content embedding to enhance linguistic content extraction from EMG signals.

Result: The method successfully generates multi-speaker speech without audible inputs, with extensive analysis confirming the effectiveness of the pitch-disentanglement approach.

Conclusion: The framework demonstrates the feasibility of silent speech generation using multimodal inputs (EMG + facial images) with improved content extraction through pitch disentanglement.

Abstract: In this paper, we introduce a novel framework for generating multi-speaker speech without relying on any audible inputs. Our approach leverages silent electromyography (EMG) signals to capture linguistic content, while facial images are used to match with the vocal identity of the target speaker. Notably, we present a pitch-disentangled content embedding that enhances the extraction of linguistic content from EMG signals. Extensive analysis demonstrates that our method can generate multi-speaker speech without any audible inputs and confirms the effectiveness of the proposed pitch-disentanglement approach.

[928] LipSody: Lip-to-Speech Synthesis with Enhanced Prosody Consistency

Jaejun Lee, Yoori Oh, Kyogu Lee

Main category: cs.SD

TL;DR: LipSody: A lip-to-speech synthesis framework that improves prosody consistency by incorporating speaker identity, linguistic content, and emotional context cues from facial video.

Details

Motivation: Existing diffusion-based lip-to-speech models like LipVoicer reconstruct linguistic content well but lack prosodic consistency, which is crucial for natural-sounding speech generation from silent video.

Method: Proposes LipSody with a prosody-guiding strategy using three complementary cues: speaker identity from facial images, linguistic content from lip movements, and emotional context from face video.

Result: Substantially improves prosody-related metrics including global/local pitch deviations, energy consistency, and speaker similarity compared to prior approaches.

Conclusion: LipSody effectively enhances prosody consistency in lip-to-speech synthesis by leveraging multimodal facial cues, advancing the quality of speech generation from silent video.

Abstract: Lip-to-speech synthesis aims to generate speech audio directly from silent facial video by reconstructing linguistic content from lip movements, providing valuable applications in situations where audio signals are unavailable or degraded. While recent diffusion-based models such as LipVoicer have demonstrated impressive performance in reconstructing linguistic content, they often lack prosodic consistency. In this work, we propose LipSody, a lip-to-speech framework enhanced for prosody consistency. LipSody introduces a prosody-guiding strategy that leverages three complementary cues: speaker identity extracted from facial images, linguistic content derived from lip movements, and emotional context inferred from face video. Experimental results demonstrate that LipSody substantially improves prosody-related metrics, including global and local pitch deviations, energy consistency, and speaker similarity, compared to prior approaches.

[929] DFKI-Speech System for WildSpoof Challenge: A robust framework for SASV In-the-Wild

Arnab Das, Yassine El Kheir, Enes Erdem Erdogan, Feidi Kallel, Tim Polzehl, Sebastian Moeller

Main category: cs.SD

TL;DR: A robust Spoofing aware Automatic Speaker Verification (SASV) framework combining spoofing detection with speaker verification using self-supervised embeddings, graph neural networks, and multi-scale feature fusion.

Details

Motivation: To develop a robust system for the WildSpoof Challenge that can simultaneously detect spoofing attacks while performing accurate speaker verification, addressing security vulnerabilities in voice biometric systems.

Method: Proposes a tandem SASV framework with: 1) spoofing detector using self-supervised speech embeddings with graph neural network backend and top-3 layer MoE for feature fusion; 2) speaker verification with low-complexity CNN fusing 2D/1D multi-scale features trained with SphereFace loss and contrastive circle loss; 3) AS Norm score normalization and model ensembling.

Result: Developed the DFKI-Speech system for WildSpoof Challenge SASV track, achieving robust performance in spoofing detection and speaker verification through the proposed framework and techniques.

Conclusion: The proposed SASV framework effectively combines spoofing detection and speaker verification with advanced neural architectures and training strategies, demonstrating strong performance for secure voice biometric applications.

Abstract: This paper presents the DFKI-Speech system developed for the WildSpoof Challenge under the Spoofing aware Automatic Speaker Verification (SASV) track. We propose a robust SASV framework in which a spoofing detector and a speaker verification (SV) network operate in tandem. The spoofing detector employs a self-supervised speech embedding extractor as the frontend, combined with a state-of-the-art graph neural network backend. In addition, a top-3 layer based mixture-of-experts (MoE) is used to fuse high-level and low-level features for effective spoofed utterance detection. For speaker verification, we adapt a low-complexity convolutional neural network that fuses 2D and 1D features at multiple scales, trained with the SphereFace loss. Additionally, contrastive circle loss is applied to adaptively weight positive and negative pairs within each training batch, enabling the network to better distinguish between hard and easy sample pairs. Finally, fixed imposter cohort based AS Norm score normalization and model ensembling are used to further enhance the discriminative capability of the speaker verification system.

[930] Masked Autoencoders as Universal Speech Enhancer

Rajalaxmi Rajagopalan, Ritwik Giri, Zhiqiang Tang, Kyu Han

Main category: cs.SD

TL;DR: A self-supervised masked autoencoder approach for universal speech enhancement that handles multiple distortions simultaneously and can be fine-tuned for specific downstream tasks like denoising and dereverberation.

Details

Motivation: Supervised speech enhancement methods require clean speech data which is often unavailable in practical scenarios. There's a need for self-supervised learning approaches that offer comparable performance and can be applied to various downstream speech applications.

Method: Developed a masked autoencoder-based universal speech enhancer trained in self-supervised manner. Uses an augmentation stack to add distortions to noisy input, learns to remove added distortions while reconstructing masked spectrogram regions. Pre-trained embeddings are fine-tuned with small paired data for specific downstream tasks.

Result: Outperforms baseline and achieves state-of-the-art performance for both in-domain and out-of-domain evaluation datasets for denoising and dereverberation tasks.

Conclusion: The proposed self-supervised masked autoencoder approach provides effective universal speech enhancement that handles multiple distortions and transfers well to downstream tasks with minimal fine-tuning data.

Abstract: Supervised speech enhancement methods have been very successful. However, in practical scenarios, there is a lack of clean speech, and self-supervised learning-based (SSL) speech enhancement methods that offer comparable enhancement performance and can be applied to other speech-related downstream applications are desired. In this work, we develop a masked autoencoder based universal speech enhancer that is agnostic to the type of distortion affecting speech, can handle multiple distortions simultaneously, and is trained in a self-supervised manner. An augmentation stack adds further distortions to the noisy input data. The masked autoencoder model learns to remove the added distortions along with reconstructing the masked regions of the spectrogram during pre-training. The pre-trained embeddings are then used by fine-tuning models trained on a small amount of paired data for specific downstream tasks. We evaluate the pre-trained features for denoising and dereverberation downstream tasks. We explore different augmentations (like single or multi-speaker) in the pre-training augmentation stack and the effect of different noisy input feature representations (like $log1p$ compression) on pre-trained embeddings and downstream fine-tuning enhancement performance. We show that the proposed method not only outperforms the baseline but also achieves state-of-the-art performance for both in-domain and out-of-domain evaluation datasets.

[931] CAARMA: Class Augmentation with Adversarial Mixup Regularization

Massa Baali, Xiang Li, Hao Chen, Syed Abdul Hannan, Rita Singh, Bhiksha Raj

Main category: cs.SD

TL;DR: CAARMA is a class augmentation framework for speaker verification that generates synthetic classes through embedding space mixing to address limited class diversity in real-world datasets, using adversarial refinement to ensure synthetic class authenticity.

Details

Motivation: Real-world speaker verification datasets often lack sufficient class diversity to effectively train models for generalizable zero-shot learning, as these models need to learn compact same-class embeddings while maintaining separation across classes.

Method: CAARMA generates synthetic classes through data mixing in the embedding space to expand training classes, and uses an adversarial refinement mechanism to minimize categorical distinctions between synthetic and real classes, ensuring synthetic class authenticity.

Result: The framework demonstrates consistent improvements across multiple speaker verification tasks and other zero-shot comparison-based speech analysis tasks, achieving a significant 8% improvement over all baseline models.

Conclusion: CAARMA effectively addresses class diversity limitations in speaker verification datasets through synthetic class generation and adversarial refinement, leading to substantial performance improvements in zero-shot learning tasks.

Abstract: Speaker verification is a typical zero-shot learning task, where inference of unseen classes is performed by comparing embeddings of test instances to known examples. The models performing inference must hence naturally generate embeddings that cluster same-class instances compactly, while maintaining separation across classes. In order to learn to do so, they are typically trained on a large number of classes (speakers), often using specialized losses. However real-world speaker datasets often lack the class diversity needed to effectively learn this in a generalizable manner. We introduce CAARMA, a class augmentation framework that addresses this problem by generating synthetic classes through data mixing in the embedding space, expanding the number of training classes. To ensure the authenticity of the synthetic classes we adopt a novel adversarial refinement mechanism that minimizes categorical distinctions between synthetic and real classes. We evaluate CAARMA on multiple speaker verification tasks, as well as other representative zero-shot comparison-based speech analysis tasks and obtain consistent improvements: our framework demonstrates a significant improvement of 8% over all baseline models. The code is available at: https://github.com/massabaali7/CAARMA/

[932] PAL: Probing Audio Encoders via LLMs – Audio Information Transfer into LLMs

Tony Alex, Wish Suharitdamrong, Sara Atito, Armin Mustafa, Philip J. B. Jackson, Imran Razzak, Muhammad Awais

Main category: cs.SD

TL;DR: LAL is a lightweight audio integration method for LLMs that injects audio representations through attention mechanisms instead of prepending tokens, reducing computational overhead while maintaining or improving performance.

Details

Motivation: Current audio integration methods for LLMs (PLITS) project audio tokens into LLM input space and prepend them, which is computationally expensive. There's a need for more efficient transfer of rich audio semantics to enable better machine listening applications.

Method: Proposes LAL (Lightweight Audio LLM Integration) that injects audio representations solely through attention mechanisms at selected LLM layers, bypassing feed-forward modules. Also introduces PAL (Probing Audio encoders via LLM), a hybrid approach that combines PLITS for summary tokens with LAL for full audio token sequences.

Result: LAL consistently matches or outperforms existing integration approaches across multiple base LLMs and tasks, with up to 30% improvement over PLITS baseline, while reducing memory usage by ~60% and increasing throughput by ~190%. PAL matches or exceeds PLITS performance with better computational efficiency.

Conclusion: LAL and PAL provide efficient alternatives to traditional audio integration methods for LLMs, enabling better audio understanding with significantly reduced computational overhead, advancing multimodal LLM capabilities for audio applications.

Abstract: Integration of audio perception into large language models (LLMs) is an emerging research area for enabling machine listening applications, yet efficient transfer of rich audio semantics from audio encoders to LLMs remains underexplored. The most widely used integration paradigm projects audio-encoder output tokens into the LLM input space (e.g., via an MLP or a Q-Former) and then prepends or inserts them into the text token sequence. We refer to this generic scheme as Prepend to the LLM’s input token space (PLITS) integration. We propose an efficient alternative, Lightweight Audio LLM Integration (LAL). LAL injects audio representations solely through the attention mechanism at selected LLM layers, bypassing the feed-forward module. It encodes rich audio semantics at an appropriate level of abstraction for integration into different transformer blocks, substantially reducing computational overhead compared to existing approaches. We further introduce PAL, a hybrid integration approach for efficiently Probing Audio encoders via LLM. PAL applies PLITS only to a compact set of summary tokens while integrating the full audio token sequence via LAL. Under an identical training curriculum, LAL consistently matches or outperforms existing integration approaches across multiple base LLMs and tasks, with improvements of up to 30% over a strong PLITS baseline, while reducing memory usage by about 60% and increasing throughput by about 190%. Moreover, PAL matches or exceeds PLITS performance while offering substantially better computational and memory efficiency.

[933] Towards Automatic Evaluation and High-Quality Pseudo-Parallel Dataset Construction for Audio Editing: A Human-in-the-Loop Method

Yuhang Jia, Hui Wang, Xin Nie, Yujie Guo, Lianru Gao, Yong Qin

Main category: cs.SD

TL;DR: AuditEval introduces a comprehensive evaluation framework for audio editing tasks, including a benchmark dataset (AuditScore) with professional annotations and automatic MOS-style evaluators to assess audio editing quality.

Details

Motivation: The lack of high-quality benchmark datasets and comprehensive evaluation metrics for audio editing tasks hinders both assessment of editing quality and improvement of the task itself.

Method: 1) Created AuditScore dataset with 6,300+ edited samples from 7 frameworks, professionally annotated on Quality, Relevance, and Faithfulness. 2) Developed AuditEval family of automatic MOS-style evaluators (SSL-based and LLM-based). 3) Used AuditEval to filter synthetic data for high-quality pseudo-parallel subset.

Result: Comprehensive experiments validate that expert-informed filtering yields higher-quality data, exposes limitations of traditional metrics, and demonstrates advantages of AuditEval framework.

Conclusion: AuditEval addresses critical gaps in audio editing evaluation by providing benchmark datasets, professional annotations, and effective automatic evaluators to advance the field.

Abstract: Audio editing aims to manipulate audio content based on textual descriptions, supporting tasks such as adding, removing, or replacing audio events. Despite recent progress, the lack of high-quality benchmark datasets and comprehensive evaluation metrics remains a major challenge for both assessing audio editing quality and improving the task itself. In this work, we propose a novel approach for audio editing task by incorporating expert knowledge into both the evaluation and dataset construction processes: 1) First, we establish AuditScore, the first comprehensive dataset for subjective evaluation of audio editing, consisting of over 6,300 edited samples generated from 7 representative audio editing frameworks and 23 system configurations. Each sample is annotated by professional raters on three key aspects of audio editing quality: overall Quality, Relevance to editing intent, and Faithfulness to original features. 2) Based on this dataset, we systematically propose AuditEval, a family of automatic MOS-style evaluators tailored for audio editing, covering both SSL-based and LLM-based approaches. It addresses the lack of effective objective metrics and the prohibitive cost of subjective evaluation in this field. 3) We further leverage AuditEval to evaluate and filter a large amount of synthetically mixed editing pairs, mining a high-quality pseudo-parallel subset by selecting the most plausible samples. Comprehensive experiments validate that our expert-informed filtering strategy effectively yields higher-quality data, while also exposing the limitations of traditional objective metrics and the advantages of AuditEval. The dataset, codes and tools can be found at: https://github.com/NKU-HLT/AuditEval.

[934] Estimating Respiratory Effort from Nocturnal Breathing Sounds for Obstructive Sleep Apnoea Screening

Xiaolei Xu, Chaoyue Niu, Guy J. Brown, Hector Romero, Ning Ma

Main category: cs.SD

TL;DR: Estimating respiratory effort from nocturnal audio to improve obstructive sleep apnea detection using latent-space fusion of audio and estimated effort features.

Details

Motivation: Current OSA screening methods are limited by environmental noise and lack physiological context. Respiratory effort is clinically important but requires contact sensors, reducing scalability and comfort. Need for sensor-free, scalable OSA monitoring using only audio.

Method: Proposes estimating respiratory effort directly from nocturnal audio, then uses latent-space fusion framework to integrate estimated effort embeddings with acoustic features for OSA detection. Requires only smartphone audio at test time.

Result: Respiratory effort estimator achieves concordance correlation coefficient of 0.48 on 157 nights from 103 participants. Fusing effort and audio improves sensitivity and AUC over audio-only baselines, especially at low AHI thresholds.

Conclusion: First study to estimate respiratory effort from audio alone, enabling physiological context recovery from sound. Approach enables sensor-free, scalable, longitudinal OSA monitoring using only smartphone audio.

Abstract: Obstructive sleep apnoea (OSA) is a prevalent condition with significant health consequences, yet many patients remain undiagnosed due to the complexity and cost of over-night polysomnography. Acoustic-based screening provides a scalable alternative, yet performance is limited by environmental noise and the lack of physiological context. Respiratory effort is a key signal used in clinical scoring of OSA events, but current approaches require additional contact sensors that reduce scalability and patient comfort. This paper presents the first study to estimate respiratory effort directly from nocturnal audio, enabling physiological context to be recovered from sound alone. We propose a latent-space fusion framework that integrates the estimated effort embeddings with acoustic features for OSA detection. Using a dataset of 157 nights from 103 participants recorded in home environments, our respiratory effort estimator achieves a concordance correlation coefficient of 0.48, capturing meaningful respiratory dynamics. Fusing effort and audio improves sensitivity and AUC over audio-only baselines, especially at low apnoea-hypopnoea index thresholds. The proposed approach requires only smartphone audio at test time, which enables sensor-free, scalable, and longitudinal OSA monitoring.

[935] SupCLAP: Controlling Optimization Trajectory Drift in Audio-Text Contrastive Learning with Support Vector Regularization

Jiehui Luo, Yuguo Yin, Yuxin Xie, Jinghan Ru, Xianwei Zhuang, Minghua He, Aofan Liu, Zihan Xiong, Dongchao Yang

Main category: cs.SD

TL;DR: SVR is a contrastive learning regularization method that controls the perpendicular component of negative sample pushing forces to improve audio-text multimodal representation learning.

Details

Motivation: Contrastive language-audio pretraining is fundamental for multimodal applications, but the perpendicular component of negative sample pushing forces causes optimization drift and instability while containing valuable information.

Method: Proposes Support Vector Regularization (SVR) with an auxiliary support vector to control the perpendicular component. Explores two unsupervised strategies for semantic radius: direct parameterization and adaptive radius predictor with constraints.

Result: SVR outperforms baselines like InfoNCE and SigLIP loss across classification, monolingual retrieval, and multilingual retrieval on standard audio-text datasets. Validated through theoretical analysis and trajectory drift experiments.

Conclusion: SVR effectively harnesses rich information from negative samples while mitigating optimization drift, is highly efficient without extra data or inference overhead, and improves audio-text multimodal representation learning.

Abstract: Contrastive language-audio pretraining, which aims to unify multimodal representations in a shared embedding space, serves as a cornerstone for building a wide range of applications, from cross-modal retrieval to cutting-edge multimodal large language models. However, we find that the perpendicular component of the pushing force from negative samples in contrastive learning is a double-edged sword: it contains rich supplementary information from negative samples, yet its unconstrained nature causes optimization trajectory drift and training instability. To address this, we propose Support Vector Regularization (SVR), a method that introduces an auxiliary support vector to control this perpendicular component, aiming to harness its rich information while mitigating the associated trajectory drift. The efficacy of SVR is critically governed by its semantic radius, for which we explore two unsupervised modeling strategies: direct parameterization and an adaptive radius predictor module enhanced with constraints to improve its predicting accuracy. Extensive experimental results demonstrate that our method surpasses widely used baselines like InfoNCE and SigLIP loss across classification, monolingual retrieval, and multilingual retrieval on standard audio-text datasets. Both the theoretical analysis and the experimental results on optimizing trajectory drift validate the correctness and effectiveness of our SVR method. Notably, our method is highly efficient, it operates without the need for extra training data or inference computation, and adds only a negligible overhead to the training.

[936] The T12 System for AudioMOS Challenge 2025: Audio Aesthetics Score Prediction System Using KAN- and VERSA-based Models

Katsuhiko Yamamoto, Koichi Miyazaki, Shogo Seki

Main category: cs.SD

TL;DR: AESCA system for AudioMOS Challenge 2025 uses KAN-based audio aesthetics prediction and VERSA toolkit ensemble for superior AES correlation scores

Details

Motivation: To develop an effective audio aesthetics score prediction system for the AudioMOS Challenge 2025 Track 2, addressing the need for accurate assessment of audio quality across multiple evaluation axes

Method: Combines Kolmogorov-Arnold Network (KAN)-based audiobox aesthetics predictor (replacing MLP layers with group-rational KAN) with VERSA toolkit regression using extreme gradient boosting; ensemble of four KAN models and one VERSA model for final AES prediction

Result: Achieved best correlations among submitted systems in three axes at utterance level, two axes at system level, and overall average; released inference models for KAN-based predictors

Conclusion: The AESCA system demonstrates effective audio aesthetics prediction through innovative KAN architecture and ensemble approach, providing state-of-the-art performance for audio quality assessment

Abstract: We propose an audio aesthetics score (AES) prediction system by CyberAgent (AESCA) for AudioMOS Challenge 2025 (AMC25) Track 2. The AESCA comprises a Kolmogorov–Arnold Network (KAN)-based audiobox aesthetics and a predictor from the metric scores using the VERSA toolkit. In the KAN-based predictor, we replaced each multi-layer perceptron layer in the baseline model with a group-rational KAN and trained the model with labeled and pseudo-labeled audio samples. The VERSA-based predictor was designed as a regression model using extreme gradient boosting, incorporating outputs from existing metrics. Both the KAN- and VERSA-based models predicted the AES, including the four evaluation axes. The final AES values were calculated using an ensemble model that combined four KAN-based models and a VERSA-based model. Our proposed T12 system yielded the best correlations among the submitted systems, in three axes at the utterance level, two axes at the system level, and the overall average. We also released the inference model of the proposed KAN-based predictor (KAN #1-#4).

[937] ConceptCaps: a Distilled Concept Dataset for Interpretability in Music Models

Bruno Sienkiewicz, Łukasz Neumann, Mateusz Modrzejewski

Main category: cs.SD

TL;DR: ConceptCaps: A dataset of 23k music-caption-audio triplets with explicit concept labels from a 200-attribute taxonomy, created using a pipeline that separates semantic modeling from text generation for improved coherence and controllability.

Details

Motivation: Existing music datasets lack clean, well-separated positive/negative examples needed for concept-based interpretability methods like TCAV. Current datasets have sparse, noisy, or ill-defined tags, making them unsuitable for concept analysis.

Method: Three-stage pipeline: 1) VAE learns plausible attribute co-occurrence patterns, 2) fine-tuned LLM converts attribute lists into professional descriptions, 3) MusicGen synthesizes corresponding audio. This separation improves over end-to-end approaches.

Result: Created ConceptCaps dataset with 23k triplets validated through audio-text alignment (CLAP), linguistic quality metrics (BERTScore, MAUVE), and TCAV analysis confirming concept probes recover musically meaningful patterns.

Conclusion: ConceptCaps provides a structured dataset for music concept analysis with explicit attribute labels, enabling better interpretability research and controllable generation in the audio domain.

Abstract: Concept-based interpretability methods like TCAV require clean, well-separated positive and negative examples for each concept. Existing music datasets lack this structure: tags are sparse, noisy, or ill-defined. We introduce ConceptCaps, a dataset of 23k music-caption-audio triplets with explicit labels from a 200-attribute taxonomy. Our pipeline separates semantic modeling from text generation: a VAE learns plausible attribute co-occurrence patterns, a fine-tuned LLM converts attribute lists into professional descriptions, and MusicGen synthesizes corresponding audio. This separation improves coherence and controllability over end-to-end approaches. We validate the dataset through audio-text alignment (CLAP), linguistic quality metrics (BERTScore, MAUVE), and TCAV analysis confirming that concept probes recover musically meaningful patterns. Dataset and code are available online.

[938] Music Plagiarism Detection: Problem Formulation and a Segment-based Solution

Seonghyeon Go, Yumin Kim

Main category: cs.SD

TL;DR: Paper defines music plagiarism detection as a distinct MIR task, introduces Similar Music Pair dataset, and proposes segment transcription-based method for detection.

Details

Motivation: Music plagiarism is a pressing social issue, but research has been hampered by lack of clear task definition, slowing progress and limiting real-world applicability.

Method: Defines music plagiarism detection as distinct from other MIR tasks, introduces Similar Music Pair dataset, and proposes segment transcription-based detection method.

Result: Provides clear task definition, dataset, and baseline method for music plagiarism detection, with demo and dataset publicly available.

Conclusion: Establishes foundation for systematic research in music plagiarism detection by providing task definition, dataset, and initial methodology.

Abstract: Recently, the problem of music plagiarism has emerged as an even more pressing social issue. As music information retrieval research advances, there is a growing effort to address issues related to music plagiarism. However, many studies, including our previous work, have conducted research without clearly defining what the music plagiarism detection task actually involves. This lack of a clear definition has slowed research progress and made it hard to apply results to real-world scenarios. To fix this situation, we defined how Music Plagiarism Detection is different from other MIR tasks and explained what problems need to be solved. We introduce the Similar Music Pair dataset to support this newly defined task. In addition, we propose a method based on segment transcription as one way to solve the task. Our demo and dataset are available at https://github.com/Mippia/ICASSP2026-MPD.

cs.LG

[939] OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models

Michael Siebenmann, Javier Argota Sánchez-Vaquerizo, Stefan Arisona, Krystian Samp, Luis Gisler, Dirk Helbing

Main category: cs.LG

TL;DR: OGD4All is an LLM-based framework for transparent, auditable citizen interaction with geospatial Open Government Data, combining semantic retrieval, agentic reasoning, and secure execution to produce verifiable multimodal outputs with high correctness and recall while minimizing hallucinations.

Details

Motivation: To enhance citizens' interaction with geospatial Open Government Data (OGD) through a transparent, auditable, and reproducible framework that provides explainable, multimodal access to public data while advancing trustworthy AI for open governance.

Method: Combines semantic data retrieval, agentic reasoning for iterative code generation, and secure sandboxed execution that produces verifiable multimodal outputs. Evaluated on 199-question benchmark covering factual and unanswerable questions across 430 City-of-Zurich datasets and 11 LLMs.

Result: Achieves 98% analytical correctness and 94% recall while reliably rejecting questions unsupported by available data, minimizing hallucination risks. Statistical robustness tests and expert feedback show reliability and social relevance.

Conclusion: The approach demonstrates how LLMs can provide explainable, multimodal access to public data, advancing trustworthy AI for open governance through transparent, auditable frameworks.

Abstract: We present OGD4All, a transparent, auditable, and reproducible framework based on Large Language Models (LLMs) to enhance citizens’ interaction with geospatial Open Government Data (OGD). The system combines semantic data retrieval, agentic reasoning for iterative code generation, and secure sandboxed execution that produces verifiable multimodal outputs. Evaluated on a 199-question benchmark covering both factual and unanswerable questions, across 430 City-of-Zurich datasets and 11 LLMs, OGD4All reaches 98% analytical correctness and 94% recall while reliably rejecting questions unsupported by available data, which minimizes hallucination risks. Statistical robustness tests, as well as expert feedback, show reliability and social relevance. The proposed approach shows how LLMs can provide explainable, multimodal access to public data, advancing trustworthy AI for open governance.

[940] Measurement for Opaque Systems: Multi-source Triangulation with Interpretable Machine Learning

Margaret Foster

Main category: cs.LG

TL;DR: A framework for measuring difficult-to-access systems using indirect data traces, interpretable ML models, and theory-guided triangulation when conventional statistical inference is impossible.

Details

Motivation: Many high-stakes systems in science and policy are difficult or impossible to measure directly - dynamics are unobservable, data are indirect/fragmented, and ground truth is absent or concealed. Conventional statistical methods fail in these data regimes.

Method: Combines multi-source triangulation with interpretable machine learning models. Instead of validating against unattainable ideal data, seeks consistency across separate, partially informative models to draw defensible conclusions about system states.

Result: Demonstrated through empirical analysis of organizational growth and internal pressure dynamics in a clandestine militant organization, showing how triangulated, interpretable ML can recover substantively meaningful variation from incomplete, biased observational signals.

Conclusion: Provides a general framework for measurement in data regimes with structurally missing or adversarial data, enabling quantitative characterization when conventional statistical or causal inference is impossible due to data limitations.

Abstract: We propose a measurement framework for difficult-to-access contexts that uses indirect data traces, interpretable machine-learning models, and theory-guided triangulation to fill inaccessible measurement spaces. Many high-stakes systems of scientific and policy interest are difficult, if not impossible, to reach directly: dynamics of interest are unobservable, data are indirect and fragmented across sources, and ground truth is absent or concealed. In these settings, available data often do not support conventional strategies for analysis, such as statistical inference on a single authoritative data stream or model validation against labeled outcomes. To address this problem, we introduce a general framework for measurement in data regimes characterized by structurally missing or adversarial data. We propose combining multi-source triangulation with interpretable machine learning models. Rather than relying on accuracy against unobservable, unattainable ideal data, our framework seeks consistency across separate, partially informative models. This allows users to draw defensible conclusions about the state of the world based on cross-signal consistency or divergence from an expected state. Our framework provides an analytical workflow tailored to quantitative characterization in the absence of data sufficient for conventional statistical or causal inference. We demonstrate our approach and explicitly surface inferential limits through an empirical analysis of organizational growth and internal pressure dynamics in a clandestine militant organization, drawing on multiple observational signals that individually provide incomplete and biased views of the underlying process. The results show how triangulated, interpretable ML can recover substantively meaningful variation.

[941] ELLMPEG: An Edge-based Agentic LLM Video Processing Tool

Zoha Azimi, Reza Farahani, Radu Prodan, Christian Timmerer

Main category: cs.LG

TL;DR: ELLMPEG is an edge-enabled agentic LLM framework for automated generation of video-processing commands using FFmpeg and VVenC, eliminating cloud dependency through local execution and verification.

Details

Motivation: Cloud-based LLMs face limitations in computational demands, privacy risks, and API costs. The paper aims to enable local video processing at the edge using agentic AI with structured reasoning and tool use.

Method: Integrates tool-aware Retrieval-Augmented Generation (RAG) with iterative self-reflection to produce and locally verify executable FFmpeg and VVenC commands directly at edge devices.

Result: Qwen2.5 with ELLMPEG achieves 78% average command-generation accuracy with zero recurring API costs, outperforming other open-source models on a 480-query dataset covering FFmpeg and VVC encoder commands.

Conclusion: ELLMPEG demonstrates effective edge-based video processing command generation using agentic LLMs, addressing cloud limitations while maintaining practical applicability through local verification.

Abstract: Large language models (LLMs), the foundation of generative AI systems like ChatGPT, are transforming many fields and applications, including multimedia, enabling more advanced content generation, analysis, and interaction. However, cloud-based LLM deployments face three key limitations: high computational and energy demands, privacy and reliability risks from remote processing, and recurring API costs. Recent advances in agentic AI, especially in structured reasoning and tool use, offer a better way to exploit open and locally deployed tools and LLMs. This paper presents ELLMPEG, an edge-enabled agentic LLM framework for the automated generation of video-processing commands. ELLMPEG integrates tool-aware Retrieval-Augmented Generation (RAG) with iterative self-reflection to produce and locally verify executable FFmpeg and VVenC commands directly at the edge, eliminating reliance on external cloud APIs. To evaluate ELLMPEG, we collect a dedicated prompt dataset comprising 480 diverse queries covering different categories of FFmpeg and the Versatile Video Codec (VVC) encoder (VVenC) commands. We validate command generation accuracy and evaluate four open-source LLMs based on command validity, tokens generated per second, inference time, and energy efficiency. We also execute the generated commands to assess their runtime correctness and practical applicability. Experimental results show that Qwen2.5, when augmented with the ELLMPEG framework, achieves an average command-generation accuracy of 78 % with zero recurring API cost, outperforming all other open-source models across both the FFmpeg and VVenC datasets.

[942] Representation Learning Enhanced Deep Reinforcement Learning for Optimal Operation of Hydrogen-based Multi-Energy Systems

Zhenyu Pu, Yu Yang, Lun Yang, Qing-Shan Jia, Xiaohong Guan, Costas J. Spanos

Main category: cs.LG

TL;DR: A deep reinforcement learning framework with representation learning for optimizing hydrogen-based multi-energy systems, addressing nonlinear dynamics and uncertainties in hydrogen energy storage.

Details

Motivation: Hydrogen-based multi-energy systems (HMES) offer low-carbon energy solutions but face challenges in optimal operation due to nonlinear dynamics of hydrogen storage systems and multiple uncertainties from supply/demand.

Method: Developed a comprehensive operational model for HMES capturing nonlinear dynamics, and proposed enhanced deep reinforcement learning framework integrating representation learning techniques for accelerated policy optimization.

Result: The comprehensive model ensures safe/reliable HESS operation, and the SR-DRL approach shows superior convergence and performance over conventional DRL in reducing operation costs and handling constraints.

Conclusion: Representation learning in DRL reorganizes state space into structured geometric representations, smoothing the learning process for complex networked energy systems.

Abstract: Hydrogen-based multi-energy systems (HMES) have emerged as a promising low-carbon and energy-efficient solution, as it can enable the coordinated operation of electricity, heating and cooling supply and demand to enhance operational flexibility, improve overall energy efficiency, and increase the share of renewable integration. However, the optimal operation of HMES remains challenging due to the nonlinear and multi-physics coupled dynamics of hydrogen energy storage systems (HESS) (consisting of electrolyters, fuel cells and hydrogen tanks) as well as the presence of multiple uncertainties from supply and demand. To address these challenges, this paper develops a comprehensive operational model for HMES that fully captures the nonlinear dynamics and multi-physics process of HESS. Moreover, we propose an enhanced deep reinforcement learning (DRL) framework by integrating the emerging representation learning techniques, enabling substantially accelerated and improved policy optimization for spatially and temporally coupled complex networked systems, which is not provided by conventional DRL. Experimental studies based on real-world datasets show that the comprehensive model is crucial to ensure the safe and reliable of HESS. In addition, the proposed SR-DRL approaches demonstrate superior convergence rate and performance over conventional DRL counterparts in terms of reducing the operation cost of HMES and handling the system operating constraints. Finally, we provide some insights into the role of representation learning in DRL, speculating that it can reorganize the original state space into a well-structured and cluster-aware geometric representation, thereby smoothing and facilitating the learning process of DRL.

[943] RAPTOR-AI for Disaster OODA Loop: Hierarchical Multimodal RAG with Experience-Driven Agentic Decision-Making

Takato Yasuno

Main category: cs.LG

TL;DR: Agentic RAG framework for disaster response with multimodal knowledge base and adaptive retrieval strategies

Details

Motivation: Humanitarian assistance requires rapid situational understanding and generalization across diverse disaster contexts, needing robust multimodal grounding for effective response across rescue, recovery, and reconstruction phases.

Method: Hierarchical knowledge base integrating textual manuals, historical lessons, and imagery; uses BLIP-based image captioning, ColVBERT embeddings, and long-context summarization; agentic controller with entropy-aware scene abstraction for dynamic retrieval strategy selection; LoRA-based post-training for experiential knowledge injection.

Result: Improved situational grounding, enhanced task decomposition accuracy, and superior usability for emergency operations on real disaster datasets.

Conclusion: The agentic RAG framework with multimodal grounding and adaptive reasoning capabilities effectively supports disaster response across all phases, demonstrating practical utility for emergency operations.

Abstract: Effective humanitarian assistance and disaster relief (HADR) requires rapid situational understanding, reliable decision support, and the ability to generalize across diverse and previously unseen disaster contexts. This work introduces an agentic Retrieval-Augmented Generation (RAG) framework designed to support the three canonical phases of disaster response: initial rescue, mid-term recovery, and long-term reconstruction. To achieve robust multimodal grounding, we construct a hierarchical knowledge base that integrates textual disaster manuals, historical lessons (e.g., the 2011 Tohoku earthquake), and both aerial and ground-level imagery. Our system builds on the open-source multimodal implementation, which processes 46 tsunami-related PDFs (2,378 pages) using BLIP-based image captioning, ColVBERT embeddings, and long-context summarization to generate an efficient, structured multimodal retrieval tree optimized for disaster knowledge preservation. An agentic controller dynamically selects retrieval strategies (e.g., RAPTOR, ColBERT) through entropy-aware scene abstraction, enabling adaptive reasoning across heterogeneous inputs. Additionally, a lightweight LoRA-based post-training method injects experiential knowledge from past disasters, enhancing the models’ capacity to support both expert and non-expert responders. Experiments on real disaster datasets demonstrate improved situational grounding, enhanced task decomposition accuracy, and superior usability for emergency operations. Incorporating recent advances in long-context RAG systems, agentic information retrieval, and contemporary emergency response AI, our system achieves substantial gains through adaptive retrieval-augmented generation with self-reasoning and multimodal chain-of-thought capabilities.

[944] Enhancing few-shot time series forecasting with LLM-guided diffusion

Haonan Shi, Dehua Shuai, Liming Wang, Xiyang Liu, Long Tian

Main category: cs.LG

TL;DR: LTSM-DIFF integrates large language models with diffusion models for few-shot time series forecasting, achieving SOTA performance by transferring language knowledge to temporal tasks.

Details

Motivation: Time series forecasting in specialized domains often suffers from limited data availability, making conventional models ineffective. The paper aims to address this few-shot challenge by leveraging the expressive power of LLMs and generative capabilities of diffusion models.

Method: Proposes LTSM-DIFF framework with two components: 1) LTSM module (fine-tuned large language model) as temporal memory mechanism to extract sequential representations, and 2) diffusion process using these representations as conditional guidance for refined temporal pattern modeling.

Result: Extensive experiments show LTSM-DIFF achieves state-of-the-art performance in data-rich scenarios and delivers significant improvements in few-shot forecasting across diverse benchmarks.

Conclusion: The work establishes a new paradigm for time series analysis under data scarcity by effectively transferring knowledge from language domain to time series tasks, enhancing both generalization and robustness.

Abstract: Time series forecasting in specialized domains is often constrained by limited data availability, where conventional models typically require large-scale datasets to effectively capture underlying temporal dynamics. To tackle this few-shot challenge, we propose LTSM-DIFF (Large-scale Temporal Sequential Memory with Diffusion), a novel learning framework that integrates the expressive power of large language models with the generative capability of diffusion models. Specifically, the LTSM module is fine-tuned and employed as a temporal memory mechanism, extracting rich sequential representations even under data-scarce conditions. These representations are then utilized as conditional guidance for a joint probability diffusion process, enabling refined modeling of complex temporal patterns. This design allows knowledge transfer from the language domain to time series tasks, substantially enhancing both generalization and robustness. Extensive experiments across diverse benchmarks demonstrate that LTSM-DIFF consistently achieves state-of-the-art performance in data-rich scenarios, while also delivering significant improvements in few-shot forecasting. Our work establishes a new paradigm for time series analysis under data scarcity.

[945] Extending Beacon to Hindi: Cultural Adaptation Drives Cross-Lingual Sycophancy

Sarthak Sattigeri

Main category: cs.LG

TL;DR: Study extends sycophancy evaluation to Hindi, finding culturally adapted prompts increase sycophancy rates compared to English, with cultural adaptation being the primary driver rather than language encoding.

Details

Motivation: To investigate whether sycophancy diagnostics generalize across languages and cultural contexts, as current evaluations are primarily English-based and may not capture cross-lingual alignment failures.

Method: Extended Beacon sycophancy diagnostic to Hindi using three conditions: English original, Hindi literal translation, and Hindi culturally adapted prompts. Evaluated four open-weight instruction-tuned models on 50 prompts per condition to separate language encoding from cultural adaptation effects.

Result: Sycophancy rates consistently higher for culturally adapted Hindi prompts than English (12.0-16.0 percentage points difference). Cultural adaptation accounts for majority of gap (14.0% delta), while language encoding contributes minimally (2.0% delta). Advice prompts show largest cross-lingual differences (20-25 percentage points).

Conclusion: Alignment behaviors measured in English don’t transfer uniformly across languages; culturally grounded prompt framing plays substantial role in sycophancy rates, highlighting need for multilingual and culturally-aware evaluation.

Abstract: Sycophancy, the tendency of language models to prioritize agreement with user preferences over principled reasoning, has been identified as a persistent alignment failure in English-language evaluations. However, it remains unclear whether such diagnostics generalize across languages and cultural contexts. We extend the Beacon single-turn forced-choice sycophancy diagnostic to Hindi through a controlled three-condition design: English original, Hindi literal translation, and Hindi culturally adapted prompts. We evaluate four open-weight instruction-tuned models on 50 prompts per condition, enabling separation of language encoding effects from cultural adaptation effects. Across all models, sycophancy rates are consistently higher for culturally adapted Hindi prompts than for English, with absolute differences ranging from 12.0 to 16.0 percentage points. A decomposition on Qwen 2.5-Coder-7B shows that cultural adaptation (delta = 14.0%, 95% CI: [4.0%, 26.0%]) accounts for the majority of this gap, while language encoding contributes minimally (delta = 2.0%, 95% CI: [0.0%, 6.0%]). Category-level analysis reveals that advice prompts exhibit the largest cross-lingual differences (20-25 percentage points), achieving statistical significance in two of four models. These findings indicate that alignment behaviors measured in English may not transfer uniformly across languages and that culturally grounded prompt framing plays a substantial role. We release all datasets and evaluation code to support replication and extension.

[946] Lightweight Edge Learning via Dataset Pruning

Laha Ale, Hu Luo, Mingsheng Cao, Shichao Li, Huanlai Xing, Haifeng Sun

Main category: cs.LG

TL;DR: Dataset pruning framework for efficient edge learning that reduces training overhead by selecting only the most informative data samples, achieving near-linear reduction in latency and energy with minimal accuracy loss.

Details

Motivation: Edge learning faces challenges with computational and energy overhead for on-device training on battery-powered mobile systems with strict thermal and memory constraints. While model architectures have been optimized for inference, training remains bottlenecked by processing massive, often redundant local datasets.

Method: Proposes a data-centric optimization framework using dataset pruning. Uses average loss statistics from a truncated warm-up phase to rank sample importance, deterministically retaining only the most critical data points under a dynamic pruning ratio. The approach is model-agnostic and operates locally without inter-device communication.

Result: Extensive experiments on standard image classification benchmarks show near-linear reduction in training latency and energy consumption proportional to pruning ratio, with negligible degradation in model accuracy.

Conclusion: Dataset pruning is a vital, complementary paradigm for enhancing sustainability and scalability of learning on resource-constrained mobile edge devices, offering efficient training without significant accuracy trade-offs.

Abstract: Edge learning facilitates ubiquitous intelligence by enabling model training and adaptation directly on data-generating devices, thereby mitigating privacy risks and communication latency. However, the high computational and energy overhead of on-device training hinders its deployment on battery-powered mobile systems with strict thermal and memory budgets. While prior research has extensively optimized model architectures for efficient inference, the training phase remains bottlenecked by the processing of massive, often redundant, local datasets. In this work, we propose a data-centric optimization framework that leverages dataset pruning to achieve resource-efficient edge learning. Unlike standard methods that process all available data, our approach constructs compact, highly informative training subsets via a lightweight, on-device importance evaluation. Specifically, we utilize average loss statistics derived from a truncated warm-up phase to rank sample importance, deterministically retaining only the most critical data points under a dynamic pruning ratio. This mechanism is model-agnostic and operates locally without inter-device communication. Extensive experiments on standard image classification benchmarks demonstrate that our framework achieves a near-linear reduction in training latency and energy consumption proportional to the pruning ratio, with negligible degradation in model accuracy. These results validate dataset pruning as a vital, complementary paradigm for enhancing the sustainability and scalability of learning on resource-constrained mobile edge devices.

[947] Distributional Reinforcement Learning for Condition-Based Maintenance of Multi-Pump Equipment

Takato Yasuno

Main category: cs.LG

TL;DR: A distributional reinforcement learning approach using Quantile Regression Deep Q-Networks (QR-DQN) with aging factor integration for multi-equipment Condition-Based Maintenance optimization.

Details

Motivation: Traditional time-based maintenance schedules lead to unnecessary costs and unexpected equipment failures. There's a need for proactive maintenance strategies using real-time condition data to optimize maintenance timing and resource allocation in industrial systems.

Method: Proposes a novel distributional reinforcement learning approach using Quantile Regression Deep Q-Networks (QR-DQN) with aging factor integration for multi-equipment CBM. The methodology involves concurrent administration of multiple pump units through three strategic scenarios: safety-first, balanced, and cost-efficient approaches.

Result: Experimental validation over 3,000 training episodes shows significant performance improvements across all strategies. The Safety-First strategy demonstrates superior cost efficiency with ROI of 3.91 (152% better performance than alternatives while requiring only 31% higher investment). The system exhibits 95.66% operational stability.

Conclusion: The proposed QR-DQN with aging factor integration provides an effective solution for multi-equipment CBM optimization, demonstrating immediate applicability to industrial environments with high operational stability.

Abstract: Condition-Based Maintenance (CBM) signifies a paradigm shift from reactive to proactive equipment management strategies in modern industrial systems. Conventional time-based maintenance schedules frequently engender superfluous expenditures and unanticipated equipment failures. In contrast, CBM utilizes real-time equipment condition data to enhance maintenance timing and optimize resource allocation. The present paper proposes a novel distributional reinforcement learning approach for multi-equipment CBM using Quantile Regression Deep Q-Networks (QR-DQN) with aging factor integration. The methodology employed in this study encompasses the concurrent administration of multiple pump units through three strategic scenarios. The implementation of safety-first, balanced, and cost-efficient approaches is imperative. Comprehensive experimental validation over 3,000 training episodes demonstrates significant performance improvements across all strategies. The Safety-First strategy demonstrates superior cost efficiency, with a return on investment (ROI) of 3.91, yielding 152% better performance than alternatives while requiring only 31% higher investment. The system exhibits 95.66% operational stability and immediate applicability to industrial environments.

[948] VoxServe: Streaming-Centric Serving System for Speech Language Models

Keisuke Kamahori, Wei-Tzu Lee, Atindra Jha, Rohan Kadekodi, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci

Main category: cs.LG

TL;DR: VoxServe is a unified serving system for Speech Language Models that optimizes streaming performance through model-execution abstraction, streaming-aware scheduling, and asynchronous inference pipelines.

Details

Motivation: Existing systems for deploying Speech Language Models in streaming settings lack flexibility, efficiency, and proper support for diverse model architectures while requiring low latency, high throughput, and streamability guarantees.

Method: VoxServe introduces a model-execution abstraction that decouples model architecture from system-level optimizations, enabling support for diverse SpeechLM architectures. It implements streaming-aware scheduling and an asynchronous inference pipeline to improve end-to-end efficiency.

Result: Evaluations show VoxServe achieves 10-20x higher throughput than existing implementations at comparable latency while maintaining high streaming viability across multiple modern SpeechLMs.

Conclusion: VoxServe provides an efficient, flexible serving system for SpeechLMs in streaming settings, significantly outperforming existing solutions while supporting diverse model architectures.

Abstract: Deploying modern Speech Language Models (SpeechLMs) in streaming settings requires systems that provide low latency, high throughput, and strong guarantees of streamability. Existing systems fall short of supporting diverse models flexibly and efficiently. We present VoxServe, a unified serving system for SpeechLMs that optimizes streaming performance. VoxServe introduces a model-execution abstraction that decouples model architecture from system-level optimizations, thereby enabling support for diverse SpeechLM architectures within a single framework. Building on this abstraction, VoxServe implements streaming-aware scheduling and an asynchronous inference pipeline to improve end-to-end efficiency. Evaluations across multiple modern SpeechLMs show that VoxServe achieves 10-20x higher throughput than existing implementations at comparable latency while maintaining high streaming viability. The code of VoxServe is available at https://github.com/vox-serve/vox-serve.

[949] TextBFGS: Quasi-Newton Optimization for Discrete Executable Text via Gradient-Operator Retrieval

Zizheng Zhang, Yuyang Liao, Chen Chen, Jian He, Dun Wu, Qianjin Yu, Yanqin Gao, Jin Yang, Kailai Zhang, Eng Siong Chng, Xionghu Zhong

Main category: cs.LG

TL;DR: TextBFGS introduces a second-order optimization framework for discrete text (prompts/code) that approximates inverse Hessian using gradient-operator retrieval from successful trajectories, enabling one-pass updates with better convergence than first-order methods.

Details

Motivation: Existing gradient-based text optimization methods are first-order (like SGD) and suffer from slow convergence and instability due to ignoring semantic curvature of the optimization landscape. There's a need for second-order methods that can better navigate the discrete text space.

Method: TextBFGS implements a Quasi-Newton optimization method for discrete text. Instead of traditional memory retrieval of similar instances, it retrieves Gradient-Operators from a memory of pre-learned successful trajectories. Given textual gradient feedback, it identifies historical correction patterns and applies these abstract operators to current variables, enabling a One-Pass Update that combines feedback generation and second-order correction in a single inference step.

Result: Empirical evaluations on code optimization tasks (HumanEval, MBPP) show TextBFGS significantly outperforms first-order baselines. It achieves superior pass rates with fewer model calls and exhibits strong cross-task transferability.

Conclusion: TextBFGS establishes a mathematically grounded paradigm for efficient, memory-aware text optimization, demonstrating the value of second-order methods for discrete text optimization problems.

Abstract: Optimizing discrete executable text such as prompts and code has recently been framed as a gradient-based process, effectively translating backpropagation concepts to the semantic space. However, existing methods predominantly operate as first-order optimizers akin to Stochastic Gradient Descent, which are suffering from slow convergence and instability because they neglect the semantic curvature of the optimization landscape. To bridge this gap, we introduce TextBFGS, a second-order framework to implement a Quasi-Newton optimization method for discrete text. Unlike traditional memory-based approaches that retrieve similar textual instances, TextBFGS approximates the inverse Hessian matrix by retrieving Gradient-Operators from the memory of pre-learned successful trajectories. Specifically, given a textual gradient feedback, TextBFGS identifies historical correction patterns from the optimization knowledge base and tries to apply these abstract operators to the current variable. This mechanism enables a One-Pass Update, combining feedback generation and second-order correction into a single inference step. Empirical evaluations on code optimization across diverse domains (e.g., HumanEval, MBPP) demonstrate that TextBFGS significantly outperforms first-order baselines. It achieves superior pass rates with fewer model calls and exhibits strong cross-task transferability, thus establishes a mathematically grounded paradigm for efficient, memory-aware text optimization.

[950] SCPL: Enhancing Neural Network Training Throughput with Decoupled Local Losses and Model Parallelism

Ming-Yao Ho, Cheng-Kai Wang, You-Teng Lin, Hung-Hsuan Chen

Main category: cs.LG

TL;DR: SCPL is a new training methodology that decouples backpropagation to enable parallel gradient computation across layers, improving training efficiency for large AI models.

Details

Motivation: High training costs and long development cycles hinder enterprise adoption of large AI models. Standard backpropagation creates inefficiencies in training deep networks due to sequential gradient flow.

Method: SCPL transforms long gradient flows into multiple short ones by decoupling backpropagation, enabling simultaneous computation of parameter gradients in different layers for superior model parallelism.

Result: SCPL demonstrates improved training throughput and efficiency compared to standard backpropagation, Early Exit, GPipe, and Associated Learning (a state-of-the-art decoupling method).

Conclusion: SCPL provides a practical pathway for organizations to develop and deploy advanced AI systems more cost-effectively by mitigating fundamental performance bottlenecks in training.

Abstract: Adopting large-scale AI models in enterprise information systems is often hindered by high training costs and long development cycles, posing a significant managerial challenge. The standard end-to-end backpropagation (BP) algorithm is a primary driver of modern AI, but it is also the source of inefficiency in training deep networks. This paper introduces a new training methodology, Supervised Contrastive Parallel Learning (SCPL), that addresses this issue by decoupling BP and transforming a long gradient flow into multiple short ones. This design enables the simultaneous computation of parameter gradients in different layers, achieving superior model parallelism and enhancing training throughput. Detailed experiments are presented to demonstrate the efficiency and effectiveness of our model compared to BP, Early Exit, GPipe, and Associated Learning (AL), a state-of-the-art method for decoupling backpropagation. By mitigating a fundamental performance bottleneck, SCPL provides a practical pathway for organizations to develop and deploy advanced information systems more cost-effectively and with greater agility. The experimental code is released for reproducibility. https://github.com/minyaho/scpl/

[951] The Impact of Machine Learning Uncertainty on the Robustness of Counterfactual Explanations

Leonidas Christodoulou, Chang Sun

Main category: cs.LG

TL;DR: Counterfactual explanations are unstable under model and data uncertainty, with small accuracy reductions causing large variations in generated counterfactuals.

Details

Motivation: Most counterfactual explanation methods haven't been tested under changing model and data uncertainty, potentially producing unstable or invalid explanations in real-world scenarios with variability.

Method: Investigates robustness of common ML model and counterfactual algorithm combinations under aleatoric and epistemic uncertainty through experiments on synthetic and real-world tabular datasets.

Result: Counterfactual explanations are highly sensitive to model uncertainty - even small reductions in model accuracy (from noise or limited data) cause large variations in generated counterfactuals both on average and for individual instances.

Conclusion: Findings highlight the need for uncertainty-aware explanation methods, especially in domains like finance and social sciences where reliable explanations are critical.

Abstract: Counterfactual explanations are widely used to interpret machine learning predictions by identifying minimal changes to input features that would alter a model’s decision. However, most existing counterfactual methods have not been tested when model and data uncertainty change, resulting in explanations that may be unstable or invalid under real-world variability. In this work, we investigate the robustness of common combinations of machine learning models and counterfactual generation algorithms in the presence of both aleatoric and epistemic uncertainty. Through experiments on synthetic and real-world tabular datasets, we show that counterfactual explanations are highly sensitive to model uncertainty. In particular, we find that even small reductions in model accuracy - caused by increased noise or limited data - can lead to large variations in the generated counterfactuals on average and on individual instances. These findings underscore the need for uncertainty-aware explanation methods in domains such as finance and the social sciences.

[952] SPGCL: Effective Graph Contrastive Learning via SVD-Guided Structural Perturbation

Hao Deng, Yingping Li, Shuiping Gou, Bo Liu

Main category: cs.LG

TL;DR: SPGCL is a robust graph contrastive learning framework that uses SVD-guided structural perturbation to generate diverse yet semantically meaningful graph views, improving GNN robustness against structural noise.

Details

Motivation: GNNs are sensitive to structural noise (spurious/missing edges). Existing graph contrastive learning methods have limitations: random perturbations are structure-agnostic and may remove critical edges, while SVD-based views become dense and lack diversity.

Method: SPGCL combines lightweight stochastic edge removal with SVD-guided refinement that recovers mistakenly removed informative edges and introduces semantically meaningful missing links. It uses sparse top-ranked edge selection and merging to avoid graph densification, and includes a contrastive fusion module with global similarity constraint.

Result: Extensive experiments on ten benchmark datasets show SPGCL consistently improves robustness and accuracy of base GNNs, outperforming state-of-the-art graph contrastive learning and structure learning methods.

Conclusion: SPGCL effectively bridges the gap between random perturbations and spectral augmentations in graph contrastive learning, providing a balanced approach that generates diverse yet semantically meaningful views for improved GNN robustness.

Abstract: Graph Neural Networks (GNNs) can be highly sensitive to structural noise, including spurious or missing edges caused by adversarial attacks or non-adversarial imperfections. Existing graph contrastive learning methods typically rely on either random perturbations (e.g., edge dropping) to generate diverse views or purely spectral augmentations (e.g., SVD) to preserve global structural priors. However, random perturbations are structure-agnostic and may remove critical edges, while SVD-based views often become dense and lack sufficient diversity. To bridge this gap, we propose SPGCL, a robust graph contrastive learning framework via SVD-guided structural perturbation. SPGCL couples lightweight stochastic edge removal with an SVD-guided refinement step that can recover mistakenly removed informative edges and introduce semantically meaningful missing links while avoiding graph densification through sparse top-ranked edge selection and merging. By balancing edge removal and recovery rates, SPGCL explicitly controls structural discrepancy between views so that contrastive signals reflect semantic structural differences rather than edge-count gaps. We further incorporate a contrastive fusion module regularized by a global similarity constraint to better align the two views. Extensive experiments on ten benchmark datasets demonstrate that SPGCL consistently improves robustness and accuracy of base GNNs, outperforming state-of-the-art graph contrastive learning and structure learning methods.

[953] Modality as Heterogeneity: Node Splitting and Graph Rewiring for Multimodal Graph Learning

Yihan Zhang, Ercan E. Kuruoglu

Main category: cs.LG

TL;DR: NSG-MoE is a multimodal graph learning framework that addresses modality confusion by splitting nodes into modality-specific components and using relation-aware experts in a Mixture-of-Experts architecture.

Details

Motivation: Multimodal graphs offer rich representational power but suffer from severe modality confusion where different modalities get mixed, reducing their effectiveness. General-purpose GNNs struggle with this issue.

Method: Proposes NSG-MoE with node-splitting and graph-rewiring mechanism combined with structured Mixture-of-Experts architecture. Each node is decomposed into modality-specific components, and relation-aware experts process heterogeneous message flows to preserve structural information and multimodal semantics.

Result: Extensive experiments on three multimodal benchmarks show NSG-MoE consistently surpasses strong baselines. Achieves competitive training efficiency despite using MoE architecture. Spectral analysis reveals adaptive filtering over modality-specific subspaces, and information-theoretic analysis shows reduced mutual information between data and parameters improving generalization.

Conclusion: NSG-MoE effectively addresses modality confusion in multimodal graphs through explicit modality decomposition and specialized expert processing, achieving superior performance with theoretical justification for its disentangling behavior and generalization benefits.

Abstract: Multimodal graphs are gaining increasing attention due to their rich representational power and wide applicability, yet they introduce substantial challenges arising from severe modality confusion. To address this issue, we propose NSG (Node Splitting Graph)-MoE, a multimodal graph learning framework that integrates a node-splitting and graph-rewiring mechanism with a structured Mixture-of-Experts (MoE) architecture. It explicitly decomposes each node into modality-specific components and assigns relation-aware experts to process heterogeneous message flows, thereby preserving structural information and multimodal semantics while mitigating the undesirable mixing effects commonly observed in general-purpose GNNs. Extensive experiments on three multimodal benchmarks demonstrate that NSG-MoE consistently surpasses strong baselines. Despite incorporating MoE – which is typically computationally heavy – our method achieves competitive training efficiency. Beyond empirical results, we provide a spectral analysis revealing that NSG performs adaptive filtering over modality-specific subspaces, thus explaining its disentangling behavior. Furthermore, an information-theoretic analysis shows that the architectural constraints imposed by NSG reduces mutual information between data and parameters and improving generalization capability.

[954] Generative AI-enhanced Probabilistic Multi-Fidelity Surrogate Modeling Via Transfer Learning

Jice Zeng, David Barajas-Solano, Hui Chen

Main category: cs.LG

TL;DR: A probabilistic multi-fidelity surrogate framework using generative transfer learning with normalizing flows, combining abundant low-fidelity data with scarce high-fidelity data for engineering applications.

Details

Motivation: High-fidelity data is scarce and computationally expensive, while low-fidelity data is abundant but less accurate. Need to address data scarcity for machine learning surrogates in engineering systems.

Method: Two-phase training of normalizing flow generative model: pretrain on large low-fidelity dataset, then fine-tune on small high-fidelity dataset. Uses surjective layers with coupling blocks for dimension reduction while maintaining exact likelihood training.

Result: The framework provides fast probabilistic predictions with quantified uncertainty, significantly outperforms low-fidelity-only baselines while using fewer high-fidelity evaluations. Validated on reinforced concrete slab benchmark.

Conclusion: Proposes a practical path toward data-efficient, generative AI-driven surrogates for complex engineering systems by combining multi-fidelity data through generative transfer learning.

Abstract: The performance of machine learning surrogates is critically dependent on data quality and quantity. This presents a major challenge, as high-fidelity (HF) data is often scarce and computationally expensive to acquire, while low-fidelity (LF) data is abundant but less accurate. To address this data scarcity problem, we develop a probabilistic multi-fidelity surrogate framework based on generative transfer learning. We employ a normalizing flow (NF) generative model as the backbone, which is trained in two phases: (i) the NF is first pretrained on a large LF dataset to learn a probabilistic forward model; (ii) the pretrained model is then fine-tuned on a small HF dataset, allowing it to correct for LF-HF discrepancies via knowledge transfer. To relax the dimension-preserving constraint of standard bijective NFs, we integrate surjective (dimension-reducing) layers with standard coupling blocks. This architecture enables learned dimension reduction while preserving the ability to train with exact likelihoods. The resulting surrogate provides fast probabilistic predictions with quantified uncertainty and significantly outperforms LF-only baselines while using fewer HF evaluations. We validate the approach on a reinforced concrete slab benchmark, combining many coarse-mesh (LF) simulations with a limited set of fine-mesh (HF) simulations. The proposed model achieves probabilistic predictions with HF accuracy, demonstrating a practical path toward data-efficient, generative AI-driven surrogates for complex engineering systems.

[955] From Perception to Action: Spatial AI Agents and World Models

Gloria Felicia, Nolan Bryant, Handi Putra, Ayaan Gazali, Eliel Lobo, Esteban Rojas

Main category: cs.LG

TL;DR: A comprehensive survey paper introducing a unified three-axis taxonomy connecting agentic capabilities with spatial tasks across scales, bridging the gap between agentic reasoning and spatial intelligence for embodied agents.

Details

Motivation: While LLMs excel at symbolic reasoning, they lack spatial intelligence needed for physical world interaction. Existing research treats agentic architectures and spatial domains separately, lacking a unified framework connecting these complementary capabilities for embodied agents.

Method: Conducted thorough review of over 2,000 papers, citing 742 works from top-tier venues. Introduced a unified three-axis taxonomy connecting agentic capabilities with spatial tasks across scales, distinguishing spatial grounding (metric understanding) from symbolic grounding (image-text association).

Result: Three key findings: (1) hierarchical memory systems are important for long-horizon spatial tasks, (2) GNN-LLM integration is promising for structured spatial reasoning, (3) world models are essential for safe deployment across spatial scales. Identified six grand challenges for future research.

Conclusion: The taxonomy provides foundation for unifying fragmented research efforts and enabling spatially-aware autonomous systems in robotics, autonomous vehicles, and geospatial intelligence. Emphasizes need for unified evaluation frameworks for cross-domain assessment.

Abstract: While large language models have become the prevailing approach for agentic reasoning and planning, their success in symbolic domains does not readily translate to the physical world. Spatial intelligence, the ability to perceive 3D structure, reason about object relationships, and act under physical constraints, is an orthogonal capability that proves important for embodied agents. Existing surveys address either agentic architectures or spatial domains in isolation. None provide a unified framework connecting these complementary capabilities. This paper bridges that gap. Through a thorough review of over 2,000 papers, citing 742 works from top-tier venues, we introduce a unified three-axis taxonomy connecting agentic capabilities with spatial tasks across scales. Crucially, we distinguish spatial grounding (metric understanding of geometry and physics) from symbolic grounding (associating images with text), arguing that perception alone does not confer agency. Our analysis reveals three key findings mapped to these axes: (1) hierarchical memory systems (Capability axis) are important for long-horizon spatial tasks. (2) GNN-LLM integration (Task axis) is a promising approach for structured spatial reasoning. (3) World models (Scale axis) are essential for safe deployment across micro-to-macro spatial scales. We conclude by identifying six grand challenges and outlining directions for future research, including the need for unified evaluation frameworks to standardize cross-domain assessment. This taxonomy provides a foundation for unifying fragmented research efforts and enabling the next generation of spatially-aware autonomous systems in robotics, autonomous vehicles, and geospatial intelligence.

[956] Dimensional Peeking for Low-Variance Gradients in Zeroth-Order Discrete Optimization via Simulation

Philipp Andelfinger, Wentong Cai

Main category: cs.LG

TL;DR: Dimensional peeking is a variance reduction method for gradient estimation in discrete optimization via simulation that lifts sampling granularity from scalar values to classes of values following the same control flow path.

Details

Motivation: Stochastic gradient estimators in discrete optimization via simulation introduce variance through perturbation-based sampling, leading to slow convergence. There's a need for variance reduction methods to improve optimization efficiency.

Method: Dimensional peeking increases information per simulation evaluation by lifting sampling granularity from scalar values to classes of values that follow the same control flow path. Derived from an established smoothed gradient estimator, it introduces no bias. Implemented via custom numerical data type to transparently carry out dimensional peeking over C++ programs.

Result: Variance reductions by factors of up to 7.9 observed for three simulation-based optimization problems with high-dimensional input. Optimization progress compared to three meta-heuristics shows dimensional peeking increases competitiveness of zeroth-order optimization for discrete and non-convex simulations.

Conclusion: Dimensional peeking effectively reduces variance in gradient estimation for discrete optimization via simulation, making zeroth-order optimization more competitive for discrete and non-convex simulation problems.

Abstract: Gradient-based optimization methods are commonly used to identify local optima in high-dimensional spaces. When derivatives cannot be evaluated directly, stochastic estimators can provide approximate gradients. However, these estimators’ perturbation-based sampling of the objective function introduces variance that can lead to slow convergence. In this paper, we present dimensional peeking, a variance reduction method for gradient estimation in discrete optimization via simulation. By lifting the sampling granularity from scalar values to classes of values that follow the same control flow path, we increase the information gathered per simulation evaluation. Our derivation from an established smoothed gradient estimator shows that the method does not introduce any bias. We present an implementation via a custom numerical data type to transparently carry out dimensional peeking over C++ programs. Variance reductions by factors of up to 7.9 are observed for three simulation-based optimization problems with high-dimensional input. The optimization progress compared to three meta-heuristics shows that dimensional peeking increases the competitiveness of zeroth-order optimization for discrete and non-convex simulations.

[957] Automated univariate time series forecasting with regression trees

Francisco Martínez, María P. Frías

Main category: cs.LG

TL;DR: Automated univariate time series forecasting using regression trees and ensembles (bagging, random forests) with autoregressive features, handling trends and seasonality.

Details

Motivation: To develop an automated forecasting methodology using machine learning approaches (regression trees and ensembles) that can handle various time series characteristics like trends and seasonality, providing an alternative to traditional statistical models.

Method: Uses regression trees and ensemble methods (bagging and random forests) with autoregressive features. Addresses feature selection for autoregressive lags, handles trending series through differencing or detrending, and manages seasonal behavior using seasonal decomposition or dummy variables.

Result: Experimental results show forecast accuracy comparable to established statistical models like exponential smoothing and ARIMA. The methodology is implemented in publicly available software.

Conclusion: Regression trees and their ensembles provide a viable alternative to traditional time series forecasting methods, offering comparable accuracy with automated feature handling for trends and seasonality.

Abstract: This paper describes a methodology for automated univariate time series forecasting using regression trees and their ensembles: bagging and random forests. The key aspects that are addressed are: the use of an autoregressive approach and recursive forecasts, how to select the autoregressive features, how to deal with trending series and how to cope with seasonal behavior. Experimental results show a forecast accuracy comparable with well-established statistical models such as exponential smoothing or ARIMA. Furthermore, a publicly available software implementing all the proposed strategies has been developed and is described in the paper.

[958] Lossless Embedding Compression via Spherical Coordinates

Han Xiao

Main category: cs.LG

TL;DR: Lossless compression method for unit-norm embeddings achieves 1.5× compression by exploiting spherical coordinate concentration around π/2, enabling entropy coding without training.

Details

Motivation: The need for efficient storage and transmission of high-dimensional embeddings used in various AI applications, where current compression methods are suboptimal for unit-norm vectors.

Method: Exploits the geometric property that spherical coordinates of high-dimensional unit vectors concentrate around π/2, causing IEEE 754 exponents to collapse to a single value, which enables effective entropy coding for lossless compression.

Result: Achieves 1.5× compression ratio, 25% better than prior methods, validated across 26 configurations spanning text, image, and multi-vector embeddings, with full lossless compression within float32 precision.

Conclusion: A novel, training-free lossless compression method for unit-norm embeddings that significantly outperforms existing approaches by leveraging geometric properties of high-dimensional vectors.

Abstract: We present a lossless compression method for unit-norm embeddings that achieves 1.5$\times$ compression, 25% better than the best prior method. The method exploits that spherical coordinates of high-dimensional unit vectors concentrate around $π/2$, causing IEEE 754 exponents to collapse to a single value and enabling entropy coding. Evaluation across 26 configurations spanning text, image, and multi-vector embeddings confirms consistent improvement. The method requires no training and is fully lossless within float32 precision.

[959] Why LoRA Resists Label Noise: A Theoretical Framework for Noise-Robust Parameter-Efficient Fine-Tuning

Brady Steele

Main category: cs.LG

TL;DR: LoRA’s theoretical resistance to label noise explained through rank constraints, optimal rank selection, and temporal separation of learning patterns, leading to RACT method for noise detection.

Details

Motivation: To theoretically explain LoRA's underexplored property of inherent resistance to label noise and leverage this understanding for practical noise detection methods.

Method: Theoretical analysis of LoRA’s memorization capacity constraints, derivation of optimal rank selection, and development of RACT (Rank-Aware Curriculum Training) that exploits rank discrepancy for noise detection.

Result: RACT achieves 91.1% F1 for noise detection on AG News while maintaining 91.46% accuracy, competitive with baselines lacking noise detection capability.

Conclusion: LoRA’s low-rank structure provides inherent robustness to label noise, enabling effective noise detection through rank-aware training strategies.

Abstract: Parameter-efficient fine-tuning methods like Low-Rank Adaptation (LoRA) have become the dominant paradigm for adapting large pretrained models. We present a theoretical framework explaining an underexplored property: LoRA’s inherent resistance to label noise. Our analysis reveals three key insights. First, we prove that rank-$r$ LoRA cannot memorize all possible label assignments once the sample size exceeds $O(r(d+k-r))$, limiting its capacity to fit arbitrary noise. Second, we derive an optimal rank balancing approximation bias and noise-induced variance, showing it decreases with noise rate. Third, we establish temporal separation: clean patterns are learned early while noise memorization occurs later. We propose RACT (Rank-Aware Curriculum Training), leveraging rank discrepancy for noise detection. Experiments validate our predictions, with RACT achieving 91.1% F1 for noise detection on AG News while maintaining 91.46% accuracy, competitive with baselines that lack noise detection capability.

[960] CARE-RFT: Confidence-Anchored Reinforcement Finetuning for Reliable Reasoning in Large Language Models

Shuozhe Li, Jincheng Cao, Bodun Hu, Aryan Mokhtari, Leqi Liu, Amy Zhang

Main category: cs.LG

TL;DR: CARE-RFT introduces confidence-anchored regularization using skew reverse KL divergence to balance reasoning performance and trustworthiness in reinforcement fine-tuned LLMs.

Details

Motivation: Standard reinforcement finetuning (RFT) improves reasoning but harms trustworthiness (hallucination, calibration), while KL-constrained RFT preserves trustworthiness but limits reasoning gains. Need method that achieves both strong reasoning and trustworthiness.

Method: CARE-RFT replaces standard reverse KL regularization with skew reverse KL divergence. This provides confidence-sensitive penalty: bounded for confident, consistently rewarded explorations (enables reasoning) but unbounded elsewhere (preserves calibration).

Result: Extensive experiments across multiple model scales and RFT algorithms show CARE-RFT matches reasoning performance of unconstrained RFT while recovering trustworthiness and calibration of base model.

Conclusion: Confidence-aware regularization is key to building both capable and trustworthy reasoning models. CARE-RFT resolves tension between reasoning performance and trustworthiness in reinforcement finetuning.

Abstract: Reinforcement finetuning (RFT) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models. However, we identify a critical trade-off: while unconstrained RFT achieves strong reasoning performance, it severely compromises model trustworthiness by amplifying hallucination and worsening calibration; conversely, RKL-constrained RFT preserves trustworthiness but limits reasoning gains due to its unbounded penalty on exploratory deviations. To resolve this tension, we introduce CARE-RFT (Confidence-Anchored Regularized Reinforcement Finetuning), a novel method that replaces standard reverse KL regularization with a skew reverse KL divergence. CARE-RFT provides a confidence-sensitive penalty: it is bounded for confident, consistently rewarded explorations to enable reasoning, while unbounded elsewhere to preserve calibration. Extensive experiments across multiple model scales and RFT algorithms show that CARE-RFT achieves a superior balance, matching the reasoning performance of unconstrained RFT while recovering the trustworthiness and calibration of the base model. Our work establishes that careful, confidence-aware regularization is key to building both capable and trustworthy reasoning models.

[961] David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via Reinforcement Learning

Samuel Nellessen, Tal Kachman

Main category: cs.LG

TL;DR: Slingshot framework discovers Tag-Along Attacks where adversaries exploit tool privileges of safety-aligned LLM agents through short instruction-like patterns, achieving high success rates across various models.

Details

Motivation: As LLMs evolve into autonomous agents with tool access, new adversarial threats emerge that exploit legitimate tool privileges, transforming safety evaluation from subjective NLP tasks into objective control problems.

Method: Introduces Slingshot, a ‘cold-start’ reinforcement learning framework that autonomously discovers emergent attack vectors (Tag-Along Attacks) where tool-less adversaries exploit trusted privileges of safety-aligned Operators through conversation alone.

Result: Achieves 67.0% success rate against Qwen2.5-32B-Instruct-AWQ (vs. 1.7% baseline), reduces expected attempts to first success from 52.3 to 1.3, and transfers zero-shot to other models including Gemini 2.5 Flash (56.0%) and Meta-SecAlign-8B (39.2%).

Conclusion: Establishes Tag-Along Attacks as a first-class, verifiable threat model showing effective agentic attacks can be elicited from off-the-shelf models through environment interaction alone, with attacks converging to short instruction-like patterns.

Abstract: The evolution of large language models into autonomous agents introduces adversarial failures that exploit legitimate tool privileges, transforming safety evaluation in tool-augmented environments from a subjective NLP task into an objective control problem. We formalize this threat model as Tag-Along Attacks: a scenario where a tool-less adversary “tags along” on the trusted privileges of a safety-aligned Operator to induce prohibited tool use through conversation alone. To validate this threat, we present Slingshot, a ‘cold-start’ reinforcement learning framework that autonomously discovers emergent attack vectors, revealing a critical insight: in our setting, learned attacks tend to converge to short, instruction-like syntactic patterns rather than multi-turn persuasion. On held-out extreme-difficulty tasks, Slingshot achieves a 67.0% success rate against a Qwen2.5-32B-Instruct-AWQ Operator (vs. 1.7% baseline), reducing the expected attempts to first success (on solved tasks) from 52.3 to 1.3. Crucially, Slingshot transfers zero-shot to several model families, including closed-source models like Gemini 2.5 Flash (56.0% attack success rate) and defensive-fine-tuned open-source models like Meta-SecAlign-8B (39.2% attack success rate). Our work establishes Tag-Along Attacks as a first-class, verifiable threat model and shows that effective agentic attacks can be elicited from off-the-shelf open-weight models through environment interaction alone.

[962] ECCO: Evidence-Driven Causal Reasoning for Compiler Optimization

Haolin Pan, Lianghong Huang, Jinyuan Dong, Mingjie Xing, Yanjun Wu

Main category: cs.LG

TL;DR: ECCO is a compiler auto-tuning framework that combines interpretable LLM reasoning with genetic algorithm search, achieving significant performance improvements over traditional baselines.

Details

Motivation: Traditional compiler auto-tuning methods lack semantic guidance (black-box search), while recent LLM approaches suffer from superficial pattern matching and causal opacity. There's a need to bridge interpretable reasoning with combinatorial search.

Method: 1) Reverse engineering to create Chain-of-Thought dataset mapping static code features to verifiable performance evidence; 2) Collaborative inference where LLM acts as strategist defining optimization intents that dynamically guide genetic algorithm mutation operations.

Result: On seven datasets, ECCO significantly outperforms LLVM opt -O3 baseline, achieving average 24.44% reduction in cycles.

Conclusion: ECCO successfully bridges interpretable reasoning with combinatorial search, enabling LLMs to learn causal logic of optimization decisions rather than superficial patterns, leading to substantial performance gains in compiler auto-tuning.

Abstract: Compiler auto-tuning faces a dichotomy between traditional black-box search methods, which lack semantic guidance, and recent Large Language Model (LLM) approaches, which often suffer from superficial pattern matching and causal opacity. In this paper, we introduce ECCO, a framework that bridges interpretable reasoning with combinatorial search. We first propose a reverse engineering methodology to construct a Chain-of-Thought dataset, explicitly mapping static code features to verifiable performance evidence. This enables the model to learn the causal logic governing optimization decisions rather than merely imitating sequences. Leveraging this interpretable prior, we design a collaborative inference mechanism where the LLM functions as a strategist, defining optimization intents that dynamically guide the mutation operations of a genetic algorithm. Experimental results on seven datasets demonstrate that ECCO significantly outperforms the LLVM opt -O3 baseline, achieving an average 24.44% reduction in cycles.

[963] From Numbers to Prompts: A Cognitive Symbolic Transition Mechanism for Lightweight Time-Series Forecasting

Namkyung Yoon, Hwangnam Kim

Main category: cs.LG

TL;DR: STM is a symbolic abstraction framework that enables efficient time series prediction using small language models by converting numeric data to symbolic tokens and using prompt engineering.

Details

Motivation: Large language models show promise for time series prediction but have high computational costs that limit deployment on lightweight platforms. There's a need for efficient methods that can leverage language models' capabilities while reducing resource requirements.

Method: Proposes Symbolic Transition Mechanism (STM) that transforms continuous time series values into symbol tokens using quantization based on human cognitive structures. It captures temporal dynamics through structured symbol transformations and uses prompt engineering to focus language models on critical parts of time series data.

Result: STM achieves error reductions up to 69% in MAE and 90% in MSE compared to baseline small language models without STM. These improvements come with negligible resource costs: maximum GPU memory increases by ~0.06% and latency overhead by only 0.64%.

Conclusion: STM demonstrates potential as an efficient, adaptable layer for symbol-driven time series prediction using foundation models, enabling deployment on resource-constrained platforms while maintaining or improving accuracy.

Abstract: Large language models have achieved remarkable success in time series prediction tasks, but their substantial computational and memory requirements limit deployment on lightweight platforms. In this paper, we propose the Symbolic Transition Mechanism (STM) a novel framework that bridges numeric time series data and language models through symbolic abstraction and prompt engineering. STM transforms continuous time series values into symbol tokens with quantization techniques based on human cognitive structures, and captures temporal dynamics through structured transformations of symbols, enabling fast engineering based predictions in which language models focus on critical parts of time series data. STM is a general purpose mechanisms that ensure the integrity of backbone language models, but they significantly improve their efficiency by inferring the dynamic and structured patterns inherent in time series data. We evaluated STM on various time series datasets, paired with four small language models (SLM) with limited computational environments. For all models, STM achieves error reductions of up to 69% in MAE and 90% in MSE compared to the default backbone SLM without STM. These results demonstrate the potential of STM as an efficient, adaptable layer for symbol-driven time series prediction using foundation models. The accuracy improvements were made at negligible resource costs, with maximum GPU memory of the base model increasing by approximately 0.06% and latency overhead increasing by only 0.64%.

[964] DeepGB-TB: A Risk-Balanced Cross-Attention Gradient-Boosted Convolutional Network for Rapid, Interpretable Tuberculosis Screening

Zhixiang Lu, Yulong Li, Feilong Tang, Zhengyong Jiang, Chong Li, Mian Zhou, Tenglong Li, Jionglong Su

Main category: cs.LG

TL;DR: DeepGB-TB: A multimodal AI system using cough audio and demographic data for tuberculosis screening with cross-modal attention and risk-balanced loss for clinical deployment.

Details

Motivation: Traditional TB diagnostics are expensive and operationally complex, creating need for AI solutions that can provide instant, non-invasive screening using accessible data like cough audio and basic demographics.

Method: Combines lightweight 1D CNN for audio processing with gradient-boosted decision tree for tabular features, using Cross-Modal Bidirectional Cross-Attention (CM-BCA) to exchange cues between modalities, and Tuberculosis Risk-Balanced Loss (TRBL) to penalize false negatives.

Result: Achieved AUROC of 0.903 and F1-score of 0.851 on diverse dataset of 1,105 patients across seven countries, enabling real-time offline inference on mobile devices.

Conclusion: DeepGB-TB offers a practical tool for global TB control by coupling AI innovation with public-health requirements for speed, affordability, and reliability, with clinically validated explanations for health worker trust.

Abstract: Large-scale tuberculosis (TB) screening is limited by the high cost and operational complexity of traditional diagnostics, creating a need for artificial-intelligence solutions. We propose DeepGB-TB, a non-invasive system that instantly assigns TB risk scores using only cough audio and basic demographic data. The model couples a lightweight one-dimensional convolutional neural network for audio processing with a gradient-boosted decision tree for tabular features. Its principal innovation is a Cross-Modal Bidirectional Cross-Attention module (CM-BCA) that iteratively exchanges salient cues between modalities, emulating the way clinicians integrate symptoms and risk factors. To meet the clinical priority of minimizing missed cases, we design a Tuberculosis Risk-Balanced Loss (TRBL) that places stronger penalties on false-negative predictions, thereby reducing high-risk misclassifications. DeepGB-TB is evaluated on a diverse dataset of 1,105 patients collected across seven countries, achieving an AUROC of 0.903 and an F1-score of 0.851, representing a new state of the art. Its computational efficiency enables real-time, offline inference directly on common mobile devices, making it ideal for low-resource settings. Importantly, the system produces clinically validated explanations that promote trust and adoption by frontline health workers. By coupling AI innovation with public-health requirements for speed, affordability, and reliability, DeepGB-TB offers a tool for advancing global TB control.

[965] Interpreting and Controlling Model Behavior via Constitutions for Atomic Concept Edits

Neha Kalibhat, Zi Wang, Prasoon Bajpai, Drew Proud, Wenjun Zeng, Been Kim, Mani Malek

Main category: cs.LG

TL;DR: A black-box interpretability framework that learns verifiable natural language constitutions explaining how prompt changes affect model behavior across tasks like mathematical reasoning and text-to-image generation.

Details

Motivation: To develop interpretable methods for understanding how specific prompt modifications affect model behaviors (alignment, correctness, constraints) in a black-box setting, enabling better control and insight into model decision-making processes.

Method: Uses Atomic Concept Edits (ACEs) - targeted operations that add, remove, or replace interpretable concepts in input prompts. Systematically applies ACEs and observes effects on model behavior across tasks to learn causal mappings from edits to outcomes.

Result: Framework validated across diverse tasks: text-to-image generation (GPT-Image focuses on grammatical adherence, Imagen 4 on atmospheric coherence) and mathematical reasoning (distractor variables confuse GPT-5 but not Gemini 2.5 or o4-mini). Learned constitutions provide 1.86x boost in success rate for controlling model behavior compared to methods without constitutions.

Conclusion: The framework successfully learns verifiable constitutions that provide deep, generalizable insights into model behavior and enable effective control, demonstrating practical utility for understanding and manipulating black-box models across different modalities.

Abstract: We introduce a black-box interpretability framework that learns a verifiable constitution: a natural language summary of how changes to a prompt affect a model’s specific behavior, such as its alignment, correctness, or adherence to constraints. Our method leverages atomic concept edits (ACEs), which are targeted operations that add, remove, or replace an interpretable concept in the input prompt. By systematically applying ACEs and observing the resulting effects on model behavior across various tasks, our framework learns a causal mapping from edits to predictable outcomes. This learned constitution provides deep, generalizable insights into the model. Empirically, we validate our approach across diverse tasks, including mathematical reasoning and text-to-image alignment, for controlling and understanding model behavior. We found that for text-to-image generation, GPT-Image tends to focus on grammatical adherence, while Imagen 4 prioritizes atmospheric coherence. In mathematical reasoning, distractor variables confuse GPT-5 but leave Gemini 2.5 models and o4-mini largely unaffected. Moreover, our results show that the learned constitutions are highly effective for controlling model behavior, achieving an average of 1.86 times boost in success rate over methods that do not use constitutions.

[966] Trade-offs Between Individual and Group Fairness in Machine Learning: A Comprehensive Review

Sandra Benítez-Peña, Blas Kolic, Victoria Menendez, Belén Pulido

Main category: cs.LG

TL;DR: Survey paper examining hybrid fairness approaches that jointly address Group Fairness (GF) and Individual Fairness (IF), providing systematic review of methods, trade-offs, and open research directions.

Details

Motivation: Algorithmic fairness is crucial for ethical computational decision-making, but existing approaches typically treat Group Fairness (demographic disparities) and Individual Fairness (consistent treatment of similar individuals) in isolation. There's a need to understand methods that integrate both perspectives and characterize their trade-offs.

Method: Systematic survey methodology: organizing existing hybrid fairness methods by fairness mechanisms and algorithmic strategies, examining theoretical foundations, optimization mechanisms, empirical evaluation practices, and limitations.

Result: Comprehensive review of hybrid fairness approaches, classification of methods, analysis of trade-offs between GF and IF, identification of limitations in current approaches, and discussion of challenges in developing principled hybrid methods.

Conclusion: Hybrid fairness methods that integrate both group and individual perspectives are essential for comprehensive algorithmic fairness, but significant challenges remain in developing context-aware approaches that provide reliable guarantees at both levels.

Abstract: Algorithmic fairness has become a central concern in computational decision-making systems, where ensuring equitable outcomes is essential for both ethical and legal reasons. Two dominant notions of fairness have emerged in the literature: Group Fairness (GF), which focuses on mitigating disparities across demographic subpopulations, and Individual Fairness (IF), which emphasizes consistent treatment of similar individuals. These notions have traditionally been studied in isolation. In contrast, this survey examines methods that jointly address GF and IF, integrating both perspectives within unified frameworks and explicitly characterizing the trade-offs between them. We provide a systematic and critical review of hybrid fairness approaches, organizing existing methods according to the fairness mechanisms they employ and the algorithmic and mathematical strategies used to reconcile multiple fairness criteria. For each class of methods, we examine their theoretical foundations, optimization mechanisms, and empirical evaluation practices, and discuss their limitations. Additionally, we discuss the challenges and identify open research directions for developing principled, context-aware hybrid fairness methods. By synthesizing insights across the literature, this survey aims to serve as a comprehensive resource for researchers and practitioners seeking to design hybrid algorithms that provide reliable fairness guarantees at both the individual and group levels.

[967] Gauss-Newton Natural Gradient Descent for Shape Learning

James King, Arturs Berzins, Siddhartha Mishra, Marius Zeinhofer

Main category: cs.LG

TL;DR: Gauss-Newton method improves optimization for shape learning tasks like implicit neural surfaces, offering faster convergence and better accuracy than first-order methods.

Details

Motivation: Shape learning tasks (implicit neural surfaces, geometry-informed neural networks) face optimization challenges including ill-conditioning of differential constraints and parameter/function space mismatch, which standard first-order methods struggle with.

Method: Applies Gauss-Newton optimization method to shape learning problems, addressing the specific mathematical structure of these optimization tasks to overcome conditioning issues and space mismatches.

Result: Significantly faster and more stable convergence than standard first-order methods, requiring far fewer iterations while improving both training speed and final solution accuracy across benchmark shape optimization tasks.

Conclusion: Gauss-Newton method is highly effective for shape learning optimization, offering substantial improvements over traditional approaches for problems involving implicit neural surfaces and geometry-informed neural networks.

Abstract: We explore the use of the Gauss-Newton method for optimization in shape learning, including implicit neural surfaces and geometry-informed neural networks. The method addresses key challenges in shape learning, such as the ill-conditioning of the underlying differential constraints and the mismatch between the optimization problem in parameter space and the function space where the problem is naturally posed. This leads to significantly faster and more stable convergence than standard first-order methods, while also requiring far fewer iterations. Experiments across benchmark shape optimization tasks demonstrate that the Gauss-Newton method consistently improves both training speed and final solution accuracy.

[968] THDC: Training Hyperdimensional Computing Models with Backpropagation

Hanne Dejonghe, Sam Leroux

Main category: cs.LG

TL;DR: THDC introduces trainable hyperdimensional computing with backpropagation, reducing dimensionality from 10,000 to 64 while maintaining or improving accuracy on image datasets.

Details

Motivation: Standard HDC has limitations: ultra-high dimensionality (10,000+ dimensions) and static random initialization reduce memory efficiency and learning capacity. There's a need for more efficient and trainable HDC approaches.

Method: Proposes Trainable Hyperdimensional Computing (THDC) with end-to-end training via backpropagation. Replaces random hypervectors with trainable embeddings and uses a one-layer binary neural network to optimize class representations.

Result: Achieves equal or better accuracy than state-of-the-art HDC on MNIST, Fashion-MNIST, and CIFAR-10 datasets while dramatically reducing dimensionality from 10,000 to just 64 dimensions.

Conclusion: THDC enables efficient hyperdimensional computing with trainable parameters, significantly reducing memory requirements while maintaining or improving performance, making HDC more practical for resource-constrained devices.

Abstract: Hyperdimensional computing (HDC) offers lightweight learning for energy-constrained devices by encoding data into high-dimensional vectors. However, its reliance on ultra-high dimensionality and static, randomly initialized hypervectors limits memory efficiency and learning capacity. Therefore, we propose Trainable Hyperdimensional Computing (THDC), which enables end-to-end HDC via backpropagation. THDC replaces randomly initialized vectors with trainable embeddings and introduces a one-layer binary neural network to optimize class representations. Evaluated on MNIST, Fashion-MNIST and CIFAR-10, THDC achieves equal or better accuracy than state-of-the-art HDC, with dimensionality reduced from 10.000 to 64.

[969] Predicting Mortgage Default with Machine Learning: AutoML, Class Imbalance, and Leakage Control

Xianghong Hu, Tianning Xu, Ying Chen, Shuai Wang

Main category: cs.LG

TL;DR: Machine learning approaches for mortgage default prediction with emphasis on addressing real-world challenges: ambiguous labeling, class imbalance, and temporal information leakage.

Details

Motivation: Mortgage default prediction is crucial for financial risk management, but real-world datasets present three major challenges that undermine evaluation validity and deployment reliability: ambiguous default labeling, severe class imbalance, and information leakage from temporal structure and post-event variables.

Method: The study compares multiple machine learning approaches using a real-world loan-level dataset with emphasis on leakage control and imbalance handling. Key methodological components include: leakage-aware feature selection, strict temporal split constraining both origination and reporting periods, and controlled downsampling of the majority class. An AutoML approach (AutoGluon) is evaluated alongside other models.

Result: Performance remains stable across multiple positive-to-negative ratios, and AutoGluon achieves the strongest AUROC among the models evaluated. The study demonstrates that proper handling of real-world dataset challenges is crucial for reliable mortgage default prediction.

Conclusion: Proper methodology addressing labeling ambiguity, class imbalance, and temporal leakage is essential for valid mortgage default prediction. AutoML approaches can achieve strong performance when combined with rigorous data handling techniques. The work will be extended in a pedagogical book chapter format.

Abstract: Mortgage default prediction is a core task in financial risk management, and machine learning models are increasingly used to estimate default probabilities and provide interpretable signals for downstream decisions. In real-world mortgage datasets, however, three factors frequently undermine evaluation validity and deployment reliability: ambiguity in default labeling, severe class imbalance, and information leakage arising from temporal structure and post-event variables. We compare multiple machine learning approaches for mortgage default prediction using a real-world loan-level dataset, with emphasis on leakage control and imbalance handling. We employ leakage-aware feature selection, a strict temporal split that constrains both origination and reporting periods, and controlled downsampling of the majority class. Across multiple positive-to-negative ratios, performance remains stable, and an AutoML approach (AutoGluon) achieves the strongest AUROC among the models evaluated. An extended and pedagogical version of this work will appear as a book chapter.

[970] Investigating Modality Contribution in Audio LLMs for Music

Giovana Morais, Magdalena Fuentes

Main category: cs.LG

TL;DR: Investigates whether Audio LLMs truly listen to audio or rely on text by quantifying modality contributions using MM-SHAP framework on MuChoMusic benchmark.

Details

Motivation: To determine if Audio Large Language Models are genuinely processing audio content or just using textual reasoning, as recent benchmarks raise questions about their true multimodal capabilities.

Method: Adapts MM-SHAP framework (performance-agnostic Shapley values) to quantify relative contributions of audio and text modalities to model predictions. Evaluates two models on MuChoMusic benchmark.

Result: Higher-accuracy models rely more on text, but even with low overall audio contribution, models can successfully localize key sound events, indicating audio is not entirely ignored.

Conclusion: First application of MM-SHAP to Audio LLMs reveals nuanced modality usage; provides foundational step for explainable AI research in audio understanding.

Abstract: Audio Large Language Models (Audio LLMs) enable human-like conversation about music, yet it is unclear if they are truly listening to the audio or just using textual reasoning, as recent benchmarks suggest. This paper investigates this issue by quantifying the contribution of each modality to a model’s output. We adapt the MM-SHAP framework, a performance-agnostic score based on Shapley values that quantifies the relative contribution of each modality to a model’s prediction. We evaluate two models on the MuChoMusic benchmark and find that the model with higher accuracy relies more on text to answer questions, but further inspection shows that even if the overall audio contribution is low, models can successfully localize key sound events, suggesting that audio is not entirely ignored. Our study is the first application of MM-SHAP to Audio LLMs and we hope it will serve as a foundational step for future research in explainable AI and audio.

[971] MiniTensor: A Lightweight, High-Performance Tensor Operations Library

Soumyadip Sarkar

Main category: cs.LG

TL;DR: MiniTensor is an open-source tensor library with PyTorch-like Python API and Rust backend, focusing on minimalism, correctness, and performance with a tiny footprint compared to mainstream frameworks.

Details

Motivation: To create a lightweight tensor operations library that maintains essential functionality for research while being orders of magnitude smaller than frameworks like PyTorch and TensorFlow, addressing bloat and complexity issues in mainstream deep learning libraries.

Method: Developed a dual-language architecture with PyTorch-like Python API for usability and Rust engine for performance-critical operations. Features include n-dimensional tensors, broadcasting, reductions, matrix multiplication, reverse-mode automatic differentiation, neural network layers, and optimizers with efficient memory management and dynamic computation graphs.

Result: Achieves package size of only a few megabytes (orders of magnitude smaller than PyTorch/TensorFlow) while preserving essential functionality for CPU-based research and development. Successfully integrates Python and Rust via PyO3.

Conclusion: MiniTensor demonstrates that a minimalist tensor library can provide sufficient functionality for research while dramatically reducing installation footprint, offering an alternative to bloated mainstream frameworks.

Abstract: We present MiniTensor, an open source tensor operations library that focuses on minimalism, correctness, and performance. MiniTensor exposes a familiar PyTorch-like Python API while it executes performance critical code in a Rust engine. The core supports dense $n$ dimensional tensors, broadcasting, reductions, matrix multiplication, reverse mode automatic differentiation, a compact set of neural network layers, and standard optimizers. In this paper, we describe the design of MiniTensor’s architecture, including its efficient memory management, dynamic computation graph for gradients, and integration with Python via PyO3. We also compare the install footprint with PyTorch and TensorFlow to demonstrate that MiniTensor achieves a package size of only a few megabytes, several orders of magnitude smaller than mainstream frameworks, while preserving the essentials needed for research and development on CPUs. The repository can be found at https://github.com/neuralsorcerer/minitensor

[972] Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies

Han Zhou, Xingchen Wan, Ruoxi Sun, Hamid Palangi, Shariq Iqbal, Ivan Vulić, Anna Korhonen, Sercan Ö. Arık

Main category: cs.LG

TL;DR: MASS is a framework that automates the design of multi-agent systems by optimizing prompts and topologies through a three-stage interleaved optimization process.

Details

Motivation: Designing effective multi-agent systems with LLMs is complex, requiring careful prompt engineering and topology design. Current approaches lack systematic optimization methods for both prompts and topologies together.

Method: Proposes Multi-Agent System Search (MASS) with three interleaved optimization stages: 1) block-level prompt optimization, 2) workflow topology optimization, and 3) workflow-level prompt optimization, where each stage conditions on previous optimizations.

Result: MASS-optimized multi-agent systems outperform existing alternatives by a substantial margin, and the framework reveals design principles for effective MAS.

Conclusion: MASS provides an automated framework for optimizing both prompts and topologies in multi-agent systems, leading to more effective designs and revealing systematic design principles.

Abstract: Large language models, employed as multiple agents that interact and collaborate with each other, have excelled at solving complex tasks. The agents are programmed with prompts that declare their functionality, along with the topologies that orchestrate interactions across agents. Designing prompts and topologies for multi-agent systems (MAS) is inherently complex. To automate the entire design process, we first conduct an in-depth analysis of the design space aiming to understand the factors behind building effective MAS. We reveal that prompts together with topologies play critical roles in enabling more effective MAS design. Based on the insights, we propose Multi-Agent System Search (MASS), a MAS optimization framework that efficiently exploits the complex MAS design space by interleaving its optimization stages, from local to global, from prompts to topologies, over three stages: 1) block-level (local) prompt optimization; 2) workflow topology optimization; 3) workflow-level (global) prompt optimization, where each stage is conditioned on the iteratively optimized prompts/topologies from former stages. We show that MASS-optimized multi-agent systems outperform a spectrum of existing alternatives by a substantial margin. Based on the MASS-found systems, we finally propose design principles behind building effective multi-agent systems.

[973] Attention Isn’t All You Need for Emotion Recognition:Domain Features Outperform Transformers on the EAV Dataset

Anmol Guragain

Main category: cs.LG

TL;DR: Complex attention mechanisms underperform on small emotion recognition datasets; simple domain-specific modifications to audio, EEG, and vision models yield better results than architectural complexity.

Details

Motivation: To investigate whether sophisticated attention mechanisms improve multimodal emotion recognition performance on small datasets, or if simpler domain-appropriate modifications are more effective.

Method: Systematic study using EAV dataset with three model categories: baseline transformers (M1), novel factorized attention mechanisms (M2), and improved CNN baselines (M3). Tested audio (MFCCs, delta MFCCs), EEG (frequency-domain features), and vision (ViT, CNN) modalities.

Result: Complex attention mechanisms (M2) underperformed by 5-13 percentage points due to overfitting. Simple modifications worked best: adding delta MFCCs improved audio accuracy from 61.9% to 65.56%; frequency-domain features for EEG achieved 67.62% (+7.62pp); vision transformer baseline reached 75.30% (exceeding ViViT’s 74.5%).

Conclusion: For small-scale multimodal emotion recognition, domain knowledge and proper implementation outperform architectural complexity; simple domain-specific feature engineering and appropriate pretraining yield better results than sophisticated attention mechanisms.

Abstract: We present a systematic study of multimodal emotion recognition using the EAV dataset, investigating whether complex attention mechanisms improve performance on small datasets. We implement three model categories: baseline transformers (M1), novel factorized attention mechanisms (M2), and improved CNN baselines (M3). Our experiments show that sophisticated attention mechanisms consistently underperform on small datasets. M2 models achieved 5 to 13 percentage points below baselines due to overfitting and destruction of pretrained features. In contrast, simple domain-appropriate modifications proved effective: adding delta MFCCs to the audio CNN improved accuracy from 61.9% to 65.56% (+3.66pp), while frequency-domain features for EEG achieved 67.62% (+7.62pp over the paper baseline). Our vision transformer baseline (M1) reached 75.30%, exceeding the paper’s ViViT result (74.5%) through domain-specific pretraining, and vision delta features achieved 72.68% (+1.28pp over the paper CNN). These findings demonstrate that for small-scale emotion recognition, domain knowledge and proper implementation outperform architectural complexity.

[974] ALIGN: Aligned Delegation with Performance Guarantees for Multi-Agent LLM Reasoning

Tong Zhu, Baiting Chen, Jin Zhou, Hua Zhou, Sriram Sankararaman, Xiaowu Dai

Main category: cs.LG

TL;DR: ALIGN formulates LLM reasoning as an aligned delegation game where a principal delegates tasks to multiple agents with designed incentives, then selects among their outputs, providing theoretical guarantees for improved performance over single-agent approaches.

Details

Motivation: Current LLMs underperform on complex reasoning tasks with single generation pipelines, and existing ensemble methods treat candidates independently without formal guarantees of improved reasoning quality.

Method: ALIGN formulates reasoning as an aligned delegation game: a principal delegates tasks to multiple agents that generate candidate solutions under designed incentives, then selects among outputs to produce final answer, inducing structured interaction while preserving alignment.

Result: Theoretical guarantees show ALIGN provably improves expected performance over single-agent generation under fair comparison. Empirical results across diverse LLM reasoning benchmarks demonstrate ALIGN outperforms strong single-agent and ensemble baselines.

Conclusion: ALIGN provides a principled framework for multi-agent LLM reasoning with formal performance guarantees, addressing limitations of independent ensemble methods through structured interaction and aligned delegation.

Abstract: LLMs often underperform on complex reasoning tasks when relying on a single generation-and-selection pipeline. Inference-time ensemble methods can improve performance by sampling diverse reasoning paths or aggregating multiple candidate answers, but they typically treat candidates independently and provide no formal guarantees that ensembling improves reasoning quality. We propose a novel method, Aligned Delegation for Multi-Agent LLM Reasoning (ALIGN), which formulates LLM reasoning as an aligned delegation game. In ALIGN, a principal delegates a task to multiple agents that generate candidate solutions under designed incentives, and then selects among their outputs to produce a final answer. This formulation induces structured interaction among agents while preserving alignment between agent and principal objectives. We establish theoretical guarantees showing that, under a fair comparison with equal access to candidate solutions, ALIGN provably improves expected performance over single-agent generation. Our analysis accommodates correlated candidate answers and relaxes independence assumptions that are commonly used in prior work. Empirical results across a broad range of LLM reasoning benchmarks consistently demonstrate that ALIGN outperforms strong single-agent and ensemble baselines.

[975] Quantum Model Parallelism for MRI-Based Classification of Alzheimer’s Disease Stages

Emine Akpinar, Murat Oduncuoglu

Main category: cs.LG

TL;DR: Quantum-based parallel model for Alzheimer’s disease stage classification using MRI data, leveraging quantum computing advantages for higher accuracy with fewer parameters.

Details

Motivation: Alzheimer's disease is a major global health concern requiring faster, more efficient diagnostic approaches. Classical AI methods face limitations with growing data volumes and computational constraints, while quantum computing offers advantages for handling high-dimensional, heterogeneous, and noisy medical data.

Method: Proposed Quantum-Based Parallel Model (QBPM) architecture inspired by classical model parallelism. Uses two distinct quantum circuits with rotational and entanglement blocks running in parallel on a quantum simulator for AD stage classification from MRI datasets.

Result: Achieved high classification accuracy across two different datasets, demonstrating robustness and generalization. Performed well under high-level Gaussian noise simulating real-world conditions. Outperformed five classical transfer learning methods with higher accuracy and comparable execution time using fewer circuit parameters.

Conclusion: QBPM represents an innovative and powerful approach for classifying stages in complex diseases like Alzheimer’s, offering quantum advantages as an efficient alternative to classical methods for medical image analysis.

Abstract: With increasing life expectancy, AD has become a major global health concern. While classical AI-based methods have been developed for early diagnosis and stage classification of AD, growing data volumes and limited computational resources necessitate faster, more efficient approaches. Quantum-based AI methods, which leverage superposition and entanglement principles along with high-dimensional Hilbert space, can surpass classical approaches’ limitations and offer higher accuracy for high-dimensional, heterogeneous, and noisy data. In this study, a Quantum-Based Parallel Model (QBPM) architecture is proposed for the efficient classification of AD stages using MRI datasets, inspired by the principles of classical model parallelism. The proposed model leverages quantum advantages by employing two distinct quantum circuits, each incorporating rotational and entanglement blocks, running in parallel on the same quantum simulator. The classification performance of the model was evaluated on two different datasets to assess its overall robustness and generalization capability. The proposed model demonstrated high classification accuracy across both datasets, highlighting its overall robustness and generalization capability. Results obtained under high-level Gaussian noise, simulating real-world conditions, further provided experimental evidence for the model’s applicability not only in theoretical but also in practical scenarios. Moreover, compared with five different classical transfer learning methods, the proposed model demonstrated its efficiency as an alternative to classical approaches by achieving higher classification accuracy and comparable execution time while utilizing fewer circuit parameters. The results indicate that the proposed QBPM architecture represents an innovative and powerful approach for the classification of stages in complex diseases such as Alzheimer’s.

[976] Monte Carlo Tree Search for Execution-Guided Program Repair with Large Language Models

Yixuan Liang

Main category: cs.LG

TL;DR: CodePilot combines Monte Carlo Tree Search with LLMs for execution-guided program repair at repository level, achieving 24.67% issue resolution on SWE-bench Lite.

Details

Motivation: Automated program repair with LLMs is challenging at repository level due to long-horizon reasoning requirements and limitations of autoregressive decoding. Current approaches struggle with real-world GitHub issues that require understanding complex codebases and generating correct patches.

Method: Hybrid framework integrating Monte Carlo Tree Search with LLMs. Performs hierarchical fault localization (repository → file → function), explores diverse patch trajectories using MCTS, leverages execution feedback as reward signal to guide search and refinement, and incorporates confidence-calibrated generation to selectively refine low-confidence outputs.

Result: Achieves 24.67% issue resolution rate on SWE-bench Lite using open-weight models, outperforming comparable baselines. Demonstrates effectiveness of combining symbolic search with neural language models for software engineering automation.

Conclusion: Combining symbolic search (MCTS) with neural language models is an effective strategy for scalable, execution-aware software engineering automation, enabling better program repair at repository level.

Abstract: Automated program repair with large language models remains challenging at the repository level due to long-horizon reasoning requirements and the limitations of autoregressive decoding. We present CodePilot, a hybrid framework that integrates Monte Carlo Tree Search (MCTS) with large language models to enable execution-guided program repair for real-world GitHub issues. CodePilot performs hierarchical fault localization from repository to file and function level, explores diverse patch trajectories using MCTS, and leverages execution feedback as a reward signal to guide search and refinement. The framework further incorporates confidence-calibrated generation to selectively refine low-confidence outputs. Experiments on SWE-bench Lite demonstrate that CodePilot achieves a 24.67% issue resolution rate using open-weight models, outperforming comparable baselines. These results suggest that combining symbolic search with neural language models is an effective strategy for scalable, execution-aware software engineering automation.

[977] On the Relationship Between Representation Geometry and Generalization in Deep Neural Networks

Sumit Yadav

Main category: cs.LG

TL;DR: Effective dimension, an unsupervised geometric metric, strongly predicts neural network performance across vision and NLP tasks, with bidirectional causal relationships established between geometry and accuracy.

Details

Motivation: To understand the relationship between representation geometry and neural network performance, investigating whether unsupervised geometric metrics can predict model accuracy across domains without requiring labels.

Method: Analyzed 52 pretrained ImageNet models across 13 architecture families, measuring effective dimension and total compression. Conducted causal experiments by degrading geometry via noise (Gaussian, Uniform, Dropout, Salt-and-pepper) and improving geometry via PCA. Extended analysis to NLP with 8 encoder models on SST-2/MNLI and 15 decoder-only LLMs on AG News.

Result: Effective dimension strongly predicts accuracy (partial r=0.75, p<10^(-10)) after controlling for model capacity. Bidirectional causality established: degrading geometry via noise causes accuracy loss (r=-0.94, p<10^(-9)), while improving geometry via PCA maintains accuracy (-0.03pp at 95% variance). Results generalize across ImageNet, CIFAR-10, and NLP tasks.

Conclusion: Effective dimension provides domain-agnostic predictive and causal information about neural network performance, computed entirely without labels, establishing a fundamental relationship between representation geometry and model effectiveness.

Abstract: We investigate the relationship between representation geometry and neural network performance. Analyzing 52 pretrained ImageNet models across 13 architecture families, we show that effective dimension – an unsupervised geometric metric – strongly predicts accuracy. Output effective dimension achieves partial r=0.75 ($p < 10^(-10)$) after controlling for model capacity, while total compression achieves partial r=-0.72. These findings replicate across ImageNet and CIFAR-10, and generalize to NLP: effective dimension predicts performance for 8 encoder models on SST-2/MNLI and 15 decoder-only LLMs on AG News (r=0.69, p=0.004), while model size does not (r=0.07). We establish bidirectional causality: degrading geometry via noise causes accuracy loss (r=-0.94, $p < 10^(-9)$), while improving geometry via PCA maintains accuracy across architectures (-0.03pp at 95% variance). This relationship is noise-type agnostic – Gaussian, Uniform, Dropout, and Salt-and-pepper noise all show $|r| > 0.90$. These results establish that effective dimension provides domain-agnostic predictive and causal information about neural network performance, computed entirely without labels.

[978] RAPTOR: Ridge-Adaptive Logistic Probes

Ziqi Gao, Yaotian Zhu, Qingcheng Zeng, Xu Zhao, Ziqing Wang, Feng Ruan, Kaize Ding

Main category: cs.LG

TL;DR: RAPTOR is a ridge-regularized logistic probe for extracting concept vectors from frozen LLMs that balances accuracy, directional stability, and computational efficiency for activation steering applications.

Details

Motivation: Current probe-then-steer pipelines need concept vectors that are accurate, directionally stable under ablation, and inexpensive to obtain. Existing methods don't optimally balance these desiderata.

Method: RAPTOR uses L2-regularized logistic regression with validation-tuned ridge strength to extract concept vectors from normalized LLM layer representations. The method is analyzed theoretically using the Convex Gaussian Min-max Theorem.

Result: RAPTOR matches or exceeds strong baselines in accuracy while achieving competitive directional stability and substantially lower training cost across experiments on instruction-tuned LLMs and human-written concept datasets.

Conclusion: RAPTOR provides an effective, efficient method for extracting concept vectors for activation steering, with theoretical analysis explaining how penalty strength mediates accuracy-stability tradeoffs.

Abstract: Probing studies what information is encoded in a frozen LLM’s layer representations by training a lightweight predictor on top of them. Beyond analysis, probes are often used operationally in probe-then-steer pipelines: a learned concept vector is extracted from a probe and injected via additive activation steering by adding it to a layer representation during the forward pass. The effectiveness of this pipeline hinges on estimating concept vectors that are accurate, directionally stable under ablation, and inexpensive to obtain. Motivated by these desiderata, we propose RAPTOR (Ridge-Adaptive Logistic Probe), a simple L2-regularized logistic probe whose validation-tuned ridge strength yields concept vectors from normalized weights. Across extensive experiments on instruction-tuned LLMs and human-written concept datasets, RAPTOR matches or exceeds strong baselines in accuracy while achieving competitive directional stability and substantially lower training cost; these quantitative results are supported by qualitative downstream steering demonstrations. Finally, using the Convex Gaussian Min-max Theorem (CGMT), we provide a mechanistic characterization of ridge logistic regression in an idealized Gaussian teacher-student model in the high-dimensional few-shot regime, explaining how penalty strength mediates probe accuracy and concept-vector stability and yielding structural predictions that qualitatively align with trends observed on real LLM embeddings.

[979] Sheaf Neural Networks and biomedical applications

Aneeqa Mehrab, Jan Willem Van Looy, Pietro Demurtas, Stefano Iotti, Emil Malucelli, Francesca Rossi, Ferdinando Zanchetta, Rita Fioresi

Main category: cs.LG

TL;DR: Sheaf Neural Networks (SNNs) are introduced as a novel approach that outperforms popular GNNs like GCN, GAT, and GraphSage on biomedical tasks through mathematical sheaf theory.

Details

Motivation: To develop a more theoretically grounded neural network architecture for graph-structured data that can better handle biomedical applications where traditional GNNs may have limitations.

Method: The paper presents the mathematical theory and modeling behind Sheaf Neural Networks (SNNs), building on sheaf theory from algebraic topology to create a novel graph neural network architecture.

Result: SNNs effectively answer biomedical questions in a concrete case study and outperform popular GNNs including Graph Convolutional Networks (GCNs), Graph Attention Networks (GAT), and GraphSage.

Conclusion: Sheaf Neural Networks provide a mathematically rigorous alternative to traditional GNNs that shows superior performance on biomedical graph problems.

Abstract: The purpose of this paper is to elucidate the theory and mathematical modelling behind the sheaf neural network (SNN) algorithm and then show how SNN can effectively answer to biomedical questions in a concrete case study and outperform the most popular graph neural networks (GNNs) as graph convolutional networks (GCNs), graph attention networks (GAT) and GraphSage.

[980] Block removal for large language models through constrained binary optimization

David Jansen, Roman Rausch, David Montero, Roman Orus

Main category: cs.LG

TL;DR: A novel method for compressing LLMs by removing transformer blocks using Ising model optimization, outperforming state-of-the-art block removal methods with up to 6 points improvement on MMLU.

Details

Motivation: Compressing large language models by removing transformer blocks is challenging due to the combinatorial explosion of possible configurations. Existing methods struggle to identify optimal block removal patterns beyond consecutive regions.

Method: Formulates block removal as a constrained binary optimization problem mapped to an Ising model. Uses physical system energies as proxy for downstream performance, enabling efficient ranking of candidate configurations. Requires only forward/backward passes for few active parameters plus an Ising solver.

Result: Outperforms state-of-the-art block removal methods across several benchmarks. Performance gains persist after short retraining, with up to 6 points improvement on MMLU benchmark. Successfully applied to challenging architectures like NVIDIA-Nemotron-3-Nano-30B-A3B-FP8.

Conclusion: The Ising model formulation provides an efficient, generalizable approach for identifying high-quality block removal configurations beyond consecutive patterns, enabling effective LLM compression with minimal computational overhead.

Abstract: Compressing resource-intensive large language models by removing whole transformer blocks is a seemingly simple idea, but identifying which blocks to remove constitutes an exponentially difficult combinatorial problem. In this paper, we formulate block removal as a constrained binary optimization problem that can be mapped to a physical system (Ising model), whose energies are a strong proxy for downstream model performance. This formulation enables an efficient ranking of a large number of candidate block-removal configurations and yields many high-quality, non-trivial solutions beyond consecutive regions. We demonstrate that our approach outperforms state-of-the-art block-removal methods across several benchmarks, with performance gains persisting after short retraining, and reaching improvements of up to 6 points on the MMLU benchmark. Our method requires only forward and backward passes for a few active parameters, together with an (at least approximate) Ising solver, and can be readily applied to any architecture. We illustrate this generality on the recent NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model, which exhibits a highly inhomogeneous and challenging block structure.

[981] Benford’s Law as a Distributional Prior for Post-Training Quantization of Large Language Models

Arthur Negrão, Pedro Silva, Vander L. S. Freitas, Gladston Moreira, Eduardo Luz

Main category: cs.LG

TL;DR: Benford-Quant: A data-free non-uniform quantization method for LLMs inspired by Benford’s Law, using log-spaced codebooks to better match skewed weight distributions and improve few-bit quantization performance.

Details

Motivation: Standard uniform quantization assumes evenly distributed parameters, but LLM weights actually follow highly skewed distributions. This mismatch leads to suboptimal quantization, especially in aggressive few-bit regimes where resolution is limited.

Method: Proposes Benford-Quant, a non-uniform quantizer that replaces uniform grids with log-spaced codebooks based on Benford’s Law. This dedicates more resolution to frequent small-magnitude weights. The method is data-free and can be hybridized with other quantization techniques like SmoothQuant and Activation-Aware Quantization.

Result: On Small Language Models (SLMs), Benford-Quant consistently improves perplexity, reducing 4-bit perplexity on Gemma-270M by more than 10%. On larger LLMs, it remains competitive, with differences explained by over-parameterization effects. Transformer layers adhere closely to Benford statistics while normalization layers deviate.

Conclusion: Incorporating a Benford-inspired prior into quantization grids is a low-cost modification that yields accuracy gains in aggressive few-bit regimes. While not surpassing state-of-the-art in tasks like perplexity and LAMBADA, it can be hybridized with other quantization methods for potential performance improvements.

Abstract: The rapid growth of Large Language Models (LLMs) intensifies the need for effective compression, with weight quantization being the most widely adopted technique. Standard uniform quantizers assume that parameters are evenly distributed, an assumption at odds with the highly skewed distributions observed in practice. We propose Benford-Quant, a simple, data-free non-uniform quantizer inspired by Benford’s Law, which predicts that leading digits follow a logarithmic distribution. Benford-Quant replaces the uniform grid with a log-spaced codebook, dedicating more resolution to the frequent small-magnitude weights. We provide both theoretical intuition and empirical evidence: (i) weights in transformer transformational layers adhere closely to Benford statistics, while normalization layers systematically deviate; (ii) on Small Language Models (SLMs), Benford-Quant consistently improves perplexity, reducing 4-bit perplexity on Gemma-270M by more than 10%; and (iii) on larger LLMs, it remains competitive, with differences explained by over-parameterization effects. Our results indicate that incorporating a Benford-inspired prior into quantization grids is a low-cost modification that yields accuracy gains in aggressive few-bit regimes. Although it is not able to surpass the state of the art in tasks such as perplexity and LAMBADA, the Benford-Quant approach can be hybridized with other quantization methods-such as SmoothQuant and Activation-Aware Quantization-without major pipeline modification, potentially improving their performance.

[982] Joint Continual Learning of Local Language Models and Cloud Offloading Decisions with Budget Constraints

Evan Chen, Wenzhi Fang, Shiqiang Wang, Christopher Brinton

Main category: cs.LG

TL;DR: DA-GRPO enables small language models to intelligently offload tasks to cloud LLMs during continual learning while respecting usage budgets, improving accuracy and reducing forgetting.

Details

Motivation: Small language models on edge devices need cloud assistance for complex tasks but must manage usage budgets. Existing reinforcement learning approaches for cloud offloading cause unstable behavior and catastrophic forgetting during continual learning.

Method: DA-GRPO extends Group Relative Policy Optimization with dual-advantage computation that incorporates cloud-usage constraints directly, avoiding fixed reward shaping. This allows joint learning of task competence and collaboration behavior, enabling natural cloud request emergence during post-training.

Result: Experiments on mathematical reasoning and code generation benchmarks show DA-GRPO improves post-switch accuracy, substantially reduces forgetting, and maintains stable cloud usage compared to prior collaborative and routing-based approaches.

Conclusion: DA-GRPO provides an effective framework for small language models to intelligently leverage cloud assistance during continual learning while respecting usage constraints and mitigating catastrophic forgetting.

Abstract: Locally deployed Small Language Models (SLMs) must continually support diverse tasks under strict memory and computation constraints, making selective reliance on cloud Large Language Models (LLMs) unavoidable. Regulating cloud assistance during continual learning is challenging, as naive reward-based reinforcement learning often yields unstable offloading behavior and exacerbates catastrophic forgetting as task distributions shift. We propose DA-GRPO, a dual-advantage extension of Group Relative Policy Optimization that incorporates cloud-usage constraints directly into advantage computation, avoiding fixed reward shaping and external routing models. This design enables the local model to jointly learn task competence and collaboration behavior, allowing cloud requests to emerge naturally during post-training while respecting a prescribed assistance budget. Experiments on mathematical reasoning and code generation benchmarks show that DA-GRPO improves post-switch accuracy, substantially reduces forgetting, and maintains stable cloud usage compared to prior collaborative and routing-based approaches.

[983] The Blessing of Dimensionality in LLM Fine-tuning: A Variance-Curvature Perspective

Qiyao Liang, Jinyeop Song, Yizhou Liu, Jeff Gore, Ila Fiete, Risto Miikkulainen, Xin Qiu

Main category: cs.LG

TL;DR: Weight-perturbation evolution strategies can effectively fine-tune billion-parameter LLMs with small populations due to low-dimensional curvature in fine-tuning landscapes, explaining both scalability and non-monotonic training dynamics.

Details

Motivation: The paper aims to explain two puzzling phenomena in fine-tuning large language models: (1) why evolution strategies (ES) can work with surprisingly small populations despite the curse of dimensionality, and (2) why fine-tuning rewards often rise, peak, and then degrade under fixed hyperparameters in both ES and GRPO methods.

Method: The authors use ES as a geometric probe to analyze fine-tuning reward landscapes across multiple benchmarks (GSM8K, ARC-C, WinoGrande) and model scales (Qwen2.5-Instruct 0.5B-7B). They develop a minimal quadratic stochastic-ascent model to capture the rise-then-decay dynamics and analyze the geometric properties of the optimization landscape.

Result: The research shows that fine-tuning landscapes are low-dimensional in curvature, with a small set of high-curvature dimensions dominating improvement. This explains both phenomena: (1) reward-improving perturbations remain accessible with small populations across scales, and (2) heterogeneous time scales produce rise-then-decay dynamics under fixed stochasticity.

Conclusion: Both ES scalability and non-monotonic training dynamics reflect shared geometric properties of fine-tuning landscapes. High-dimensional fine-tuning may admit a broader class of viable optimization methods than worst-case theory implies, suggesting new approaches for multimodal model optimization.

Abstract: Weight-perturbation evolution strategies (ES) can fine-tune billion-parameter language models with surprisingly small populations (e.g., $N!\approx!30$), contradicting classical zeroth-order curse-of-dimensionality intuition. We also observe a second seemingly separate phenomenon: under fixed hyperparameters, the stochastic fine-tuning reward often rises, peaks, and then degrades in both ES and GRPO. We argue that both effects reflect a shared geometric property of fine-tuning landscapes: they are low-dimensional in curvature. A small set of high-curvature dimensions dominates improvement, producing (i) heterogeneous time scales that yield rise-then-decay under fixed stochasticity, as captured by a minimal quadratic stochastic-ascent model, and (ii) degenerate improving updates, where many random perturbations share similar components along these directions. Using ES as a geometric probe on fine-tuning reward landscapes of GSM8K, ARC-C, and WinoGrande across Qwen2.5-Instruct models (0.5B–7B), we show that reward-improving perturbations remain empirically accessible with small populations across scales. Together, these results reconcile ES scalability with non-monotonic training dynamics and suggest that high-dimensional fine-tuning may admit a broader class of viable optimization methods than worst-case theory implies.

[984] Learning Robust Reasoning through Guided Adversarial Self-Play

Shuozhe Li, Vaishnav Tadiparthi, Kwonjoon Lee, Nakul Agarwal, Hossein Nourkhiz Mahjoub, Ehsan Moradi Pari, Lizhang Chen, Amy Zhang, Liu Leqi

Main category: cs.LG

TL;DR: GASP trains robust reasoning models through adversarial self-play where a polluter creates misleading contexts and an agent learns to detect and repair errors, using only outcome verification without human labels.

Details

Motivation: Current reinforcement learning from verifiable rewards (RLVR) produces strong reasoning models but fails catastrophically when conditioning context is fallible (corrupted chain-of-thought, misleading partial solutions, or input perturbations), since standard RLVR only optimizes final-answer correctness under clean conditioning.

Method: GASP (Guided Adversarial Self-Play) creates an adversarial self-play game within a single model: a polluter learns to induce failure via locally coherent corruptions, while an agent learns to diagnose and recover under the same corrupted conditioning. Uses in-distribution repair guidance - an imitation term on self-generated repairs to increase recovery probability while preserving capabilities.

Result: Across four open-weight models (1.5B-8B), GASP transforms strong-but-brittle reasoners into robust ones that withstand misleading and perturbed context while often improving clean accuracy. Adversarial corruptions induce an effective curriculum, and in-distribution guidance enables rapid recovery learning with minimal representational drift.

Conclusion: GASP successfully creates robust reasoning models capable of handling fallible conditioning contexts through adversarial self-play training, demonstrating that explicit training of detect-and-repair capabilities using only outcome verification can significantly improve model robustness.

Abstract: Reinforcement learning from verifiable rewards (RLVR) produces strong reasoning models, yet they can fail catastrophically when the conditioning context is fallible (e.g., corrupted chain-of-thought, misleading partial solutions, or mild input perturbations), since standard RLVR optimizes final-answer correctness only under clean conditioning. We introduce GASP (Guided Adversarial Self-Play), a robustification method that explicitly trains detect-and-repair capabilities using only outcome verification. Without human labels or external teachers, GASP forms an adversarial self-play game within a single model: a polluter learns to induce failure via locally coherent corruptions, while an agent learns to diagnose and recover under the same corrupted conditioning. To address the scarcity of successful recoveries early in training, we propose in-distribution repair guidance, an imitation term on self-generated repairs that increases recovery probability while preserving previously acquired capabilities. Across four open-weight models (1.5B–8B), GASP transforms strong-but-brittle reasoners into robust ones that withstand misleading and perturbed context while often improving clean accuracy. Further analysis shows that adversarial corruptions induce an effective curriculum, and in-distribution guidance enables rapid recovery learning with minimal representational drift.

[985] The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization

Manyi Li, Yufan Liu, Lai Jiang, Bing Li, Yuming Li, Weiming Hu

Main category: cs.LG

TL;DR: IVO attack framework exposes that unlearning-based defenses in diffusion models only partially disrupt NSFW concept mappings, leaving dormant memories that can be reactivated through initial latent variable optimization.

Details

Motivation: Current unlearning-based defenses claim to purge NSFW concepts from diffusion models, but the authors suspect this "forgetting" is largely an illusion, with knowledge remaining as dormant memories that could be reactivated.

Method: Proposes IVO (Initial Latent Variable Optimization) attack framework that uses Image Inversion, Adversarial Optimization, and Reused Attack to optimize initial latent variables, realigning noise distributions of unlearned models with original unsafe states.

Result: Extensive experiments across 8 unlearning techniques show IVO achieves superior attack success rates and strong semantic consistency, exposing fundamental flaws in current defenses.

Conclusion: Unlearning-based defenses are fundamentally flawed as they only partially disrupt concept mappings, leaving dormant memories that can be reactivated, necessitating more robust defense mechanisms.

Abstract: Although unlearning-based defenses claim to purge Not-Safe-For-Work (NSFW) concepts from diffusion models (DMs), we reveals that this “forgetting” is largely an illusion. Unlearning partially disrupts the mapping between linguistic symbols and the underlying knowledge, which remains intact as dormant memories. We find that the distributional discrepancy in the denoising process serves as a measurable indicator of how much of the mapping is retained, also reflecting the strength of unlearning. Inspired by this, we propose IVO (Initial Latent Variable Optimization), a concise and powerful attack framework that reactivates these dormant memories by reconstructing the broken mappings. Through Image Inversion}, Adversarial Optimization and Reused Attack, IVO optimizes initial latent variables to realign the noise distribution of unlearned models with their original unsafe states. Extensive experiments across 8 widely used unlearning techniques demonstrate that IVO achieves superior attack success rates and strong semantic consistency, exposing fundamental flaws in current defenses. The code is available at anonymous.4open.science/r/IVO/. Warning: This paper has unsafe images that may offend some readers.

[986] How Understanding Forecast Uncertainty Resolves the Explainability Problem in Machine Learning Models

Joseph L. Breeden

Main category: cs.LG

TL;DR: The paper argues that explanatory instability in local linear explanation methods (like LIME/SHAP) near decision boundaries is actually appropriate because forecast uncertainty is high there, and suggests first assessing forecast usability before seeking explanations.

Details

Motivation: Address concerns about instability in local linear explanation methods (LIME, SHAP) near decision boundaries, which have been criticized for being unreliable in those regions.

Method: Proposes a new sequence: first assess whether a usable forecast exists (with low enough uncertainty), then only seek explanations via local linear approximations when forecasts are usable. When no usable forecast exists, fall back to simpler overall models like logistic regression.

Result: Shows that explanatory instability is appropriate when forecast uncertainty is high, and that methods claiming to be explainable everywhere (like ReLU networks) have only illusory explainability because forecast uncertainty at segment boundaries is too high to be useful.

Conclusion: The correct approach is to first determine forecast usability, then seek explanations only when forecasts are usable. Explanatory instability near decision boundaries reflects high forecast uncertainty, not a flaw in explanation methods.

Abstract: For applications of machine learning in critical decisions, explainability is a primary concern, and often a regulatory requirement. Local linear methods for generating explanations, such as LIME and SHAP, have been criticized for being unstable near decision boundaries. In this paper, we explain that such concerns reflect a misunderstanding of the problem. The forecast uncertainty is high at decision boundaries, so consequently, the explanatory instability is high. The correct approach is to change the sequence of events and questions being asked. Nonlinear models can be highly predictive in some regions while having little or no predictability in others. Therefore, the first question is whether a usable forecast exists. When there is a forecast with low enough uncertainty to be useful, an explanation can be sought via a local linear approximation. In such cases, the explanatory instability is correspondingly low. When no usable forecast exists, the decision must fall to a simpler overall model such as traditional logistic regression. Additionally, these results show that some methods that purport to be explainable everywhere, such as ReLU networks or any piecewise linear model, have only an illusory explainability, because the forecast uncertainty at the segment boundaries is too high to be useful. Explaining an unusable forecast is pointless.

[987] GEPC: Group-Equivariant Posterior Consistency for Out-of-Distribution Detection in Diffusion Models

Yadang Alexis Rouzoumka, Jean Pinsolle, Eugénie Terreaux, Christèle Morisseau, Jean-Philippe Ovarlez, Chengfang Ren

Main category: cs.LG

TL;DR: GEPC is a training-free OOD detection method for diffusion models that measures equivariance consistency of learned scores under finite groups, detecting OOD samples when score transformations break expected symmetries.

Details

Motivation: Most diffusion-based OOD detectors focus on score magnitude or local geometry but ignore equivariance properties. Since diffusion models often inherit approximate equivariances from ID data and convolutional backbones, breaking of these symmetries can signal OOD samples even when score magnitude remains unchanged.

Method: GEPC measures how consistently learned scores transform under finite groups (flips, rotations, circular shifts). It computes an equivariance-residual functional averaged over group transformations, requiring only score evaluations without additional training. The method produces interpretable equivariance-breaking maps showing where symmetries are violated.

Result: GEPC achieves competitive or improved AUROC compared to recent diffusion-based OOD detection baselines on standard image benchmark datasets. On high-resolution synthetic aperture radar imagery, it yields strong target-background separation and produces visually interpretable equivariance-breaking maps that highlight anomalies.

Conclusion: Equivariance consistency provides a powerful signal for OOD detection in diffusion models. GEPC offers a computationally lightweight, training-free approach that leverages symmetry breaking to detect anomalies, with applications extending to specialized domains like radar imagery analysis.

Abstract: Diffusion models learn a time-indexed score field $\mathbf{s}_θ(\mathbf{x}_t,t)$ that often inherits approximate equivariances (flips, rotations, circular shifts) from in-distribution (ID) data and convolutional backbones. Most diffusion-based out-of-distribution (OOD) detectors exploit score magnitude or local geometry (energies, curvature, covariance spectra) and largely ignore equivariances. We introduce Group-Equivariant Posterior Consistency (GEPC), a training-free probe that measures how consistently the learned score transforms under a finite group $\mathcal{G}$, detecting equivariance breaking even when score magnitude remains unchanged. At the population level, we propose the ideal GEPC residual, which averages an equivariance-residual functional over $\mathcal{G}$, and we derive ID upper bounds and OOD lower bounds under mild assumptions. GEPC requires only score evaluations and produces interpretable equivariance-breaking maps. On OOD image benchmark datasets, we show that GEPC achieves competitive or improved AUROC compared to recent diffusion-based baselines while remaining computationally lightweight. On high-resolution synthetic aperture radar imagery where OOD corresponds to targets or anomalies in clutter, GEPC yields strong target-background separation and visually interpretable equivariance-breaking maps. Code is available at https://github.com/RouzAY/gepc-diffusion/.

[988] Reducing Memorisation in Generative Models via Riemannian Bayesian Inference

Johanna Marie Gegenfurtner, Albert Kjøller Jacobsen, Naima Elosegui Borras, Alejandro Valverde Mahou, Georgios Arvanitidis

Main category: cs.LG

TL;DR: Bayesian approach to reduce memorization in flow matching/diffusion models using Riemannian geometry in parameter space

Details

Motivation: Modern generative models struggle with balancing memorization and generalization; need better ways to capture data distribution variability while reducing memorization

Method: Bayesian perspective focusing on parameter space, constructing predictive posterior using Riemannian metric to capture loss geometry, flexible approximate posterior adapting to local loss landscape structure

Result: Empirical demonstration of reduced memorization while preserving generalization; theoretical analysis explaining findings

Conclusion: Considering loss geometry enables effective use of parameter space for complex high-dimensional generative models

Abstract: Modern generative models can produce realistic samples, however, balancing memorisation and generalisation remains an open problem. We approach this challenge from a Bayesian perspective by focusing on the parameter space of flow matching and diffusion models and constructing a predictive posterior that better captures the variability of the data distribution. In particular, we capture the geometry of the loss using a Riemannian metric and leverage a flexible approximate posterior that adapts to the local structure of the loss landscape. This approach allows us to sample generative models that resemble the original model, but exhibit reduced memorisation. Empirically, we demonstrate that the proposed approach reduces memorisation while preserving generalisation. Further, we provide a theoretical analysis of our method, which explains our findings. Overall, our work illustrates how considering the geometry of the loss enables effective use of the parameter space, even for complex high-dimensional generative models.

[989] Reducing Class-Wise Performance Disparity via Margin Regularization

Beier Zhu, Kesen Zhao, Jiequan Cui, Qianru Sun, Yuan Zhou, Xun Yang, Hanwang Zhang

Main category: cs.LG

TL;DR: MR² is a margin regularization method that reduces performance disparities between easy and hard classes in classification by dynamically adjusting margins in logit and representation spaces based on per-class feature variability.

Details

Motivation: Deep neural networks often show substantial class-wise accuracy disparities even on balanced data, raising reliability concerns. While empirical remedies exist, theoretical understanding of these disparities is limited, motivating a principled approach to reduce performance gaps.

Method: MR² introduces a theoretically principled regularization that dynamically adjusts margins in both logit and representation spaces. It optimizes per-class logit margins proportional to feature spread and penalizes excessive representation margins to enhance intra-class compactness, guided by a margin-based, class-sensitive generalization bound.

Result: Experiments on seven datasets including ImageNet with diverse pre-trained backbones (MAE, MoCov2, CLIP) show MR² improves overall accuracy while significantly boosting hard class performance without trading off easy classes, effectively reducing performance disparity.

Conclusion: MR² provides a theoretically grounded approach to reduce class-wise performance disparities in classification by dynamically adjusting margins based on feature variability, offering a practical solution for more reliable neural network deployment.

Abstract: Deep neural networks often exhibit substantial disparities in class-wise accuracy, even when trained on class-balanced data, posing concerns for reliable deployment. While prior efforts have explored empirical remedies, a theoretical understanding of such performance disparities in classification remains limited. In this work, we present Margin Regularization for Performance Disparity Reduction (MR$^2$), a theoretically principled regularization for classification by dynamically adjusting margins in both the logit and representation spaces. Our analysis establishes a margin-based, class-sensitive generalization bound that reveals how per-class feature variability contributes to error, motivating the use of larger margins for hard classes. Guided by this insight, MR$^2$ optimizes per-class logit margins proportional to feature spread and penalizes excessive representation margins to enhance intra-class compactness. Experiments on seven datasets, including ImageNet, and diverse pre-trained backbones (MAE, MoCov2, CLIP) demonstrate that MR$^2$ not only improves overall accuracy but also significantly boosts hard class performance without trading off easy classes, thus reducing performance disparity. Code is available at: https://github.com/BeierZhu/MR2

[990] Analyzing Shapley Additive Explanations to Understand Anomaly Detection Algorithm Behaviors and Their Complementarity

Jordan Levy, Paul Saves, Moncef Garouani, Nicolas Verstaevel, Benoit Gaudou

Main category: cs.LG

TL;DR: Proposes using SHAP explanations to measure similarity between anomaly detectors, selecting diverse models for ensembles based on their decision mechanisms rather than just output scores.

Details

Motivation: Unsupervised anomaly detection ensembles often suffer from redundancy because detectors rely on similar decision cues, limiting complementarity. Current ensemble methods lack systematic ways to identify truly diverse models that capture different types of irregularities.

Method: Uses SHapley Additive exPlanations (SHAP) to quantify how each anomaly detector attributes importance to input features. Creates attribution profiles from these explanations and measures similarity between detectors based on their explanation patterns. Selects ensemble members that have divergent explanations while maintaining individual model quality.

Result: Shows that detectors with similar SHAP explanations produce correlated anomaly scores and identify overlapping anomalies. Explanation divergence reliably indicates complementary detection behavior. Ensembles selected based on explanation diversity outperform those selected using traditional methods.

Conclusion: Explanation-driven metrics offer a novel criterion for ensemble selection that goes beyond raw outputs. While diversity is important, individual model quality remains essential. Combining explanation diversity with model performance leads to more effective unsupervised anomaly detection ensembles.

Abstract: Unsupervised anomaly detection is a challenging problem due to the diversity of data distributions and the lack of labels. Ensemble methods are often adopted to mitigate these challenges by combining multiple detectors, which can reduce individual biases and increase robustness. Yet building an ensemble that is genuinely complementary remains challenging, since many detectors rely on similar decision cues and end up producing redundant anomaly scores. As a result, the potential of ensemble learning is often limited by the difficulty of identifying models that truly capture different types of irregularities. To address this, we propose a methodology for characterizing anomaly detectors through their decision mechanisms. Using SHapley Additive exPlanations, we quantify how each model attributes importance to input features, and we use these attribution profiles to measure similarity between detectors. We show that detectors with similar explanations tend to produce correlated anomaly scores and identify largely overlapping anomalies. Conversely, explanation divergence reliably indicates complementary detection behavior. Our results demonstrate that explanation-driven metrics offer a different criterion than raw outputs for selecting models in an ensemble. However, we also demonstrate that diversity alone is insufficient; high individual model performance remains a prerequisite for effective ensembles. By explicitly targeting explanation diversity while maintaining model quality, we are able to construct ensembles that are more diverse, more complementary, and ultimately more effective for unsupervised anomaly detection.

[991] Dispersion Loss Counteracts Embedding Condensation and Improves Generalization in Small Language Models

Chen Liu, Xingzhi Sun, Xi Xiao, Alexandre Van Tassel, Ke Xu, Kristof Reimann, Danqi Liao, Mark Gerstein, Tianyang Wang, Xiao Wang, Smita Krishnaswamy

Main category: cs.LG

TL;DR: Small language models suffer from “embedding condensation” where token embeddings collapse into narrow subspaces, unlike larger models. A dispersion loss during training mitigates this issue and improves performance.

Details

Motivation: To understand why larger LLMs perform better and to improve smaller models without increasing parameters, by studying representational differences and addressing embedding condensation phenomenon.

Method: Systematic analysis of embedding condensation across Transformer families, observation that knowledge distillation doesn’t reliably mitigate it, and formulation of a dispersion loss that explicitly encourages embedding dispersion during training.

Result: Small models (GPT2, Qwen3-0.6B) show severe condensation while larger models (GPT2-xl, Qwen3-32B) are more resistant. The dispersion loss mitigates condensation, recovers dispersion patterns of larger models, and yields performance gains across 10 benchmarks.

Conclusion: Embedding condensation is a key representational limitation in small models that can be addressed through explicit dispersion regularization, offering a principled path to improve smaller Transformers without additional parameters.

Abstract: Large language models (LLMs) achieve remarkable performance through ever-increasing parameter counts, but scaling incurs steep computational costs. To better understand LLM scaling, we study representational differences between LLMs and their smaller counterparts, with the goal of replicating the representational qualities of larger models in the smaller models. We observe a geometric phenomenon which we term $\textbf{embedding condensation}$, where token embeddings collapse into a narrow cone-like subspace in some language models. Through systematic analyses across multiple Transformer families, we show that small models such as $\texttt{GPT2}$ and $\texttt{Qwen3-0.6B}$ exhibit severe condensation, whereas the larger models such as $\texttt{GPT2-xl}$ and $\texttt{Qwen3-32B}$ are more resistant to this phenomenon. Additional observations show that embedding condensation is not reliably mitigated by knowledge distillation from larger models. To fight against it, we formulate a dispersion loss that explicitly encourages embedding dispersion during training. Experiments demonstrate that it mitigates condensation, recovers dispersion patterns seen in larger models, and yields performance gains across 10 benchmarks. We believe this work offers a principled path toward improving smaller Transformers without additional parameters.

[992] GRIP2: A Robust and Powerful Deep Knockoff Method for Feature Selection

Bob Junyi Zou, Lu Tian

Main category: cs.LG

TL;DR: GRIP2 is a deep knockoff feature selection method that integrates feature activity over a 2D regularization surface to improve robustness in high-correlation, low-signal regimes while controlling false discovery rates.

Details

Motivation: Feature selection in nonlinear, highly correlated, low signal-to-noise regimes is challenging, especially for deep learning methods. Existing approaches struggle with false discovery rate control and robustness to correlation and noise.

Method: GRIP2 integrates first-layer feature activity over a 2D regularization surface controlling sparsity strength and geometry. Uses block-stochastic sampling to approximate the surface integral in a single training run, producing antisymmetric statistics for FDR control.

Result: GRIP2 shows improved robustness to feature correlation and noise compared to standard deep learning feature selectors. Maintains high power and stability in challenging regimes. On HIV drug resistance data, recovers known mutations better than linear baselines.

Conclusion: GRIP2 provides reliable feature selection with FDR control in challenging nonlinear, correlated, low-signal regimes where deep learning methods are most needed, demonstrating practical utility on real-world biological data.

Abstract: Identifying truly predictive covariates while strictly controlling false discoveries remains a fundamental challenge in nonlinear, highly correlated, and low signal-to-noise regimes, where deep learning based feature selection methods are most attractive. We propose Group Regularization Importance Persistence in 2 Dimensions (GRIP2), a deep knockoff feature importance statistic that integrates first-layer feature activity over a two-dimensional regularization surface controlling both sparsity strength and sparsification geometry. To approximate this surface integral in a single training run, we introduce efficient block-stochastic sampling, which aggregates feature activity magnitudes across diverse regularization regimes along the optimization trajectory. The resulting statistics are antisymmetric by construction, ensuring finite-sample FDR control. In extensive experiments on synthetic and semi-real data, GRIP2 demonstrates improved robustness to feature correlation and noise level: in high correlation and low signal-to-noise ratio regimes where standard deep learning based feature selectors may struggle, our method retains high power and stability. Finally, on real-world HIV drug resistance data, GRIP2 recovers known resistance-associated mutations with power better than established linear baselines, confirming its reliability in practice.

[993] Green-NAS: A Global-Scale Multi-Objective Neural Architecture Search for Robust and Efficient Edge-Native Weather Forecasting

Md Muhtasim Munif Fahim, Soyda Humyra Yesmin, Saiful Islam, Md. Palash Bin Faruque, Md. A. Salam, Md. Mahfuz Uddin, Samiul Islam, Tofayel Ahmed, Md. Binyamin, Md. Rezaul Karim

Main category: cs.LG

TL;DR: Green-NAS is a multi-objective neural architecture search framework for low-resource weather forecasting that optimizes for both accuracy and computational efficiency while minimizing energy costs and carbon footprint.

Details

Motivation: The paper addresses the need for sustainable AI deployment in weather forecasting by developing a framework that adheres to 'Green AI' principles, focusing on minimizing computational energy costs and carbon footprints rather than just maximizing performance.

Method: A multi-objective neural architecture search (NAS) framework that simultaneously optimizes for model accuracy and efficiency, finding lightweight models with very few parameters. The approach includes transfer learning to improve forecasting accuracy when limited historical data is available.

Result: The best model (Green-NAS-A) achieved RMSE of 0.0988 (within 1.4% of manually tuned baseline) using only 153k parameters, which is 239 times fewer than other global weather forecasting models like GraphCast. Transfer learning improved accuracy by approximately 5.2% compared to training separate models for each city.

Conclusion: Green-NAS demonstrates that sustainable, efficient weather forecasting models can achieve competitive accuracy with dramatically reduced computational resources, enabling deployment in low-resource environments while minimizing environmental impact.

Abstract: We introduce Green-NAS, a multi-objective NAS (neural architecture search) framework designed for low-resource environments using weather forecasting as a case study. By adhering to ‘Green AI’ principles, the framework explicitly minimizes computational energy costs and carbon footprints, prioritizing sustainable deployment over raw computational scale. The Green-NAS architecture search method is optimized for both model accuracy and efficiency to find lightweight models with high accuracy and very few model parameters; this is accomplished through an optimization process that simultaneously optimizes multiple objectives. Our best-performing model, Green-NAS-A, achieved an RMSE of 0.0988 (i.e., within 1.4% of our manually tuned baseline) using only 153k model parameters, which is 239 times fewer than other globally applied weather forecasting models, such as GraphCast. In addition, we also describe how the use of transfer learning will improve the weather forecasting accuracy by approximately 5.2%, in comparison to a naive approach of training a new model for each city, when there is limited historical weather data available for that city.

[994] TABES: Trajectory-Aware Backward-on-Entropy Steering for Masked Diffusion Models

Shreshth Saini, Avinab Saha, Balu Adsumilli, Neil Birkbeck, Yilin Wang, Alan C. Bovik

Main category: cs.LG

TL;DR: BoE Steering uses gradient-guided inference with Token Influence Scores to improve masked diffusion model sampling by approximating infinite-horizon lookahead via single backward pass, avoiding trajectory lock-in while maintaining efficiency.

Details

Motivation: Current masked diffusion model sampling methods use simple confidence-based heuristics that ignore long-term impacts of local decisions, causing trajectory lock-in where early hallucinations lead to global incoherence. Search-based methods help but are computationally expensive (O(K) forward passes per step).

Method: Proposes Backward-on-Entropy (BoE) Steering framework that derives Token Influence Score (TIS) from first-order expansion of trajectory cost functional. Uses gradient of future entropy with respect to input embeddings as optimal control signal. Introduces ActiveQueryAttention sparse adjoint primitive to reduce backward pass complexity for scalability.

Result: BoE achieves superior Pareto frontier for inference-time scaling compared to existing unmasking methods, demonstrating gradient-guided steering offers mathematically principled and efficient path to robust non-autoregressive generation.

Conclusion: Gradient-guided inference with BoE Steering provides efficient solution to trajectory lock-in in masked diffusion models, balancing computational efficiency with generation quality through principled mathematical framework.

Abstract: Masked Diffusion Models (MDMs) have emerged as a promising non-autoregressive paradigm for generative tasks, offering parallel decoding and bidirectional context utilization. However, current sampling methods rely on simple confidence-based heuristics that ignore the long-term impact of local decisions, leading to trajectory lock-in where early hallucinations cascade into global incoherence. While search-based methods mitigate this, they incur prohibitive computational costs ($O(K)$ forward passes per step). In this work, we propose Backward-on-Entropy (BoE) Steering, a gradient-guided inference framework that approximates infinite-horizon lookahead via a single backward pass. We formally derive the Token Influence Score (TIS) from a first-order expansion of the trajectory cost functional, proving that the gradient of future entropy with respect to input embeddings serves as an optimal control signal for minimizing uncertainty. To ensure scalability, we introduce \texttt{ActiveQueryAttention}, a sparse adjoint primitive that exploits the structure of the masking objective to reduce backward pass complexity. BoE achieves a superior Pareto frontier for inference-time scaling compared to existing unmasking methods, demonstrating that gradient-guided steering offers a mathematically principled and efficient path to robust non-autoregressive generation. We will release the code.

[995] Sample Complexity Analysis for Constrained Bilevel Reinforcement Learning

Naman Saxena, Vaneet Aggarwal

Main category: cs.LG

TL;DR: Theoretical analysis of bilevel RL algorithms with constrained optimization, focusing on sample complexity analysis for policy gradient methods with non-smooth objectives.

Details

Motivation: Bilevel RL problems appear in meta-learning, hierarchical learning, and RL from human feedback, but theoretical analysis of bilevel RL algorithms has been understudied despite empirical progress.

Method: Proposes Constrained Bilevel Subgradient Optimization (CBSO) algorithm using penalty-based objective function to handle constraints, avoiding primal-dual gap and hyper-gradient issues. Analyzes non-smooth optimization using Moreau envelope for parameterized policy gradient RL.

Result: Achieves iteration complexity of O(ε⁻²) and sample complexity of Õ(ε⁻⁴) for the proposed algorithm.

Conclusion: First theoretical analysis of parameterized policy gradient RL with non-smooth objectives using Moreau envelope, providing sample complexity bounds for constrained bilevel RL.

Abstract: Several important problem settings within the literature of reinforcement learning (RL), such as meta-learning, hierarchical learning, and RL from human feedback (RL-HF), can be modelled as bilevel RL problems. A lot has been achieved in these domains empirically; however, the theoretical analysis of bilevel RL algorithms hasn’t received a lot of attention. In this work, we analyse the sample complexity of a constrained bilevel RL algorithm, building on the progress in the unconstrained setting. We obtain an iteration complexity of $O(ε^{-2})$ and sample complexity of $\tilde{O}(ε^{-4})$ for our proposed algorithm, Constrained Bilevel Subgradient Optimization (CBSO). We use a penalty-based objective function to avoid the issue of primal-dual gap and hyper-gradient in the context of a constrained bilevel problem setting. The penalty-based formulation to handle constraints requires analysis of non-smooth optimization. We are the first ones to analyse the generally parameterized policy gradient-based RL algorithm with a non-smooth objective function using the Moreau envelope.

[996] Generation Order and Parallel Decoding in Masked Diffusion Models: An Information-Theoretic Perspective

Shaorong Zhang, Longxuan Yu, Rob Brekelmans, Luhan Tang, Salman Asif, Greg Ver Steeg

Main category: cs.LG

TL;DR: The paper provides an information-theoretic framework to analyze failure modes in Masked Diffusion Models, focusing on order sensitivity and parallelization bias, with insights on Easy-First decoding benefits, sampling errors in parallel decoding, and verification costs.

Details

Motivation: Masked Diffusion Models accelerate inference but their theoretical mechanisms for generation order and parallelization risks are under-explored. The paper aims to understand the fundamental sources of failure in these models.

Method: The authors develop a unified information-theoretic framework to decouple and analyze two failure sources: order sensitivity and parallelization bias. They use theoretical analysis and validate with experiments on controlled Block-HMM and large-scale MDMs (LLaDA) for arithmetic reasoning.

Result: Three key insights: (1) Easy-First decoding benefits increase with model error; (2) factorized parallel decoding introduces sampling errors leading to large Reverse KL divergence; (3) verification eliminates sampling error but has exponential cost, while heuristics like remasking cannot guarantee correctness.

Conclusion: The paper provides a theoretical framework for understanding failure modes in Masked Diffusion Models, highlighting trade-offs between parallelization efficiency and distributional correctness, with implications for designing more robust generation systems.

Abstract: Masked Diffusion Models (MDMs) significantly accelerate inference by trading off sequential determinism. However, the theoretical mechanisms governing generation order and the risks inherent in parallelization remain under-explored. In this work, we provide a unified information-theoretic framework to decouple and analyze two fundamental sources of failure: order sensitivity and parallelization bias. Our analysis yields three key insights: (1) The benefits of Easy-First decoding (prioritizing low-entropy tokens) are magnified as model error increases; (2) factorized parallel decoding introduces intrinsic sampling errors that can lead to arbitrary large Reverse KL divergence, capturing “incoherence” failures that standard Forward KL metrics overlook; and (3) while verification can eliminate sampling error, it incurs an exponential cost governed by the total correlation within a block. Conversely, heuristics like remasking, though computationally efficient, cannot guarantee distributional correctness. Experiments on a controlled Block-HMM and large-scale MDMs (LLaDA) for arithmetic reasoning validate our theoretical framework.

[997] Self-Attention at Constant Cost per Token via Symmetry-Aware Taylor Approximation

Franz A. Heinsen, Leo Kozachkov

Main category: cs.LG

TL;DR: Efficient self-attention computation with constant cost per token using symmetric tensor product chains and polynomial kernel features

Details

Motivation: Standard self-attention in Transformers has quadratic computational costs that scale with context length, creating unsustainable infrastructure and energy demands for large-scale models

Method: Decompose conventional self-attention’s Taylor expansion into symmetric tensor product chains, exploit symmetry to map queries/keys to minimal polynomial-kernel feature basis with feed-forward transformations

Result: Achieves constant cost per token (inverse to head size), orders-of-magnitude reductions in memory/computation, enables unbounded token generation at fixed cost

Conclusion: Enables efficient large-scale Transformer models with substantially reduced infrastructure/energy demands; mathematical techniques have independent interest

Abstract: The most widely used artificial intelligence (AI) models today are Transformers employing self-attention. In its standard form, self-attention incurs costs that increase with context length, driving demand for storage, compute, and energy that is now outstripping society’s ability to provide them. To help address this issue, we show that self-attention is efficiently computable to arbitrary precision with constant cost per token, achieving orders-of-magnitude reductions in memory use and computation. We derive our formulation by decomposing the conventional formulation’s Taylor expansion into expressions over symmetric chains of tensor products. We exploit their symmetry to obtain feed-forward transformations that efficiently map queries and keys to coordinates in a minimal polynomial-kernel feature basis. Notably, cost is fixed inversely in proportion to head size, enabling application over a greater number of heads per token than otherwise feasible. We implement our formulation and empirically validate its correctness. Our work enables unbounded token generation at modest fixed cost, substantially reducing the infrastructure and energy demands of large-scale Transformer models. The mathematical techniques we introduce are of independent interest.

[998] From Observations to States: Latent Time Series Forecasting

Jie Yang, Yifan Hu, Yuante Li, Kexin Zhang, Kaize Ding, Philip S. Yu

Main category: cs.LG

TL;DR: LatentTSF addresses latent chaos in time series forecasting by shifting from observation regression to latent state prediction using an AutoEncoder framework.

Details

Motivation: The paper identifies a "Latent Chaos" paradox where accurate time series forecasting models learn temporally disordered latent representations due to the dominant observation-space forecasting paradigm that minimizes point-wise errors on noisy data, encouraging shortcut solutions instead of recovering underlying system dynamics.

Method: Proposes Latent Time Series Forecasting (LatentTSF) paradigm using an AutoEncoder to project observations into higher-dimensional latent state space, then performing forecasting entirely in latent space to focus on learning structured temporal dynamics.

Result: Theoretical analysis shows latent objectives implicitly maximize mutual information between predicted latent states and ground-truth states/observations. Extensive experiments on benchmarks confirm LatentTSF mitigates latent chaos and achieves superior performance.

Conclusion: LatentTSF effectively addresses the representation paradox in time series forecasting by shifting to latent state prediction, enabling better recovery of underlying system dynamics and improved forecasting performance.

Abstract: Deep learning has achieved strong performance in Time Series Forecasting (TSF). However, we identify a critical representation paradox, termed Latent Chaos: models with accurate predictions often learn latent representations that are temporally disordered and lack continuity. We attribute this phenomenon to the dominant observation-space forecasting paradigm. Most TSF models minimize point-wise errors on noisy and partially observed data, which encourages shortcut solutions instead of the recovery of underlying system dynamics. To address this issue, we propose Latent Time Series Forecasting (LatentTSF), a novel paradigm that shifts TSF from observation regression to latent state prediction. Specifically, LatentTSF employs an AutoEncoder to project observations at each time step into a higher-dimensional latent state space. This expanded representation aims to capture underlying system variables and impose a smoother temporal structure. Forecasting is then performed entirely in the latent space, allowing the model to focus on learning structured temporal dynamics. Theoretical analysis demonstrates that our proposed latent objectives implicitly maximize mutual information between predicted latent states and ground-truth states and observations. Extensive experiments on widely-used benchmarks confirm that LatentTSF effectively mitigates latent chaos, achieving superior performance. Our code is available in https://github.com/Muyiiiii/LatentTSF.

[999] Agentic Framework for Epidemiological Modeling

Rituparna Datta, Zihan Guan, Baltazar Espinoza, Yiqi Su, Priya Pitre, Srini Venkatramanan, Naren Ramakrishnan, Anil Vullikanti

Main category: cs.LG

TL;DR: EPIAGENT is an agentic framework that automatically synthesizes, calibrates, verifies, and refines epidemiological simulators through iterative program synthesis and explicit epidemiological flow graphs.

Details

Motivation: Traditional epidemic modeling approaches rely on fixed model classes that require manual redesign as pathogens, policies, and scenario assumptions evolve, creating inefficiencies and limiting adaptability to changing public health needs.

Method: Models disease progression as iterative program synthesis problem with explicit epidemiological flow graph intermediate representation that links scenario specifications to model structure, enabling modular correctness checks before code generation. Verified flow graphs are compiled into mechanistic models supporting interpretable parameter learning under constraints.

Result: EPIAGENT captures complex growth dynamics and produces epidemiologically consistent counterfactual projections across varying vaccination and immune escape assumptions. The agentic feedback loop prevents degeneration and significantly accelerates convergence toward valid models.

Conclusion: EPIAGENT demonstrates that agentic frameworks can automate epidemiological simulator development while maintaining interpretability and correctness, mimicking professional expert workflows to accelerate model convergence.

Abstract: Epidemic modeling is essential for public health planning, yet traditional approaches rely on fixed model classes that require manual redesign as pathogens, policies, and scenario assumptions evolve. We introduce EPIAGENT, an agentic framework that automatically synthesizes, calibrates, verifies, and refines epidemiological simulators by modeling disease progression as an iterative program synthesis problem. A central design choice is an explicit epidemiological flow graph intermediate representation that links scenario specifications to model structure and enables strong, modular correctness checks before code is generated. Verified flow graphs are then compiled into mechanistic models supporting interpretable parameter learning under physical and epidemiological constraints. Evaluation on epidemiological scenario case studies demonstrates that EPIAGENT captures complex growth dynamics and produces epidemiologically consistent counterfactual projections across varying vaccination and immune escape assumptions. Our results show that the agentic feedback loop prevents degeneration and significantly accelerates convergence toward valid models by mimicking professional expert workflows.

[1000] Neural Ising Machines via Unrolling and Zeroth-Order Training

Sam Reifenstein, Timothee Leleu

Main category: cs.LG

TL;DR: NPIM: Neural network parameterized Ising machine that learns node-wise update rules for NP-hard Ising and Max-Cut optimization through compact MLP parameterization and zeroth-order optimization.

Details

Motivation: To develop a data-driven approach for solving NP-hard optimization problems (Ising and Max-Cut) that can learn effective algorithmic structure from data rather than relying on hand-designed heuristics, overcoming gradient instability issues in recurrent Ising-machine dynamics.

Method: Proposes NPIM which learns a shared, node-wise update rule mapping local interaction fields to spin updates using a compact multilayer perceptron with few parameters. Training uses zeroth-order optimization to avoid unstable gradients from backpropagation through long recurrent dynamics.

Result: NPIM recovers effective algorithmic structure including momentum-like behavior and time-varying schedules, achieving competitive solution quality and time-to-solution on standard Ising and neural combinatorial optimization benchmarks compared to learning-based methods and classical Ising-machine heuristics.

Conclusion: NPIM demonstrates that learned dynamics with low parameter counts can effectively solve complex optimization problems, recovering sophisticated algorithmic behaviors and achieving competitive performance against established methods.

Abstract: We propose a data-driven heuristic for NP-hard Ising and Max-Cut optimization that learns the update rule of an iterative dynamical system. The method learns a shared, node-wise update rule that maps local interaction fields to spin updates, parameterized by a compact multilayer perceptron with a small number of parameters. Training is performed using a zeroth-order optimizer, since backpropagation through long, recurrent Ising-machine dynamics leads to unstable and poorly informative gradients. We call this approach a neural network parameterized Ising machine (NPIM). Despite its low parameter count, the learned dynamics recover effective algorithmic structure, including momentum-like behavior and time-varying schedules, enabling efficient search in highly non-convex energy landscapes. Across standard Ising and neural combinatorial optimization benchmarks, NPIM achieves competitive solution quality and time-to-solution relative to recent learning-based methods and strong classical Ising-machine heuristics.

[1001] Beyond the Loss Curve: Scaling Laws, Active Learning, and the Limits of Learning from Exact Posteriors

Arian Khorasani, Nathaniel Chen, Yug D Oswal, Akshat Santhana Gopalan, Egemen Kolemen, Ravid Shwartz-Ziv

Main category: cs.LG

TL;DR: Normalizing flows as oracles enable exact posterior computation for investigating scaling laws, learning limits, soft labels, distribution shift, and active learning in neural networks.

Details

Motivation: Standard benchmarks cannot measure how close neural networks are to optimal performance because they lack access to the true posterior distribution p(y|x). The paper aims to establish a framework using class-conditional normalizing flows as oracles to compute exact posteriors on realistic images.

Method: Use class-conditional normalizing flows as oracles to make exact posterior distributions tractable on realistic image datasets (AFHQ, ImageNet). This enables five investigations: 1) Scaling laws analysis decomposing prediction error, 2) Measuring limits of learning via aleatoric floors, 3) Training with exact posteriors as soft labels, 4) Computing exact KL divergence for distribution shift analysis, and 5) Using exact epistemic uncertainty for active learning.

Result: The framework reveals that: 1) Epistemic error follows power laws in dataset size even when total loss plateaus; 2) Architectures differ in approaching aleatoric floors (ResNets show clean scaling while Vision Transformers stall in low-data regimes); 3) Training with exact posteriors outperforms hard labels and yields near-perfect calibration; 4) Shift type matters more than magnitude (class imbalance barely affects accuracy where input noise causes catastrophic degradation); 5) Exact epistemic uncertainty improves active learning sample efficiency.

Conclusion: Standard evaluation metrics hide ongoing learning, mask architectural differences, and cannot properly diagnose distribution shift. The oracle framework provides exact posterior access enabling deeper understanding of neural network performance limits and learning dynamics.

Abstract: How close are neural networks to the best they could possibly do? Standard benchmarks cannot answer this because they lack access to the true posterior p(y|x). We use class-conditional normalizing flows as oracles that make exact posteriors tractable on realistic images (AFHQ, ImageNet). This enables five lines of investigation. Scaling laws: Prediction error decomposes into irreducible aleatoric uncertainty and reducible epistemic error; the epistemic component follows a power law in dataset size, continuing to shrink even when total loss plateaus. Limits of learning: The aleatoric floor is exactly measurable, and architectures differ markedly in how they approach it: ResNets exhibit clean power-law scaling while Vision Transformers stall in low-data regimes. Soft labels: Oracle posteriors contain learnable structure beyond class labels: training with exact posteriors outperforms hard labels and yields near-perfect calibration. Distribution shift: The oracle computes exact KL divergence of controlled perturbations, revealing that shift type matters more than shift magnitude: class imbalance barely affects accuracy at divergence values where input noise causes catastrophic degradation. Active learning: Exact epistemic uncertainty distinguishes genuinely informative samples from inherently ambiguous ones, improving sample efficiency. Our framework reveals that standard metrics hide ongoing learning, mask architectural differences, and cannot diagnose the nature of distribution shift.

[1002] Optimal Transport-Guided Adversarial Attacks on Graph Neural Network-Based Bot Detection

Kunal Mukherjee, Zulfikar Alom, Tran Gia Bao Ngo, Cuneyt Gurcan Akcora, Murat Kantarcioglu

Main category: cs.LG

TL;DR: BOCLOAK is a framework for evaluating robustness of GNN-based social bot detectors using optimal transport to generate realistic adversarial attacks under spatio-temporal constraints.

Details

Motivation: Existing GNN-based bot detectors lack evaluation under realistic attack scenarios with real-world constraints like temporal and domain-specific limitations, creating a need for robust evaluation methods.

Method: BOCLOAK constructs probability measures over spatio-temporal neighbor features, learns optimal transport geometry to separate human/bot behaviors, and decodes transport plans into sparse, plausible edge edits that obey real-world constraints.

Result: Achieves up to 80.13% higher attack success rates while using 99.80% less GPU memory compared to baselines across three datasets, five bot detectors, and three defenses.

Conclusion: Optimal transport provides a lightweight, principled framework for bridging the gap between adversarial attacks and real-world bot detection under realistic constraints.

Abstract: The rise of bot accounts on social media poses significant risks to public discourse. To address this threat, modern bot detectors increasingly rely on Graph Neural Networks (GNNs). However, the effectiveness of these GNN-based detectors in real-world settings remains poorly understood. In practice, attackers continuously adapt their strategies as well as must operate under domain-specific and temporal constraints, which can fundamentally limit the applicability of existing attack methods. As a result, there is a critical need for robust GNN-based bot detection methods under realistic, constraint-aware attack scenarios. To address this gap, we introduce BOCLOAK to systematically evaluate the robustness of GNN-based social bot detection via both edge editing and node injection adversarial attacks under realistic constraints. BOCLOAK constructs a probability measure over spatio-temporal neighbor features and learns an optimal transport geometry that separates human and bot behaviors. It then decodes transport plans into sparse, plausible edge edits that evade detection while obeying real-world constraints. We evaluate BOCLOAK across three social bot datasets, five state-of-the-art bot detectors, three adversarial defenses, and compare it against four leading graph adversarial attack baselines. BOCLOAK achieves up to 80.13% higher attack success rates while using 99.80% less GPU memory under realistic real-world constraints. Most importantly, BOCLOAK shows that optimal transport provides a lightweight, principled framework for bridging the gap between adversarial attacks and real-world bot detection.

[1003] Harvest: Opportunistic Peer-to-Peer GPU Caching for LLM Inference

Nikhil Gopal, Kostis Kaffes

Main category: cs.LG

TL;DR: Harvest is a GPU cache management framework that uses peer-to-peer GPU interconnects to dynamically place model weights and KV cache in unused GPU memory, reducing memory pressure and improving inference throughput.

Details

Motivation: LLM inference is increasingly memory-bound due to growing model sizes and KV cache growth during autoregressive decoding. Existing offloading solutions to host memory suffer from PCIe bandwidth limitations, creating a need for more efficient memory management.

Method: Harvest opportunistically uses high-bandwidth peer-to-peer GPU interconnects to create a transient cache tier in unused GPU memory across multiple GPUs. It dynamically places model weights and KV cache entries in available GPU memory while preserving correctness.

Result: Harvest achieves more than 2x throughput speedup by accelerating retrieval of expert layer weights and KV cache entries through reduced data movement overhead.

Conclusion: The Harvest framework demonstrates that exploiting peer-to-peer GPU interconnects for dynamic memory management can significantly improve LLM inference performance by better utilizing available GPU memory resources.

Abstract: Large Language Model (LLM) inference is increasingly constrained by GPU memory capacity rather than compute throughput, driven by growing model sizes and the linear growth of the key-value (KV) cache during autoregressive decoding. Existing approaches mitigate memory pressure by offloading model state and KV tensors to host memory, but incur substantial latency due to limited PCIe bandwidth. We present Harvest, an opportunistic GPU cache management framework that exploits high-bandwidth peer-to-peer GPU interconnects to dynamically place model weights and KV cache in unused GPU memory. Harvest treats peer GPU memory as a transient cache tier, preserving correctness while reducing data movement overhead under dynamic memory availability. We demonstrate significant throughput speedup of more than 2 times by using Harvest to accelerate the retrieval of two widely-used inference components: expert layer weights and KV cache entries.

[1004] In-Run Data Shapley for Adam Optimizer

Meng Ding, Zeqing Zhang, Di Wang, Lijie Hu

Main category: cs.LG

TL;DR: Proposes Adam-Aware In-Run Data Shapley to address optimizer-dependent data attribution, showing SGD-based methods fail for Adam and introducing scalable computation with near-perfect fidelity.

Details

Motivation: Current data attribution methods rely on SGD's linear structure and fail for adaptive optimizers like Adam, which are widely used in modern ML pipelines. SGD-based proxies diverge significantly from true contributions under Adam, making them ineffective for practical applications.

Method: Proposes Adam-Aware In-Run Data Shapley with: 1) closed-form approximation restoring additivity by redefining utility under fixed-state assumption, and 2) Linearized Ghost Approximation that linearizes variance-dependent scaling term to compute pairwise gradient dot-products without materializing per-sample gradients.

Result: Achieves near-perfect fidelity to ground-truth marginal contributions (Pearson R > 0.99) while retaining ~95% of standard training throughput. Significantly outperforms SGD-based baselines in data attribution downstream tasks.

Conclusion: Data attribution is optimizer-dependent, and the proposed method provides accurate, scalable attribution for Adam optimizer, bridging the gap for modern training pipelines.

Abstract: Reliable data attribution is essential for mitigating bias and reducing computational waste in modern machine learning, with the Shapley value serving as the theoretical gold standard. While recent “In-Run” methods bypass the prohibitive cost of retraining by estimating contributions dynamically, they heavily rely on the linear structure of Stochastic Gradient Descent (SGD) and fail to capture the complex dynamics of adaptive optimizers like Adam. In this work, we demonstrate that data attribution is inherently optimizer-dependent: we show that SGD-based proxies diverge significantly from true contributions under Adam (Pearson $R \approx 0.11$), rendering them ineffective for modern training pipelines. To bridge this gap, we propose Adam-Aware In-Run Data Shapley. We derive a closed-form approximation that restores additivity by redefining utility under a fixed-state assumption and enable scalable computation via a novel Linearized Ghost Approximation. This technique linearizes the variance-dependent scaling term, allowing us to compute pairwise gradient dot-products without materializing per-sample gradients. Extensive experiments show that our method achieves near-perfect fidelity to ground-truth marginal contributions ($R > 0.99$) while retaining $\sim$95% of standard training throughput. Furthermore, our Adam-aware attribution significantly outperforms SGD-based baselines in data attribution downstream tasks.

[1005] Prototype-based Explainable Neural Networks with Channel-specific Reasoning for Geospatial Learning Tasks

Anushka Narayanan, Karianne J. Bergen

Main category: cs.LG

TL;DR: A prototype-based explainable AI approach specifically designed for multi-channel geospatial data, enabling channel-specific prototypes to enhance interpretability while maintaining performance comparable to standard neural networks.

Details

Motivation: Existing prototype-based XAI methods are designed for standard RGB images and not optimized for geoscientific data with distinct variable-specific channels. There's a need for interpretable models that can handle multi-channel geospatial data while providing transparent explanations.

Method: Developed a prototype-based XAI approach tailored for multi-channel geospatial data where each channel represents distinct physical variables. The model identifies separate channel-specific prototypical characteristics from multiple training examples, allowing examination of how individual and combined features influence predictions.

Result: Demonstrated through two geoscientific case studies: (1) classification of Madden Julian Oscillation phases using multi-variable climate data, and (2) land-use classification from multispectral satellite imagery. The approach produces both local (instance-level) and global (model-level) explanations while achieving comparable performance to standard neural networks.

Conclusion: The channel-specific prototype approach enhances transparency and trustworthiness of ML models for geoscientific tasks by explicitly incorporating channel-prototypes into the prediction process, providing insights into feature-relevance across different data channels.

Abstract: Explainable AI (XAI) is essential for understanding machine learning (ML) decision-making and ensuring model trustworthiness in scientific applications. Prototype-based XAI methods offer an intrinsically interpretable alternative to post-hoc approaches which often yield inconsistent explanations. Prototype-based XAI methods make predictions based on the similarity between inputs and learned prototypes that represent typical characteristics of target classes. However, existing prototype-based models are primarily designed for standard RGB image data and are not optimized for the distinct, variable-specific channels commonly found in geoscientific image and raster datasets. In this study, we develop a prototype-based XAI approach tailored for multi-channel geospatial data, where each channel represents a distinct physical environmental variable or spectral channel. Our approach enables the model to identify separate, channel-specific prototypical characteristics sourced from multiple distinct training examples that inform how these features individually and in combination influence model prediction while achieving comparable performance to standard neural networks. We demonstrate this method through two geoscientific case studies: (1) classification of Madden Julian Oscillation phases using multi-variable climate data and (2) land-use classification from multispectral satellite imagery. This approach produces both local (instance-level) and global (model-level) explanations for providing insights into feature-relevance across channels. By explicitly incorporating channel-prototypes into the prediction process, we discuss how this approach enhances the transparency and trustworthiness of ML models for geoscientific learning tasks.

[1006] Efficient and accurate steering of Large Language Models through attention-guided feature learning

Parmida Davarmanesh, Ashia Wilson, Adityanarayanan Radhakrishnan

Main category: cs.LG

TL;DR: Attention-guided steering framework improves LLM concept manipulation by automatically selecting relevant tokens, handling feature heterogeneity, and identifying optimal layers, nearly doubling steerable concepts across models up to 70B parameters.

Details

Motivation: Existing steering methods for manipulating LLM internal activations are brittle and sensitive to algorithmic choices, with inconsistent steerability of semantic concepts. There's a need for more robust steering frameworks that can reliably guide LLM responses toward specific concepts.

Method: Proposes an attention-guided steering framework that addresses three core challenges: (1) automatic selection of relevant token embeddings for extracting concept-related features, (2) accounting for heterogeneity of concept-related features across LLM activations, and (3) identification of layers most relevant for steering.

Result: Across a benchmark of 512 semantic concepts, the framework substantially improved steering over previous state-of-the-art, nearly doubling the number of successfully steered concepts across various model architectures and sizes (up to 70 billion parameter models).

Conclusion: The attention-guided steering framework enables more effective manipulation of LLM internal representations and provides insights into concept-specific feature distribution across layers, opening avenues for efficient fine-tuning algorithms for industry-scale LLMs.

Abstract: Steering, or direct manipulation of internal activations to guide LLM responses toward specific semantic concepts, is emerging as a promising avenue for both understanding how semantic concepts are stored within LLMs and advancing LLM capabilities. Yet, existing steering methods are remarkably brittle, with seemingly non-steerable concepts becoming completely steerable based on subtle algorithmic choices in how concept-related features are extracted. In this work, we introduce an attention-guided steering framework that overcomes three core challenges associated with steering: (1) automatic selection of relevant token embeddings for extracting concept-related features; (2) accounting for heterogeneity of concept-related features across LLM activations; and (3) identification of layers most relevant for steering. Across a steering benchmark of 512 semantic concepts, our framework substantially improved steering over previous state-of-the-art (nearly doubling the number of successfully steered concepts) across model architectures and sizes (up to 70 billion parameter models). Furthermore, we use our framework to shed light on the distribution of concept-specific features across LLM layers. Overall, our framework opens further avenues for developing efficient, highly-scalable fine-tuning algorithms for industry-scale LLMs.

[1007] Adaptive Momentum and Nonlinear Damping for Neural Network Training

Aikaterini Karoni, Rajit Rajpal, Benedict Leimkuhler, Gabriel Stoltz

Main category: cs.LG

TL;DR: Continuous-time optimization scheme with adaptive momentum coefficients regulated by kinetic energy, introducing cubic damping to mSGD and Adam for improved stability and convergence.

Details

Motivation: To address optimization challenges in large-scale models where standard methods like mSGD struggle, particularly with stability-convergence trade-offs in training complex architectures like ViT, BERT, and GPT2.

Method: Proposes adaptive friction mechanism based on kinetic energy of each parameter, relating to cubic damping from structural dynamics. Augments continuous dynamics of mSGD and Adam with cubic damping term to create two specific optimization schemes.

Result: Methods demonstrate robustness and match or outperform Adam on training ViT, BERT, and GPT2 tasks where mSGD typically struggles. Theoretical results establish exponential convergence of proposed schemes.

Conclusion: The adaptive friction approach with cubic damping provides effective optimization for large-scale models, offering improved stability without sacrificing convergence speed.

Abstract: We propose a continuous-time scheme for large-scale optimization that introduces individual, adaptive momentum coefficients regulated by the kinetic energy of each model parameter. This approach automatically adjusts to local landscape curvature to maintain stability without sacrificing convergence speed. We demonstrate that our adaptive friction can be related to cubic damping, a suppression mechanism from structural dynamics. Furthermore, we introduce two specific optimization schemes by augmenting the continuous dynamics of mSGD and Adam with a cubic damping term. Empirically, our methods demonstrate robustness and match or outperform Adam on training ViT, BERT, and GPT2 tasks where mSGD typically struggles. We further provide theoretical results establishing the exponential convergence of the proposed schemes.

[1008] Planning with Language and Generative Models: Toward General Reward-Guided Wireless Network Design

Chenyang Yuan, Xiaoyuan Cheng

Main category: cs.LG

TL;DR: Diffusion-based generative inference with unified reward function outperforms LLMs for scalable indoor AP deployment planning across diverse floorplans.

Details

Motivation: Current AP deployment in wireless networks faces challenges with complex indoor geometries and signal propagation. LLMs show strong wireless domain knowledge but have high computational costs and limited scalability due to dependence on external verifiers.

Method: Proposes generative inference models guided by unified reward function capturing AP deployment objectives. Uses diffusion samplers that progressively improve sampling by smoothing and sharpening reward landscape, rather than iterative refinement.

Result: Diffusion samplers consistently outperform alternative generative approaches. Introduces large-scale real-world dataset for indoor AP deployment requiring over 50k CPU hours to train general reward functions. Shows effective in- and out-of-distribution generalization and robustness.

Conclusion: Diffusion-based generative inference with unified reward function provides scalable and domain-agnostic foundation for indoor AP deployment planning, overcoming limitations of LLM-based approaches.

Abstract: Intelligent access point (AP) deployment remains challenging in next-generation wireless networks due to complex indoor geometries and signal propagation. We firstly benchmark general-purpose large language models (LLMs) as agentic optimizers for AP planning and find that, despite strong wireless domain knowledge, their dependence on external verifiers results in high computational costs and limited scalability. Motivated by these limitations, we study generative inference models guided by a unified reward function capturing core AP deployment objectives across diverse floorplans. We show that diffusion samplers consistently outperform alternative generative approaches. The diffusion process progressively improves sampling by smoothing and sharpening the reward landscape, rather than relying on iterative refinement, which is effective for non-convex and fragmented objectives. Finally, we introduce a large-scale real-world dataset for indoor AP deployment, requiring over $50k$ CPU hours to train general reward functions, and evaluate in- and out-of-distribution generalization and robustness. Our results suggest that diffusion-based generative inference with a unified reward function provides a scalable and domain-agnostic foundation for indoor AP deployment planning.

[1009] Leveraging Textual-Cues for Enhancing Multimodal Sentiment Analysis by Object Recognition

Sumana Biswas, Karen Young, Josephine Griffith

Main category: cs.LG

TL;DR: TEMSA framework improves multimodal sentiment analysis by extracting object names from images and combining them with text data, showing better performance than individual modality analysis.

Details

Motivation: Multimodal sentiment analysis faces challenges due to dissimilarities between text and image modalities, sentiment ambiguity, and contextual complexity. The paper aims to address these difficulties by better integrating visual and textual information.

Method: Introduces TEMSA (Textual-Cues for Enhancing Multimodal Sentiment Analysis) based on object recognition. Extracts all object names detected in images and combines them with associated text data (called TEMS). Experiments with individual and combined analysis on two datasets.

Result: Only TEMS (combined text and extracted object names) improves results compared to individual analysis of text or images alone. The approach demonstrates effectiveness in enhancing multimodal sentiment analysis performance.

Conclusion: TEMSA contributes to advancing multimodal sentiment analysis by effectively combining image and text data through object recognition, offering insights into multimodal integration strategies.

Abstract: Multimodal sentiment analysis, which includes both image and text data, presents several challenges due to the dissimilarities in the modalities of text and image, the ambiguity of sentiment, and the complexities of contextual meaning. In this work, we experiment with finding the sentiments of image and text data, individually and in combination, on two datasets. Part of the approach introduces the novel `Textual-Cues for Enhancing Multimodal Sentiment Analysis’ (TEMSA) based on object recognition methods to address the difficulties in multimodal sentiment analysis. Specifically, we extract the names of all objects detected in an image and combine them with associated text; we call this combination of text and image data TEMS. Our results demonstrate that only TEMS improves the results when considering all the object names for the overall sentiment of multimodal data compared to individual analysis. This research contributes to advancing multimodal sentiment analysis and offers insights into the efficacy of TEMSA in combining image and text data for multimodal sentiment analysis.

[1010] Quantum Generator Kernels

Philipp Altmann, Maximilian Mansky, Maximilian Zorn, Jonas Stein, Claudia Linnhoff-Popien

Main category: cs.LG

TL;DR: Quantum Generator Kernels (QGKs) use variational generator groups to create parameterizable quantum kernels that can embed large-scale real-world data like images into constrained quantum devices, outperforming existing quantum and classical kernel methods.

Details

Motivation: Quantum kernel methods theoretically offer advantages by making classically inseparable features separable in quantum space, but practical application is limited by Noisy Intermediate-Scale Quantum (NISQ) hardware constraints. Current hybrid architectures with fixed intermediate embeddings may not fully exploit quantum computing's potential, necessitating better strategies to compress and embed large-scale data like images into quantum devices.

Method: Proposes Quantum Generator Kernels (QGKs) using Variational Generator Groups (VGGs) that merge universal generators into a parameterizable operator for scalable coverage of quantum space. Includes training a weight vector to optimize kernel alignment to the target domain by parameterizing the projection of VGGs in the current data context.

Result: Empirical results demonstrate superior projection and classification capabilities compared to state-of-the-art quantum and classical kernel approaches, showing potential as a versatile framework for various QML applications.

Conclusion: QGKs address limitations of current quantum machine learning approaches by providing a scalable, parameterizable quantum kernel method that can effectively embed large-scale real-world data like images into constrained quantum hardware, outperforming existing methods.

Abstract: Quantum kernel methods offer significant theoretical benefits by rendering classically inseparable features separable in quantum space. Yet, the practical application of Quantum Machine Learning (QML), currently constrained by the limitations of Noisy Intermediate-Scale Quantum (NISQ) hardware, necessitates effective strategies to compress and embed large-scale real-world data like images into the constrained capacities of existing quantum devices or simulators. To this end, we propose Quantum Generator Kernels (QGKs), a generator-based approach to quantum kernels, comprising a set of Variational Generator Groups (VGGs) that merge universal generators into a parameterizable operator, ensuring scalable coverage of the available quantum space. Thereby, we address shortcomings of current leading strategies employing hybrid architectures, which might prevent exploiting quantum computing’s full potential due to fixed intermediate embedding processes. To optimize the kernel alignment to the target domain, we train a weight vector to parameterize the projection of the VGGs in the current data context. Our empirical results demonstrate superior projection and classification capabilities of the QGK compared to state-of-the-art quantum and classical kernel approaches and show its potential to serve as a versatile framework for various QML applications.

[1011] Post-Training Probability Manifold Correction via Structured SVD Pruning and Self-Referential Distillation

Aaron R. Flouro, Shawn P. Chadwick

Main category: cs.LG

TL;DR: SparseKD is a post-training compression method that combines structured SVD pruning with self-referential knowledge distillation, where models teach themselves by matching their own pre-compression probability distributions, achieving 15-65% parameter reduction with acceptable quality trade-offs.

Details

Motivation: Large language models are expensive to deploy, creating a need for efficient compression methods that reduce computational costs while maintaining model quality.

Method: Combines structured SVD pruning with self-referential knowledge distillation where the model teaches itself by matching its own probability distribution from before compression, requiring no external teacher or architectural changes.

Result: Self-referential distillation alone improves model quality by 39% relative to original checkpoints; combined with pruning achieves 15-65% parameter reduction with acceptable quality trade-offs; speedups come from reduced dense matrix multiplication in feed-forward layers.

Conclusion: SparseKD provides an effective, immediately deployable compression method that requires no external super-teacher, architectural changes, or custom inference kernels, making it complementary to attention optimizations.

Abstract: Large language models are expensive to deploy. We introduce Sparse Knowledge Distillation (SparseKD), a post-training method that compresses transformer models by combining structured SVD pruning with self-referential knowledge distillation. The key insight is simple: instead of using an external teacher, the model teaches itself by matching its own probability distribution from before compression. This self-referential setup enables surprisingly strong quality recovery after aggressive pruning. Our experiments reveal an unexpected finding: self-referential distillation alone, applied post-training under an identical objective and fixed calibration dataset, improves model quality by 39% relative to the original converged checkpoint. When combined with structured pruning, SparseKD achieves 15-65% parameter reduction with acceptable quality trade-offs. Kernel profiling shows that speedups arise entirely from reduced dense matrix multiplication in feed-forward layers while attention remains unchanged, making this approach complementary to attention optimizations. We validate across two model families (0.6B and 3.8B parameters) with multi-seed experiments confirming high reproducibility. SparseKD requires no external super-teacher, no architectural changes, and no custom inference kernels, making it immediately deployable with existing infrastructure.

[1012] MATRIX: A Multimodal Benchmark and Post-Training Framework for Materials Science

Delia McGrath, Curtis Chong, Rohil Kulkarni, Gerbrand Ceder, Adeesh Kolluru

Main category: cs.LG

TL;DR: Multimodal post-training with experimental images improves materials science reasoning beyond text-only supervision, showing cross-modal transfer benefits.

Details

Motivation: Need to assess whether incorporating visual experimental data during post-training improves mechanism-grounded explanation reasoning beyond text-only supervision in materials science.

Method: Introduced MATRIX benchmark for materials science reasoning, compared post-training on structured text alone vs. with paired experimental images, isolating visual grounding effects.

Result: Visual supervision improved experimental interpretation by 10-25% and yielded 5-16% gains on text-only scientific reasoning tasks; improvements rely on correct image-text alignment.

Conclusion: Multimodal post-training with visual data enhances scientific reasoning, showing cross-modal representational transfer that extends beyond materials science to other domains.

Abstract: Scientific reasoning in materials science requires integrating multimodal experimental evidence with underlying physical theory. Existing benchmarks make it difficult to assess whether incorporating visual experimental data during post-training improves mechanism-grounded explanation reasoning beyond text-only supervision. We introduce MATRIX, a multimodal benchmark for materials science reasoning that evaluates foundational theory, research-level reasoning, and the interpretation of real experimental artifacts across multiple characterization modalities. Using MATRIX as a controlled diagnostic, we isolate the effect of visual grounding by comparing post-training on structured materials science text alone with post-training that incorporates paired experimental images. Despite using relatively small amounts of multimodal data, visual supervision improves experimental interpretation by 10-25% and yields 5-16% gains on text-only scientific reasoning tasks. Our results demonstrate that these improvements rely on correct image-text alignment during post-training, highlighting cross-modal representational transfer. We also observe consistent improvements on ScienceQA and PubMedQA, demonstrating that the benefits of structured multimodal post-training extend beyond materials science. The MATRIX dataset is available at https://huggingface.co/datasets/radical-ai/MATRIX and the model at https://huggingface.co/radical-ai/MATRIX-PT.

[1013] RePaint-Enhanced Conditional Diffusion Model for Parametric Engineering Designs under Performance and Parameter Constraints

Ke Wang, Nguyen Gia Hien Vu, Yifan Tang, Mostafa Rahmani Dehaghani, G. Gary Wang

Main category: cs.LG

TL;DR: A RePaint-enhanced diffusion framework for engineering design generation that creates missing components from partial references while satisfying performance constraints, without retraining.

Details

Motivation: To enable generative design in engineering applications where partial reference designs exist and performance constraints must be satisfied, without requiring model retraining for each new constraint.

Method: Uses a pre-trained performance-guided DDPM with RePaint enhancement, applying mask-based resampling during inference to generate missing design components from partial references while respecting performance and parameter constraints.

Result: The method achieves accuracy comparable to or better than pre-trained models on parametric ship hull and airfoil design problems, enabling controlled novelty through partial design fixing.

Conclusion: Provides an efficient, training-free solution for parameter-constraint-aware generative design in engineering applications, extending diffusion models to constrained design generation.

Abstract: This paper presents a RePaint-enhanced framework that integrates a pre-trained performance-guided denoising diffusion probabilistic model (DDPM) for performance- and parameter-constraint engineering design generation. The proposed method enables the generation of missing design components based on a partial reference design while satisfying performance constraints, without retraining the underlying model. By applying mask-based resampling during inference process, RePaint allows efficient and controllable repainting of partial designs under both performance and parameter constraints, which is not supported by conventional DDPM-base methods. The framework is evaluated on two representative design problems, parametric ship hull design and airfoil design, demonstrating its ability to generate novel designs with expected performance based on a partial reference design. Results show that the method achieves accuracy comparable to or better than pre-trained models while enabling controlled novelty through fixing partial designs. Overall, the proposed approach provides an efficient, training-free solution for parameter-constraint-aware generative design in engineering applications.

[1014] A Fragile Guardrail: Diffusion LLM’s Safety Blessing and Its Failure Mode

Zeyuan He, Yupeng Chen, Lang Lin, Yihan Wang, Shenxu Chang, Eric Sommerlade, Philip Torr, Junchi Yu, Adel Bibi, Jialin Yu

Main category: cs.LG

TL;DR: Diffusion LLMs show intrinsic safety against jailbreak attacks but are vulnerable to context nesting attacks that embed harmful requests in benign structures.

Details

Motivation: To investigate the safety properties of Diffusion LLMs compared to autoregressive LLMs, particularly their robustness against jailbreak attacks and potential vulnerabilities.

Method: Analyzed the diffusion trajectory mechanism that provides stepwise reduction of unsafe generations, then identified and tested context nesting attacks that bypass this protection by embedding harmful requests within structured benign contexts.

Result: D-LLMs demonstrate intrinsic safety against traditional jailbreak attacks but are vulnerable to context nesting attacks, achieving state-of-the-art attack success rates and enabling the first successful jailbreak of Gemini Diffusion.

Conclusion: D-LLMs have inherent safety advantages but are not foolproof; context nesting exposes critical vulnerabilities, requiring new safety measures for commercial D-LLMs.

Abstract: Diffusion large language models (D-LLMs) offer an alternative to autoregressive LLMs (AR-LLMs) and have demonstrated advantages in generation efficiency. Beyond the utility benefits, we argue that D-LLMs exhibit a previously underexplored safety blessing: their diffusion-style generation confers intrinsic robustness against jailbreak attacks originally designed for AR-LLMs. In this work, we provide an initial analysis of the underlying mechanism, showing that the diffusion trajectory induces a stepwise reduction effect that progressively suppresses unsafe generations. This robustness, however, is not absolute. We identify a simple yet effective failure mode, termed context nesting, where harmful requests are embedded within structured benign contexts, effectively bypassing the stepwise reduction mechanism. Empirically, we show that this simple strategy is sufficient to bypass D-LLMs’ safety blessing, achieving state-of-the-art attack success rates across models and benchmarks. Most notably, it enables the first successful jailbreak of Gemini Diffusion, to our knowledge, exposing a critical vulnerability in commercial D-LLMs. Together, our results characterize both the origins and the limits of D-LLMs’ safety blessing, constituting an early-stage red-teaming of D-LLMs.

[1015] Localized, High-resolution Geographic Representations with Slepian Functions

Arjun Rao, Ruth Crasto, Tessa Ooms, David Rolnick, Konstantin Klemmer, Marc Rußwurm

Main category: cs.LG

TL;DR: Slepian function-based geographic location encoder that concentrates representational capacity in regions-of-interest and scales to high resolutions without heavy computational demands.

Details

Motivation: Geographic data is fundamentally local (disease outbreaks, ecological patterns, economic activity), but current ML models distribute representational capacity uniformly across the globe, struggling at fine-grained resolutions needed for localized applications.

Method: Proposes a geographic location encoder built from spherical Slepian functions that concentrate representational capacity inside a region-of-interest. Also presents a hybrid Slepian-Spherical Harmonic encoder for settings requiring global context, which efficiently bridges local-global performance tradeoff while retaining pole-safety and spherical-surface-distance preservation.

Result: Across five tasks spanning classification, regression, and image-augmented prediction, Slepian encodings outperform baselines and retain performance advantages across a wide range of neural network architectures.

Conclusion: The proposed Slepian-based geographic encoders provide efficient, high-resolution representation for localized geographic applications while maintaining global context when needed, outperforming existing approaches across diverse tasks and architectures.

Abstract: Geographic data is fundamentally local. Disease outbreaks cluster in population centers, ecological patterns emerge along coastlines, and economic activity concentrates within country borders. Machine learning models that encode geographic location, however, distribute representational capacity uniformly across the globe, struggling at the fine-grained resolutions that localized applications require. We propose a geographic location encoder built from spherical Slepian functions that concentrate representational capacity inside a region-of-interest and scale to high resolutions without extensive computational demands. For settings requiring global context, we present a hybrid Slepian-Spherical Harmonic encoder that efficiently bridges the tradeoff between local-global performance, while retaining desirable properties such as pole-safety and spherical-surface-distance preservation. Across five tasks spanning classification, regression, and image-augmented prediction, Slepian encodings outperform baselines and retain performance advantages across a wide range of neural network architectures.

[1016] Fast Forward: Accelerating LLM Prefill with Predictive FFN Sparsity

Aayush Gautam, Mukul Gagrani, Junyoung Park, Mingu Lee, Chiris Lott, Narasimha Reddy

Main category: cs.LG

TL;DR: FastForward accelerates LLM prefill stage by predicting and sparsifying FFN computations with minimal accuracy loss

Details

Motivation: The prefill stage of LLM inference is computationally expensive for long contexts, with FFNs dominating costs. Existing sparsification methods designed for autoregressive decoding don't exploit prefill parallelism and often degrade accuracy.

Method: Combines lightweight expert predictor to select high-importance neurons per block, error compensation network to correct sparsity-induced errors, and layer-wise sparsity scheduler to allocate compute based on token-mixing importance.

Result: Achieves up to 1.45× compute-bound speedup at 50% FFN sparsity with <6% accuracy loss on LongBench across LLaMA and Qwen models up to 8B parameters, reducing Time-to-First-Token for efficient long-context inference.

Conclusion: FastForward enables efficient long-context LLM inference on constrained hardware by accelerating the prefill stage through predictive FFN sparsity while maintaining accuracy.

Abstract: The prefill stage of large language model (LLM) inference is a key computational bottleneck for long-context workloads. At short-to-moderate context lengths (1K–16K tokens), Feed-Forward Networks (FFNs) dominate this cost, accounting for most of the total FLOPs. Existing FFN sparsification methods, designed for autoregressive decoding, fail to exploit the prefill stage’s parallelism and often degrade accuracy. To address this, we introduce FastForward, a predictive sparsity framework that accelerates LLM prefill through block-wise, context-aware FFN sparsity. FastForward combines (1) a lightweight expert predictor to select high-importance neurons per block, (2) an error compensation network to correct sparsity-induced errors, and (3) a layer-wise sparsity scheduler to allocate compute based on token-mixing importance. Across LLaMA and Qwen models up to 8B parameters, FastForward delivers up to 1.45$\times$ compute-bound speedup at 50% FFN sparsity with $<$ 6% accuracy loss compared to the dense baseline on LongBench, substantially reducing Time-to-First-Token (TTFT) for efficient, long-context LLM inference on constrained hardware.

[1017] MemoryLLM: Plug-n-Play Interpretable Feed-Forward Memory for Transformers

Ajay Jaiswal, Lauren Hannah, Han-Byul Kim, Duc Hoang, Arnav Kundu, Mehrdad Farajtabar, Minsik Cho

Main category: cs.LG

TL;DR: MemoryLLM decouples feed-forward networks from self-attention to study FFNs as context-free token-wise neural retrieval memory, enabling pre-computed token lookups for improved inference efficiency.

Details

Motivation: To better understand transformer components in LLMs, particularly feed-forward modules (FFNs), by decoupling them from self-attention to study FFNs as context-free token-wise neural retrieval memory.

Method: MemoryLLM trains FFNs in isolation from self-attention using token embeddings, enabling FFNs to be pre-computed as token-wise lookups (ToLs). Also introduces Flex-MemoryLLM as an intermediate architecture between conventional transformers and MemoryLLM.

Result: Enables on-demand transfer between VRAM and storage, enhancing inference efficiency. Investigates how input tokens access memory locations within FFN parameters and the importance of FFN memory across different downstream tasks.

Conclusion: MemoryLLM provides a framework for studying FFNs as context-free token-wise neural retrieval memory, offering both interpretability benefits and practical efficiency improvements for transformer architectures.

Abstract: Understanding how transformer components operate in LLMs is important, as it is at the core of recent technological advances in artificial intelligence. In this work, we revisit the challenges associated with interpretability of feed-forward modules (FFNs) and propose MemoryLLM, which aims to decouple FFNs from self-attention and enables us to study the decoupled FFNs as context-free token-wise neural retrieval memory. In detail, we investigate how input tokens access memory locations within FFN parameters and the importance of FFN memory across different downstream tasks. MemoryLLM achieves context-free FFNs by training them in isolation from self-attention directly using the token embeddings. This approach allows FFNs to be pre-computed as token-wise lookups (ToLs), enabling on-demand transfer between VRAM and storage, additionally enhancing inference efficiency. We also introduce Flex-MemoryLLM, positioning it between a conventional transformer design and MemoryLLM. This architecture bridges the performance gap caused by training FFNs with context-free token-wise embeddings.

[1018] DROGO: Default Representation Objective via Graph Optimization in Reinforcement Learning

Hon Tik Tse, Marlos C. Machado

Main category: cs.LG

TL;DR: Direct approximation of the principal eigenvector of the default representation for efficient reward shaping in RL

Details

Motivation: The default representation and its principal eigenvector are useful for various RL applications, but current methods require expensive matrix approximation and eigendecomposition that don't scale to high-dimensional spaces.

Method: Derived an objective for directly approximating the principal eigenvector of the default representation using a neural network, bypassing the need for explicit matrix computation.

Result: Empirically demonstrated effectiveness in multiple environments and successfully applied the learned eigenvectors for reward shaping.

Conclusion: The direct approximation method provides a scalable approach to compute the principal eigenvector of the default representation for RL applications.

Abstract: In computational reinforcement learning, the default representation (DR) and its principal eigenvector have been shown to be effective for a wide variety of applications, including reward shaping, count-based exploration, option discovery, and transfer. However, in prior investigations, the eigenvectors of the DR were computed by first approximating the DR matrix, and then performing an eigendecomposition. This procedure is computationally expensive and does not scale to high-dimensional spaces. In this paper, we derive an objective for directly approximating the principal eigenvector of the DR with a neural network. We empirically demonstrate the effectiveness of the objective in a number of environments, and apply the learned eigenvectors for reward shaping.

[1019] Fed-Listing: Federated Label Distribution Inference in Graph Neural Networks

Suprim Nakarmi, Junggab Son, Yue Zhao, Zuobin Xiong

Main category: cs.LG

TL;DR: Fed-Listing is a gradient-based attack that infers private label statistics from federated graph neural networks by analyzing final-layer gradients, revealing class proportions without accessing raw data.

Details

Motivation: Federated Graph Neural Networks (FedGNNs) aim to preserve privacy in collaborative learning over decentralized graph data, but shared model updates (gradients) can still leak sensitive information. While various privacy inference attacks exist, label distribution inference in FedGNNs remains underexplored.

Method: Fed-Listing uses only final-layer gradients exchanged during training to infer private label statistics. It employs an auxiliary shadow dataset to generate diverse label partitioning strategies, simulating various client distributions, and trains an attack model on these simulated scenarios.

Result: Extensive experiments on four benchmark datasets and three GNN architectures show Fed-Listing significantly outperforms existing baselines (random guessing and Decaf), even under challenging non-i.i.d. scenarios. Defense mechanisms barely reduce attack performance unless model utility is severely degraded.

Conclusion: Fed-Listing demonstrates that even in federated settings with GNNs, gradient sharing can leak sensitive label distribution information, highlighting the need for more robust privacy-preserving techniques in federated graph learning.

Abstract: Graph Neural Networks (GNNs) have been intensively studied for their expressive representation and learning performance on graph-structured data, enabling effective modeling of complex relational dependencies among nodes and edges in various domains. However, the standalone GNNs can unleash threat surfaces and privacy implications, as some sensitive graph-structured data is collected and processed in a centralized setting. To solve this issue, Federated Graph Neural Networks (FedGNNs) are proposed to facilitate collaborative learning over decentralized local graph data, aiming to preserve user privacy. Yet, emerging research indicates that even in these settings, shared model updates, particularly gradients, can unintentionally leak sensitive information of local users. Numerous privacy inference attacks have been explored in traditional federated learning and extended to graph settings, but the problem of label distribution inference in FedGNNs remains largely underexplored. In this work, we introduce Fed-Listing (Federated Label Distribution Inference in GNNs), a novel gradient-based attack designed to infer the private label statistics of target clients in FedGNNs without access to raw data or node features. Fed-Listing only leverages the final-layer gradients exchanged during training to uncover statistical patterns that reveal class proportions in a stealthy manner. An auxiliary shadow dataset is used to generate diverse label partitioning strategies, simulating various client distributions, on which the attack model is obtained. Extensive experiments on four benchmark datasets and three GNN architectures show that Fed-Listing significantly outperforms existing baselines, including random guessing and Decaf, even under challenging non-i.i.d. scenarios. Moreover, applying defense mechanisms can barely reduce our attack performance, unless the model’s utility is severely degraded.

[1020] Variational Approach for Job Shop Scheduling

Seung Heon Oh, Jiwon Baek, Ki Young Cho, Hee Chang Yoon, Jong Hun Woo

Main category: cs.LG

TL;DR: VG2S framework uses variational inference and graph neural networks to solve Job Shop Scheduling Problems with better generalization and training stability than conventional DRL methods.

Details

Motivation: Job Shop Scheduling Problem (JSSP) is critical for manufacturing efficiency but conventional Deep Reinforcement Learning approaches suffer from non-stationarity during training and poor generalization to unseen problem instances due to simultaneous optimization of representation learning and policy execution.

Method: Proposes Variational Graph-to-Scheduler (VG2S) framework that introduces variational inference to JSSP for the first time, using ELBO with maximum entropy reinforcement learning. Decouples representation learning from policy optimization via variational graph encoder to learn robust structural representations of scheduling instances.

Result: VG2S demonstrates superior zero-shot generalization compared to state-of-the-art DRL baselines and traditional dispatching rules, especially on large-scale challenging benchmarks like DMU and SWV, with enhanced training stability and robustness against hyperparameter variations.

Conclusion: The VG2S framework effectively addresses JSSP challenges by decoupling representation learning from policy optimization through variational inference, leading to improved generalization and training stability in manufacturing scheduling problems.

Abstract: This paper proposes a novel Variational Graph-to-Scheduler (VG2S) framework for solving the Job Shop Scheduling Problem (JSSP), a critical task in manufacturing that directly impacts operational efficiency and resource utilization. Conventional Deep Reinforcement Learning (DRL) approaches often face challenges such as non-stationarity during training and limited generalization to unseen problem instances because they optimize representation learning and policy execution simultaneously. To address these issues, we introduce variational inference to the JSSP domain for the first time and derive a probabilistic objective based on the Evidence of Lower Bound (ELBO) with maximum entropy reinforcement learning. By mathematically decoupling representation learning from policy optimization, the VG2S framework enables the agent to learn robust structural representations of scheduling instances through a variational graph encoder. This approach significantly enhances training stability and robustness against hyperparameter variations. Extensive experiments demonstrate that the proposed method exhibits superior zero-shot generalization compared with state-of-the-art DRL baselines and traditional dispatching rules, particularly on large-scale and challenging benchmark instances such as DMU and SWV.

[1021] Robustness of AutoML on Dirty Categorical Data

Marcos L. P. Bueno, Joaquin Vanschoren

Main category: cs.LG

TL;DR: AutoML pipeline for handling dirty categorical data with morphological encoders to improve predictive performance on datasets with high-cardinality categorical features.

Details

Motivation: AutoML methods for classification handle data imperfections but their behavior on dirty categorical datasets with high-cardinality features is less known. Recent research shows morphological encoders improve ML performance on such data, but their effects in AutoML pipelines are unexplored.

Method: Propose a pipeline that transforms categorical data into numerical data using advanced encoding schemes, then benchmark AutoML methods on dirty datasets to compare performance and analyze the ML pipelines built by AutoMLs.

Result: Benchmarking reveals differences in predictive performance between standard AutoML methods and the proposed pipeline with morphological encoders, providing insights into ML pipeline construction beyond just the best model.

Conclusion: The proposed pipeline enables AutoML methods to better handle dirty categorical data through advanced encoding schemes, offering improved performance and deeper insights into pipeline construction for categorical data challenges.

Abstract: The goal of automated machine learning (AutoML) is to reduce trial and error when doing machine learning (ML). Although AutoML methods for classification are able to deal with data imperfections, such as outliers, multiple scales and missing data, their behavior is less known on dirty categorical datasets. These datasets often have several categorical features with high cardinality arising from issues such as lack of curation and automated collection. Recent research has shown that ML models can benefit from morphological encoders for dirty categorical data, leading to significantly superior predictive performance. However the effects of using such encoders in AutoML methods are not known at the moment. In this paper, we propose a pipeline that transforms categorical data into numerical data so that an AutoML can handle categorical data transformed by more advanced encoding schemes. We benchmark the current robustness of AutoML methods on a set of dirty datasets and compare it with the proposed pipeline. This allows us to get insight on differences in predictive performance. We also look at the ML pipelines built by AutoMLs in order to gain insight beyond the best model as typically returned by these methods.

[1022] Federated-inspired Single-cell Batch Integration in Latent Space

Quang-Huy Nguyen, Zongliang Yue, Hao Chen, Wei-Shinn Ku, Jiaqi Wang

Main category: cs.LG

TL;DR: scBatchProx: A federated learning-inspired post-hoc optimization method for batch correction in single-cell RNA sequencing data that refines cell embeddings without requiring raw data or centralized training.

Details

Motivation: Single-cell RNA sequencing generates massive datasets with batch effects that obscure biological signals. Existing batch correction methods either insufficiently correct batch effects or require centralized retraining on complete datasets, limiting applicability in distributed and evolving single-cell data settings.

Method: scBatchProx treats each batch as a client in a federated learning framework, learning batch-conditioned adapters under proximal regularization to correct batch structure directly in latent space without requiring raw expression data or centralized optimization. It’s a lightweight post-hoc method that only optimizes batch-specific adapter parameters.

Result: Extensive experiments show scBatchProx consistently yields relative gains of approximately 3-8% in overall embedding quality, with batch correction and biological conservation improving in 90% and 85% of data-method pairs respectively.

Conclusion: scBatchProx represents a step toward practical refinement of learned representations in dynamic single-cell data systems, offering a deployable solution for batch correction in distributed settings.

Abstract: Advances in single-cell RNA sequencing enable the rapid generation of massive, high-dimensional datasets, yet the accumulation of data across experiments introduces batch effects that obscure true biological signals. Existing batch correction approaches either insufficiently correct batch effects or require centralized retraining on the complete dataset, limiting their applicability in distributed and continually evolving single-cell data settings. We introduce scBatchProx, a post-hoc optimization method inspired by federated learning principles for refining cell-level embeddings produced by arbitrary upstream methods. Treating each batch as a client, scBatchProx learns batch-conditioned adapters under proximal regularization, correcting batch structure directly in latent space without requiring raw expression data or centralized optimization. The method is lightweight and deployable, optimizing batch-specific adapter parameters only. Extensive experiments show that scBatchProx consistently yields relative gains of approximately 3-8% in overall embedding quality, with batch correction and biological conservation improving in 90% and 85% of data-method pairs, respectively. We envision this work as a step toward the practical refinement of learned representations in dynamic single-cell data systems.

[1023] Open Materials Generation with Inference-Time Reinforcement Learning

Philipp Hoellmer, Stefano Martiniani

Main category: cs.LG

TL;DR: OMatG-IRL enables RL-based crystal structure generation using velocity fields instead of score functions, improving sampling efficiency by orders of magnitude.

Details

Motivation: Current continuous-time generative models for crystalline materials struggle to incorporate explicit target properties into generation. While RL could align models with objectives, it typically requires score functions that aren't available in flow-based models using velocity fields.

Method: OMatG-IRL is a policy-gradient RL framework that operates directly on learned velocity fields without needing explicit score computation. It uses stochastic perturbations of generation dynamics to enable exploration and policy-gradient estimation at inference time while preserving baseline performance.

Result: First application of RL to crystal structure prediction (CSP). Achieves competitive performance with score-based RL approaches while preserving diversity through composition conditioning. Enables time-dependent velocity-annealing schedules with order-of-magnitude improvements in sampling efficiency and generation time reduction.

Conclusion: OMatG-IRL successfully enables RL-based optimization for flow-based generative models in materials science, overcoming the score function limitation and achieving significant efficiency gains for crystal structure prediction.

Abstract: Continuous-time generative models for crystalline materials enable inverse materials design by learning to predict stable crystal structures, but incorporating explicit target properties into the generative process remains challenging. Policy-gradient reinforcement learning (RL) provides a principled mechanism for aligning generative models with downstream objectives but typically requires access to the score, which has prevented its application to flow-based models that learn only velocity fields. We introduce Open Materials Generation with Inference-time Reinforcement Learning (OMatG-IRL), a policy-gradient RL framework that operates directly on the learned velocity fields and eliminates the need for the explicit computation of the score. OMatG-IRL leverages stochastic perturbations of the underlying generation dynamics preserving the baseline performance of the pretrained generative model while enabling exploration and policy-gradient estimation at inference time. Using OMatG-IRL, we present the first application of RL to crystal structure prediction (CSP). Our method enables effective reinforcement of an energy-based objective while preserving diversity through composition conditioning, and it achieves performance competitive with score-based RL approaches. Finally, we show that OMatG-IRL can learn time-dependent velocity-annealing schedules, enabling accurate CSP with order-of-magnitude improvements in sampling efficiency and, correspondingly, reduction in generation time.

[1024] LLMs as High-Dimensional Nonlinear Autoregressive Models with Attention: Training, Alignment and Inference

Vikram Krishnamurthy

Main category: cs.LG

TL;DR: Mathematical formulation of transformer-based LLMs as high-dimensional nonlinear autoregressive models with attention-based dependencies, covering training, alignment, and generation processes.

Details

Motivation: To provide a concise mathematical reference for researchers seeking explicit equation-level descriptions of LLM training, alignment, and generation, moving beyond vague architectural descriptions.

Method: Formulates LLMs as high-dimensional nonlinear autoregressive models with attention-based dependencies, where self-attention emerges as repeated bilinear-softmax-linear compositions. Covers pretraining via next-token prediction, alignment methods (RLHF, DPO, RSFT, RLVR), and autoregressive generation.

Result: Provides a unified mathematical framework enabling principled analysis of alignment-induced behaviors (sycophancy), inference-time phenomena (hallucination, in-context learning, chain-of-thought, retrieval-augmented generation), and extensions like continual learning.

Conclusion: The mathematical formulation serves as a concise reference for interpretation and theoretical development of LLMs, clarifying their computational structure beyond architectural descriptions.

Abstract: Large language models (LLMs) based on transformer architectures are typically described through collections of architectural components and training procedures, obscuring their underlying computational structure. This review article provides a concise mathematical reference for researchers seeking an explicit, equation-level description of LLM training, alignment, and generation. We formulate LLMs as high-dimensional nonlinear autoregressive models with attention-based dependencies. The framework encompasses pretraining via next-token prediction, alignment methods such as reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), rejection sampling fine-tuning (RSFT), and reinforcement learning from verifiable rewards (RLVR), as well as autoregressive generation during inference. Self-attention emerges naturally as a repeated bilinear–softmax–linear composition, yielding highly expressive sequence models. This formulation enables principled analysis of alignment-induced behaviors (including sycophancy), inference-time phenomena (such as hallucination, in-context learning, chain-of-thought prompting, and retrieval-augmented generation), and extensions like continual learning, while serving as a concise reference for interpretation and further theoretical development.

[1025] Towards Building Non-Fine-Tunable Foundation Models

Ziyao Wang, Nizhang Li, Pingzhi Li, Guoheng Sun, Tianlong Chen, Ang Li

Main category: cs.LG

TL;DR: PMP (Private Mask Pre-Training) creates non-fine-tunable foundation models by concentrating learning in a private sparse subnetwork mask, preventing effective unauthorized fine-tuning while maintaining base performance.

Details

Motivation: Open-sourcing foundation models exposes trainers to economic and safety risks from unrestricted downstream fine-tuning, creating a need for models that remain broadly usable but resist unauthorized adaptation.

Method: Private Mask Pre-Training (PMP) concentrates representation learning into a sparse subnetwork identified early in training, keeps the binary mask private, and releases only dense weights, creating misalignment between fine-tuning objectives and pre-training geometry.

Result: PMP preserves base model performance while consistently degrading unauthorized fine-tuning across a wide range of downstream tasks, with non-fine-tunability strength controlled by mask ratio.

Conclusion: PMP provides a practical framework for creating non-fine-tunable foundation models that balance open access with protection against unauthorized adaptation, addressing key safety and economic concerns in model sharing.

Abstract: Open-sourcing foundation models (FMs) enables broad reuse but also exposes model trainers to economic and safety risks from unrestricted downstream fine-tuning. We address this problem by building non-fine-tunable foundation models: models that remain broadly usable in their released form while yielding limited adaptation gains under task-agnostic unauthorized fine-tuning. We propose Private Mask Pre-Training (PMP), a pre-training framework that concentrates representation learning into a sparse subnetwork identified early in training. The binary mask defining this subnetwork is kept private, and only the final dense weights are released. This forces unauthorized fine-tuning without access to the mask to update parameters misaligned with pretraining subspace, inducing an intrinsic mismatch between the fine-tuning objective and the pre-training geometry. We provide theoretical analysis showing that this mismatch destabilizes gradient-based adaptation and bounds fine-tuning gains. Empirical results on large language models demonstrating that PMP preserves base model performance while consistently degrading unauthorized fine-tuning across a wide range of downstream tasks, with the strength of non-fine-tunability controlled by the mask ratio.

[1026] Stabilizing Decentralized Federated Fine-Tuning via Topology-Aware Alternating LoRA

Xiaoyu Wang, Xiaotian Li, Zhixiang Zhou, Chen Li, Yong Liu

Main category: cs.LG

TL;DR: TAD-LoRA is a topology-aware decentralized federated learning framework that addresses challenges in parameter-efficient fine-tuning using LoRA under dynamic communication graphs by coordinating updates to control inter-client misalignment.

Details

Motivation: Decentralized federated learning (DFL) with low-rank adaptation (LoRA) faces unique challenges due to the factorized structure of LoRA parameters. Unlike linear parameters, decentralized aggregation of LoRA updates introduces topology-dependent cross terms that can destabilize training under dynamic communication graphs.

Method: Proposes TAD-LoRA, a Topology-Aware Decentralized Low-Rank Adaptation framework that coordinates the updates and mixing of LoRA factors to control inter-client misalignment. The method theoretically addresses convergence under non-convex objectives with explicit characterization of trade-offs between topology-induced cross-term error and block-coordinate representation bias.

Result: Experiments under various communication conditions validate the analysis, showing TAD-LoRA achieves robust performance across different communication scenarios. It remains competitive in strongly connected topologies and delivers clear gains under moderately and weakly connected topologies, with particularly strong results on the MNLI dataset.

Conclusion: TAD-LoRA provides an effective solution for parameter-efficient fine-tuning in decentralized federated learning settings, addressing the unique challenges posed by LoRA’s factorized structure in dynamic communication environments.

Abstract: Decentralized federated learning (DFL), a serverless variant of federated learning, poses unique challenges for parameter-efficient fine-tuning due to the factorized structure of low-rank adaptation (LoRA). Unlike linear parameters, decentralized aggregation of LoRA updates introduces topology-dependent cross terms that can destabilize training under dynamic communication graphs. We propose \texttt{TAD-LoRA}, a Topology-Aware Decentralized Low-Rank Adaptation framework that coordinates the updates and mixing of LoRA factors to control inter-client misalignment. We theoretically prove the convergence of \texttt{TAD-LoRA} under non-convex objectives, explicitly characterizing the trade-off between topology-induced cross-term error and block-coordinate representation bias governed by the switching interval of alternative training. Experiments under various communication conditions validate our analysis, showing that \texttt{TAD-LoRA} achieves robust performance across different communication scenarios, remaining competitive in strongly connected topologies and delivering clear gains under moderately and weakly connected topologies, with particularly strong results on the MNLI dataset.

[1027] FedMOA: Federated GRPO for Personalized Reasoning LLMs under Heterogeneous Rewards

Ziyao Wang, Daeun Jung, Yexiao He, Guoheng Sun, Zheyu Shen, Myungjin Lee, Ang Li

Main category: cs.LG

TL;DR: FedMOA is a federated learning framework that extends Group Relative Policy Optimization (GRPO) for multi-objective alignment of large language models under heterogeneous reward definitions across devices, using adaptive weighting and quality-aware aggregation.

Details

Motivation: Traditional RL alignment is memory-prohibitive for on-device federated learning due to critic network overhead. GRPO's critic-free architecture enables on-device training, but federated settings introduce challenges: heterogeneous reward definitions, imbalanced multi-objective optimization, and high training costs.

Method: FedMOA stabilizes local training through an online adaptive weighting mechanism via hypergradient descent, prioritizing primary reasoning as auxiliary objectives saturate. On the server side, it uses task- and accuracy-aware aggregation to prioritize high-quality updates.

Result: Experiments on mathematical reasoning and code generation benchmarks show FedMOA consistently outperforms federated averaging, achieving accuracy gains of up to 2.2% while improving global performance, personalization, and multi-objective balance.

Conclusion: FedMOA provides an effective federated learning framework for multi-objective alignment of large language models that addresses the challenges of heterogeneous rewards and resource constraints in on-device training scenarios.

Abstract: Group Relative Policy Optimization (GRPO) has recently emerged as an effective approach for improving the reasoning capabilities of large language models through online multi-objective reinforcement learning. While personalization on private data is increasingly vital, traditional Reinforcement Learning (RL) alignment is often memory-prohibitive for on-device federated learning due to the overhead of maintaining a separate critic network. GRPO’s critic-free architecture enables feasible on-device training, yet transitioning to a federated setting introduces systemic challenges: heterogeneous reward definitions, imbalanced multi-objective optimization, and high training costs. We propose FedMOA, a federated GRPO framework for multi-objective alignment under heterogeneous rewards. FedMOA stabilizes local training through an online adaptive weighting mechanism via hypergradient descent, which prioritizes primary reasoning as auxiliary objectives saturate. On the server side, it utilizes a task- and accuracy-aware aggregation strategy to prioritize high-quality updates. Experiments on mathematical reasoning and code generation benchmarks demonstrate that FedMOA consistently outperforms federated averaging, achieving accuracy gains of up to 2.2% while improving global performance, personalization, and multi-objective balance.

[1028] LatentTrack: Sequential Weight Generation via Latent Filtering

Omer Haq

Main category: cs.LG

TL;DR: LatentTrack (LT) is a sequential neural architecture for online probabilistic prediction under nonstationary dynamics, performing Bayesian filtering in latent space with hypernetwork-generated model parameters for constant-time online adaptation.

Details

Motivation: The paper addresses the challenge of online probabilistic prediction under nonstationary dynamics, where traditional methods struggle with distribution shift and require computationally expensive per-step gradient updates.

Method: LT performs causal Bayesian filtering in low-dimensional latent space using a lightweight hypernetwork to generate predictive model parameters at each time step. It employs a predict-generate-update filtering framework in function space with Monte Carlo inference over latent trajectories.

Result: On the Jena Climate benchmark for long-horizon online regression, LT consistently achieves lower negative log-likelihood and mean squared error than stateful sequential and static uncertainty-aware baselines, with competitive calibration.

Conclusion: Latent-conditioned function evolution is an effective alternative to traditional latent-state modeling under distribution shift, enabling constant-time online adaptation without per-step gradient updates.

Abstract: We introduce LatentTrack (LT), a sequential neural architecture for online probabilistic prediction under nonstationary dynamics. LT performs causal Bayesian filtering in a low-dimensional latent space and uses a lightweight hypernetwork to generate predictive model parameters at each time step, enabling constant-time online adaptation without per-step gradient updates. At each time step, a learned latent model predicts the next latent distribution, which is updated via amortized inference using new observations, yielding a predict–generate–update filtering framework in function space. The formulation supports both structured (Markovian) and unstructured latent dynamics within a unified objective, while Monte Carlo inference over latent trajectories produces calibrated predictive mixtures with fixed per-step cost. Evaluated on long-horizon online regression using the Jena Climate benchmark, LT consistently achieves lower negative log-likelihood and mean squared error than stateful sequential and static uncertainty-aware baselines, with competitive calibration, demonstrating that latent-conditioned function evolution is an effective alternative to traditional latent-state modeling under distribution shift.

[1029] Search Inspired Exploration in Reinforcement Learning

Georgios Sotirchos, Zlatan Ajanović, Jens Kober

Main category: cs.LG

TL;DR: SIERL is a reinforcement learning method that uses search-inspired sub-goal selection from frontier states to improve exploration in sparse-reward environments.

Details

Motivation: Existing RL exploration methods for sparse-reward environments have limitations: curriculum learning and Go-Explore rely on hand-crafted heuristics, while curiosity-driven methods risk converging to suboptimal policies. There's a need for systematic exploration that actively guides the agent toward informative regions.

Method: SIERL selects sub-goals from the frontier (boundary of known state space) at the start of each episode. Sub-goals are prioritized based on cost-to-come and cost-to-go estimates, steering exploration toward the most informative regions. The method ensures sub-goals are neither overly familiar nor completely novel.

Result: SIERL outperforms dominant baselines in challenging sparse-reward environments, achieving better performance on main task goals and generalizing better to reach arbitrary states in the environment.

Conclusion: SIERL provides an effective search-inspired approach for exploration in sparse-reward RL that systematically expands the frontier and guides exploration toward informative regions without relying on hand-crafted heuristics.

Abstract: Exploration in environments with sparse rewards remains a fundamental challenge in reinforcement learning (RL). Existing approaches such as curriculum learning and Go-Explore often rely on hand-crafted heuristics, while curiosity-driven methods risk converging to suboptimal policies. We propose Search-Inspired Exploration in Reinforcement Learning (SIERL), a novel method that actively guides exploration by setting sub-goals based on the agent’s learning progress. At the beginning of each episode, SIERL chooses a sub-goal from the \textit{frontier} (the boundary of the agent’s known state space), before the agent continues exploring toward the main task objective. The key contribution of our method is the sub-goal selection mechanism, which provides state-action pairs that are neither overly familiar nor completely novel. Thus, it assures that the frontier is expanded systematically and that the agent is capable of reaching any state within it. Inspired by search, sub-goals are prioritized from the frontier based on estimates of cost-to-come and cost-to-go, effectively steering exploration towards the most informative regions. In experiments on challenging sparse-reward environments, SIERL outperforms dominant baselines in both achieving the main task goal and generalizing to reach arbitrary states in the environment.

[1030] PAIR-Former: Budgeted Relational MIL for miRNA Target Prediction

Jiaqi Yin, Baiming Chen, Jia Fei, Mingjun Yang

Main category: cs.LG

TL;DR: PAIR-Former is a Budgeted Relational Multi-Instance Learning approach for miRNA-mRNA targeting that performs cheap pool scanning, selects diverse candidate sites, and applies Set Transformer aggregation under compute constraints.

Details

Motivation: miRNA-mRNA targeting is a large-bag prediction problem where each transcript has many candidate target sites but only pair-level labels are observed, requiring efficient methods to handle computational constraints while maintaining accuracy.

Method: Proposes PAIR-Former with three stages: 1) cheap full-pool scan, 2) selection of up to K diverse candidate target sites on CPU, 3) permutation-invariant Set Transformer aggregator on selected tokens under compute budget constraints.

Result: On miRAW dataset, PAIR-Former outperforms strong pooling baselines at practical budget (K=64) and provides controllable accuracy-compute trade-off as K varies, with theoretical analysis linking budgeted selection to approximation error and generalization.

Conclusion: PAIR-Former effectively addresses the Budgeted Relational MIL problem for miRNA-mRNA targeting, offering practical efficiency-accuracy trade-offs with theoretical guarantees for large-bag prediction problems.

Abstract: Functional miRNA–mRNA targeting is a large-bag prediction problem: each transcript yields a heavy-tailed pool of candidate target sites (CTSs), yet only a pair-level label is observed. We formalize this regime as \emph{Budgeted Relational Multi-Instance Learning (BR-MIL)}, where at most $K$ instances per bag may receive expensive encoding and relational processing under a hard compute budget. We propose \textbf{PAIR-Former} (Pool-Aware Instance-Relational Transformer), a BR-MIL pipeline that performs a cheap full-pool scan, selects up to $K$ diverse CTSs on CPU, and applies a permutation-invariant Set Transformer aggregator on the selected tokens. On miRAW, PAIR-Former outperforms strong pooling baselines at a practical operating budget ($K^\star{=}64$) while providing a controllable accuracy–compute trade-off as $K$ varies. We further provide theory linking budgeted selection to (i) approximation error decreasing with $K$ and (ii) generalization terms governed by $K$ in the expensive relational component.

[1031] Parallel Stochastic Gradient-Based Planning for World Models

Michael Psenka, Michael Rabbat, Aditi Krishnapriyan, Yann LeCun, Amir Bar

Main category: cs.LG

TL;DR: GRASP is a differentiable planner for visual world models that treats states as optimization variables with soft dynamics constraints and stochastic exploration to solve long-horizon control tasks from video input.

Details

Motivation: World models can simulate environment dynamics from visual inputs, but planning with them is challenging due to vast unstructured search spaces. Existing planning methods struggle with long-horizon tasks and local optima in vision-based world models.

Method: Proposes GRASP: treats states as “virtual states” optimization variables with soft dynamics constraints, introduces stochasticity for exploration, modifies gradient structure to handle sensitive gradients in vision-based world models, enabling parallel computation and easier optimization.

Result: Outperforms existing planning algorithms (CEM and vanilla GD) on long-horizon experiments in both success rate and convergence time when applied to video-based world models.

Conclusion: GRASP provides an effective differentiable planning approach for visual world models that handles long-horizon control tasks through stochastic optimization and gradient structure modifications.

Abstract: World models simulate environment dynamics from raw sensory inputs like video. However, using them for planning can be challenging due to the vast and unstructured search space. We propose a robust and highly parallelizable planner that leverages the differentiability of the learned world model for efficient optimization, solving long-horizon control tasks from visual input. Our method treats states as optimization variables (“virtual states”) with soft dynamics constraints, enabling parallel computation and easier optimization. To facilitate exploration and avoid local optima, we introduce stochasticity into the states. To mitigate sensitive gradients through high-dimensional vision-based world models, we modify the gradient structure to descend towards valid plans while only requiring action-input gradients. Our planner, which we call GRASP (Gradient RelAxed Stochastic Planner), can be viewed as a stochastic version of a non-condensed or collocation-based optimal controller. We provide theoretical justification and experiments on video-based world models, where our resulting planner outperforms existing planning algorithms like the cross-entropy method (CEM) and vanilla gradient-based optimization (GD) on long-horizon experiments, both in success rate and time to convergence.

[1032] Diffusion LMs Can Approximate Optimal Infilling Lengths Implicitly

Hengchang Liu, Zhao Yang, Bing Su

Main category: cs.LG

TL;DR: CAL is a training-free method that enables diffusion language models to automatically discover optimal infilling lengths by leveraging statistical patterns in first-step denoising confidence, improving performance on code and text infilling tasks.

Details

Motivation: Diffusion language models have inherent bidirectional generation capabilities suitable for infilling, but their performance is limited by pre-specified infilling lengths. The authors aim to unlock DLMs' latent ability to discover correct infilling lengths without requiring specialized training.

Method: CAL identifies two key statistical phenomena in first-step denoising confidence: 1) Oracle Peak near ground-truth length, and 2) systematic Length Bias that obscures this signal. The method calibrates this bias and uses an efficient search to approximate optimal length before formal decoding.

Result: CAL improves Pass@1 by up to 47.7% over fixed-length baselines and 40.5% over chat-based adaptive methods in code infilling. For text infilling, it boosts BLEU-2 and ROUGE-L by up to 8.5% and 9.9% respectively.

Conclusion: CAL demonstrates that diffusion language models possess inherent ability to discover correct infilling lengths, and provides a training-free method to leverage this capability for robust infilling performance across code and text domains.

Abstract: Diffusion language models (DLMs) provide a bidirectional generation framework naturally suited for infilling, yet their performance is constrained by the pre-specified infilling length. In this paper, we reveal that DLMs possess an inherent ability to discover the correct infilling length. We identify two key statistical phenomena in the first-step denoising confidence: a local \textit{Oracle Peak} that emerges near the ground-truth length and a systematic \textit{Length Bias} that often obscures this signal. By leveraging this signal and calibrating the bias, our training-free method \textbf{CAL} (\textbf{C}alibrated \textbf{A}daptive \textbf{L}ength) enables DLMs to approximate the optimal length through an efficient search before formal decoding. Empirical evaluations demonstrate that CAL improves Pass@1 by up to 47.7% over fixed-length baselines and 40.5% over chat-based adaptive methods in code infilling, while boosting BLEU-2 and ROUGE-L by up to 8.5% and 9.9% in text infilling. These results demonstrate that CAL paves the way for robust DLM infilling without requiring any specialized training. Code is available at https://github.com/NiuHechang/Calibrated_Adaptive_Length.

[1033] Quality-Diversity Optimization as Multi-Objective Optimization

Xi Lin, Ping Guo, Yilu Liu, Qingfu Zhang, Jianyong Sun

Main category: cs.LG

TL;DR: This paper reformulates Quality-Diversity optimization as a multi-objective optimization problem with many objectives, enabling the use of established MOO methods for QD problems.

Details

Motivation: QD optimization aims to find diverse high-performing solutions, but existing QD algorithms have distinct design principles. The authors want to leverage well-established multi-objective optimization methods to solve QD problems more effectively.

Method: Reformulates QD optimization as a multi-objective optimization problem with a large number of objectives, then applies set-based scalarization techniques from MOO to solve QD problems through collaborative search.

Result: Experimental studies across several QD applications show the method achieves performance competitive with state-of-the-art QD algorithms.

Conclusion: The connection between QD and MOO enables leveraging established MOO techniques for QD problems while maintaining theoretical guarantees and achieving competitive performance.

Abstract: The Quality-Diversity (QD) optimization aims to discover a collection of high-performing solutions that simultaneously exhibit diverse behaviors within a user-defined behavior space. This paradigm has stimulated significant research interest and demonstrated practical utility in domains including robot control, creative design, and adversarial sample generation. A variety of QD algorithms with distinct design principles have been proposed in recent years. Instead of proposing a new QD algorithm, this work introduces a novel reformulation by casting the QD optimization as a multi-objective optimization (MOO) problem with a huge number of optimization objectives. By establishing this connection, we enable the direct adoption of well-established MOO methods, particularly set-based scalarization techniques, to solve QD problems through a collaborative search process. We further provide a theoretical analysis demonstrating that our approach inherits theoretical guarantees from MOO while providing desirable properties for the QD optimization. Experimental studies across several QD applications confirm that our method achieves performance competitive with state-of-the-art QD algorithms.

[1034] AREAL-DTA: Dynamic Tree Attention for Efficient Reinforcement Learning of Large Language Models

Jiarui Zhang, Yuchen Yang, Ran Yan, Zhiyu Mei, Liyuan Zhang, Daifeng Li, Wei Fu, Jiaxuan Gao, Shusheng Xu, Yi Wu, Binhang Yuan

Main category: cs.LG

TL;DR: AREAL-DTA is an efficient RL training framework that exploits prefix sharing in LLM post-training using DFS-based execution and distributed batching to reduce computation and memory overhead.

Details

Motivation: RL-based post-training for LLMs is computationally expensive due to repeated recomputation of shared token prefixes in rollout sequences. Existing methods process sequences independently, wasting resources on identical prefix computations.

Method: AREAL-DTA uses depth-first-search (DFS)-based execution to dynamically traverse rollout prefix trees, materializing only one root-to-leaf path at a time. It also incorporates load-balanced distributed batching across multiple GPUs for scalability.

Result: Achieves up to 8.31× higher training throughput on RL post-training workloads compared to existing methods, as measured by τ²-bench.

Conclusion: AREAL-DTA provides an efficient solution for RL-based LLM post-training by exploiting prefix sharing through DFS traversal and distributed processing, significantly improving computational efficiency.

Abstract: Reinforcement learning (RL) based post-training for large language models (LLMs) is computationally expensive, as it generates many rollout sequences that could frequently share long token prefixes. Existing RL frameworks usually process these sequences independently, repeatedly recomputing identical prefixes during forward and backward passes during policy model training, leading to substantial inefficiencies in computation and memory usage. Although prefix sharing naturally induces a tree structure over rollouts, prior tree-attention-based solutions rely on fully materialized attention masks and scale poorly in RL settings. In this paper, we introduce AREAL-DTA to efficiently exploit prefix sharing in RL training. AREAL-DTA employs a depth-first-search (DFS)-based execution strategy that dynamically traverses the rollout prefix tree during both forward and backward computation, materializing only a single root-to-leaf path at a time. To further improve scalability, AREAL-DTA incorporates a load-balanced distributed batching mechanism that dynamically constructs and processes prefix trees across multiple GPUs. Across the popular RL post-training workload, AREAL-DTA achieves up to $8.31\times$ in $τ^2$-bench higher training throughput.

[1035] OD-DEAL: Dynamic Expert-Guided Adversarial Learning with Online Decomposition for Scalable Capacitated Vehicle Routing

Dongbin Jiao, Zisheng Chen, Xianyi Wang, Jintao Shi, Shengcai Liu, Shi Yan

Main category: cs.LG

TL;DR: OD-DEAL is an adversarial learning framework that combines hybrid genetic search with online barycenter clustering decomposition and knowledge distillation to solve large-scale capacitated vehicle routing problems efficiently.

Details

Motivation: Solving large-scale CVRP is challenging due to high complexity of traditional heuristics and poor generalization of neural solvers on massive graphs. There's a need for methods that can handle large instances with near-constant neural scaling.

Method: OD-DEAL integrates hybrid genetic search (HGS) with online barycenter clustering (BCC) decomposition and uses knowledge distillation to transfer expert heuristic behavior. It trains a GAT-based generative policy through a minimax game, distilling divide-and-conquer strategies into dense surrogate rewards for clustering-free inference.

Result: Achieves state-of-the-art real-time CVRP performance, solving 10,000-node instances with near-constant neural scaling, enabling sub-second, heuristic-quality inference for dynamic large-scale deployment.

Conclusion: OD-DEAL provides an effective framework for large-scale CVRP that combines the strengths of heuristic methods with neural approaches, enabling efficient real-time solutions for massive routing problems.

Abstract: Solving large-scale capacitated vehicle routing problems (CVRP) is hindered by the high complexity of heuristics and the limited generalization of neural solvers on massive graphs. We propose OD-DEAL, an adversarial learning framework that tightly integrates hybrid genetic search (HGS) and online barycenter clustering (BCC) decomposition, and leverages high-fidelity knowledge distillation to transfer expert heuristic behavior. OD-DEAL trains a graph attention network (GAT)-based generative policy through a minimax game, in which divide-and-conquer strategies from a hybrid expert are distilled into dense surrogate rewards. This enables high-quality, clustering-free inference on large-scale instances. Empirical results demonstrate that OD-DEAL achieves state-of-the-art (SOTA) real-time CVRP performance, solving 10000-node instances with near-constant neural scaling. This uniquely enables the sub-second, heuristic-quality inference required for dynamic large-scale deployment.

[1036] Partition of Unity Neural Networks for Interpretable Classification with Explicit Class Regions

Akram Aldroubi

Main category: cs.LG

TL;DR: PUNN replaces softmax with learned partition of unity functions that directly represent class probabilities, enabling explicit visualization of class regions while maintaining competitive accuracy.

Details

Motivation: Neural network classifiers are difficult to interpret because softmax-based models define class regions implicitly through coupled inequalities among logits, making them hard to extract and visualize. There's a need for interpretable-by-design architectures that maintain competitive performance.

Method: Introduces Partition of Unity Neural Networks (PUNN) that constructs k nonnegative functions h₁,…,hₖ satisfying Σhᵢ(x)=1, where each hᵢ(x) directly represents P(class i|x). Uses gate functions gᵢ with various activation functions (sigmoid, Gaussian, bump) and parameterizations ranging from flexible MLPs to shape-informed designs (spherical shells, ellipsoids, spherical harmonics).

Result: PUNN with MLP-based gates achieves accuracy within 0.3-0.6% of standard MLPs on synthetic data, UCI benchmarks, and MNIST. When geometric priors match data structure, shape-informed gates achieve comparable accuracy with up to 300× fewer parameters.

Conclusion: Interpretable-by-design architectures like PUNN can be competitive with black-box models while providing transparent class probability assignments and explicit visualization of class regions.

Abstract: Despite their empirical success, neural network classifiers remain difficult to interpret. In softmax-based models, class regions are defined implicitly as solutions to systems of inequalities among logits, making them difficult to extract and visualize. We introduce Partition of Unity Neural Networks (PUNN), an architecture in which class probabilities arise directly from a learned partition of unity, without requiring a softmax layer. PUNN constructs $k$ nonnegative functions $h_1, \ldots, h_k$ satisfying $\sum_i h_i(x) = 1$, where each $h_i(x)$ directly represents $P(\text{class } i \mid x)$. Unlike softmax, where class regions are defined implicitly through coupled inequalities among logits, each PUNN partition function $h_i$ directly defines the probability of class $i$ as a standalone function of $x$. We prove that PUNN is dense in the space of continuous probability maps on compact domains. The gate functions $g_i$ that define the partition can use various activation functions (sigmoid, Gaussian, bump) and parameterizations ranging from flexible MLPs to parameter-efficient shape-informed designs (spherical shells, ellipsoids, spherical harmonics). Experiments on synthetic data, UCI benchmarks, and MNIST show that PUNN with MLP-based gates achieves accuracy within 0.3–0.6% of standard multilayer perceptrons. When geometric priors match the data structure, shape-informed gates achieve comparable accuracy with up to 300$\times$ fewer parameters. These results demonstrate that interpretable-by-design architectures can be competitive with black-box models while providing transparent class probability assignments.

[1037] Minerva: Reinforcement Learning with Verifiable Rewards for Cyber Threat Intelligence LLMs

Md Tanvirul Alam, Aritran Piplai, Ionut Cardei, Nidhi Rastogi, Peter J Worth

Main category: cs.LG

TL;DR: Minerva introduces reinforcement learning with verifiable rewards (RLVR) for CTI tasks, using task-specific verifiers and self-training to improve structured output generation from noisy security artifacts.

Details

Motivation: CTI analysts need to convert unstructured security artifacts into standardized representations, but existing LLM approaches are brittle for structured CTI outputs and rely heavily on supervised fine-tuning. CTI standards provide canonical identifiers and schemas that enable deterministic verification of model outputs.

Method: Proposes RLVR (reinforcement learning with verifiable rewards) using task-specific verifiers that score structured outputs and identifier predictions. Introduces Minerva dataset and training pipeline spanning multiple CTI subtasks. Uses lightweight self-training to generate additional verified trajectories to address reward sparsity.

Result: Experiments across LLM backbones show consistent improvements in accuracy and robustness over supervised fine-tuning across multiple benchmarks.

Conclusion: RLVR with verifiable rewards and self-training effectively improves structured output generation for CTI tasks, outperforming traditional supervised approaches.

Abstract: Cyber threat intelligence (CTI) analysts routinely convert noisy, unstructured security artifacts into standardized, automation-ready representations. Although large language models (LLMs) show promise for this task, existing approaches remain brittle when producing structured CTI outputs and have largely relied on supervised fine-tuning (SFT). In contrast, CTI standards and community-maintained resources define canonical identifiers and schemas that enable deterministic verification of model outputs. We leverage this structure to study reinforcement learning with verifiable rewards (RLVR) for CTI tasks. We introduce \textit{Minerva}, a unified dataset and training pipeline spanning multiple CTI subtasks, each paired with task-specific verifiers that score structured outputs and identifier predictions. To address reward sparsity during rollout, we propose a lightweight self-training mechanism that generates additional verified trajectories and distills them back into the model. Experiments across LLM backbones show consistent improvements in accuracy and robustness over SFT across multiple benchmarks.

[1038] Contrastive Learning for Privacy Enhancements in Industrial Internet of Things

Lin Liu, Rita Machacy, Simi Kuniyilh

Main category: cs.LG

TL;DR: A comprehensive review of contrastive learning-based privacy-preserving techniques specifically for Industrial Internet of Things (IIoT) systems, focusing on industrial data characteristics, architectures, and applications.

Details

Motivation: IIoT enables predictive maintenance and optimization in industrial environments but introduces significant privacy risks due to sensitive operational data. Contrastive learning offers promise for privacy-preserving analytics by reducing reliance on labeled data and raw data sharing.

Method: The paper conducts a comprehensive review of existing contrastive learning-based privacy-preserving techniques specifically tailored for IIoT systems, analyzing their application to industrial data characteristics, system architectures, and various industrial scenarios.

Result: The review identifies current solutions, open challenges, and outlines future research directions for privacy preservation in IIoT using contrastive learning techniques.

Conclusion: Contrastive learning shows promise for addressing privacy concerns in IIoT systems, but specific adaptations are needed for industrial contexts, and further research is required to overcome existing challenges.

Abstract: The Industrial Internet of Things (IIoT) integrates intelligent sensing, communication, and analytics into industrial environments, including manufacturing, energy, and critical infrastructure. While IIoT enables predictive maintenance and cross-site optimization of modern industrial control systems, such as those in manufacturing and energy, it also introduces significant privacy and confidentiality risks due to the sensitivity of operational data. Contrastive learning, a self-supervised representation learning paradigm, has recently emerged as a promising approach for privacy-preserving analytics by reducing reliance on labeled data and raw data sharing. Although contrastive learning-based privacy-preserving techniques have been explored in the Internet of Things (IoT) domain, this paper offers a comprehensive review of these techniques specifically for privacy preservation in Industrial Internet of Things (IIoT) systems. It emphasizes the unique characteristics of industrial data, system architectures, and various application scenarios. Additionally, the paper discusses solutions and open challenges and outlines future research directions.

[1039] NEST: Nested Event Stream Transformer for Sequences of Multisets

Minghui Sun, Haoyu Gong, Xingyu You, Jillian Hurst, Benjamin Goldstein, Matthew Engelhard

Main category: cs.LG

TL;DR: NEST is a foundation model for hierarchical event stream data that preserves multiset structure, improving computational efficiency and representation quality over flattened approaches.

Details

Motivation: Existing foundation models for event streams flatten hierarchical structure (sequences of multisets) into 1D sequences, causing computational inefficiency from dense attention, learning spurious within-set relationships, and poor set-level representations from heuristic pooling.

Method: Introduces Nested Event Stream Transformer (NEST) that preserves original hierarchy in architecture. Formulates Masked Set Modeling (MSM) for efficient pretraining that promotes improved set-level representation learning.

Result: Experiments on real-world multiset sequence data show NEST captures real-world dynamics while improving both pretraining efficiency and downstream performance.

Conclusion: Preserving hierarchical structure in foundation model architecture provides useful inductive bias that improves computational efficiency and representation quality for event stream data.

Abstract: Event stream data often exhibit hierarchical structure in which multiple events co-occur, resulting in a sequence of multisets (i.e., bags of events). In electronic health records (EHRs), for example, medical events are grouped into a sequence of clinical encounters with well-defined temporal structure, but the order and timing of events within each encounter may be unknown or unreliable. Most existing foundation models (FMs) for event stream data flatten this hierarchy into a one-dimensional sequence, leading to (i) computational inefficiency associated with dense attention and learning spurious within-set relationships, and (ii) lower-quality set-level representations from heuristic post-training pooling for downstream tasks. Here, we show that preserving the original hierarchy in the FM architecture provides a useful inductive bias that improves both computational efficiency and representation quality. We then introduce Nested Event Stream Transformer (NEST), a FM for event streams comprised of sequences of multisets. Building on this architecture, we formulate Masked Set Modeling (MSM), an efficient paradigm that promotes improved set-level representation learning. Experiments on real-world multiset sequence data show that NEST captures real-world dynamics while improving both pretraining efficiency and downstream performance.

[1040] Physiology as Language: Translating Respiration to Sleep EEG

Kaiwen Zha, Chao Li, Hao He, Peng Cao, Tianhong Li, Ali Mirzazadeh, Ellen Zhang, Jong Woo Lee, Yoon Kim, Dina Katabi

Main category: cs.LG

TL;DR: A cross-physiology translation framework that synthesizes sleep EEG from respiration signals using waveform-conditional generation with discrete tokenization, achieving accurate reconstruction and supporting downstream tasks comparable to ground truth EEG.

Details

Motivation: To enable non-invasive, remote neurological assessment during sleep by translating simpler physiological signals (respiration) into more complex brain activity (EEG), addressing the significant complexity gap between modalities.

Method: Proposes a waveform-conditional generative framework that preserves fine-grained respiratory dynamics while constraining EEG target space through discrete tokenization. Trained on over 28,000 individuals’ data.

Result: Achieves 7% Mean Absolute Error in EEG spectrogram reconstruction. Synthesized EEG supports downstream tasks with performance comparable to ground truth: age estimation (MAE 5.0 vs. 5.1 years), sex detection (AUROC 0.81 vs. 0.82), and sleep staging (Accuracy 0.84 vs. 0.88). Generalizes to contactless sensing using wireless radio-frequency reflections.

Conclusion: Demonstrates feasibility of cross-physiology translation from respiration to EEG, enabling remote, non-contact neurological assessment during sleep with performance comparable to ground truth measurements.

Abstract: This paper introduces a novel cross-physiology translation task: synthesizing sleep electroencephalography (EEG) from respiration signals. To address the significant complexity gap between the two modalities, we propose a waveform-conditional generative framework that preserves fine-grained respiratory dynamics while constraining the EEG target space through discrete tokenization. Trained on over 28,000 individuals, our model achieves a 7% Mean Absolute Error in EEG spectrogram reconstruction. Beyond reconstruction, the synthesized EEG supports downstream tasks with performance comparable to ground truth EEG on age estimation (MAE 5.0 vs. 5.1 years), sex detection (AUROC 0.81 vs. 0.82), and sleep staging (Accuracy 0.84 vs. 0.88), significantly outperforming baselines trained directly on breathing. Finally, we demonstrate that the framework generalizes to contactless sensing by synthesizing EEG from wireless radio-frequency reflections, highlighting the feasibility of remote, non-contact neurological assessment during sleep.

[1041] Convergent World Representations and Divergent Tasks

Core Francisco Park

Main category: cs.LG

TL;DR: Multi-task training on geometric tasks produces aligned world representations, but some divergent tasks harm new entity integration during fine-tuning.

Details

Motivation: To understand neural representation geometry and its role in downstream adaptability through controlled experiments with world representations and multi-task training.

Method: Created framework with 5,075 city coordinates as world representation, trained models on 7 geometric tasks using autoregressive training, studied multi-task convergence and fine-tuning adaptation.

Result: Multi-task training drives convergence of world representations, but divergent tasks actively harm representational integration of new entities during fine-tuning.

Conclusion: Training on multiple relational tasks produces convergent world representations, but divergent tasks can catastrophically harm new entity integration via fine-tuning.

Abstract: While neural representations are central to modern deep learning, the conditions governing their geometry and their roles in downstream adaptability remain poorly understood. We develop a framework clearly separating the underlying world, the data generation process and the resulting model representations to study these questions in a controlled setup. 5,075 city coordinates define the world and 7 geometric tasks generate the training data for autoregressive training. We find that different tasks give rise to qualitatively and quantitatively distinct world representation geometries. However, multi-task training drives convergence of world representations: models trained on non-overlapping tasks develop aligned geometric representations, providing controlled evidence for the Multitask Scaling Hypothesis of the Platonic Representation Hypothesis. To study adaptation, we pretrain models on all tasks, then test whether new entities (cities) can be consistently integrated into the representation space via fine-tuning. Surprisingly, we find that despite multi-task pretraining, some tasks, which we call divergent, actively harm the representational integration of new entities and harm generalization. Our results show that training on multiple relational tasks reliably produces convergent world representations, but lurking divergent tasks can catastrophically harm new entity integration via fine-tuning.

[1042] AIRE-Prune: Asymptotic Impulse-Response Energy for State Pruning in State Space Models

Apurba Prasad Padhy, Fernando Camacho, Saibal Mukhopadhyay

Main category: cs.LG

TL;DR: AIRE-Prune is a structured post-training pruning method for state space models that reduces state dimensions by minimizing long-run output-energy distortion using asymptotic impulse-response energy scores.

Details

Motivation: State space models often sacrifice capacity, search space, or stability to manage memory and compute costs of large state dimensions. There's a need for efficient pruning methods that can reduce computational overhead while maintaining model performance.

Method: AIRE-Prune assigns each state a closed-form asymptotic impulse-response energy-based score (total impulse-response energy over infinite horizon), normalizes these scores layer-wise for global cross-layer comparison, and prunes states with lowest energy contributions.

Result: Achieves average pruning of 60.8% across diverse sequence benchmarks with only 0.29% average accuracy drop without retraining, significantly lowering compute requirements.

Conclusion: AIRE-Prune effectively reduces redundancy in SISO and MIMO SSMs while maintaining performance, extending modal truncation principles from single systems to deep stacks and aligning pruning with asymptotic response energy rather than worst-case gain.

Abstract: State space models (SSMs) often sacrifice capacity, search space, or stability to offset the memory and compute costs of large state dimensions. We introduce a structured post-training pruning method for SSMs – AIRE-Prune (Asymptotic Impulse-Response Energy for State PRUN(E)) – that reduces each layer’s state dimension by directly minimizing long-run output-energy distortion. AIRE-Prune assigns every state a closed-form asymptotic impulse-response energy-based score, i.e., the total impulse-response energy it contributes over an infinite horizon (time), and normalizes these scores layer-wise to enable global cross-layer comparison and selection. This extends modal truncation from single systems to deep stacks and aligns pruning with asymptotic response energy rather than worst-case gain. Across diverse sequence benchmarks, AIRE-Prune reveals substantial redundancy in SISO and MIMO SSMs with average pruning of 60.8%, with average accuracy drop of 0.29% without retraining, while significantly lowering compute. Code: https://github.com/falcon-arrow/AIRE-Prune.

[1043] Invertible Memory Flow Networks

Liyu Zerihun, Alexandr Plashchinsky

Main category: cs.LG

TL;DR: IMFN addresses long sequence compression via binary tree factorization with sweeper modules for 2-to-1 compression, achieving O(log N) depth and sublinear error, with distillation to constant-cost recurrent models.

Details

Motivation: Long sequence neural memory is challenging due to vanishing gradients in RNNs and quadratic scaling in Transformers. Compressing long sequences into fixed representations has an intractable optimization landscape.

Method: Invertible Memory Flow Networks decompose long sequence compression into pairwise merges using a binary tree of “sweeper” modules. Each sweeper learns simpler 2-to-1 compression tasks, achieving O(log N) depth with sublinear error accumulation. For online inference, distillation into constant-cost recurrent student achieves O(1) sequential steps.

Result: Empirical validation on long MNIST sequences and UCF-101 videos demonstrates successful compression of high-dimensional data over long sequences.

Conclusion: IMFN makes long sequence compression tractable through factorization and pairwise compression, offering efficient alternatives to RNNs and Transformers for long sequence processing.

Abstract: Long sequence neural memory remains a challenging problem. RNNs and their variants suffer from vanishing gradients, and Transformers suffer from quadratic scaling. Furthermore, compressing long sequences into a finite fixed representation remains an intractable problem due to the difficult optimization landscape. Invertible Memory Flow Networks (IMFN) make long sequence compression tractable through factorization: instead of learning end-to-end compression, we decompose the problem into pairwise merges using a binary tree of “sweeper” modules. Rather than learning to compress long sequences, each sweeper learns a much simpler 2-to-1 compression task, achieving O(log N) depth with sublinear error accumulation in sequence length. For online inference, we distilled into a constant-cost recurrent student achieving O(1) sequential steps. Empirical results validate IMFN on long MNIST sequences and UCF-101 videos, demonstrating compression of high-dimensional data over long sequences.

[1044] OpenDDI: A Comprehensive Benchmark for DDI Prediction

Xinmo Jin, Bowen Fan, Xunkai Li, Henan Sun, YuXin Zeng, Zekai Chen, Yuxuan Sun, Jia Li, Qiangqiang Dai, Hongchao Qin, Rong-Hua Li, Guoren Wang

Main category: cs.LG

TL;DR: OpenDDI is a comprehensive benchmark for Drug-Drug Interaction prediction that addresses data quality and evaluation standardization challenges by unifying datasets, introducing multimodal drug representations, and providing standardized evaluation protocols.

Details

Motivation: Current DDI prediction faces two fundamental challenges: (1) lack of high-quality data with most studies relying on small-scale datasets and single-modal drug representations, and (2) lack of standardized evaluation with inconsistent scenarios, varied metrics, and diverse baselines.

Method: OpenDDI unifies 6 existing DDI datasets and 2 existing drug representations, contributes 3 new large-scale LLM-augmented datasets, and introduces a new multimodal drug representation covering 5 modalities. It also unifies 20 state-of-the-art model baselines across 3 downstream tasks with standardized evaluation protocols.

Result: The benchmark enables comprehensive evaluation of DDI prediction methods, leading to 10 valuable insights about current limitations and providing critical guidance for the field.

Conclusion: OpenDDI addresses key challenges in DDI prediction by providing standardized data, multimodal representations, and evaluation protocols, serving as a comprehensive benchmark to advance the field.

Abstract: Drug-Drug Interactions (DDIs) significantly influence therapeutic efficacy and patient safety. As experimental discovery is resource-intensive and time-consuming, efficient computational methodologies have become essential. The predominant paradigm formulates DDI prediction as a drug graph-based link prediction task. However, further progress is hindered by two fundamental challenges: (1) lack of high-quality data: most studies rely on small-scale DDI datasets and single-modal drug representations; (2) lack of standardized evaluation: inconsistent scenarios, varied metrics, and diverse baselines. To address the above issues, we propose OpenDDI, a comprehensive benchmark for DDI prediction. Specifically, (1) from the data perspective, OpenDDI unifies 6 widely used DDI datasets and 2 existing forms of drug representation, while additionally contributing 3 new large-scale LLM-augmented datasets and a new multimodal drug representation covering 5 modalities. (2) From the evaluation perspective, OpenDDI unifies 20 SOTA model baselines across 3 downstream tasks, with standardized protocols for data quality, effectiveness, generalization, robustness, and efficiency. Based on OpenDDI, we conduct a comprehensive evaluation and derive 10 valuable insights for DDI prediction while exposing current limitations to provide critical guidance for this rapidly evolving field. Our code is available at https://github.com/xiaoriwuguang/OpenDDI

[1045] One Loss to Rule Them All: Marked Time-to-Event for Structured EHR Foundation Models

Zilin Jing, Vincent Jeanselme, Yuta Kobayashi, Simon A. Lee, Chao Pang, Aparajita Kashyap, Yanwei Li, Xinzhuo Jiang, Shalmali Joshi

Main category: cs.LG

TL;DR: ORA is a new pretraining objective for EHR foundation models that jointly models event timing and continuous measurements using marked time-to-event approach, outperforming traditional next-token prediction methods.

Details

Motivation: Current EHR foundation models use next-token prediction similar to NLP, but this fails to capture the full structure of EHR data which includes irregular sampling, discrete events, and continuous numerical measurements like lab values and treatment dosages.

Method: Proposes ORA, a marked time-to-event pretraining objective that jointly models both event timing and associated continuous measurements, accounting for the temporal irregularity and mixed data types in EHR.

Result: Across multiple datasets, downstream tasks, and model architectures, ORA consistently yields more generalizable representations than next-token prediction and other pretraining losses that ignore continuous measurements, with improvements in classification, regression, and time-to-event prediction.

Conclusion: ORA introduces a new family of EHR foundation models and demonstrates that pretraining objectives accounting for EHR structure are critical for expanding downstream capabilities and generalizability beyond traditional classification tasks.

Abstract: Clinical events captured in Electronic Health Records (EHR) are irregularly sampled and may consist of a mixture of discrete events and numerical measurements, such as laboratory values or treatment dosages. The sequential nature of EHR, analogous to natural language, has motivated the use of next-token prediction to train prior EHR Foundation Models (FMs) over events. However, this training fails to capture the full structure of EHR. We propose ORA, a marked time-to-event pretraining objective that jointly models event timing and associated measurements. Across multiple datasets, downstream tasks, and model architectures, this objective consistently yields more generalizable representations than next-token prediction and pretraining losses that ignore continuous measurements. Importantly, the proposed objective yields improvements beyond traditional classification evaluation, including better regression and time-to-event prediction. Beyond introducing a new family of FMs, our results suggest a broader takeaway: pretraining objectives that account for EHR structure are critical for expanding downstream capabilities and generalizability

[1046] Depth, Not Data: An Analysis of Hessian Spectral Bifurcation

Shenyang Deng, Boyao Liao, Zhuoli Ouyang, Tianyu Pang, Yaoqing Yang

Main category: cs.LG

TL;DR: Deep linear network analysis shows Hessian spectral bifurcation (bulk-and-spike structure) arises from network architecture independent of data imbalance, with spectral gap scaling linearly with depth.

Details

Motivation: Challenge the prevailing view that the "bulk-and-spike" Hessian eigenvalue structure in deep networks is solely due to data covariance imbalance, and investigate whether network architecture itself can cause this spectral bifurcation.

Method: Analyze a deep linear network setup theoretically, proving that even with perfectly balanced data covariance, the Hessian exhibits bifurcation eigenvalue structure with dominant and bulk clusters. Establish scaling relationships between eigenvalues and network depth.

Result: Demonstrate that spectral bifurcation occurs purely from network architecture, independent of data imbalance. Show that the ratio between dominant and bulk eigenvalues scales linearly with network depth, revealing architecture’s strong influence on spectral gap.

Conclusion: Both model architecture and data characteristics should be considered when designing optimization algorithms for deep networks, as architecture alone can create challenging optimization landscapes through Hessian spectral structure.

Abstract: The eigenvalue distribution of the Hessian matrix plays a crucial role in understanding the optimization landscape of deep neural networks. Prior work has attributed the well-documented ``bulk-and-spike’’ spectral structure, where a few dominant eigenvalues are separated from a bulk of smaller ones, to the imbalance in the data covariance matrix. In this work, we challenge this view by demonstrating that such spectral Bifurcation can arise purely from the network architecture, independent of data imbalance. Specifically, we analyze a deep linear network setup and prove that, even when the data covariance is perfectly balanced, the Hessian still exhibits a Bifurcation eigenvalue structure: a dominant cluster and a bulk cluster. Crucially, we establish that the ratio between dominant and bulk eigenvalues scales linearly with the network depth. This reveals that the spectral gap is strongly affected by the network architecture rather than solely by data distribution. Our results suggest that both model architecture and data characteristics should be considered when designing optimization algorithms for deep networks.

[1047] Contrastive Domain Generalization for Cross-Instrument Molecular Identification in Mass Spectrometry

Seunghyun Yoo, Sanghong Kim, Namkyung Yoon, Hwangnam Kim

Main category: cs.LG

TL;DR: A cross-modal alignment framework that maps mass spectra into molecular structure embedding space for improved generalization in molecular identification from MS data.

Details

Motivation: Existing deep learning approaches for molecular identification from mass spectrometry data treat spectral matching as closed-set recognition, limiting generalization to unseen molecular scaffolds.

Method: Proposes a cross-modal alignment framework that directly maps mass spectra into the chemically meaningful molecular structure embedding space of a pretrained chemical language model.

Result: Achieves 42.2% Top-1 accuracy in fixed 256-way zero-shot retrieval on scaffold-disjoint benchmark, strong generalization in global retrieval, and 95.4% accuracy in 5-way 5-shot molecular re-identification.

Conclusion: Explicitly integrating physical spectral resolution with molecular structure embedding is key to solving the generalization bottleneck in molecular identification from MS data.

Abstract: Identifying molecules from mass spectrometry (MS) data remains a fundamental challenge due to the semantic gap between physical spectral peaks and underlying chemical structures. Existing deep learning approaches often treat spectral matching as a closed-set recognition task, limiting their ability to generalize to unseen molecular scaffolds. To overcome this limitation, we propose a cross-modal alignment framework that directly maps mass spectra into the chemically meaningful molecular structure embedding space of a pretrained chemical language model. On a strict scaffold-disjoint benchmark, our model achieves a Top-1 accuracy of 42.2% in fixed 256-way zero-shot retrieval and demonstrates strong generalization under a global retrieval setting. Moreover, the learned embedding space demonstrates strong chemical coherence, reaching 95.4% accuracy in 5-way 5-shot molecular re-identification. These results suggest that explicitly integrating physical spectral resolution with molecular structure embedding is key to solving the generalization bottleneck in molecular identification from MS data.

[1048] Beyond the Node: Clade-level Selection for Efficient MCTS in Automatic Heuristic Design

Kezhao Lai, Yutao Lai, Hai-Lin Liu

Main category: cs.LG

TL;DR: Clade-AHD improves LLM-based heuristic design by replacing point estimates with Bayesian beliefs to address MCTS over-exploitation under limited computational budgets

Details

Motivation: MCTS in LLM-based Automatic Heuristic Design suffers from over-exploitation tendency under limited computational budgets needed for heuristic evaluation, leading to unreliable decision-making

Method: Replaces node-level point estimates with clade-level Bayesian beliefs, aggregates descendant evaluations into Beta distributions, and performs Thompson Sampling over these beliefs to explicitly model uncertainty

Result: Extensive experiments on complex combinatorial optimization problems show Clade-AHD consistently outperforms state-of-the-art methods while significantly reducing computational cost

Conclusion: Clade-AHD provides an efficient framework for LLM-based heuristic design that better handles uncertainty and computational constraints through Bayesian modeling

Abstract: While Monte Carlo Tree Search (MCTS) shows promise in Large Language Model (LLM) based Automatic Heuristic Design (AHD), it suffers from a critical over-exploitation tendency under the limited computational budgets required for heuristic evaluation. To address this limitation, we propose Clade-AHD, an efficient framework that replaces node-level point estimates with clade-level Bayesian beliefs. By aggregating descendant evaluations into Beta distributions and performing Thompson Sampling over these beliefs, Clade-AHD explicitly models uncertainty to guide exploration, enabling more reliable decision-making under sparse and noisy evaluations. Extensive experiments on complex combinatorial optimization problems demonstrate that Clade-AHD consistently outperforms state-of-the-art methods while significantly reducing computational cost. The source code is publicly available at: https://github.com/Mriya0306/Clade-AHD.

[1049] Forget by Uncertainty: Orthogonal Entropy Unlearning for Quantized Neural Networks

Tian Zhang, Yujia Tong, Junhao Dong, Ke Xu, Yuze Wang, Jingling Yuan

Main category: cs.LG

TL;DR: OEU is an orthogonal entropy unlearning framework for quantized neural networks that achieves genuine forgetting through entropy maximization and gradient orthogonal projection.

Details

Motivation: The need for machine unlearning in quantized models deployed on edge devices under privacy regulations like GDPR, addressing limitations of existing methods that conflate forgetting with misremembering and use ineffective gradient reweighting.

Method: Two key innovations: 1) Entropy-guided unlearning that maximizes prediction uncertainty on forgotten data, and 2) Gradient orthogonal projection that projects forgetting gradients onto the orthogonal complement of retain gradients to eliminate interference.

Result: Extensive experiments show OEU outperforms existing methods in both forgetting effectiveness and retain accuracy, with theoretical guarantees for utility preservation.

Conclusion: OEU provides an effective solution for machine unlearning in quantized models, achieving genuine forgetting while preserving model utility through principled gradient orthogonalization.

Abstract: The deployment of quantized neural networks on edge devices, combined with privacy regulations like GDPR, creates an urgent need for machine unlearning in quantized models. However, existing methods face critical challenges: they induce forgetting by training models to memorize incorrect labels, conflating forgetting with misremembering, and employ scalar gradient reweighting that cannot resolve directional conflicts between gradients. We propose OEU, a novel Orthogonal Entropy Unlearning framework with two key innovations: 1) Entropy-guided unlearning maximizes prediction uncertainty on forgotten data, achieving genuine forgetting rather than confident misprediction, and 2) Gradient orthogonal projection eliminates interference by projecting forgetting gradients onto the orthogonal complement of retain gradients, providing theoretical guarantees for utility preservation under first-order approximation. Extensive experiments demonstrate that OEU outperforms existing methods in both forgetting effectiveness and retain accuracy.

[1050] When Classes Evolve: A Benchmark and Framework for Stage-Aware Class-Incremental Learning

Zheng Zhang, Tao Hu, Xueheng Li, Yang Wang, Rui Li, Jie Zhang, Chengjun Xie

Main category: cs.LG

TL;DR: This paper introduces Stage-CIL, a new paradigm for class-incremental learning that addresses intra-class morphological evolution (like a larva turning into a butterfly), and proposes STAGE method with Stage-Bench dataset for evaluation.

Details

Motivation: Traditional class-incremental learning assumes classes are morphologically static, but real-world classes can evolve significantly (e.g., biological metamorphosis). Current methods fail to handle both inter-class discrimination and intra-class morphological adaptation.

Method: Proposes Stage-CIL paradigm and STAGE method that explicitly learns abstract evolution patterns within a fixed-size memory pool, decoupling semantic identity from transformation dynamics to predict future morphologies from earlier representations.

Result: STAGE consistently and substantially outperforms existing state-of-the-art approaches on the proposed Stage-Bench dataset across 10 domains, effectively addressing both inter-class discrimination and intra-class morphological adaptation.

Conclusion: The Stage-CIL paradigm and STAGE method successfully address the challenge of learning evolving classes in incremental learning scenarios, providing a framework for handling both inter-class forgetting and intra-class evolution.

Abstract: Class-Incremental Learning (CIL) aims to sequentially learn new classes while mitigating catastrophic forgetting of previously learned knowledge. Conventional CIL approaches implicitly assume that classes are morphologically static, focusing primarily on preserving previously learned representations as new classes are introduced. However, this assumption neglects intra-class evolution: a phenomenon wherein instances of the same semantic class undergo significant morphological transformations, such as a larva turning into a butterfly. Consequently, a model must both discriminate between classes and adapt to evolving appearances within a single class. To systematically address this challenge, we formalize Stage-Aware CIL (Stage-CIL), a paradigm in which each class is learned progressively through distinct morphological stages. To facilitate rigorous evaluation within this paradigm, we introduce the Stage-Bench, a 10-domain, 2-stages dataset and protocol that jointly measure inter- and intra-class forgetting. We further propose STAGE, a novel method that explicitly learns abstract and transferable evolution patterns within a fixed-size memory pool. By decoupling semantic identity from transformation dynamics, STAGE enables accurate prediction of future morphologies based on earlier representations. Extensive empirical evaluation demonstrates that STAGE consistently and substantially outperforms existing state-of-the-art approaches, highlighting its effectiveness in simultaneously addressing inter-class discrimination and intra-class morphological adaptation.

[1051] Data Distribution as a Lever for Guiding Optimizers Toward Superior Generalization in LLMs

Tushaar Gangavarapu, Jiping Li, Christopher Vattheuer, Zhangyang Wang, Baharan Mirzasoleiman

Main category: cs.LG

TL;DR: Theoretical analysis shows SAM optimizer reduces simplicity bias in LLMs, leading to better generalization; modifying training data distribution to upsample later-learned examples achieves similar benefits more efficiently.

Details

Motivation: To understand why SAM optimizer improves generalization in LLMs and find more efficient alternatives to SAM's computational expense by modifying training data distribution.

Method: Theoretical analysis of in-context linear regression with multi-head linear self-attention, comparing GD and SAM training dynamics; empirical validation by upsampling/augmenting examples learned later in training to reduce simplicity bias.

Result: SAM reduces simplicity bias, explaining its generalization benefits; modifying training data distribution achieves similar generalization improvements with 18% relative accuracy gains on mathematical reasoning tasks across multiple LLMs.

Conclusion: Training data distribution modification can efficiently achieve generalization benefits similar to SAM by reducing simplicity bias, offering practical alternative for LLM training.

Abstract: Can modifying the training data distribution guide optimizers toward solutions with improved generalization when training large language models (LLMs)? In this work, we theoretically analyze an in-context linear regression model with multi-head linear self-attention, and compare the training dynamics of two gradient based optimizers, namely gradient descent (GD) and sharpness-aware minimization (SAM), the latter exhibiting superior generalization properties but is prohibitively expensive for training even medium-sized LLMs. We show, for the first time, that SAM induces a lower simplicity bias (SB)-the tendency of an optimizer to preferentially learn simpler features earlier in training-and identify this reduction as a key factor underlying its improved generalization performance. Motivated by this insight, we demonstrate that altering the training data distribution by upsampling or augmenting examples learned later in training similarly reduces SB and leads to improved generalization. Our extensive experiments show that our strategy improves the performance of multiple LLMs-including Phi2-2.7B , Llama3.2-1B, Gemma3-1B-PT, and Qwen3-0.6B-Base-achieving relative accuracy gains up to 18% when fine-tuned with AdamW and Muon on mathematical reasoning tasks.

[1052] Sparsity-Aware Unlearning for Large Language Models

Yuze Wang, Yujia Tong, Ke Xu, Jingling Yuan, Jiawei Jiang, Chuang Hu

Main category: cs.LG

TL;DR: SAU is a sparsity-aware unlearning method for LLMs that addresses privacy risks by enabling effective forgetting in sparse models through gradient masking and importance-aware redistribution.

Details

Motivation: LLMs memorize sensitive information during training, creating privacy risks. Existing unlearning methods work poorly on sparse models (essential for efficient deployment), as they require updating all parameters but sparsification prunes many weights to zero, limiting forgetting capacity.

Method: Proposes Sparsity-Aware Unlearning (SAU) with two key components: 1) Gradient masking that redirects updates only to surviving (non-zero) weights, decoupling unlearning from sparsification objectives; 2) Importance-aware redistribution to compensate for pruned parameters.

Result: Extensive experiments show SAU significantly outperforms existing unlearning methods on sparse LLMs, achieving effective forgetting while preserving model utility.

Conclusion: SAU successfully addresses the challenge of machine unlearning in sparse LLMs, enabling privacy protection without sacrificing efficiency, which is crucial for practical deployment of large language models.

Abstract: Large Language Models (LLMs) inevitably memorize sensitive information during training, posing significant privacy risks. Machine unlearning has emerged as a promising solution to selectively remove such information without full retraining. However, existing methods are designed for dense models and overlook model sparsification-an essential technique for efficient LLM deployment. We find that unlearning effectiveness degrades substantially on sparse models. Through empirical analysis, we reveal that this degradation occurs because existing unlearning methods require updating all parameters, yet sparsification prunes substantial weights to zero, fundamentally limiting the model’s forgetting capacity. To address this challenge, we propose Sparsity-Aware Unlearning (SAU), which decouples unlearning from sparsification objectives through gradient masking that redirects updates to surviving weights, combined with importance-aware redistribution to compensate for pruned parameters. Extensive experiments demonstrate that SAU significantly outperforms existing methods on sparse LLMs, achieving effective forgetting while preserving model utility.

[1053] Bridging Time and Frequency: A Joint Modeling Framework for Irregular Multivariate Time Series Forecasting

Xiangfei Qiu, Kangjia Yan, Xvyuan Liu, Xingjian Wu, Jilin Hu

Main category: cs.LG

TL;DR: TFMixer: A joint time-frequency modeling framework for irregular multivariate time series forecasting using learnable NUDFT for global frequency analysis and patch mixing for local temporal modeling.

Details

Motivation: Irregular multivariate time series forecasting is challenging due to non-uniform sampling and variable asynchronicity, which violate standard models' equidistant assumptions and hinder both local temporal modeling and global periodic structure capture.

Method: TFMixer combines a Global Frequency Module with learnable Non-Uniform Discrete Fourier Transform (NUDFT) to extract spectral representations from irregular timestamps, and a Local Time Module with query-based patch mixing to adaptively aggregate temporal patches. The framework fuses time-domain and frequency-domain representations and uses inverse NUDFT for seasonal extrapolation.

Result: Extensive experiments on real-world datasets demonstrate state-of-the-art performance for irregular multivariate time series forecasting.

Conclusion: TFMixer effectively addresses irregular multivariate time series forecasting challenges through joint time-frequency modeling, achieving superior performance by combining global frequency analysis with local temporal patch mixing.

Abstract: Irregular multivariate time series forecasting (IMTSF) is challenging due to non-uniform sampling and variable asynchronicity. These irregularities violate the equidistant assumptions of standard models, hindering local temporal modeling and rendering classical frequency-domain methods ineffective for capturing global periodic structures. To address this challenge, we propose TFMixer, a joint time-frequency modeling framework for IMTS forecasting. Specifically, TFMixer incorporates a Global Frequency Module that employs a learnable Non-Uniform Discrete Fourier Transform (NUDFT) to directly extract spectral representations from irregular timestamps. In parallel, the Local Time Module introduces a query-based patch mixing mechanism to adaptively aggregate informative temporal patches and alleviate information density imbalance. Finally, TFMixer fuses the time-domain and frequency-domain representations to generate forecasts and further leverages inverse NUDFT for explicit seasonal extrapolation. Extensive experiments on real-world datasets demonstrate the state–of-the-art performance of TFMixer.

[1054] Safe Langevin Soft Actor Critic

Mahesh Keswani, Samyak Jain, Raunak P. Bhattacharyya

Main category: cs.LG

TL;DR: SL-SAC is a constrained RL algorithm that uses parameter-space exploration with Langevin dynamics and distributional risk control via CVaR optimization to balance reward and safety, achieving lower costs on safety benchmarks.

Details

Motivation: Constrained RL faces challenges with poor generalization from sharp value minima and inadequate handling of heavy-tailed risk distributions, which limits safety performance.

Method: Combines three mechanisms: 1) Adaptive Stochastic Gradient Langevin Dynamics for reward critics to escape poor optima, 2) distributional cost estimation via Implicit Quantile Networks with CVaR optimization for tail-risk mitigation, and 3) reactive Lagrangian relaxation based on empirical CVaR of episodic costs.

Result: Achieves lowest cost in 7 out of 10 Safety-Gymnasium tasks while maintaining competitive returns, with 19-63% cost reductions in velocity tasks compared to state-of-the-art baselines.

Conclusion: SL-SAC effectively addresses constrained RL challenges through parameter-space exploration and distributional risk control, providing theoretical guarantees and practical improvements in safety-critical tasks.

Abstract: Balancing reward and safety in constrained reinforcement learning remains challenging due to poor generalization from sharp value minima and inadequate handling of heavy-tailed risk distribution. We introduce Safe Langevin Soft Actor-Critic (SL-SAC), a principled algorithm that addresses both issues through parameter-space exploration and distributional risk control. Our approach combines three key mechanisms: (1) Adaptive Stochastic Gradient Langevin Dynamics (aSGLD) for reward critics, promoting ensemble diversity and escape from poor optima; (2) distributional cost estimation via Implicit Quantile Networks (IQN) with Conditional Value-at-Risk (CVaR) optimization for tail-risk mitigation; and (3) a reactive Lagrangian relaxation scheme that adapts constraint enforcement based on the empirical CVaR of episodic costs. We provide theoretical guarantees on CVaR estimation error and demonstrate that CVaR-based Lagrange updates yield stronger constraint violation signals than expected-cost updates. On Safety-Gymnasium benchmarks, SL-SAC achieves the lowest cost in 7 out of 10 tasks while maintaining competitive returns, with cost reductions of 19-63% in velocity tasks compared to state-of-the-art baselines.

[1055] SEER: Transformer-based Robust Time Series Forecasting via Automated Patch Enhancement and Replacement

Xiangfei Qiu, Xvyuan Liu, Tianen Shen, Xingjian Wu, Hanyin Cheng, Bin Yang, Jilin Hu

Main category: cs.LG

TL;DR: SEER is a robust time series forecasting framework that addresses low-quality data issues through dynamic patch selection and replacement mechanisms.

Details

Motivation: Existing patch-based time series methods use all patches indiscriminately, but real-world data often has quality issues (missing values, noise, anomalies) that make some patches contain low-quality information, negatively impacting predictions.

Method: Two main components: 1) Augmented Embedding Module using Mixture-of-Experts for patch representations and channel-adaptive perception for series-wise tokens; 2) Learnable Patch Replacement Module with dynamic filtering to eliminate negative patches and replaced attention to substitute low-quality patches with global series-wise tokens.

Result: Comprehensive experiments demonstrate state-of-the-art (SOTA) performance in time series forecasting.

Conclusion: SEER effectively addresses data quality issues in time series forecasting through dynamic patch selection and replacement, improving robustness and accuracy.

Abstract: Time series forecasting is important in many fields that require accurate predictions for decision-making. Patching techniques, commonly used and effective in time series modeling, help capture temporal dependencies by dividing the data into patches. However, existing patch-based methods fail to dynamically select patches and typically use all patches during the prediction process. In real-world time series, there are often low-quality issues during data collection, such as missing values, distribution shifts, anomalies and white noise, which may cause some patches to contain low-quality information, negatively impacting the prediction results. To address this issue, this study proposes a robust time series forecasting framework called SEER. Firstly, we propose an Augmented Embedding Module, which improves patch-wise representations using a Mixture-of-Experts (MoE) architecture and obtains series-wise token representations through a channel-adaptive perception mechanism. Secondly, we introduce a Learnable Patch Replacement Module, which enhances forecasting robustness and model accuracy through a two-stage process: 1) a dynamic filtering mechanism eliminates negative patch-wise tokens; 2) a replaced attention module substitutes the identified low-quality patches with global series-wise token, further refining their representations through a causal attention mechanism. Comprehensive experimental results demonstrate the SOTA performance of SEER.

[1056] Kernelized Edge Attention: Addressing Semantic Attention Blurring in Temporal Graph Neural Networks

Govind Waghmare, Srini Rohan Gujulla Leel, Nikhil Tumbde, Sumedh B G, Sonia Gupta, Srikanta Bedathur

Main category: cs.LG

TL;DR: KEAT introduces kernelized edge attention for temporal graphs to address semantic attention blurring by modulating edge features with continuous-time kernels, preserving distinct temporal behaviors of nodes and edges.

Details

Motivation: Current TGNNs fail to distinguish between slowly evolving node embeddings and rapidly changing edge features, leading to semantic attention blurring where attention weights cannot capture distinct temporal behaviors, limiting fine-grained temporal dependency modeling and interpretability.

Method: KEAT uses kernelized edge attention that modulates edge features with continuous-time kernels (Laplacian, RBF, and learnable MLP variants) to preserve distinct roles of nodes and edges, compatible with both Transformer-style and message-passing architectures.

Result: Achieves up to 18% MRR improvement over DyGFormer and 7% over TGN on link prediction tasks, enabling more accurate, interpretable, and temporally aware message passing in TGNNs.

Conclusion: KEAT successfully addresses semantic attention blurring in temporal graphs through kernelized edge attention, improving both performance and interpretability while maintaining compatibility with existing TGNN architectures.

Abstract: Temporal Graph Neural Networks (TGNNs) aim to capture the evolving structure and timing of interactions in dynamic graphs. Although many models incorporate time through encodings or architectural design, they often compute attention over entangled node and edge representations, failing to reflect their distinct temporal behaviors. Node embeddings evolve slowly as they aggregate long-term structural context, while edge features reflect transient, timestamped interactions (e.g. messages, trades, or transactions). This mismatch results in semantic attention blurring, where attention weights cannot distinguish between slowly drifting node states and rapidly changing, information-rich edge interactions. As a result, models struggle to capture fine-grained temporal dependencies and provide limited transparency into how temporal relevance is computed. This paper introduces KEAT (Kernelized Edge Attention for Temporal Graphs), a novel attention formulation that modulates edge features using a family of continuous-time kernels, including Laplacian, RBF, and learnable MLP variant. KEAT preserves the distinct roles of nodes and edges, and integrates seamlessly with both Transformer-style (e.g., DyGFormer) and message-passing (e.g., TGN) architectures. It achieves up to 18% MRR improvement over the recent DyGFormer and 7% over TGN on link prediction tasks, enabling more accurate, interpretable and temporally aware message passing in TGNNs.

[1057] Direct Preference Optimization with Rating Information: Practical Algorithms and Provable Gains

Luca Viano, Ruida Zhou, Yifan Sun, Mahdi Namazifar, Volkan Cevher, Shoham Sabach, Mohammad Ghavamzadeh

Main category: cs.LG

TL;DR: Proposes improved preference optimization algorithms that leverage rating gap information alongside pairwise preferences to achieve faster statistical rates and robust performance across LLMs.

Details

Motivation: DPO algorithms use limited pairwise preference feedback, but this ambiguity in response quality measurement can be problematic. The authors aim to leverage additional rating gap information to improve alignment while maintaining robustness to inaccuracies.

Method: Develop new algorithms that incorporate rating gap information (how much better chosen response is than rejected one) alongside pairwise preferences. Theoretically analyze statistical rates and robustness to inaccurate rating gaps.

Result: Algorithms achieve faster statistical rates than DPO when accurate rating gap information is available. Performance remains robust to rating gap inaccuracies. Outperforms DPO-style algorithms across various LLMs and evaluation benchmarks.

Conclusion: Rating gap information can significantly improve preference optimization for foundation model alignment while maintaining robustness, offering practical advantages over standard DPO approaches.

Abstract: The class of direct preference optimization (DPO) algorithms has emerged as a promising approach for solving the alignment problem in foundation models. These algorithms work with very limited feedback in the form of pairwise preferences and fine-tune models to align with these preferences without explicitly learning a reward model. While the form of feedback used by these algorithms makes the data collection process easy and relatively more accurate, its ambiguity in terms of the quality of responses could have negative implications. For example, it is not clear if a decrease (increase) in the likelihood of preferred (dispreferred) responses during the execution of these algorithms could be interpreted as a positive or negative phenomenon. In this paper, we study how to design algorithms that can leverage additional information in the form of rating gap, which informs the learner how much the chosen response is better than the rejected one. We present new algorithms that can achieve faster statistical rates than DPO in presence of accurate rating gap information. Moreover, we theoretically prove and empirically show that the performance of our algorithms is robust to inaccuracy in rating gaps. Finally, we demonstrate the solid performance of our methods in comparison to a number of DPO-style algorithms across a wide range of LLMs and evaluation benchmarks.

[1058] Actor-Dual-Critic Dynamics for Zero-sum and Identical-Interest Stochastic Games

Ahmed Said Donmez, Yuksel Arslantas, Muhammed O. Sayin

Main category: cs.LG

TL;DR: A novel independent, payoff-based learning framework for stochastic games that is model-free, game-agnostic, and gradient-free, using a best-response-type actor-critic architecture with fast and slow critics.

Details

Motivation: To develop a decentralized learning algorithm for stochastic games that doesn't require model knowledge, is applicable to different game types, and provides theoretical convergence guarantees while being practical for real-world applications.

Method: Uses an actor-critic architecture where agents update strategies based on two critics: a fast critic that responds intuitively to observed payoffs with limited information, and a slow critic that deliberatively approximates the dynamic programming solution. Learning relies on non-equilibrium adaptation through smoothed best responses to observed payoffs.

Result: Establishes convergence to (approximate) equilibria in two-agent zero-sum and multi-agent identical-interest stochastic games over infinite horizons. Empirical results validate robustness and effectiveness across both game classes.

Conclusion: Provides one of the first payoff-based, fully decentralized learning algorithms with theoretical guarantees for both zero-sum and identical-interest stochastic games, offering a practical model-free approach.

Abstract: We propose a novel independent and payoff-based learning framework for stochastic games that is model-free, game-agnostic, and gradient-free. The learning dynamics follow a best-response-type actor-critic architecture, where agents update their strategies (actors) using feedback from two distinct critics: a fast critic that intuitively responds to observed payoffs under limited information, and a slow critic that deliberatively approximates the solution to the underlying dynamic programming problem. Crucially, the learning process relies on non-equilibrium adaptation through smoothed best responses to observed payoffs. We establish convergence to (approximate) equilibria in two-agent zero-sum and multi-agent identical-interest stochastic games over an infinite horizon. This provides one of the first payoff-based and fully decentralized learning algorithms with theoretical guarantees in both settings. Empirical results further validate the robustness and effectiveness of the proposed approach across both classes of games.

[1059] Rethinking Zero-Shot Time Series Classification: From Task-specific Classifiers to In-Context Inference

Juntao Fang, Shifeng Xie, Shengbin Nie, Yuhui Ling, Yuming Liu, Zijian Li, Keli Zhang, Lujia Pan, Themis Palpanas, Ruichu Cai

Main category: cs.LG

TL;DR: TIC-FM: A training-free in-context learning framework for zero-shot time series classification that uses labeled data as context without parameter updates, outperforming frozen encoder + classifier approaches.

Details

Motivation: Current zero-shot evaluation of time series foundation models violates the training-free premise by using frozen encoders with task-specific classifiers, introducing evaluation bias from classifier-dependent training choices.

Method: Proposes TIC-FM with time series encoder + lightweight projection adapter + split-masked latent memory Transformer; treats labeled training set as context and predicts all test labels in single forward pass without parameter updates.

Result: Experiments on 128 UCR datasets show strong accuracy with consistent gains in extreme low-label situations, demonstrating effective training-free transfer.

Conclusion: In-context learning can subsume trained classifiers and emulate gradient-based training within single forward pass, providing truly training-free zero-shot evaluation for time series foundation models.

Abstract: The zero-shot evaluation of time series foundation models (TSFMs) for classification typically uses a frozen encoder followed by a task-specific classifier. However, this practice violates the training-free premise of zero-shot deployment and introduces evaluation bias due to classifier-dependent training choices. To address this issue, we propose TIC-FM, an in-context learning framework that treats the labeled training set as context and predicts labels for all test instances in a single forward pass, without parameter updates. TIC-FM pairs a time series encoder and a lightweight projection adapter with a split-masked latent memory Transformer. We further provide theoretical justification that in-context inference can subsume trained classifiers and can emulate gradient-based classifier training within a single forward pass. Experiments on 128 UCR datasets show strong accuracy, with consistent gains in the extreme low-label situation, highlighting training-free transfer

[1060] MoDEx: Mixture of Depth-specific Experts for Multivariate Long-term Time Series Forecasting

Hyekyung Yoon, Minhyuk Lee, Imseung Park, Myungjoo Kang

Main category: cs.LG

TL;DR: MoDEx: A lightweight Mixture of Depth-specific Experts for multivariate long-term time series forecasting that replaces complex backbones with specialized MLP experts based on layer sensitivity analysis.

Details

Motivation: Existing LTSF paradigms use a three-stage pipeline but the behaviors of individual backbone layers remain underexplored. The authors aim to understand depth-specific specialization in modeling temporal dynamics and leverage this insight to create more efficient forecasting models.

Method: Introduces layer sensitivity, a gradient-based metric inspired by GradCAM and effective receptive field theory to quantify contributions of time points to layer features. Uses this analysis to propose MoDEx - a Mixture of Depth-specific Experts that replaces complex backbones with depth-specific MLP experts.

Result: Achieves state-of-the-art accuracy on seven real-world benchmarks, ranking first in 78% of cases, while using significantly fewer parameters and computational resources. Also integrates seamlessly into transformer variants, consistently boosting their performance.

Conclusion: MoDEx demonstrates robust generalizability as an efficient and high-performance LTSF framework that can enhance existing transformer-based models while being computationally efficient.

Abstract: Multivariate long-term time series forecasting (LTSF) supports critical applications such as traffic-flow management, solar-power scheduling, and electricity-transformer monitoring. The existing LTSF paradigms follow a three-stage pipeline of embedding, backbone refinement, and long-horizon prediction. However, the behaviors of individual backbone layers remain underexplored. We introduce layer sensitivity, a gradient-based metric inspired by GradCAM and effective receptive field theory, which quantifies both positive and negative contributions of each time point to a layer’s latent features. Applying this metric to a three-layer MLP backbone reveals depth-specific specialization in modeling temporal dynamics in the input sequence. Motivated by these insights, we propose MoDEx, a lightweight Mixture of Depth-specific Experts, which replaces complex backbones with depth-specific MLP experts. MoDEx achieves state-of-the-art accuracy on seven real-world benchmarks, ranking first in 78 percent of cases, while using significantly fewer parameters and computational resources. It also integrates seamlessly into transformer variants, consistently boosting their performance and demonstrating robust generalizability as an efficient and high-performance LTSF framework.

[1061] From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs

Louis Schiekiera, Max Zimmer, Christophe Roux, Sebastian Pokutta, Fritz Günther

Main category: cs.LG

TL;DR: Behavioral experiments (forced-choice and free association) can recover information about LLM hidden-state geometry, with forced-choice tasks showing stronger alignment than free association.

Details

Motivation: To understand whether psycholinguistic behavioral experiments can reveal information about the internal semantic geometry of LLMs, specifically whether behavior-based similarity measurements can recover hidden-state representations.

Method: Used eight instruction-tuned transformer models, ran two experimental paradigms (similarity-based forced choice and free association) over 5,000-word vocabulary, collected 17.5M+ trials to build behavior-based similarity matrices, and used representational similarity analysis to compare behavioral geometries to layerwise hidden-state similarity.

Result: Forced-choice behavior aligns substantially more with hidden-state geometry than free association. Behavioral similarity (especially forced choice) predicts unseen hidden-state similarities beyond lexical baselines and cross-model consensus, indicating behavior-only measurements retain recoverable information about internal semantic geometry.

Conclusion: Behavioral tasks can uncover information about LLMs’ internal cognitive states, with forced-choice paradigms being particularly effective at recovering hidden-state geometry, suggesting psycholinguistic experiments provide meaningful insights into model representations.

Abstract: We investigate the extent to which an LLM’s hidden-state geometry can be recovered from its behavior in psycholinguistic experiments. Across eight instruction-tuned transformer models, we run two experimental paradigms – similarity-based forced choice and free association – over a shared 5,000-word vocabulary, collecting 17.5M+ trials to build behavior-based similarity matrices. Using representational similarity analysis, we compare behavioral geometries to layerwise hidden-state similarity and benchmark against FastText, BERT, and cross-model consensus. We find that forced-choice behavior aligns substantially more with hidden-state geometry than free association. In a held-out-words regression, behavioral similarity (especially forced choice) predicts unseen hidden-state similarities beyond lexical baselines and cross-model consensus, indicating that behavior-only measurements retain recoverable information about internal semantic geometry. Finally, we discuss implications for the ability of behavioral tasks to uncover hidden cognitive states.

[1062] Equilibrium of Feasible Zone and Uncertain Model in Safe Exploration

Yujie Yang, Zhilong Zheng, Shengbo Eben Li

Main category: cs.LG

TL;DR: The paper proposes SEE, a safe equilibrium exploration framework for RL that finds the maximum feasible zone by balancing exploration with environment model accuracy through iterative refinement.

Details

Motivation: Current safe RL approaches limit exploration to feasible zones, but lack understanding of what the maximum feasible zone is and how to identify it. The paper aims to establish that safe exploration's goal is finding equilibrium between feasible zone size and environment model accuracy.

Method: Proposes SEE (safe equilibrium exploration) framework that alternates between finding maximum feasible zone and least uncertain model. Uses graph formulation of uncertain model, with theoretical guarantees of monotonic refinement and convergence to equilibrium.

Result: Experiments on classic control tasks show SEE successfully expands feasible zones with zero constraint violation, achieving safe exploration equilibrium within few iterations.

Conclusion: The paper establishes equilibrium as fundamental goal of safe exploration, provides first framework to achieve it, and demonstrates practical effectiveness in control tasks.

Abstract: Ensuring the safety of environmental exploration is a critical problem in reinforcement learning (RL). While limiting exploration to a feasible zone has become widely accepted as a way to ensure safety, key questions remain unresolved: what is the maximum feasible zone achievable through exploration, and how can it be identified? This paper, for the first time, answers these questions by revealing that the goal of safe exploration is to find the equilibrium between the feasible zone and the environment model. This conclusion is based on the understanding that these two components are interdependent: a larger feasible zone leads to a more accurate environment model, and a more accurate model, in turn, enables exploring a larger zone. We propose the first equilibrium-oriented safe exploration framework called safe equilibrium exploration (SEE), which alternates between finding the maximum feasible zone and the least uncertain model. Using a graph formulation of the uncertain model, we prove that the uncertain model obtained by SEE is monotonically refined, the feasible zones monotonically expand, and both converge to the equilibrium of safe exploration. Experiments on classic control tasks show that our algorithm successfully expands the feasible zones with zero constraint violation, and achieves the equilibrium of safe exploration within a few iterations.

[1063] Combinatorial Bandit Bayesian Optimization for Tensor Outputs

Jingru Huang, Haijie Xu, Jie Guo, Manrui Jiang, Chen Zhang

Main category: cs.LG

TL;DR: A novel Bayesian optimization method for tensor-output functions with theoretical guarantees and extensions to combinatorial bandit settings.

Details

Motivation: Existing Bayesian optimization methods only handle scalar or vector outputs, but many real-world problems involve tensor outputs (e.g., multi-dimensional data, multi-task learning). There's also a need to handle combinatorial settings where only subsets of tensor outputs contribute to the objective.

Method: Proposes tensor-output Gaussian process (TOGP) with tensor-output kernels to model structural dependencies in tensor outputs. Develops UCB acquisition function for standard BO. Extends to combinatorial bandit BO (CBBO) with partially observed outputs using CMAB-UCB2 criterion for joint selection of queried points and optimal output subsets.

Result: Establishes theoretical regret bounds ensuring sublinear performance. Extensive experiments on synthetic and real-world datasets demonstrate superiority over existing methods.

Conclusion: The proposed tensor-output BO methods effectively handle tensor-valued functions and combinatorial settings with theoretical guarantees and practical advantages.

Abstract: Bayesian optimization (BO) has been widely used to optimize expensive and black-box functions across various domains. Existing BO methods have not addressed tensor-output functions. To fill this gap, we propose a novel tensor-output BO method. Specifically, we first introduce a tensor-output Gaussian process (TOGP) with two classes of tensor-output kernels as a surrogate model of the tensor-output function, which can effectively capture the structural dependencies within the tensor. Based on it, we develop an upper confidence bound (UCB) acquisition function to select the queried points. Furthermore, we introduce a more complex and practical problem setting, named combinatorial bandit Bayesian optimization (CBBO), where only a subset of the outputs can be selected to contribute to the objective function. To tackle this, we propose a tensor-output CBBO method, which extends TOGP to handle partially observed outputs, and accordingly design a novel combinatorial multi-arm bandit-UCB2 (CMAB-UCB2) criterion to sequentially select both the queried points and the optimal output subset. Theoretical regret bounds for the two methods are established, ensuring their sublinear performance. Extensive synthetic and real-world experiments demonstrate their superiority.

[1064] CoRe-Fed: Bridging Collaborative and Representation Fairness via Federated Embedding Distillation

Noorain Mukhtiar, Adnan Mahmood, Quan Z. Sheng

Main category: cs.LG

TL;DR: CoRe-Fed is a federated learning framework that addresses fairness issues through embedding alignment and contribution-aware aggregation to reduce representation and collaborative biases.

Details

Motivation: Federated learning suffers from performance disparities due to heterogeneous data distributions and unequal client participation, leading to unfair outcomes through representation bias (misaligned client representations) and collaborative bias (inequitable contribution during aggregation).

Method: Proposes CoRe-Fed with two key components: 1) alignment-driven mechanism for semantic consistency between local and global embeddings to reduce representational divergence, and 2) dynamic reward-penalty-based aggregation strategy that adjusts client weights based on participation history and embedding alignment.

Result: Extensive experiments across diverse models and datasets demonstrate that CoRe-Fed improves both fairness and model performance over state-of-the-art baseline algorithms.

Conclusion: CoRe-Fed provides a unified optimization framework that effectively bridges collaborative and representation fairness in federated learning through embedding-level regularization and fairness-aware aggregation.

Abstract: With the proliferation of distributed data sources, Federated Learning (FL) has emerged as a key approach to enable collaborative intelligence through decentralized model training while preserving data privacy. However, conventional FL algorithms often suffer from performance disparities across clients caused by heterogeneous data distributions and unequal participation, which leads to unfair outcomes. Specifically, we focus on two core fairness challenges, i.e., representation bias, arising from misaligned client representations, and collaborative bias, stemming from inequitable contribution during aggregation, both of which degrade model performance and generalizability. To mitigate these disparities, we propose CoRe-Fed, a unified optimization framework that bridges collaborative and representation fairness via embedding-level regularization and fairness-aware aggregation. Initially, an alignment-driven mechanism promotes semantic consistency between local and global embeddings to reduce representational divergence. Subsequently, a dynamic reward-penalty-based aggregation strategy adjusts each client’s weight based on participation history and embedding alignment to ensure contribution-aware aggregation. Extensive experiments across diverse models and datasets demonstrate that CoRe-Fed improves both fairness and model performance over the state-of-the-art baseline algorithms.

[1065] PHAT: Modeling Period Heterogeneity for Multivariate Time Series Forecasting

Jiaming Ma, Guanjun Wang, Qihe Huang, Sheng Huang, Haofeng Ma, Zhengyang Zhou, Pengkun Wang, Binwu Wang, Yang Wang

Main category: cs.LG

TL;DR: PHAT is a transformer-based model for multivariate time series forecasting that addresses periodic heterogeneity by organizing data into periodic buckets and using positive-negative attention mechanisms.

Details

Motivation: Existing time series forecasting models fail to capture periodic heterogeneity where different variables exhibit distinct and dynamically changing periods, which is common in real-world data.

Method: PHAT organizes multivariate inputs into 3D periodic buckets (variate groups by periodicity, time steps by phase, offsets within period), restricts interactions within buckets, and uses positive-negative attention with periodic alignment and deviation perspectives.

Result: PHAT significantly outperforms 18 baselines on 14 real-world datasets, achieving highly competitive forecasting performance.

Conclusion: PHAT effectively captures periodic heterogeneity in multivariate time series through its bucket organization and attention mechanisms, leading to superior forecasting performance.

Abstract: While existing multivariate time series forecasting models have advanced significantly in modeling periodicity, they largely neglect the periodic heterogeneity common in real-world data, where variates exhibit distinct and dynamically changing periods. To effectively capture this periodic heterogeneity, we propose PHAT (Period Heterogeneity-Aware Transformer). Specifically, PHAT arranges multivariate inputs into a three-dimensional “periodic bucket” tensor, where the dimensions correspond to variate group characteristics with similar periodicity, time steps aligned by phase, and offsets within the period. By restricting interactions within buckets and masking cross-bucket connections, PHAT effectively avoids interference from inconsistent periods. We also propose a positive-negative attention mechanism, which captures periodic dependencies from two perspectives: periodic alignment and periodic deviation. Additionally, the periodic alignment attention scores are decomposed into positive and negative components, with a modulation term encoding periodic priors. This modulation constrains the attention mechanism to more faithfully reflect the underlying periodic trends. A mathematical explanation is provided to support this property. We evaluate PHAT comprehensively on 14 real-world datasets against 18 baselines, and the results show that it significantly outperforms existing methods, achieving highly competitive forecasting performance. Our sources is available at GitHub.

[1066] Riemannian Flow Matching for Disentangled Graph Domain Adaptation

Yingxu Wang, Xinwang Liu, Mengzhu Wang, Siyang Gao, Nan Yin

Main category: cs.LG

TL;DR: DisRFM is a geometry-aware Graph Domain Adaptation framework that uses Riemannian manifolds and flow matching to address structural degeneration and optimization instability in adversarial graph alignment.

Details

Motivation: Traditional Graph Domain Adaptation using adversarial learning in Euclidean space suffers from structural degeneration (entangled hierarchical and semantic representations) and optimization instability from minimax adversarial training dynamics.

Method: Embeds graphs into Riemannian manifold using polar coordinates to disentangle structure (radius) from semantics (angle). Uses radial Wasserstein alignment for topology preservation and angular clustering for semantic discrimination. Employs Riemannian flow matching for stable feature transport along geodesic paths.

Result: Extensive experiments show DisRFM consistently outperforms state-of-the-art methods. Theoretically proven asymptotic stability of flow matching and tighter bound for target risk.

Conclusion: DisRFM effectively addresses key challenges in GDA through geometric disentanglement and stable flow-based transport, providing a robust framework for graph domain adaptation.

Abstract: Graph Domain Adaptation (GDA) typically uses adversarial learning to align graph embeddings in Euclidean space. However, this paradigm suffers from two critical challenges: Structural Degeneration, where hierarchical and semantic representations are entangled, and Optimization Instability, which arises from oscillatory dynamics of minimax adversarial training. To tackle these issues, we propose DisRFM, a geometry-aware GDA framework that unifies Riemannian embedding and flow-based transport. First, to overcome structural degeneration, we embed graphs into a Riemannian manifold. By adopting polar coordinates, we explicitly disentangle structure (radius) from semantics (angle). Then, we enforce topology preservation through radial Wasserstein alignment and semantic discrimination via angular clustering, thereby preventing feature entanglement and collapse. Second, we address the instability of adversarial alignment by using Riemannian flow matching. This method learns a smooth vector field to guide source features toward the target along geodesic paths, guaranteeing stable convergence. The geometric constraints further guide the flow to maintain the disentangled structure during transport. Theoretically, we prove the asymptotic stability of the flow matching and derive a tighter bound for the target risk. Extensive experiments demonstrate that DisRFM consistently outperforms state-of-the-art methods.

[1067] Three-Way Emotion Classification of EEG-based Signals using Machine Learning

Ashna Purwar, Gaurav Simkar, Madhumita, Sachin Kadam

Main category: cs.LG

TL;DR: EEG-based emotion classification using ML models (LR, SVM, RF) on three emotion classes (Negative, Neutral, Positive), with RF achieving best performance.

Details

Motivation: EEG signals directly reflect brain activity and can reveal emotional states, making EEG-based emotion recognition valuable for emotion-aware systems. The paper aims to determine which ML model works best for three-way emotion classification of EEG signals.

Method: Complete workflow including data preprocessing and comparison of three ML models: logistic regression (LR), support vector machine (SVM), and random forest (RF). Models are trained and tested on a limited EEG dataset with three emotion classes. Performance evaluated using accuracy and F1-score.

Result: ML models can effectively classify EEG signals into three emotion categories. Random forest (RF) gave the best results with higher accuracy and F1-score than LR and SVM. RF also outperformed existing state-of-the-art classification models in terms of accuracy.

Conclusion: RF model captures emotional patterns more accurately and effectively than LR and SVM for EEG-based emotion classification. ML models are viable for three-way emotion classification of EEG signals.

Abstract: Electroencephalography (EEG) is a widely used technique for measuring brain activity. EEG-based signals can reveal a persons emotional state, as they directly reflect activity in different brain regions. Emotion-aware systems and EEG-based emotion recognition are a growing research area. This paper presents how machine learning (ML) models categorize a limited dataset of EEG signals into three different classes, namely Negative, Neutral, or Positive. It also presents the complete workflow, including data preprocessing and comparison of ML models. To understand which ML classification model works best for this kind of problem, we train and test the following three commonly used models: logistic regression (LR), support vector machine (SVM), and random forest (RF). The performance of each is evaluated with respect to accuracy and F1-score. The results indicate that ML models can be effectively utilized for three-way emotion classification of EEG signals. Among the three ML models trained on the available dataset, the RF model gave the best results. Its higher accuracy and F1-score suggest that it is able to capture the emotional patterns more accurately and effectively than the other two models. The RF model also outperformed the existing state-of-the-art classification models in terms of the accuracy parameter.

[1068] Strong Linear Baselines Strike Back: Closed-Form Linear Models as Gaussian Process Conditional Density Estimators for TSAD

Aleksandr Yugay, Hang Cui, Changhua Pei, Alexey Zaytsev

Main category: cs.LG

TL;DR: Simple linear autoregressive model with OLS regression matches or outperforms complex deep learning methods for time series anomaly detection while being computationally efficient.

Details

Motivation: Current TSAD research focuses on increasingly complex neural architectures that are hard to train and expensive to infer, but simple linear models may be surprisingly effective and efficient.

Method: Proposes using linear autoregressive anomaly score with closed-form ordinary least squares (OLS) regression solution, which estimates a finite-history Gaussian process conditional density.

Result: Across extensive univariate and multivariate benchmarks, the linear approach achieves superior accuracy while requiring orders of magnitude fewer computational resources compared to state-of-the-art deep detectors.

Conclusion: Future TSAD research should include strong linear baselines and develop new benchmarks with richer temporal structures to better demonstrate advantages of deep learning models.

Abstract: Research in time series anomaly detection (TSAD) has largely focused on developing increasingly sophisticated, hard-to-train, and expensive-to-infer neural architectures. We revisit this paradigm and show that a simple linear autoregressive anomaly score with the closed-form solution provided by ordinary least squares (OLS) regression consistently matches or outperforms state-of-the-art deep detectors. From a theoretical perspective, we show that linear models capture a broad class of anomaly types, estimating a finite-history Gaussian process conditional density. From a practical side, across extensive univariate and multivariate benchmarks, the proposed approach achieves superior accuracy while requiring orders of magnitude fewer computational resources. Thus, future research should consistently include strong linear baselines and, more importantly, develop new benchmarks with richer temporal structures pinpointing the advantages of deep learning models.

[1069] Provably Protecting Fine-Tuned LLMs from Training Data Extraction

Tom Segal, Asaf Shabtai, Yuval Elovici

Main category: cs.LG

TL;DR: SCP-Δr is a privacy-preserving fine-tuning method for LLMs that selectively smooths low-impact token probabilities while preserving influential deviations, achieving strong protection against training data extraction attacks with minimal utility loss.

Details

Motivation: Fine-tuning LLMs on sensitive data creates privacy risks from training data extraction attacks, but existing defenses either lack formal privacy guarantees or cause significant utility degradation.

Method: SCP-Δr uses Near Access Freeness (NAF) principles, operating on relative probabilities to identify and preserve only influential token-level deviations while aggressively smoothing low-impact tokens using a base model.

Result: The method achieves orders-of-magnitude better theoretical bounds than existing NAF approaches and provides strong empirical protection against training data extraction attacks with minimal performance loss.

Conclusion: Selective smoothing of token-level probability shifts enables effective privacy protection during LLM fine-tuning while maintaining utility, addressing a critical gap in privacy-preserving machine learning.

Abstract: Fine-tuning large language models (LLMs) on sensitive datasets raises privacy concerns, as training data extraction (TDE) attacks can expose highly confidential information. Existing defenses against such attacks either lack formal privacy guarantees or incur substantial utility degradation. We observe that fine-tuning induces widespread probability shifts, yet preserving only a small subset of influential token-level deviations is sufficient; the remaining shifts can be aggressively smoothed with minimal impact on utility. Motivated by this insight, we propose SCP-$Δ_r$, a Near Access Freeness (NAF)-based algorithm that operates on relative probabilities and explicitly smooths low-impact tokens using a base model. SCP-$Δ_r$ achieves orders-of-magnitude better theoretical bounds than existing NAF based methods and provides strong empirical protection against TDE attacks with minimal performance loss.

[1070] Topology and Geometry of the Learning Space of ReLU Networks: Connectivity and Singularities

Marco Nurisso, Pierrick Leroy, Giovanni Petri, Francesco Vaccarino

Main category: cs.LG

TL;DR: The paper studies the parameter space geometry of feed-forward ReLU networks on DAG architectures, focusing on connectedness and singularities, linking them to network topology and differentiable pruning.

Details

Motivation: Understanding the geometric properties of ReLU network parameter spaces is crucial for analyzing training dynamics, as gradient flow restricts parameters to algebraic varieties due to ReLU's homogeneity.

Method: Extends previous results by characterizing connectedness through bottleneck nodes and balance conditions, analyzes singularities in relation to DAG topology and induced sub-networks, and connects reachability to differentiable pruning.

Result: Provides thorough characterization showing singularities are intricately connected to DAG topology, establishes principled connection with differentiable pruning, and validates with numerical experiments.

Conclusion: The parameter space geometry of ReLU networks is fundamentally shaped by network architecture, with singularities tied to DAG topology, offering insights for training analysis and pruning techniques.

Abstract: Understanding the properties of the parameter space in feed-forward ReLU networks is critical for effectively analyzing and guiding training dynamics. After initialization, training under gradient flow decisively restricts the parameter space to an algebraic variety that emerges from the homogeneous nature of the ReLU activation function. In this study, we examine two key challenges associated with feed-forward ReLU networks built on general directed acyclic graph (DAG) architectures: the (dis)connectedness of the parameter space and the existence of singularities within it. We extend previous results by providing a thorough characterization of connectedness, highlighting the roles of bottleneck nodes and balance conditions associated with specific subsets of the network. Our findings clearly demonstrate that singularities are intricately connected to the topology of the underlying DAG and its induced sub-networks. We discuss the reachability of these singularities and establish a principled connection with differentiable pruning. We validate our theory with simple numerical experiments.

[1071] Forecasting Energy Availability in Local Energy Communities via LSTM Federated Learning

Fabio Turazza, Marcello Pietri, Natalia Selini Hadjidimitriou, Marco Mamei

Main category: cs.LG

TL;DR: Federated Learning with LSTM networks for energy consumption forecasting in Local Energy Communities while preserving user privacy.

Details

Motivation: Local Energy Communities need accurate energy forecasting for self-sufficiency, but face privacy constraints that prevent sharing consumption data among users.

Method: Use Federated Learning (FL) with Long Short-Term Memory (LSTM) networks to create forecasting models without sharing sensitive user data.

Result: Demonstrates a viable solution for energy forecasting that balances data sharing constraints with forecasting accuracy.

Conclusion: FL with LSTM networks provides a privacy-preserving approach for energy forecasting in Local Energy Communities, addressing the trade-off between data sharing and accuracy.

Abstract: Local Energy Communities are emerging as crucial players in the landscape of sustainable development. A significant challenge for these communities is achieving self-sufficiency through effective management of the balance between energy production and consumption. To meet this challenge, it is essential to develop and implement forecasting models that deliver accurate predictions, which can then be utilized by optimization and planning algorithms. However, the application of forecasting solutions is often hindered by privacy constrains and regulations as the users participating in the Local Energy Community can be (rightfully) reluctant sharing their consumption patterns with others. In this context, the use of Federated Learning (FL) can be a viable solution as it allows to create a forecasting model without the need to share privacy sensitive information among the users. In this study, we demonstrate how FL and long short-term memory (LSTM) networks can be employed to achieve this objective, highlighting the trade-off between data sharing and forecasting accuracy.

[1072] LocalV: Exploiting Information Locality for IP-level Verilog Generation

Hanqi Lyu, Di Huang, Yaoyu Zhu, Kangcheng Liu, Bohan Dou, Chongxiao Li, Pengwei Jin, Shuyao Cheng, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, Yunji Chen

Main category: cs.LG

TL;DR: LocalV is a multi-agent framework for generating Register-Transfer Level (RTL) code from specifications, addressing challenges in handling long documents, generating long code, and debugging through hierarchical decomposition and locality-aware techniques.

Details

Motivation: RTL code generation is labor-intensive and existing LLM approaches struggle with industrial IP-level design tasks due to challenges with long documents, long code generation, and complex debugging cycles.

Method: LocalV uses a multi-agent framework with hierarchical document partitioning, task planning, localized code generation, interface-consistent merging, and AST-guided locality-aware debugging to decompose long-document to long-code problems into short-document, short-code tasks.

Result: On RealBench (IP-level Verilog generation benchmark), LocalV achieves 45.0% pass rate, substantially outperforming state-of-the-art LLMs and agents (21.6%).

Conclusion: LocalV demonstrates that leveraging information locality through multi-agent decomposition enables scalable RTL code generation for industrial hardware design tasks.

Abstract: The generation of Register-Transfer Level (RTL) code is a crucial yet labor-intensive step in digital hardware design, traditionally requiring engineers to manually translate complex specifications into thousands of lines of synthesizable Hardware Description Language (HDL) code. While Large Language Models (LLMs) have shown promise in automating this process, existing approaches-including fine-tuned domain-specific models and advanced agent-based systems-struggle to scale to industrial IP-level design tasks. We identify three key challenges: (1) handling long, highly detailed documents, where critical interface constraints become buried in unrelated submodule descriptions; (2) generating long RTL code, where both syntactic and semantic correctness degrade sharply with increasing output length; and (3) navigating the complex debugging cycles required for functional verification through simulation and waveform analysis. To overcome these challenges, we propose LocalV, a multi-agent framework that leverages information locality in modular hardware design. LocalV decomposes the long-document to long-code generation problem into a set of short-document, short-code tasks, enabling scalable generation and debugging. Specifically, LocalV integrates hierarchical document partitioning, task planning, localized code generation, interface-consistent merging, and AST-guided locality-aware debugging. Experiments on RealBench, an IP-level Verilog generation benchmark, demonstrate that LocalV substantially outperforms state-of-the-art (SOTA) LLMs and agents, achieving a pass rate of 45.0% compared to 21.6%.

[1073] Deep Time-series Forecasting Needs Kernelized Moment Balancing

Licheng Pan, Hao Wang, Haocheng Yang, Yuqi Li, Qingsong Wen, Xiaoxi Li, Zhichao Chen, Haoxuan Li, Zhixuan Chu, Yuan Lu

Main category: cs.LG

TL;DR: Kernelized Moment Balancing for Direct Forecasting (KMB-DF) improves time-series forecasting by adaptively selecting informative balancing functions from RKHS to achieve full distribution alignment between forecasts and ground truths.

Details

Motivation: Existing time-series forecasting objectives fail to achieve true distribution balance between forecasts and ground truths because they only enforce moment matching for one or two predefined balancing functions, violating Imbens' criterion for full distribution alignment.

Method: Proposes KMB-DF which adaptively selects the most informative balancing functions from a reproducing kernel Hilbert space (RKHS) to enforce sufficient distribution balancing. Derives a tractable, differentiable objective that enables efficient estimation from empirical samples and integration into gradient-based training pipelines.

Result: Extensive experiments across multiple models and datasets show that KMB-DF consistently improves forecasting accuracy and achieves state-of-the-art performance.

Conclusion: KMB-DF effectively addresses the distribution balancing problem in time-series forecasting by adaptively selecting informative balancing functions, leading to improved forecasting performance compared to existing methods.

Abstract: Deep time-series forecasting can be formulated as a distribution balancing problem aimed at aligning the distribution of the forecasts and ground truths. According to Imbens’ criterion, true distribution balance requires matching the first moments with respect to any balancing function. We demonstrate that existing objectives fail to meet this criterion, as they enforce moment matching only for one or two predefined balancing functions, thus failing to achieve full distribution balance. To address this limitation, we propose direct forecasting with kernelized moment balancing (KMB-DF). Unlike existing objectives, KMB-DF adaptively selects the most informative balancing functions from a reproducing kernel hilbert space (RKHS) to enforce sufficient distribution balancing. We derive a tractable and differentiable objective that enables efficient estimation from empirical samples and seamless integration into gradient-based training pipelines. Extensive experiments across multiple models and datasets show that KMB-DF consistently improves forecasting accuracy and achieves state-of-the-art performance. Code is available at https://anonymous.4open.science/r/KMB-DF-403C.

[1074] Federated Learning at the Forefront of Fairness: A Multifaceted Perspective

Noorain Mukhtiar, Adnan Mahmood, Yipeng Zhou, Jian Yang, Jing Teng, Quan Z. Sheng

Main category: cs.LG

TL;DR: Survey paper on fairness in federated learning, categorizing approaches and evaluation metrics for equitable model performance across heterogeneous clients.

Details

Motivation: Fairness in federated learning is becoming critical due to heterogeneous client constraints and the need for balanced model performance across different scenarios and clients.

Method: Comprehensive survey approach with classification of fairness-aware methods from two perspectives: model performance-oriented and capability-oriented. Provides framework to categorize fairness concerns and technical aspects.

Result: Systematic categorization of state-of-the-art fairness approaches in FL, examination of effectiveness in balancing equity and performance, analysis of evaluation metrics for measuring fairness quantitatively.

Conclusion: Identifies open research directions and proposes prospective solutions for advancing fairness in federated learning, providing foundation for researchers in this area.

Abstract: Fairness in Federated Learning (FL) is emerging as a critical factor driven by heterogeneous clients’ constraints and balanced model performance across various scenarios. In this survey, we delineate a comprehensive classification of the state-of-the-art fairness-aware approaches from a multifaceted perspective, i.e., model performance-oriented and capability-oriented. Moreover, we provide a framework to categorize and address various fairness concerns and associated technical aspects, examining their effectiveness in balancing equity and performance within FL frameworks. We further examine several significant evaluation metrics leveraged to measure fairness quantitatively. Finally, we explore exciting open research directions and propose prospective solutions that could drive future advancements in this important area, laying a solid foundation for researchers working toward fairness in FL.

[1075] Spectral Imbalance Causes Forgetting in Low-Rank Continual Adaptation

Hao Gu, Mao-Lin Luo, Zi-Hao Zhou, Han-Chen Zhang, Min-Ling Zhang, Tong Wei

Main category: cs.LG

TL;DR: EBLoRA: A parameter-efficient continual learning method that balances singular value spectra in low-rank adaptations to mitigate forgetting in vision-language models.

Details

Motivation: Most continual learning approaches focus on avoiding interference with past updates rather than understanding what makes task-specific updates naturally preserve previously acquired knowledge. Low-rank adaptations exhibit imbalanced singular value spectra where dominant components disrupt previous knowledge and are vulnerable to future interference.

Method: Decouple magnitude of task update from directional structure and formulate as constrained optimization on restricted Stiefel manifold. Use projected first-order method compatible with standard deep-learning optimizers for vision-language models.

Result: Method mitigates both backward and forward forgetting, consistently outperforming continual learning baselines.

Conclusion: EBLoRA enables explicit balance among components in low-rank adaptations, improving continual learning performance for vision-language models.

Abstract: Parameter-efficient continual learning aims to adapt pre-trained models to sequential tasks without forgetting previously acquired knowledge. Most existing approaches treat continual learning as avoiding interference with past updates, rather than considering what properties make the current task-specific update naturally preserve previously acquired knowledge. From a knowledge-decomposition perspective, we observe that low-rank adaptations exhibit highly imbalanced singular value spectra: a few dominant components absorb most of the adaptation energy, thereby (i) more likely to disrupt previously acquired knowledge and (ii) making the update more vulnerable to interference from subsequent tasks. To enable explicit balance among components, we decouple the magnitude of the task update from its directional structure and formulate it as a constrained optimization problem on a restricted Stiefel manifold. We address this problem using a projected first-order method compatible with standard deep-learning optimizers used in vision-language models. Our method mitigates both backward and forward forgetting, consistently outperforming continual learning baselines. The implementation code is available at https://github.com/haodotgu/EBLoRA.

[1076] Rethinking Hallucinations: Correctness, Consistency, and Prompt Multiplicity

Prakhar Ganesh, Reza Shokri, Golnoosh Farnadi

Main category: cs.LG

TL;DR: Paper introduces prompt multiplicity framework to evaluate LLM hallucination consistency beyond just correctness, revealing high inconsistency rates and limitations in current detection/mitigation methods.

Details

Motivation: Current hallucination evaluation focuses only on correctness and overlooks consistency, which is necessary to properly understand and address the harms of LLM hallucinations. Existing frameworks fail to distinguish between different types of hallucination-related harms.

Method: Introduces prompt multiplicity framework for quantifying consistency in LLM evaluations. Analyzes inconsistency in benchmarks like Med-HALT, studies role of consistency in hallucination detection and mitigation techniques.

Result: Reveals significant multiplicity (over 50% inconsistency in benchmarks), shows detection techniques detect consistency rather than correctness, and finds mitigation techniques like RAG can introduce additional inconsistencies.

Conclusion: Prompt multiplicity provides improved framework for understanding hallucination harms and uncovers critical limitations in current detection and mitigation strategies. Consistency is crucial for proper hallucination evaluation.

Abstract: Large language models (LLMs) are known to “hallucinate” by generating false or misleading outputs. Hallucinations pose various harms, from erosion of trust to widespread misinformation. Existing hallucination evaluation, however, focuses only on correctness and often overlooks consistency, necessary to distinguish and address these harms. To bridge this gap, we introduce prompt multiplicity, a framework for quantifying consistency in LLM evaluations. Our analysis reveals significant multiplicity (over 50% inconsistency in benchmarks like Med-HALT), suggesting that hallucination-related harms have been severely misunderstood. Furthermore, we study the role of consistency in hallucination detection and mitigation. We find that: (a) detection techniques detect consistency, not correctness, and (b) mitigation techniques like RAG, while beneficial, can introduce additional inconsistencies. By integrating prompt multiplicity into hallucination evaluation, we provide an improved framework of potential harms and uncover critical limitations in current detection and mitigation strategies.

[1077] Pareto-Conditioned Diffusion Models for Offline Multi-Objective Optimization

Jatan Shrestha, Santeri Heiskanen, Kari Hepola, Severi Rissanen, Pekka Jääskeläinen, Joni Pajarinen

Main category: cs.LG

TL;DR: Pareto-Conditioned Diffusion (PCD) formulates offline multi-objective optimization as a conditional sampling problem using diffusion models, avoiding explicit surrogate models and enabling exploration beyond observed data.

Details

Motivation: Offline multi-objective optimization faces challenges in generalizing beyond static datasets and exploring trade-offs between competing objectives without explicit surrogate models.

Method: PCD uses conditional diffusion models to sample Pareto-optimal solutions by conditioning directly on desired trade-offs, with reweighting for high-performing samples and reference-direction mechanisms for novel region exploration.

Result: PCD achieves competitive performance on standard offline MOO benchmarks and demonstrates greater consistency across diverse tasks compared to existing approaches.

Conclusion: Conditional diffusion models provide an effective framework for offline MOO that avoids explicit surrogate modeling and enables better generalization beyond training data.

Abstract: Multi-objective optimization (MOO) arises in many real-world applications where trade-offs between competing objectives must be carefully balanced. In the offline setting, where only a static dataset is available, the main challenge is generalizing beyond observed data. We introduce Pareto-Conditioned Diffusion (PCD), a novel framework that formulates offline MOO as a conditional sampling problem. By conditioning directly on desired trade-offs, PCD avoids the need for explicit surrogate models. To effectively explore the Pareto front, PCD employs a reweighting strategy that focuses on high-performing samples and a reference-direction mechanism to guide sampling towards novel, promising regions beyond the training data. Experiments on standard offline MOO benchmarks show that PCD achieves highly competitive performance and, importantly, demonstrates greater consistency across diverse tasks than existing offline MOO approaches.

[1078] GraphNNK – Graph Classification and Interpretability

Zeljko Bolevic, Milos Brajovic, Isidora Stankovic, Ljubisa Stankovic

Main category: cs.LG

TL;DR: NNK interpolation replaces parametric classifiers in GNNs with non-negative kernel regression for better interpretability and generalization

Details

Motivation: Current GNNs rely on parametric classifiers (linear softmax layers) which limit interpretability and sometimes hinder generalization. There's a need for more interpretable and potentially better generalizing approaches for graph-structured data.

Method: Proposes using Non-Negative Kernel regression (NNK) as an interpolation-based method to replace parametric classifiers. Predictions are expressed as convex combinations of similar training examples in the embedding space.

Result: The approach yields both theoretical results and interpretable explanations by expressing predictions as convex combinations of training examples, providing better transparency into model decisions.

Conclusion: NNK interpolation offers a promising alternative to parametric classifiers in GNNs, improving interpretability while maintaining or potentially enhancing generalization performance.

Abstract: Graph Neural Networks (GNNs) have become a standard approach for learning from graph-structured data. However, their reliance on parametric classifiers (most often linear softmax layers) limits interpretability and sometimes hinders generalization. Recent work on interpolation-based methods, particularly Non-Negative Kernel regression (NNK), has demonstrated that predictions can be expressed as convex combinations of similar training examples in the embedding space, yielding both theoretical results and interpretable explanations.

[1079] BLOCK-EM: Preventing Emergent Misalignment by Blocking Causal Features

Muhammed Ustaomeroglu, Guannan Qu

Main category: cs.LG

TL;DR: The paper investigates a mechanistic approach to prevent emergent misalignment in language models during fine-tuning by identifying and blocking internal features that cause undesirable out-of-domain behaviors while maintaining target-task performance.

Details

Motivation: When language models are fine-tuned on narrow supervised objectives, they can develop emergent misalignment - learning the target behavior but also acquiring undesirable out-of-domain behaviors. The authors aim to prevent this misalignment without degrading model quality or target-task performance.

Method: The approach identifies a small set of internal features that reliably control misaligned behavior, then discourages the model from strengthening these features during fine-tuning. The method involves feature identification, blocking/constraining these features, and extensive validation across six fine-tuning domains.

Result: Blocking a fixed set of features achieves up to 95% relative reduction in emergent misalignment with no degradation in model quality or target-task performance. The study includes validation with disjoint splits, multiple judges, random seeds, quality metrics, and extensive ablations showing the reduction is specific to the identified mechanism.

Conclusion: Targeted training-time constraints on internal mechanisms can effectively mitigate emergent misalignment without degrading target-task performance, though limitations exist where misalignment can re-emerge under prolonged fine-tuning through alternative pathways.

Abstract: Emergent misalignment can arise when a language model is fine-tuned on a narrowly scoped supervised objective: the model learns the target behavior, yet also develops undesirable out-of-domain behaviors. We investigate a mechanistic approach to preventing emergent misalignment by identifying a small set of internal features that reliably control the misaligned behavior and then discouraging the model from strengthening these features during fine-tuning. Across six fine-tuning domains, blocking (i.e., constraining) a fixed set of features achieves up to 95% relative reduction in emergent misalignment with no degradation in model quality or target-task performance. We strengthen validity with disjoint selection/evaluation splits, multiple independent judges, multiple random seeds for key settings, quality metrics, and extensive ablations demonstrating that the reduction in misalignment is specific to the identified mechanism. We also characterize a limiting regime in which misalignment re-emerges under prolonged fine-tuning, present evidence consistent with rerouting through alternative features or layers, and evaluate modifications that partially restore the misalignment-blocking effect. Overall, our results show that targeted training-time constraints on internal mechanisms can mitigate emergent misalignment without degrading target-task performance.

[1080] Provable Model Provenance Set for Large Language Models

Xiaoqi Qiu, Hao Zeng, Zhiyu Hou, Hongxin Wei

Main category: cs.LG

TL;DR: MPS is a provable method for model provenance analysis that guarantees coverage of all true sources at a specified confidence level using sequential testing.

Details

Motivation: The need for reliable model provenance analysis due to unauthorized model usage and misattribution, with existing methods lacking provable error control and overlooking multiple sources.

Method: Proposes Model Provenance Set (MPS) using a sequential test-and-exclusion procedure to adaptively construct a small set that guarantees coverage of all true provenances at a prescribed confidence level.

Result: MPS effectively achieves target provenance coverage while limiting inclusion of unrelated models, demonstrating practical utility for attribution and auditing tasks.

Conclusion: MPS provides a rigorous, provable framework for model provenance analysis with practical applications in model attribution and auditing.

Abstract: The growing prevalence of unauthorized model usage and misattribution has increased the need for reliable model provenance analysis. However, existing methods largely rely on heuristic fingerprint-matching rules that lack provable error control and often overlook the existence of multiple sources, leaving the reliability of their provenance claims unverified. In this work, we first formalize the model provenance problem with provable guarantees, requiring rigorous coverage of all true provenances at a prescribed confidence level. Then, we propose the Model Provenance Set (MPS), which employs a sequential test-and-exclusion procedure to adaptively construct a small set satisfying the guarantee. The key idea of MPS is to test the significance of provenance existence within a candidate pool, thereby establishing a provable asymptotic guarantee at a user-specific confidence level. Extensive experiments demonstrate that MPS effectively achieves target provenance coverage while strictly limiting the inclusion of unrelated models, and further reveal its potential for practical provenance analysis in attribution and auditing tasks.

[1081] A novel VAE-DML fusion framework for casual analysis of greenwashing in the mining industry

Yuxin Lu, Zhen Peng, Xiqiang Xia, Jie Wang

Main category: cs.LG

TL;DR: Study examines how equity balance in mining industry chain enterprises inhibits greenwashing behavior using VAE and DML models to establish causal relationships.

Details

Motivation: Mining enterprises are crucial for resource consumption and environmental impact in the context of global green transition and "dual carbon" goals. Ensuring authentic environmental disclosure is essential for sustainable development and national strategic objectives.

Method: Innovatively employs Variational Autoencoder (VAE) and Double Machine Learning (DML) model to construct counterfactual scenarios, mitigating endogeneity concerns and identifying causal relationships between equity balance and greenwashing.

Result: 1) Significant negative causal relationship between equity balance and corporate greenwashing; 2) Heterogeneous effects across regions, industrial segments, and environmental sensitivity; 3) Temporal dynamics with strongest current impact, diminishing lagged effect, and stable long-term influence; 4) Three mechanisms: alleviating management pressure, enhancing executive team stability, and intensifying media scrutiny.

Conclusion: Equity balance serves as an effective governance mechanism to curb greenwashing in mining industry chain enterprises through multiple pathways, with implications for sustainable development and environmental policy.

Abstract: Against the backdrop of the global green transition and “dual carbon” goals, mining industry chain enterprises are pivotal entities in terms of resource consumption and environmental impact. Their environmental performance directly affects regional ecological security and is closely tied to national resource strategies and green transformation outcomes. Ensuring the authenticity and reliability of their environmental disclosure is thus a core and urgent issue for sustainable development and national strategic objectives.From a corporate governance perspective, this study examines equity balance as a fundamental governance mechanism, investigating its inhibitory effect on greenwashing behavior among these enterprises and the underlying pathways involved. Methodologically, the paper innovatively employs a Variational Autoencoder (VAE) and a Double Machine Learning (DML) model to construct counterfactual scenarios, mitigating endogeneity concerns and precisely identifying the causal relationship between equity balance and greenwashing. The findings indicate, first, a significant negative causal relationship between equity balance and corporate greenwashing, confirming its substantive governance effect. Second, this inhibitory effect exhibits notable heterogeneity, manifesting more strongly in western regions, upstream segments of the industrial chain, and industries with high environmental sensitivity. Third, the governance effect demonstrates clear temporal dynamics, with the strongest impact occurring in the current period, followed by a diminishing yet statistically significant lagged effect, and ultimately a stable long-term cumulative influence. Finally, mechanism analysis reveals that equity balance operates through three distinct channels to curb greenwashing: alleviating management performance pressure, enhancing the stability of the executive team, and intensifying media scrutiny.

[1082] Stable Time Series Prediction of Enterprise Carbon Emissions Based on Causal Inference

Zitao Hong, Zhen Peng, Xueping Liu

Main category: cs.LG

TL;DR: A stable temporal prediction framework for enterprise carbon emissions using causal inference and stable learning to handle distribution shifts across regions, industries, and time.

Details

Motivation: Accurate prediction of enterprise carbon emissions is crucial for energy optimization and low-carbon transformation, but significant heterogeneity across regions, industries, and enterprises causes distribution shifts and non-stationarity in carbon emission data, compromising prediction accuracy and decision-making value.

Method: Integrates causal inference with stable learning and time-series modeling, incorporating enterprise-level energy inputs, capital investment, labor deployment, carbon pricing, and policy factors. Uses risk consistency-constrained stable learning to extract causal stable features from multi-environment samples, with adaptive normalization and sample reweighting to handle temporal non-stationarity.

Result: The approach enhances model generalization capability and explainability in complex environments by dynamically rectifying temporal non-stationarity induced by economic fluctuations and policy transitions.

Conclusion: Proposes a stable temporal prediction mechanism that addresses distribution shift challenges in carbon emission forecasting, improving accuracy for production planning and carbon quota trading decisions.

Abstract: Against the backdrop of ongoing carbon peaking and carbon neutrality goals, accurate prediction of enterprise carbon emission trends constitutes an essential foundation for energy structure optimization and low-carbon transformation decision-making. Nevertheless, significant heterogeneity persists across regions, industries and individual enterprises regarding energy structure, production scale, policy intensity and governance efficacy, resulting in pronounced distribution shifts and non-stationarity in carbon emission data across both temporal and spatial dimensions. Such cross-regional and cross-enterprise data drift not only compromises the accuracy of carbon emission reporting but substantially undermines the guidance value of predictive models for production planning and carbon quota trading decisions. To address this critical challenge, we integrate causal inference perspectives with stable learning methodologies and time-series modelling, proposing a stable temporal prediction mechanism tailored to distribution shift environments. This mechanism incorporates enterprise-level energy inputs, capital investment, labour deployment, carbon pricing, governmental interventions and policy implementation intensity, constructing a risk consistency-constrained stable learning framework that extracts causal stable features (robust against external perturbations yet demonstrating long-term stable effects on carbon dioxide emissions) from multi-environment samples across diverse policies, regions and industrial sectors. Furthermore, through adaptive normalization and sample reweighting strategies, the approach dynamically rectifies temporal non-stationarity induced by economic fluctuations and policy transitions, ultimately enhancing model generalization capability and explainability in complex environments.

[1083] Fast Non-Episodic Finite-Horizon RL with K-Step Lookahead Thresholding

Jiamin Xu, Kyra Gan

Main category: cs.LG

TL;DR: A novel RL algorithm using K-step lookahead Q-functions with thresholding for non-episodic finite-horizon MDPs, achieving improved sample efficiency and theoretical guarantees.

Details

Motivation: Online reinforcement learning in non-episodic, finite-horizon MDPs is challenging due to the need to estimate returns to a fixed terminal time. Existing infinite-horizon methods using discounted contraction don't naturally handle fixed-horizon structure.

Method: Introduces a modified Q-function that learns K-step lookahead (truncates planning to next K steps) with thresholding mechanism: actions selected only when estimated K-step lookahead value exceeds time-varying threshold. Provides efficient tabular learning algorithm with adaptive K increase over time.

Result: Achieves minimax optimal constant regret for K=1 and O(max((K-1),C_{K-1})√(SAT log(T))) regret for K≥2. Empirical evaluation shows superior cumulative rewards over state-of-the-art tabular RL methods across synthetic MDPs and RL environments (JumpRiverswim, FrozenLake, AnyTrading).

Conclusion: The K-step lookahead approach with thresholding provides an effective solution for non-episodic finite-horizon MDPs, balancing lookahead depth against estimation variance and achieving strong theoretical and empirical performance.

Abstract: Online reinforcement learning in non-episodic, finite-horizon MDPs remains underexplored and is challenged by the need to estimate returns to a fixed terminal time. Existing infinite-horizon methods, which often rely on discounted contraction, do not naturally account for this fixed-horizon structure. We introduce a modified Q-function: rather than targeting the full-horizon, we learn a K-step lookahead Q-function that truncates planning to the next K steps. To further improve sample efficiency, we introduce a thresholding mechanism: actions are selected only when their estimated K-step lookahead value exceeds a time-varying threshold. We provide an efficient tabular learning algorithm for this novel objective, proving it achieves fast finite-sample convergence: it achieves minimax optimal constant regret for $K=1$ and $\mathcal{O}(\max((K-1),C_{K-1})\sqrt{SAT\log(T)})$ regret for any $K \geq 2$. We numerically evaluate the performance of our algorithm under the objective of maximizing reward. Our implementation adaptively increases K over time, balancing lookahead depth against estimation variance. Empirical results demonstrate superior cumulative rewards over state-of-the-art tabular RL methods across synthetic MDPs and RL environments: JumpRiverswim, FrozenLake and AnyTrading.

[1084] Multi-Objective Multi-Fidelity Bayesian Optimization with Causal Priors

Md Abir Hossen, Mohammad Ali Javidian, Vignesh Narayanan, Jason M. O’Kane, Pooyan Jamshidi

Main category: cs.LG

TL;DR: RESCUE is a multi-fidelity Bayesian optimization method that incorporates causal modeling to improve sample efficiency when low-fidelity proxies are poorly aligned with target fidelity.

Details

Motivation: Existing multi-fidelity Bayesian optimization methods rely on associational dependencies between inputs, fidelities, and objectives, which can perform poorly when lower-fidelity proxies are misaligned with the target fidelity. There's a need for methods that capture causal mechanisms rather than just correlations.

Method: RESCUE learns a structural causal model capturing causal relationships between inputs, fidelities, and objectives, then constructs a probabilistic multi-fidelity surrogate encoding intervention effects. It uses a causal hypervolume knowledge-gradient acquisition strategy to select input-fidelity pairs balancing expected multi-objective improvement and cost.

Result: RESCUE improves sample efficiency over state-of-the-art multi-fidelity optimization methods on synthetic and real-world problems in robotics, machine learning (AutoML), and healthcare.

Conclusion: Incorporating causal calculus into multi-fidelity Bayesian optimization addresses limitations of existing methods and improves performance when low-fidelity proxies are poorly aligned with target fidelity.

Abstract: Multi-fidelity Bayesian optimization (MFBO) accelerates the search for the global optimum of black-box functions by integrating inexpensive, low-fidelity approximations. The central task of an MFBO policy is to balance the cost-efficiency of low-fidelity proxies against their reduced accuracy to ensure effective progression toward the high-fidelity optimum. Existing MFBO methods primarily capture associational dependencies between inputs, fidelities, and objectives, rather than causal mechanisms, and can perform poorly when lower-fidelity proxies are poorly aligned with the target fidelity. We propose RESCUE (REducing Sampling cost with Causal Understanding and Estimation), a multi-objective MFBO method that incorporates causal calculus to systematically address this challenge. RESCUE learns a structural causal model capturing causal relationships between inputs, fidelities, and objectives, and uses it to construct a probabilistic multi-fidelity (MF) surrogate that encodes intervention effects. Exploiting the causal structure, we introduce a causal hypervolume knowledge-gradient acquisition strategy to select input-fidelity pairs that balance expected multi-objective improvement and cost. We show that RESCUE improves sample efficiency over state-of-the-art MF optimization methods on synthetic and real-world problems in robotics, machine learning (AutoML), and healthcare.

[1085] Sporadic Gradient Tracking over Directed Graphs: A Theoretical Perspective on Decentralized Federated Learning

Shahryar Zehtabi, Dong-Jun Han, Seyyedali Hosseinalipour, Christopher Brinton

Main category: cs.LG

TL;DR: Spod-GT is a decentralized federated learning algorithm that combines gradient tracking for data heterogeneity with sporadic client participation to handle resource diversity over directed communication graphs.

Details

Motivation: Decentralized Federated Learning (DFL) faces challenges with data heterogeneity across clients and diverse resource availability. Previous work addressed these issues separately - gradient tracking for data heterogeneity and sporadic participation for resource constraints - but no unified solution exists for general directed graphs.

Method: Proposes Spod-GT algorithm that allows client-specific gradient computation frequencies and heterogeneous/asymmetric communication frequencies over directed graphs. Uses gradient tracking techniques while accommodating intermittent client participation with relaxed assumptions on gradient estimation variance and gradient diversity.

Result: Provides rigorous convergence analysis with consensus and optimality guarantees for gradient tracking over directed graphs despite intermittent participation. Numerical experiments on image classification datasets demonstrate efficacy compared to gradient tracking baselines.

Conclusion: Spod-GT successfully unifies gradient tracking for data heterogeneity with sporadic client participation for resource diversity, offering a comprehensive DFL solution for practical scenarios with heterogeneous clients and communication constraints.

Abstract: Decentralized Federated Learning (DFL) enables clients with local data to collaborate in a peer-to-peer manner to train a generalized model. In this paper, we unify two branches of work that have separately solved important challenges in DFL: (i) gradient tracking techniques for mitigating data heterogeneity and (ii) accounting for diverse availability of resources across clients. We propose $\textit{Sporadic Gradient Tracking}$ ($\texttt{Spod-GT}$), the first DFL algorithm that incorporates these factors over general directed graphs by allowing (i) client-specific gradient computation frequencies and (ii) heterogeneous and asymmetric communication frequencies. We conduct a rigorous convergence analysis of our methodology with relaxed assumptions on gradient estimation variance and gradient diversity of clients, providing consensus and optimality guarantees for GT over directed graphs despite intermittent client participation. Through numerical experiments on image classification datasets, we demonstrate the efficacy of $\texttt{Spod-GT}$ compared to well-known GT baselines.

[1086] Latent Shadows: The Gaussian-Discrete Duality in Masked Diffusion

Guinan Chen, Xunpeng Huang, Ying Sun, Shijin Wang, Yanyong Zhang, Chao Wang

Main category: cs.LG

TL;DR: Masked Consistency Distillation (MCD) enables deterministic sampling for masked discrete diffusion models via explicit duality with continuous Gaussian processes, achieving 16x inference speedup without quality loss.

Details

Motivation: Masked discrete diffusion models produce high-quality language generation but suffer from slow inference due to lack of deterministic sampling tools. Existing deterministic distillation methods underperform masked models, while masked domain methods rely on stochastic distillation, creating an efficiency-quality trade-off.

Method: Establishes explicit Masked Diffusion Duality showing masked processes arise from continuous Gaussian processes via maximum-value index preservation. Introduces Masked Consistency Distillation (MCD) that leverages this duality to analytically construct deterministic coupled trajectories for consistency distillation, bypassing numerical ODE solvers.

Result: Achieves 16x inference speedup compared to prior stochastic distillation methods without compromising generation quality. Provides theoretical foundation connecting masked and continuous diffusion.

Conclusion: MCD unlocks full potential of consistency distillation for high-performance discrete generation by enabling deterministic sampling in masked diffusion models, bridging theoretical gap between masked and continuous diffusion paradigms.

Abstract: Masked discrete diffusion is a dominant paradigm for high-quality language modeling where tokens are iteratively corrupted to a mask state, yet its inference efficiency is bottlenecked by the lack of deterministic sampling tools. While diffusion duality enables deterministic distillation for uniform models, these approaches generally underperform masked models and rely on complex integral operators. Conversely, in the masked domain, prior methods typically assume the absence of deterministic trajectories, forcing a reliance on stochastic distillation. To bridge this gap, we establish explicit Masked Diffusion Duality, proving that the masked process arises as the projection of a continuous Gaussian process via a novel maximum-value index preservation mechanism. Furthermore, we introduce Masked Consistency Distillation (MCD), a principled framework that leverages this duality to analytically construct the deterministic coupled trajectories required for consistency distillation, bypassing numerical ODE solvers. This result strictly improves upon prior stochastic distillation methods, achieving a 16$\times$ inference speedup without compromising generation quality. Our findings not only provide a solid theoretical foundation connecting masked and continuous diffusion, but also unlock the full potential of consistency distillation for high-performance discrete generation. Our code is available at https://anonymous.4open.science/r/MCD-70FD.

[1087] JTok: On Token Embedding as another Axis of Scaling Law via Joint Token Self-modulation

Yebin Yang, Huaijin Wu, Fu Guo, Lin Yao, Xiaohan Qin, Jingzhi Wang, Debing Zhang, Junchi Yan

Main category: cs.LG

TL;DR: Token-indexed parameters (JTok/JTok-M) as a new scaling axis that decouples model capacity from FLOPs by using auxiliary embedding tables to modulate Transformer layers with minimal computational overhead.

Details

Motivation: Traditional LLM scaling couples performance with computational cost increases. While MoE decouples capacity from compute, it introduces memory overhead and hardware efficiency challenges. The paper aims to find a new scaling dimension that avoids these issues.

Method: Proposes Joint-Token (JTok) and Mixture of Joint-Token (JTok-M) which augment Transformer layers with modulation vectors retrieved from auxiliary embedding tables. These vectors modulate the backbone via lightweight element-wise operations with negligible FLOPs overhead.

Result: Experiments from 650M to 61B parameters show consistent validation loss reduction and significant downstream task improvements (+4.1 MMLU, +8.3 ARC, +8.9 CEval). JTok-M achieves comparable quality with 35% less compute than vanilla MoE, showing predictable power-law scaling behavior.

Conclusion: Token-indexed parameters represent a novel scaling axis that fundamentally shifts the quality-compute Pareto frontier, offering efficient capacity scaling without proportional FLOPs increases.

Abstract: LLMs have traditionally scaled along dense dimensions, where performance is coupled with near-linear increases in computational cost. While MoE decouples capacity from compute, it introduces large memory overhead and hardware efficiency challenges. To overcome these, we propose token-indexed parameters as a novel, orthogonal scaling axis that decouple model capacity from FLOPs. Specifically, we introduce Joint-Token (JTok) and Mixture of Joint-Token (JTok-M), which augment Transformer layers with modulation vectors retrieved from auxiliary embedding tables. These vectors modulate the backbone via lightweight, element-wise operations, incurring negligible FLOPs overhead. Extensive experiments on both dense and MoE backbones, spanning from 650M (190M + 460M embedding) to 61B (17B + 44B embedding) total parameters, demonstrate that our approach consistently reduces validation loss and significantly improves downstream task performance (e.g., +4.1 on MMLU, +8.3 on ARC, +8.9 on CEval). Rigorous isoFLOPs analysis further confirms that JTok-M fundamentally shifts the quality-compute Pareto frontier, achieving comparable model quality with 35% less compute relative to vanilla MoE architectures, and we validate that token-indexed parameters exhibit a predictable power-law scaling behavior. Moreover, our efficient implementation ensures that the overhead introduced by JTok and JTok-M remains marginal.

[1088] Mobile Exergames: Activity Recognition Based on Smartphone Sensors

David Craveiro, Hugo Silva

Main category: cs.LG

TL;DR: A 2D endless game called Duck Catch & Fit that uses smartphone sensors (accelerometer, gyroscope, magnetometer) for human activity recognition combined with voice recognition for gameplay.

Details

Motivation: To create an immersive gaming experience by integrating detailed human activity recognition using smartphone sensors with voice recognition, demonstrating practical applications of sensor-based activity detection in gaming.

Method: Developed a proof-of-concept game that extracts features from smartphone sensors (accelerometer, gyroscope, magnetometer) and applies machine learning to detect activities like staying, side movements, and fake side movements. Combined with voice recognition system to detect the word “fire”.

Result: The system successfully recognizes human activities with high accuracy using machine learning techniques. The combination of movement-based and voice-based integrations creates more immersive gameplay.

Conclusion: Smartphone sensors combined with machine learning can effectively recognize human activities for gaming applications, and integrating multiple modalities (movement + voice) enhances gameplay immersion.

Abstract: Smartphone sensors can be extremely useful in providing information on the activities and behaviors of persons. Human activity recognition is increasingly used for games, medical, or surveillance. In this paper, we propose a proof-of-concept 2D endless game called Duck Catch & Fit, which implements a detailed activity recognition system that uses a smartphone accelerometer, gyroscope, and magnetometer sensors. The system applies feature extraction and learning mechanism to detect human activities like staying, side movements, and fake side movements. In addition, a voice recognition system is combined to recognize the word “fire” and raise the game’s complexity. The results show that it is possible to use machine learning techniques to recognize human activity with high recognition levels. Also, the combination of movement-based and voice-based integrations contributes to a more immersive gameplay.

[1089] Over-Alignment vs Over-Fitting: The Role of Feature Learning Strength in Generalization

Taesun Yeom, Taehyeok Ha, Jaeho Lee

Main category: cs.LG

TL;DR: The paper investigates how feature learning strength (FLS) affects generalization in deep networks, finding an optimal FLS that balances over-alignment and over-fitting, contrary to intuition that stronger feature learning always improves generalization.

Details

Motivation: Existing theory on feature learning strength (FLS) focuses on asymptotic regimes but offers limited insight into how FLS affects generalization in practical settings where training stops upon reaching target training risk. The paper aims to understand this practical impact.

Method: Combines empirical studies on deep networks with theoretical analysis of gradient flow dynamics in two-layer ReLU networks trained with logistic loss, where FLS is controlled via initialization scale.

Result: Discovers emergence of an optimal FLS that yields substantial generalization gains, explained by a trade-off: excessive FLS causes over-alignment that degrades generalization, while overly small FLS leads to over-fitting.

Conclusion: There exists an optimal feature learning strength that balances competing effects, challenging the prevailing intuition that stronger feature learning universally improves generalization in practical training scenarios.

Abstract: Feature learning strength (FLS), i.e., the inverse of the effective output scaling of a model, plays a critical role in shaping the optimization dynamics of neural nets. While its impact has been extensively studied under the asymptotic regimes – both in training time and FLS – existing theory offers limited insight into how FLS affects generalization in practical settings, such as when training is stopped upon reaching a target training risk. In this work, we investigate the impact of FLS on generalization in deep networks under such practical conditions. Through empirical studies, we first uncover the emergence of an $\textit{optimal FLS}$ – neither too small nor too large – that yields substantial generalization gains. This finding runs counter to the prevailing intuition that stronger feature learning universally improves generalization. To explain this phenomenon, we develop a theoretical analysis of gradient flow dynamics in two-layer ReLU nets trained with logistic loss, where FLS is controlled via initialization scale. Our main theoretical result establishes the existence of an optimal FLS arising from a trade-off between two competing effects: An excessively large FLS induces an $\textit{over-alignment}$ phenomenon that degrades generalization, while an overly small FLS leads to $\textit{over-fitting}$.

[1090] Don’t Forget Its Variance! The Minimum Path Variance Principle for Accurate and Stable Score-Based Density Ratio Estimation

Wei Chen, Jiacheng Li, Shigui Li, Zhiqi Lin, Junmei Yang, John Paisley, Delu Zeng

Main category: cs.LG

TL;DR: MinPV Principle addresses path variance in score-based density ratio estimation, proposing a closed-form solution and Kumaraswamy Mixture Model parameterization for stable, accurate estimators.

Details

Motivation: Score-based methods for density ratio estimation suffer from a paradox: while theoretically path-independent, their practical performance heavily depends on the chosen path schedule. This inconsistency between theory and practice needs resolution.

Method: Proposes MinPV (Minimum Path Variance) Principle to minimize the overlooked path variance term. Derives a closed-form expression for the variance and parameterizes the path with a flexible Kumaraswamy Mixture Model to learn data-adaptive, low-variance paths without heuristic selection.

Result: The method yields more accurate and stable density ratio estimators, establishing new state-of-the-art results on challenging benchmarks.

Conclusion: The MinPV Principle resolves the path variance paradox in score-based density ratio estimation, providing a principled optimization approach that bridges the gap between theoretical path-independence and practical performance dependence.

Abstract: Score-based methods have emerged as a powerful framework for density ratio estimation (DRE), but they face an important paradox in that, while theoretically path-independent, their practical performance depends critically on the chosen path schedule. We resolve this issue by proving that tractable training objectives differ from the ideal, ground-truth objective by a crucial, overlooked term: the path variance of the time score. To address this, we propose MinPV (\textbf{Min}imum \textbf{P}ath \textbf{V}ariance) Principle, which introduces a principled heuristic to minimize the overlooked path variance. Our key contribution is the derivation of a closed-form expression for the variance, turning an intractable problem into a tractable optimization. By parameterizing the path with a flexible Kumaraswamy Mixture Model, our method learns a data-adaptive, low-variance path without heuristic selection. This principled optimization of the complete objective yields more accurate and stable estimators, establishing new state-of-the-art results on challenging benchmarks.

[1091] RMFlow: Refined Mean Flow by a Noise-Injection Step for Multimodal Generation

Yuhao Huang, Shih-Hsin Wang, Andrea L. Bertozzi, Bao Wang

Main category: cs.LG

TL;DR: RMFlow improves 1-NFE MeanFlow generation by adding noise-injection refinement and a new loss function for better multimodal generation across images, molecules, and time-series.

Details

Motivation: MeanFlow's single-function evaluation (1-NFE) generation often produces suboptimal results despite its efficiency. The authors aim to enhance 1-NFE generation quality while maintaining computational efficiency.

Method: RMFlow integrates coarse 1-NFE MeanFlow transport with tailored noise-injection refinement. It uses a neural network to approximate average velocity of flow paths, trained with a novel loss function balancing Wasserstein distance minimization and sample likelihood maximization.

Result: RMFlow achieves near state-of-the-art results on text-to-image, context-to-molecule, and time-series generation using only 1-NFE, with computational cost comparable to baseline MeanFlows.

Conclusion: RMFlow successfully addresses MeanFlow’s quality limitations in 1-NFE generation, providing efficient high-quality multimodal generation across different domains.

Abstract: Mean flow (MeanFlow) enables efficient, high-fidelity image generation, yet its single-function evaluation (1-NFE) generation often cannot yield compelling results. We address this issue by introducing RMFlow, an efficient multimodal generative model that integrates a coarse 1-NFE MeanFlow transport with a subsequent tailored noise-injection refinement step. RMFlow approximates the average velocity of the flow path using a neural network trained with a new loss function that balances minimizing the Wasserstein distance between probability paths and maximizing sample likelihood. RMFlow achieves near state-of-the-art results on text-to-image, context-to-molecule, and time-series generation using only 1-NFE, at a computational cost comparable to the baseline MeanFlows.

[1092] Investigating the Robustness of Subtask Distillation under Spurious Correlation

Pattarawat Chormai, Klaus-Robert Müller, Grégoire Montavon

Main category: cs.LG

TL;DR: Study evaluates distillation methods’ robustness to spurious correlations in data, finding advanced methods like SubDistill remain stable while baselines degrade significantly.

Details

Motivation: Knowledge distillation often relies on limited datasets that may contain spurious correlations, raising concerns about the robustness of distilled models in real-world applications with imperfect data.

Method: Evaluates established distillation methods and the recent SubDistill method using data with varying strengths of spurious correlations, measuring performance degradation as correlation strength increases.

Result: Advanced methods like SubDistill remain fairly robust to spurious correlations, while baseline methods degrade to near-random performance as correlation strength increases, showing a widening performance gap.

Conclusion: Knowledge distillation faces significant challenges with imperfect real-world datasets containing spurious correlations, highlighting the need for robust distillation methods that can handle such data limitations.

Abstract: Subtask distillation is an emerging paradigm in which compact, specialized models are extracted from large, general-purpose ‘foundation models’ for deployment in environments with limited resources or in standalone computer systems. Although distillation uses a teacher model, it still relies on a dataset that is often limited in size and may lack representativeness or exhibit spurious correlations. In this paper, we evaluate established distillation methods, as well as the recent SubDistill method, when using data with spurious correlations for distillation. As the strength of the correlations increases, we observe a widening gap between advanced methods, such as SubDistill, which remain fairly robust, and some baseline methods, which degrade to near-random performance. Overall, our study underscores the challenges of knowledge distillation when applied to imperfect, real-world datasets, particularly those with spurious correlations.

[1093] Towards Multiscale Graph-based Protein Learning with Geometric Secondary Structural Motifs

Shih-Hsin Wang, Yuhao Huang, Taos Transue, Justin Baker, Jonathan Forstater, Thomas Strohmer, Bao Wang

Main category: cs.LG

TL;DR: Proposes a hierarchical graph neural network framework for protein structure learning that uses fine-grained subgraphs for secondary structure motifs and coarse-grained graphs for inter-motif relationships, improving accuracy and efficiency.

Details

Motivation: Existing GNN-based methods for protein structure learning struggle with multiscale representations and long-range dependencies. Proteins have hierarchical organization (secondary structures forming tertiary structures) that current approaches don't efficiently capture.

Method: 1) Constructs hierarchical graph representation: fine-grained subgraphs for secondary structure motifs (α-helices, β-strands, loops) and a coarse-grained graph connecting motifs based on spatial arrangement. 2) Uses two GNNs: one for local interactions within motifs, another for higher-level structural relationships across motifs. Modular design allows flexible GNN choice.

Result: Theoretically preserves maximal expressiveness without losing structural information. Empirically improves prediction accuracy and reduces computational cost across various benchmarks when integrating baseline GNNs into the multiscale framework.

Conclusion: The hierarchical multiscale graph framework effectively captures protein structural hierarchies, addressing limitations of existing GNN methods while maintaining theoretical guarantees and practical efficiency.

Abstract: Graph neural networks (GNNs) have emerged as powerful tools for learning protein structures by capturing spatial relationships at the residue level. However, existing GNN-based methods often face challenges in learning multiscale representations and modeling long-range dependencies efficiently. In this work, we propose an efficient multiscale graph-based learning framework tailored to proteins. Our proposed framework contains two crucial components: (1) It constructs a hierarchical graph representation comprising a collection of fine-grained subgraphs, each corresponding to a secondary structure motif (e.g., $α$-helices, $β$-strands, loops), and a single coarse-grained graph that connects these motifs based on their spatial arrangement and relative orientation. (2) It employs two GNNs for feature learning: the first operates within individual secondary motifs to capture local interactions, and the second models higher-level structural relationships across motifs. Our modular framework allows a flexible choice of GNN in each stage. Theoretically, we show that our hierarchical framework preserves the desired maximal expressiveness, ensuring no loss of critical structural information. Empirically, we demonstrate that integrating baseline GNNs into our multiscale framework remarkably improves prediction accuracy and reduces computational cost across various benchmarks.

[1094] Improving Flow Matching by Aligning Flow Divergence

Yuhao Huang, Taos Transue, Shih-Hsin Wang, William Feldman, Hong Zhang, Bao Wang

Main category: cs.LG

TL;DR: The paper introduces a new training objective for flow-based generative models that simultaneously matches both the flow and its divergence, improving accuracy over standard conditional flow matching.

Details

Motivation: Conditional flow matching (CFM) is efficient for training flow-based generative models but insufficient for ensuring accurate learning of probability paths, leading to performance limitations.

Method: Develops a PDE characterization of error between learned and exact probability paths, shows total variation gap is bounded by CFM loss plus divergence loss, and designs new objective that matches both flow and divergence.

Result: The new approach improves performance of flow-based generative models without sacrificing generation efficiency, demonstrated on benchmark tasks including dynamical systems, DNA sequences, and videos.

Conclusion: Simultaneous flow and divergence matching provides better theoretical guarantees and empirical performance than standard conditional flow matching for flow-based generative models.

Abstract: Conditional flow matching (CFM) stands out as an efficient, simulation-free approach for training flow-based generative models, achieving remarkable performance for data generation. However, CFM is insufficient to ensure accuracy in learning probability paths. In this paper, we introduce a new partial differential equation characterization for the error between the learned and exact probability paths, along with its solution. We show that the total variation gap between the two probability paths is bounded above by a combination of the CFM loss and an associated divergence loss. This theoretical insight leads to the design of a new objective function that simultaneously matches the flow and its divergence. Our new approach improves the performance of the flow-based generative model by a noticeable margin without sacrificing generation efficiency. We showcase the advantages of this enhanced training approach over CFM on several important benchmark tasks, including generative modeling for dynamical systems, DNA sequences, and videos. Code is available at \href{https://github.com/Utah-Math-Data-Science/Flow_Div_Matching}{Utah-Math-Data-Science}.

[1095] Learning Heat-based Equations in Self-similar variables

Shihao Wang, Qipeng Qian, Jingquan Wang

Main category: cs.LG

TL;DR: SSV training framework improves neural operator learning for heat-based equations by using self-similar coordinates for better long-term dynamics prediction

Details

Motivation: To improve neural operator learning for heat-based equations by leveraging mathematical structure through self-similar variables for better long-term dynamics prediction

Method: Developed SSV training framework compatible with standard neural operator training, tested on 2D incompressible Navier-Stokes and 1D viscous Burgers equations using MLPs and factorized fully connected networks

Result: SSV-trained networks consistently deliver substantially more accurate and stable extrapolation beyond training window and better capture qualitative long-time trends across both systems and architectures

Conclusion: Self-similar coordinates provide mathematically motivated inductive bias for learning long-time dynamics of heat-based equations

Abstract: We study solution learning for heat-based equations in self-similar variables (SSV). We develop an SSV training framework compatible with standard neural-operator training. We instantiate this framework on the two-dimensional incompressible Navier-Stokes equations and the one-dimensional viscous Burgers equation, and perform controlled comparisons between models trained in physical coordinates and in the corresponding self-similar coordinates using two simple fully connected architectures (standard multilayer perceptrons and a factorized fully connected network). Across both systems and both architectures, SSV-trained networks consistently deliver substantially more accurate and stable extrapolation beyond the training window and better capture qualitative long-time trends. These results suggest that self-similar coordinates provide a mathematically motivated inductive bias for learning the long-time dynamics of heat-based equations.

[1096] Privacy in Practice: Private COVID-19 Detection in X-Ray Images (Extended Version)

Lucas Lange, Maja Schneider, Peter Christen, Erhard Rahm

Main category: cs.LG

TL;DR: This paper explores differential privacy (DP) for COVID-19 image classification models, evaluating privacy-utility trade-offs and testing DP’s effectiveness against membership inference attacks (MIAs) in practice.

Details

Motivation: To address privacy concerns in COVID-19 medical imaging while enabling machine learning analysis, previous works have limitations including small datasets, unclear privacy guarantees, and lack of investigation into practical privacy effectiveness against real attacks.

Method: The authors implement differentially private ML models for COVID-19 classification, account for class imbalances, evaluate utility-privacy trade-offs over strict privacy budgets, and empirically test practical privacy through black-box membership inference attacks (MIAs).

Result: Results show that needed privacy levels differ based on task-dependent MIA threats, DP’s impact on practical MIA defense is limited (only marginal improvement in empirical privacy leakage with increasing DP guarantees), and better utility-privacy trade-offs are possible.

Conclusion: Empirical attack-specific privacy estimation can play a vital role in tuning for practical privacy, and DP’s practical effectiveness against MIAs in COVID-19 classification appears limited despite theoretical guarantees.

Abstract: Machine learning (ML) can help fight pandemics like COVID-19 by enabling rapid screening of large volumes of images. To perform data analysis while maintaining patient privacy, we create ML models that satisfy Differential Privacy (DP). Previous works exploring private COVID-19 models are in part based on small datasets, provide weaker or unclear privacy guarantees, and do not investigate practical privacy. We suggest improvements to address these open gaps. We account for inherent class imbalances and evaluate the utility-privacy trade-off more extensively and over stricter privacy budgets. Our evaluation is supported by empirically estimating practical privacy through black-box Membership Inference Attacks (MIAs). The introduced DP should help limit leakage threats posed by MIAs, and our practical analysis is the first to test this hypothesis on the COVID-19 classification task. Our results indicate that needed privacy levels might differ based on the task-dependent practical threat from MIAs. The results further suggest that with increasing DP guarantees, empirical privacy leakage only improves marginally, and DP therefore appears to have a limited impact on practical MIA defense. Our findings identify possibilities for better utility-privacy trade-offs, and we believe that empirical attack-specific privacy estimation can play a vital role in tuning for practical privacy.

Hao Mark Chen, Zhiwen Mo, Royson Lee, Qianzhou Wang, Da Li, Shell Xu Hu, Wayne Luk, Timothy Hospedales, Hongxiang Fan

Main category: cs.LG

TL;DR: Dynamic Expert Sharing (DES) addresses expert explosion in MoE diffusion LLMs by enabling sequence-level expert reuse through coreset selection, reducing memory overhead while maintaining accuracy.

Details

Motivation: MoE architectures in diffusion LLMs suffer from expert explosion during parallel decoding - as tokens increase, distinct expert activations grow linearly, causing memory bottlenecks that negate efficiency gains of both MoE and parallel decoding.

Method: Proposes Dynamic Expert Sharing (DES) with two strategies: 1) DES-Seq adapts optimal allocation to sequence level, 2) DES-Vote uses saliency-aware voting where tokens collectively elect a coreset based on aggregated router weights.

Result: DES reduces unique expert activations by over 55% and latency by up to 38%, while retaining 99% of vanilla accuracy, effectively decoupling memory overhead from parallelism degree.

Conclusion: DES successfully addresses expert explosion in MoE dLLMs through sequence-level coreset selection, enabling efficient parallel decoding without sacrificing quality.

Abstract: Among parallel decoding paradigms, diffusion large language models (dLLMs) have emerged as a promising candidate that balances generation quality and throughput. However, their integration with Mixture-of-Experts (MoE) architectures is constrained by an expert explosion: as the number of tokens generated in parallel increases, the number of distinct experts activated grows nearly linearly. This results in substantial memory traffic that pushes inference into a memory-bound regime, negating the efficiency gains of both MoE and parallel decoding. To address this challenge, we propose Dynamic Expert Sharing (DES), a novel technique that shifts MoE optimization from token-centric pruning and conventional expert skipping methods to sequence-level coreset selection. To maximize expert reuse, DES identifies a compact, high-utility set of experts to satisfy the requirements of an entire parallel decoding block. We introduce two innovative selection strategies: (1) Intra-Sequence Sharing (DES-Seq), which adapts optimal allocation to the sequence level, and (2) Saliency-Aware Voting (DES-Vote), a novel mechanism that allows tokens to collectively elect a coreset based on aggregated router weights. Extensive experiments on MoE dLLMs demonstrate that DES reduces unique expert activations by over 55% and latency by up to 38%, while retaining 99% of vanilla accuracy, effectively decoupling memory overhead from the degree of parallelism.

[1098] Generating Synthetic Health Sensor Data for Privacy-Preserving Wearable Stress Detection

Lucas Lange, Nils Wenzlitschke, Erhard Rahm

Main category: cs.LG

TL;DR: Privacy-aware synthesis of smartwatch health sensor data using GANs with Differential Privacy for stress detection applications

Details

Motivation: Smartwatch health sensor data contains sensitive personal information and is resource-intensive to acquire for research, creating a need for privacy-preserving synthetic data generation methods.

Method: Employ Generative Adversarial Networks (GANs) with Differential Privacy (DP) safeguards to generate synthetic multi-sensor smartwatch health readings related to stress moments, with testing of multiple GANs and data enhancement strategies.

Result: GAN-based augmentation methods significantly improve stress detection model performance: private DP training scenarios show 11.90-15.48% F1-score increase, while non-private training still achieves 0.45% boost. Synthetic data quality is confirmed but impacted by stronger privacy requirements.

Conclusion: Differentially private synthetic data effectively optimizes utility-privacy trade-offs for smartwatch health applications, especially when real training samples are limited, though increased privacy requirements impact data quality.

Abstract: Smartwatch health sensor data are increasingly utilized in smart health applications and patient monitoring, including stress detection. However, such medical data often comprise sensitive personal information and are resource-intensive to acquire for research purposes. In response to this challenge, we introduce the privacy-aware synthetization of multi-sensor smartwatch health readings related to moments of stress, employing Generative Adversarial Networks (GANs) and Differential Privacy (DP) safeguards. Our method not only protects patient information but also enhances data availability for research. To ensure its usefulness, we test synthetic data from multiple GANs and employ different data enhancement strategies on an actual stress detection task. Our GAN-based augmentation methods demonstrate significant improvements in model performance, with private DP training scenarios observing an 11.90-15.48% increase in F1-score, while non-private training scenarios still see a 0.45% boost. These results underline the potential of differentially private synthetic data in optimizing utility-privacy trade-offs, especially with the limited availability of real training samples. Through rigorous quality assessments, we confirm the integrity and plausibility of our synthetic data, which, however, are significantly impacted when increasing privacy requirements.

[1099] Test-time Generalization for Physics through Neural Operator Splitting

Louis Serrano, Jiequn Han, Edouard Oyallon, Shirley Ho, Rudy Morel

Main category: cs.LG

TL;DR: Neural operator splitting method for zero-shot generalization to unseen PDE dynamics by composing training operators at test time without weight updates.

Details

Motivation: Neural operators struggle to generalize to out-of-distribution test inputs like novel initial conditions, unseen PDE coefficients, or new physics. Existing methods require fine-tuning with examples from new dynamics, lacking true zero-shot generalization.

Method: Proposes neural operator splitting strategy that searches over compositions of training operators at test time to approximate unseen dynamics. Builds on DISCO’s dictionary of neural operators trained across different dynamics, enabling test-time computation without modifying pretrained weights.

Result: Achieves state-of-the-art zero-shot generalization on challenging out-of-distribution tasks including parameter extrapolation and novel combinations of physics phenomena. Can recover underlying PDE parameters.

Conclusion: Test-time computation is a key avenue for building flexible, compositional, and generalizable neural operators, enabling zero-shot adaptation to unseen dynamics.

Abstract: Neural operators have shown promise in learning solution maps of partial differential equations (PDEs), but they often struggle to generalize when test inputs lie outside the training distribution, such as novel initial conditions, unseen PDE coefficients or unseen physics. Prior works address this limitation with large-scale multiple physics pretraining followed by fine-tuning, but this still requires examples from the new dynamics, falling short of true zero-shot generalization. In this work, we propose a method to enhance generalization at test time, i.e., without modifying pretrained weights. Building on DISCO, which provides a dictionary of neural operators trained across different dynamics, we introduce a neural operator splitting strategy that, at test time, searches over compositions of training operators to approximate unseen dynamics. On challenging out-of-distribution tasks including parameter extrapolation and novel combinations of physics phenomena, our approach achieves state-of-the-art zero-shot generalization results, while being able to recover the underlying PDE parameters. These results underscore test-time computation as a key avenue for building flexible, compositional, and generalizable neural operators.

[1100] Assessing the Impact of Image Dataset Features on Privacy-Preserving Machine Learning

Lucas Lange, Maurice-Maximilian Heykeroth, Erhard Rahm

Main category: cs.LG

TL;DR: Analysis of how image dataset characteristics affect utility-privacy trade-off in differentially private CNNs, finding class imbalance increases vulnerability but DP helps, while fewer classes improves both utility and privacy.

Details

Motivation: ML models trained on sensitive data face security challenges and privacy risks. Need to understand how dataset characteristics affect the utility-privacy trade-off in differentially private computer vision models to guide practitioners in optimizing this balance.

Method: Analyzed multiple image datasets with different characteristics (class balance, number of classes, entropy, Fisher Discriminant Ratio) across various privacy budgets. Evaluated how these characteristics affect utility and vulnerability of both private and non-private CNN models.

Result: Imbalanced datasets increase vulnerability in minority classes, but differential privacy mitigates this issue. Datasets with fewer classes improve both model utility and privacy. High entropy or low FDR datasets deteriorate the utility-privacy trade-off.

Conclusion: Dataset characteristics significantly impact the utility-privacy trade-off in differentially private computer vision. These insights provide guidance for practitioners to estimate and optimize this trade-off based on dataset properties.

Abstract: Machine Learning (ML) is crucial in many sectors, including computer vision. However, ML models trained on sensitive data face security challenges, as they can be attacked and leak information. Privacy-Preserving Machine Learning (PPML) addresses this by using Differential Privacy (DP) to balance utility and privacy. This study identifies image dataset characteristics that affect the utility and vulnerability of private and non-private Convolutional Neural Network (CNN) models. Through analyzing multiple datasets and privacy budgets, we find that imbalanced datasets increase vulnerability in minority classes, but DP mitigates this issue. Datasets with fewer classes improve both model utility and privacy, while high entropy or low Fisher Discriminant Ratio (FDR) datasets deteriorate the utility-privacy trade-off. These insights offer valuable guidance for practitioners and researchers in estimating and optimizing the utility-privacy trade-off in image datasets, helping to inform data and privacy modifications for better outcomes based on dataset characteristics.

[1101] Reliability-Aware Determinantal Point Processes for Robust Informative Data Selection in Large Language Models

Ahmad Sarlak, Abolfazl Razi

Main category: cs.LG

TL;DR: ProbDPP: A reliability-aware data selection method that extends k-DPP to handle probabilistic data access failures, enabling robust diversity maximization under uncertainty with online learning of reliability parameters.

Details

Motivation: Traditional data selection methods like DPP assume perfect data availability, but real-world scenarios involve storage outages, communication failures, and stochastic access issues. Existing methods collapse under these conditions, creating a need for reliability-aware selection approaches.

Method: Introduces ProbDPP, a novel reliability-aware implementation of k-DPP that reformulates the objective with a regularization term accounting for probabilistic access. Frames the problem as a combinatorial semi-bandit problem and proposes a UCB-style algorithm to learn unknown reliability parameters online.

Result: The method provides theoretical regret bounds ensuring performance guarantees. ProbDPP enables robust selection of diverse data batches under uncertainty while maintaining computational efficiency.

Conclusion: ProbDPP addresses the critical gap in data selection methods by incorporating reliability considerations, making it suitable for practical deployment under computational and communication constraints with uncertain data access.

Abstract: Informative data selection is a key requirement for large language models (LLMs) to minimize the amount of data required for fine-tuning, network distillation, and token pruning, enabling fast and efficient deployment, especially under computational and communication constraints. Traditional subset selection methods, including those based on Determinantal Point Processes (DPP), focus on maximizing diversity but assume that selected data batches are always available error-free. This presumption prohibits their use under partial storage outage, imperfect communication, and stochastic access failures. Furthermore, we show that the original formulation collapses under such conditions. To address this gap, we introduce ProbDPP, a novel reliability-aware implementation of k-DPP that accounts for probabilistic data access by recasting the objective function with a regularization term that remains well-posed and decomposes into a geometric diversity term and unreliability cost. The resulting objective facilitates robust selection of diverse data batches under uncertainty. Furthermore, we frame this reliability-aware diversity maximization as a combinatorial semi-bandit problem and propose a UCB-style algorithm to efficiently learn the unknown reliability online. Theoretical analysis provides regret bounds for the proposed approach, ensuring performance guarantees.

[1102] Federated Learning With Individualized Privacy Through Client Sampling

Lucas Lange, Ole Borchardt, Erhard Rahm

Main category: cs.LG

TL;DR: Individualized Differential Privacy for Federated Learning adapts privacy protection to user preferences, improving utility-privacy trade-off compared to uniform DP baselines.

Details

Motivation: Address growing concerns about user data collection by moving beyond uniform privacy protection to individualized approaches that respect diverse user privacy preferences, balancing protection and utility.

Method: Extends SAMPLE algorithm from centralized settings to Federated Learning, calculates client-specific sampling rates based on heterogeneous privacy budgets, and integrates them into modified IDP-FedAvg algorithm.

Result: Achieves clear improvements over uniform DP baselines, reduces privacy-utility trade-off, and outperforms alternative SCALE method which assigns differing noise scales to clients.

Conclusion: Individualized DP in FL effectively balances privacy and utility by respecting user preferences, though challenges remain for complex tasks with non-i.i.d. data due to decentralized constraints.

Abstract: With growing concerns about user data collection, individualized privacy has emerged as a promising solution to balance protection and utility by accounting for diverse user privacy preferences. Instead of enforcing a uniform level of anonymization for all users, this approach allows individuals to choose privacy settings that align with their comfort levels. Building on this idea, we propose an adapted method for enabling Individualized Differential Privacy (IDP) in Federated Learning (FL) by handling clients according to their personal privacy preferences. By extending the SAMPLE algorithm from centralized settings to FL, we calculate client-specific sampling rates based on their heterogeneous privacy budgets and integrate them into a modified IDP-FedAvg algorithm. We test this method under realistic privacy distributions and multiple datasets. The experimental results demonstrate that our approach achieves clear improvements over uniform DP baselines, reducing the trade-off between privacy and utility. Compared to the alternative SCALE method in related work, which assigns differing noise scales to clients, our method performs notably better. However, challenges remain for complex tasks with non-i.i.d. data, primarily stemming from the constraints of the decentralized setting.

[1103] GAPNet: Plug-in Jointly Learning Task-Specific Graph for Dynamic Stock Relation

Yingjie Niu, Lanxin Lu, Changhong Jin, Ruihai Dong

Main category: cs.LG

TL;DR: GAPNet: Graph Adaptation Plug-in Network for financial forecasting that learns task-specific graph topologies and representations end-to-end, enhancing existing GNN models with dynamic edge rewiring capabilities.

Details

Motivation: Existing financial forecasting methods rely on predefined graphs to capture inter-stock relationships, but web-based financial signals are noisy, asynchronous, and hard to obtain, leading to poor generalizability and misalignment between predefined graphs and downstream tasks.

Method: GAPNet is a plug-in network that attaches to existing pairwise graph or hypergraph backbone models. It learns task-specific topology and representations jointly via two components: Spatial Perception Layer for short-term co-movements across assets, and Temporal Perception Layer for long-term dependencies under distribution shift.

Result: GAPNet consistently enhances profitability and stability across two real-world stock datasets, yielding annualized cumulative returns up to 0.47 for RT-GCN and 0.63 for CI-STHPAN, with peak Sharpe Ratios of 2.20 and 2.12 respectively.

Conclusion: Jointly learning graph structures and representations is essential for task-specific relational modeling in financial forecasting, and GAPNet’s plug-and-play design ensures broad applicability to diverse GNN-based architectures.

Abstract: The advent of the web has led to a paradigm shift in the financial relations, with the real-time dissemination of news, social discourse, and financial filings contributing significantly to the reshaping of financial forecasting. The existing methods rely on establishing relations a priori, i.e. predefining graphs to capture inter-stock relationships. However, the stock-related web signals are characterised by high levels of noise, asynchrony, and challenging to obtain, resulting in poor generalisability and non-alignment between the predefined graphs and the downstream tasks. To address this, we propose GAPNet, a Graph Adaptation Plug-in Network that jointly learns task-specific topology and representations in an end-to-end manner. GAPNet attaches to existing pairwise graph or hypergraph backbone models, enabling the dynamic adaptation and rewiring of edge topologies via two complementary components: a Spatial Perception Layer that captures short-term co-movements across assets, and a Temporal Perception Layer that maintains long-term dependency under distribution shift. Across two real-world stock datasets, GAPNet has been shown to consistently enhance the profitability and stability in comparision to the state-of-the-art models, yielding annualised cumulative returns of up to 0.47 for RT-GCN and 0.63 for CI-STHPAN, with peak Sharpe Ratio of 2.20 and 2.12 respectively. The plug-and-play design of GAPNet ensures its broad applicability to diverse GNN-based architectures. Our results underscore that jointly learning graph structures and representations is essential for task-specific relational modeling.

[1104] Reinforcement Learning via Conservative Agent for Environments with Random Delays

Jongsoo Lee, Jangwon Kim, Jiseok Jeong, Soohee Han

Main category: cs.LG

TL;DR: A conservative agent approach that transforms random-delay RL environments into constant-delay equivalents, enabling existing constant-delay methods to work effectively in random-delay settings without algorithm modifications.

Details

Motivation: Real-world RL applications often face delayed feedback that violates Markov assumptions. While constant delays have been studied, random delays remain largely unexplored due to their variability and unpredictability, creating significant challenges for RL algorithms.

Method: Proposes a conservative agent that reformulates random-delay environments into their constant-delay equivalents. This transformation allows any state-of-the-art constant-delay method to be directly applied to random-delay environments without modifying algorithmic structure.

Result: Empirical evaluation on continuous control tasks shows the conservative agent-based algorithm significantly outperforms existing baseline algorithms in terms of asymptotic performance and sample efficiency.

Conclusion: The proposed conservative agent provides a simple yet robust solution for decision-making under random delays, enabling effective extension of constant-delay methods to random-delay environments without performance sacrifice.

Abstract: Real-world reinforcement learning applications are often hindered by delayed feedback from environments, which violates the Markov assumption and introduces significant challenges. Although numerous delay-compensating methods have been proposed for environments with constant delays, environments with random delays remain largely unexplored due to their inherent variability and unpredictability. In this study, we propose a simple yet robust agent for decision-making under random delays, termed the conservative agent, which reformulates the random-delay environment into its constant-delay equivalent. This transformation enables any state-of-the-art constant-delay method to be directly extended to the random-delay environments without modifying the algorithmic structure or sacrificing performance. We evaluate the conservative agent-based algorithm on continuous control tasks, and empirical results demonstrate that it significantly outperforms existing baseline algorithms in terms of asymptotic performance and sample efficiency.

[1105] Domain-Adaptive and Scalable Dense Retrieval for Content-Based Recommendation

Mritunjay Pandey

Main category: cs.LG

TL;DR: Dense retrieval system for e-commerce recommendation using bi-encoder with contrastive learning, improving recall over BM25 while maintaining practical serving efficiency.

Details

Motivation: Traditional keyword matching (BM25) fails under vocabulary mismatch in e-commerce recommendation when user intent has limited lexical overlap with product metadata. Need semantic understanding to bridge this gap.

Method: Two-tower bi-encoder fine-tuned on Amazon Reviews dataset using supervised contrastive learning with Multiple Negatives Ranking Loss. Training pairs from review text (query proxy) and item metadata (positive document). Efficient serving via FAISS HNSW indexing and ONNX Runtime with INT8 quantization.

Result: Recall@10 improved from 0.26 (BM25) to 0.66 on review-to-title benchmark over 826,402 items. Achieved 6.1 ms median CPU inference latency with 4x model size reduction.

Conclusion: Provides end-to-end blueprint for domain-adapted dense retrieval from training to CPU-efficient serving at catalog scale, demonstrating practical semantic retrieval for e-commerce.

Abstract: E-commerce recommendation and search commonly rely on sparse keyword matching (e.g., BM25), which breaks down under vocabulary mismatch when user intent has limited lexical overlap with product metadata. We cast content-based recommendation as recommendation-as-retrieval: given a natural-language intent signal (a query or review), retrieve the top-K most relevant items from a large catalog via semantic similarity. We present a scalable dense retrieval system based on a two-tower bi-encoder, fine-tuned on the Amazon Reviews 2023 (Fashion) subset using supervised contrastive learning with Multiple Negatives Ranking Loss. We construct training pairs from review text (as a query proxy) and item metadata (as the positive document) and fine-tune on 50,000 sampled interactions with a maximum sequence length of 500 tokens. For efficient serving, we combine FAISS HNSW indexing with an ONNX Runtime inference pipeline using INT8 dynamic quantization. On a review-to-title benchmark over 826,402 catalog items, our approach improves Recall@10 from 0.26 (BM25) to 0.66, while meeting practical latency and model-size constraints: 6.1 ms median CPU inference latency (batch size 1) and a 4x reduction in model size. Overall, we provide an end-to-end, reproducible blueprint for taking domain-adapted dense retrieval from offline training to CPU-efficient serving at catalog scale.

[1106] Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing

Anxin Guo, Jingwei Li

Main category: cs.LG

TL;DR: The paper presents an information-theoretic framework showing that hallucination in LLMs is an optimal strategy under limited capacity, not just a training flaw.

Details

Motivation: To understand why LLMs hallucinate random facts with high confidence, even when trained on perfect data, by formalizing it as a membership testing problem.

Method: Unifies Bloom filter discrete error metrics with LLM continuous log-loss, analyzes memorization as membership testing in sparse fact regimes, establishes rate-distortion theorem showing optimal memory efficiency requires KL divergence minimization between fact/non-fact score distributions.

Result: Theoretical framework shows hallucination is information-theoretically optimal under limited capacity - optimal strategy is to assign high confidence to some non-facts rather than abstain or forget. Validated empirically on synthetic data showing hallucinations persist as natural consequence of lossy compression.

Conclusion: Hallucination in LLMs is not just a training flaw but an inherent consequence of limited capacity and lossy compression, providing a distinctive information-theoretic explanation for this phenomenon.

Abstract: Large language models often hallucinate with high confidence on “random facts” that lack inferable patterns. We formalize the memorization of such facts as a membership testing problem, unifying the discrete error metrics of Bloom filters with the continuous log-loss of LLMs. By analyzing this problem in the regime where facts are sparse in the universe of plausible claims, we establish a rate-distortion theorem: the optimal memory efficiency is characterized by the minimum KL divergence between score distributions on facts and non-facts. This theoretical framework provides a distinctive explanation for hallucination: even with optimal training, perfect data, and a simplified “closed world” setting, the information-theoretically optimal strategy under limited capacity is not to abstain or forget, but to assign high confidence to some non-facts, resulting in hallucination. We validate this theory empirically on synthetic data, showing that hallucinations persist as a natural consequence of lossy compression.

[1107] PyGALAX: An Open-Source Python Toolkit for Advanced Explainable Geospatial Machine Learning

Pingping Wang, Yihong Yuan, Lingcheng Li, Yongmei Lu

Main category: cs.LG

TL;DR: PyGALAX is a Python package for geospatial analysis combining AutoML and XAI to handle spatial heterogeneity in regression/classification tasks with automatic model selection and SHAP-based interpretability.

Details

Motivation: To address spatial non-stationarity and complex spatial relationships by making advanced geospatial machine learning methods accessible and interpretable for researchers and practitioners across geography, urban planning, and environmental science.

Method: Integrates automated machine learning (AutoML) for model selection/optimization with explainable AI (XAI) techniques like SHAP analysis. Includes automatic bandwidth selection and flexible kernel function selection for spatial modeling, building upon the GALAX framework.

Result: PyGALAX outperforms traditional geographically weighted regression (GWR) methods and provides greater flexibility and robustness for spatial modeling across diverse datasets. It packages functionalities into an accessible, reproducible Python toolkit.

Conclusion: PyGALAX effectively addresses spatial non-stationarity while maintaining interpretability, making advanced geospatial machine learning accessible to researchers and practitioners with transparent insights at global and local scales.

Abstract: PyGALAX is a Python package for geospatial analysis that integrates automated machine learning (AutoML) and explainable artificial intelligence (XAI) techniques to analyze spatial heterogeneity in both regression and classification tasks. It automatically selects and optimizes machine learning models for different geographic locations and contexts while maintaining interpretability through SHAP (SHapley Additive exPlanations) analysis. PyGALAX builds upon and improves the GALAX framework (Geospatial Analysis Leveraging AutoML and eXplainable AI), which has proven to outperform traditional geographically weighted regression (GWR) methods. Critical enhancements in PyGALAX from the original GALAX framework include automatic bandwidth selection and flexible kernel function selection, providing greater flexibility and robustness for spatial modeling across diverse datasets and research questions. PyGALAX not only inherits all the functionalities of the original GALAX framework but also packages them into an accessible, reproducible, and easily deployable Python toolkit while providing additional options for spatial modeling. It effectively addresses spatial non-stationarity and generates transparent insights into complex spatial relationships at both global and local scales, making advanced geospatial machine learning methods accessible to researchers and practitioners in geography, urban planning, environmental science, and related fields.

[1108] Efficient Deep Learning for Medical Imaging: Bridging the Gap Between High-Performance AI and Clinical Deployment

Cuong Manh Nguyen, Truong-Son Hy

Main category: cs.LG

TL;DR: Review paper on efficient deep learning architectures for medical image analysis, focusing on lightweight models (CNNs, transformers, linear complexity models) and compression techniques for clinical deployment.

Details

Motivation: Large-scale deep learning models face deployment challenges in clinical settings due to computational costs, latency constraints, and patient data privacy concerns with cloud-based processing.

Method: Comprehensive review categorizing efficient models into three streams: Convolutional Neural Networks (CNNs), Lightweight Transformers, and emerging Linear Complexity Models. Also examines model compression strategies including pruning, quantization, knowledge distillation, and low-rank factorization.

Result: Provides synthesis of efficient architectures and compression techniques that maintain diagnostic performance while reducing hardware requirements for clinical deployment.

Conclusion: Serves as roadmap for bridging gap between high-performance AI and resource-constrained clinical environments through on-device intelligence approaches.

Abstract: Deep learning has revolutionized medical image analysis, playing a vital role in modern clinical applications. However, the deployment of large-scale models in real-world clinical settings remains challenging due to high computational costs, latency constraints, and patient data privacy concerns associated with cloud-based processing. To address these bottlenecks, this review provides a comprehensive synthesis of efficient and lightweight deep learning architectures specifically tailored for the medical domain. We categorize the landscape of modern efficient models into three primary streams: Convolutional Neural Networks (CNNs), Lightweight Transformers, and emerging Linear Complexity Models. Furthermore, we examine key model compression strategies (including pruning, quantization, knowledge distillation, and low-rank factorization) and evaluate their efficacy in maintaining diagnostic performance while reducing hardware requirements. By identifying current limitations and discussing the transition toward on-device intelligence, this review serves as a roadmap for researchers and practitioners aiming to bridge the gap between high-performance AI and resource-constrained clinical environments.

[1109] Early Classification of Time Series in Non-Stationary Cost Regimes

Aurélien Renault, Alexis Bondu, Antoine Cornuéjols, Vincent Lemaire

Main category: cs.LG

TL;DR: Online learning adaptations for Early Classification of Time Series to handle non-stationary decision costs, with RL-based strategies showing strong robustness to cost drift.

Details

Motivation: Existing ECTS methods assume fixed, known decision costs, but in practice costs are often uncertain and change over time, causing mismatches between training and deployment objectives.

Method: Adapt ECTS approaches to online learning setting, focusing on separable methods where only triggering model is updated during deployment while classifier remains fixed. Propose bandit-based and RL-based online adaptations and baselines.

Result: Online learning effectively improves robustness of ECTS methods to cost drift, with RL-based strategies exhibiting strong and stable performance across varying cost regimes.

Conclusion: Online learning adaptations can address cost non-stationarity in ECTS, with RL-based approaches showing particular promise for handling changing decision cost dynamics.

Abstract: Early Classification of Time Series (ECTS) addresses decision-making problems in which predictions must be made as early as possible while maintaining high accuracy. Most existing ECTS methods assume that the time-dependent decision costs governing the learning objective are known, fixed, and correctly specified. In practice, however, these costs are often uncertain and may change over time, leading to mismatches between training-time and deployment-time objectives. In this paper, we study ECTS under two practically relevant forms of cost non-stationarity: drift in the balance between misclassification and decision delay costs, and stochastic realizations of decision costs that deviate from the nominal training-time model. To address these challenges, we revisit representative ECTS approaches and adapt them to an online learning setting. Focusing on separable methods, we update only the triggering model during deployment, while keeping the classifier fixed. We propose several online adaptations and baselines, including bandit-based and RL-based approaches, and conduct controlled experiments on synthetic data to systematically evaluate robustness under cost non-stationarity. Our results demonstrate that online learning can effectively improve the robustness of ECTS methods to cost drift, with RL-based strategies exhibiting strong and stable performance across varying cost regimes.

[1110] Beyond What Seems Necessary: Hidden Gains from Scaling Training-Time Reasoning Length under Outcome Supervision

Yihao Xue, Allan Zhang, Jianhao Huang, Amit Sahai, Baharan Mirzasoleiman

Main category: cs.LG

TL;DR: Training LLMs with longer reasoning paths improves out-of-distribution generalization even after in-distribution performance saturates, suggesting reasoning length as a key scaling factor for robustness.

Details

Motivation: The paper investigates how increasing training-time reasoning length (through methods like RL fine-tuning or architectural recurrence) affects model generalization, particularly the relationship between in-distribution performance saturation and continued out-of-distribution improvement.

Method: Theoretical analysis combined with empirical experiments using two approaches: (1) increasing loop counts in looped Transformers on synthetic tasks, and (2) increasing token budgets during RL fine-tuning of LLMs on mathematical reasoning tasks.

Result: Shows that out-of-distribution performance continues to improve as training-time reasoning length increases, even after in-distribution performance has saturated. Provides theoretical explanations via self-iteration inducing stronger inductive bias and regularization reducing reliance on shortcut solutions.

Conclusion: Reasoning length serves as an important scaling knob for improving model robustness and generalization, with out-of-distribution benefits requiring larger reasoning budgets than indicated by in-distribution validation alone.

Abstract: Training LLMs to think and reason for longer has become a key ingredient in building state-of-the-art models that can solve complex problems previously out of reach. Recent efforts pursue this in different ways, such as RL fine-tuning to elicit long CoT or scaling latent reasoning through architectural recurrence. This makes reasoning length an important scaling knob. In this work, we identify a novel phenomenon (both theoretically and experimentally): under outcome-only supervision, out-of-distribution (OOD) performance can continue improving as training-time reasoning length (e.g., the token budget in RL, or the loop count in looped Transformers) increases, even after in-distribution (ID) performance has saturated. This suggests that robustness may require a larger budget than ID validation alone would indicate. We provide theoretical explanations via two mechanisms: (i) self-iteration can induce a stronger inductive bias in the hypothesis class, reshaping ID-optimal solutions in ways that improve OOD generalization; and (ii) when shortcut solutions that work for ID samples but not for OOD samples persist in the hypothesis class, regularization can reduce the learned solution’s reliance on these shortcuts as the number of self-iterations increases. We complement the theory with empirical evidence from two realizations of scaling training-time reasoning length: increasing the number of loops in looped Transformers on a synthetic task, and increasing token budgets during RL fine-tuning of LLMs on mathematical reasoning.

[1111] Continuous-Utility Direct Preference Optimization

Muhammad Ahmed Mohsin, Muhammad Umer, Ahsan Bilal, Zihao He, Muhammad Usman Rafique, Asad Aali, Muhammad Ali Jamshed, John M. Cioffi, Emily Fox

Main category: cs.LG

TL;DR: CU-DPO introduces continuous utility scoring for LLM reasoning alignment, replacing binary preferences with fine-grained scores to capture partial progress and improve strategy selection.

Details

Motivation: Current LLM reasoning alignment uses binary preference supervision that fails to capture partial progress or fine-grained reasoning quality, limiting effective strategy optimization.

Method: Two-stage training: (1) strategy selection via best-vs-all comparisons to choose optimal cognitive strategy, (2) execution refinement using margin-stratified pairs to correctly execute selected strategy.

Result: Improves strategy selection accuracy from 35-46% to 68-78% across seven base models, with consistent reasoning gains up to 6.6 points on in-distribution datasets and effective transfer to OOD tasks.

Conclusion: CU-DPO provides theoretical and empirical advantages over binary preference alignment, enabling better reasoning strategy optimization through continuous utility signals.

Abstract: Large language model reasoning is often treated as a monolithic capability, relying on binary preference supervision that fails to capture partial progress or fine-grained reasoning quality. We introduce Continuous Utility Direct Preference Optimization (CU-DPO), a framework that aligns models to a portfolio of prompt-based cognitive strategies by replacing binary labels with continuous scores that capture fine-grained reasoning quality. We prove that learning with K strategies yields a Theta(K log K) improvement in sample complexity over binary preferences, and that DPO converges to the entropy-regularized utility-maximizing policy. To exploit this signal, we propose a two-stage training pipeline: (i) strategy selection, which optimizes the model to choose the best strategy for a given problem via best-vs-all comparisons, and (ii) execution refinement, which trains the model to correctly execute the selected strategy using margin-stratified pairs. On mathematical reasoning benchmarks, CU-DPO improves strategy selection accuracy from 35-46 percent to 68-78 percent across seven base models, yielding consistent downstream reasoning gains of up to 6.6 points on in-distribution datasets with effective transfer to out-of-distribution tasks.

[1112] SALAAD: Sparse And Low-Rank Adaptation via ADMM

Hao Ma, Melis Ilayda Bal, Liang Zhang, Bingcong Li, Niao He, Melanie Zeilinger, Michael Muehlebach

Main category: cs.LG

TL;DR: SALAAD is a plug-and-play framework that induces sparse and low-rank structures in LLMs during training to enable flexible control of model capacity for memory-constrained deployment.

Details

Motivation: Modern LLMs face compute and memory constraints in deployment, requiring flexible capacity control. Existing sparse/low-rank approaches use heuristic designs that ignore layer heterogeneity or need architectural modifications.

Method: Formulates structured weight learning under augmented Lagrangian framework with adaptive controller that dynamically balances training loss and structural constraints, preserving training stability while controlling capacity evolution.

Result: Substantially reduces memory consumption during deployment while achieving performance comparable to ad-hoc methods. Single training run yields continuous spectrum of model capacities for smooth deployment across diverse memory budgets.

Conclusion: SALAAD provides effective plug-and-play framework for flexible model capacity control without retraining, enabling elastic deployment of LLMs under memory constraints.

Abstract: Modern large language models are increasingly deployed under compute and memory constraints, making flexible control of model capacity a central challenge. While sparse and low-rank structures naturally trade off capacity and performance, existing approaches often rely on heuristic designs that ignore layer and matrix heterogeneity or require model-specific architectural modifications. We propose SALAAD, a plug-and-play framework applicable to different model architectures that induces sparse and low-rank structures during training. By formulating structured weight learning under an augmented Lagrangian framework and introducing an adaptive controller that dynamically balances the training loss and structural constraints, SALAAD preserves the stability of standard training dynamics while enabling explicit control over the evolution of effective model capacity during training. Experiments across model scales show that SALAAD substantially reduces memory consumption during deployment while achieving performance comparable to ad-hoc methods. Moreover, a single training run yields a continuous spectrum of model capacities, enabling smooth and elastic deployment across diverse memory budgets without the need for retraining.

[1113] Dynamic Prior Thompson Sampling for Cold-Start Exploration in Recommender Systems

Zhenyu Zhao, David Zhang, Ellie Zhao, Ehsan Saberian

Main category: cs.LG

TL;DR: Dynamic Prior Thompson Sampling addresses cold-start exploration in recommender systems by replacing uniform priors with tunable priors that control exploration intensity for new items.

Details

Motivation: Standard Thompson Sampling with uniform Beta(1,1) priors assumes 50% success rate for new items, which is often overly optimistic when true base rates are much lower. This causes systematic over-allocation to weak items, especially with batched updates and pipeline latency where new items remain "no data" for hours.

Method: Proposes Dynamic Prior Thompson Sampling with a closed-form quadratic solution for prior mean that enforces P(X_j > Y_k) = epsilon at introduction time, making exploration intensity predictable and tunable while preserving Bayesian updates.

Result: Across Monte Carlo validation, offline batched simulations, and large-scale online experiments on thumbnail personalization serving millions of users, dynamic priors deliver precise exploration control and improved efficiency versus uniform-prior baseline.

Conclusion: Dynamic priors provide a practical solution for controlling exploration intensity in cold-start scenarios, improving efficiency in large-scale recommender systems with delayed feedback.

Abstract: Cold-start exploration is a core challenge in large-scale recommender systems: new or data-sparse items must receive traffic to estimate value, but over-exploration harms users and wastes impressions. In practice, Thompson Sampling (TS) is often initialized with a uniform Beta(1,1) prior, implicitly assuming a 50% success rate for unseen items. When true base rates are far lower, this optimistic prior systematically over-allocates to weak items. The impact is amplified by batched policy updates and pipeline latency: for hours, newly launched items can remain effectively “no data,” so the prior dominates allocation before feedback is incorporated. We propose Dynamic Prior Thompson Sampling, a prior design that directly controls the probability that a new arm outcompetes the incumbent winner. Our key contribution is a closed-form quadratic solution for the prior mean that enforces P(X_j > Y_k) = epsilon at introduction time, making exploration intensity predictable and tunable while preserving TS Bayesian updates. Across Monte Carlo validation, offline batched simulations, and a large-scale online experiment on a thumbnail personalization system serving millions of users, dynamic priors deliver precise exploration control and improved efficiency versus a uniform-prior baseline.

[1114] Optimal Budgeted Adaptation of Large Language Models

Jing Wang, Jie Shen, Dean Foster, Zohar Karnin, Jeremy C Weiss

Main category: cs.LG

TL;DR: A framework for budget-aware supervised fine-tuning of LLMs using contextual Stackelberg game formulation with label-querying strategies to optimize label efficiency.

Details

Motivation: Addresses the fundamental trade-off between labeled data availability and downstream accuracy in fine-tuning large language models, where labeled data is often scarce and expensive to obtain.

Method: Formulates LLM adaptation as a contextual Stackelberg game where the learner commits to scoring policy and label-querying strategy, while environment selects challenging supervised alternatives. Incorporates finite supervision budget directly into learning objective and uses Largest-Latency-First confidence gate for selective label querying.

Result: Achieves $\tilde{O}(d\sqrt{T})$ regret under standard linear contextual assumptions in full-feedback regime, and budget-aware regret bound of $\tilde{O}(\sqrt{dB} + c\sqrt{B})$ with $B=βT$ using LLF confidence gate.

Conclusion: Proposes a principled game-theoretic approach to budget-aware fine-tuning that explicitly addresses label efficiency while maintaining theoretical guarantees on regret bounds.

Abstract: The trade-off between labeled data availability and downstream accuracy remains a central challenge in fine-tuning large language models (LLMs). We propose a principled framework for \emph{budget-aware supervised fine-tuning} by casting LLM adaptation as a contextual Stackelberg game. In our formulation, the learner (leader) commits to a scoring policy and a label-querying strategy, while an adaptive environment (follower) selects challenging supervised alternatives in response. To explicitly address label efficiency, we incorporate a finite supervision budget directly into the learning objective. Our algorithm operates in the full-feedback regime and achieves $\tilde{O}(d\sqrt{T})$ regret under standard linear contextual assumptions. We extend the framework with a Largest-Latency-First (LLF) confidence gate that selectively queries labels, achieving a budget-aware regret bound of $\tilde{O}(\sqrt{dB} + c\sqrt{B})$ with $B=βT$.

[1115] SAGE: Agentic Framework for Interpretable and Clinically Translatable Computational Pathology Biomarker Discovery

Sahar Almahfouz Nasser, Juan Francisco Pesantez Borja, Jincheng Liu, Tanvir Hasan, Zenghan Wang, Suman Ghosh, Sandeep Manandhar, Shikhar Shiromani, Twisha Shah, Naoto Tokuyama, Anant Madabhushi

Main category: cs.LG

TL;DR: SAGE is an agentic AI system that generates interpretable pathology biomarkers by grounding them in biological evidence through literature-anchored reasoning and multimodal data analysis.

Details

Motivation: Current AI models in computational pathology are black-box and lack interpretability, hindering clinical adoption. Engineered biomarkers offer interpretability but often lack systematic biological validation.

Method: SAGE integrates literature-anchored reasoning with multimodal data analysis to correlate image-derived features with molecular biomarkers (gene expression) and clinical outcomes. It coordinates specialized agents for biological contextualization and empirical hypothesis validation.

Result: SAGE prioritizes transparent, biologically supported biomarkers and advances clinical translation of computational pathology by generating interpretable engineered biomarkers.

Conclusion: SAGE addresses the interpretability gap in computational pathology by systematically generating biologically-grounded, interpretable biomarkers through agentic AI and multimodal analysis.

Abstract: Despite significant progress in computational pathology, many AI models remain black-box and difficult to interpret, posing a major barrier to clinical adoption due to limited transparency and explainability. This has motivated continued interest in engineered image-based biomarkers, which offer greater interpretability but are often proposed based on anecdotal evidence or fragmented prior literature rather than systematic biological validation. We introduce SAGE (Structured Agentic system for hypothesis Generation and Evaluation), an agentic AI system designed to identify interpretable, engineered pathology biomarkers by grounding them in biological evidence. SAGE integrates literature-anchored reasoning with multimodal data analysis to correlate image-derived features with molecular biomarkers, such as gene expression, and clinically relevant outcomes. By coordinating specialized agents for biological contextualization and empirical hypothesis validation, SAGE prioritizes transparent, biologically supported biomarkers and advances the clinical translation of computational pathology.

[1116] From drift to adaptation to the failed ml model: Transfer Learning in Industrial MLOps

Waqar Muhammad Ashraf, Talha Ansar, Fahad Ahmed, Jawad Hussain, Muhammad Mujtaba Abbas, Vivek Dua

Main category: cs.LG

TL;DR: This paper compares transfer learning techniques (ETL, ALTL, LLTL) for updating failed ANN models under data drift in industrial MLOps, using flue gas differential pressure monitoring in a thermal power plant as a case study.

Details

Motivation: The paper addresses the need for systematic frameworks to update ML models when they fail under data drift in production environments, particularly for reliable Machine Learning Operations (MLOps) in industrial settings.

Method: The study compares three transfer learning techniques for updating failed feedforward ANN models: ensemble transfer learning (ETL), all-layers transfer learning (ALTL), and last-layer transfer learning (LLTL). The methods are evaluated using flue gas differential pressure data from a 660 MW thermal power plant’s air preheater unit, which mimics batch processes due to load cycling.

Result: ETL provides higher predictive accuracy for smaller batch sizes (5 days), while ALTL is more suitable for larger batch sizes (8 days). Computational requirements for model updates show mixed trends across different batch sizes. The study provides empirical insights for adapting failed models to data drifts in industrial process monitoring.

Conclusion: The paper offers fundamental and empirical insights for MLOps practitioners to adapt failed models to data drifts in industrial processes, demonstrating that different transfer learning techniques are optimal depending on batch size characteristics.

Abstract: Model adaptation to production environment is critical for reliable Machine Learning Operations (MLOps), less attention is paid to developing systematic framework for updating the ML models when they fail under data drift. This paper compares the transfer learning enabled model update strategies including ensemble transfer learning (ETL), all-layers transfer learning (ALTL), and last-layer transfer learning (LLTL) for updating the failed feedforward artificial neural network (ANN) model. The flue gas differential pressure across the air preheater unit installed in a 660 MW thermal power plant is analyzed as a case study since it mimics the batch processes due to load cycling in the power plant. Updating the failed ANN model by three transfer learning techniques reveals that ETL provides relatively higher predictive accuracy for the batch size of 5 days than those of LLTL and ALTL. However, ALTL is found to be suitable for effective update of the model trained on large batch size (8 days). A mixed trend is observed for computational requirement (hyperparameter tuning and model training) of model update techniques for different batch sizes. These fundamental and empiric insights obtained from the batch process-based industrial case study can assist the MLOps practitioners in adapting the failed models to data drifts for the accurate monitoring of industrial processes.

[1117] Probing the Knowledge Boundary: An Interactive Agentic Framework for Deep Knowledge Extraction

Yuheng Yang, Siqi Zhu, Tao Feng, Ge Liu, Jiaxuan You

Main category: cs.LG

TL;DR: Interactive agentic framework for systematically probing and quantifying knowledge boundaries in Large Language Models using adaptive exploration policies and knowledge processing pipeline.

Details

Motivation: LLMs act as compressed knowledge bases, but their actual knowledge content and boundaries remain unclear. Existing benchmarks are mostly static and lack systematic probing capabilities, creating a need for more dynamic, interactive approaches to understand what knowledge LLMs truly contain.

Method: Proposes an interactive agentic framework with four adaptive exploration policies to probe knowledge at different granularities. Includes a three-stage knowledge processing pipeline: vector-based filtering to remove duplicates, LLM-based adjudication for semantic overlaps, and domain-relevance auditing to retain valid knowledge units.

Result: Recursive taxonomy is the most effective exploration strategy. Clear knowledge scaling law observed where larger models extract more knowledge. Identified Pass@1-versus-Pass@k trade-off: domain-specialized models have higher initial accuracy but degrade rapidly, while general-purpose models maintain stable performance. Training data composition leads to distinct, measurable knowledge profiles across model families.

Conclusion: The proposed interactive framework enables systematic extraction and quantification of LLM knowledge, revealing important patterns about knowledge distribution, scaling laws, and performance trade-offs that vary by model specialization and training data composition.

Abstract: Large Language Models (LLMs) can be seen as compressed knowledge bases, but it remains unclear what knowledge they truly contain and how far their knowledge boundaries extend. Existing benchmarks are mostly static and provide limited support for systematic knowledge probing. In this paper, we propose an interactive agentic framework to systematically extract and quantify the knowledge of LLMs. Our method includes four adaptive exploration policies to probe knowledge at different granularities. To ensure the quality of extracted knowledge, we introduce a three-stage knowledge processing pipeline that combines vector-based filtering to remove exact duplicates, LLM-based adjudication to resolve ambiguous semantic overlaps, and domain-relevance auditing to retain valid knowledge units. Through extensive experiments, we find that recursive taxonomy is the most effective exploration strategy. We also observe a clear knowledge scaling law, where larger models consistently extract more knowledge. In addition, we identify a Pass@1-versus-Pass@k trade-off: domain-specialized models achieve higher initial accuracy but degrade rapidly, while general-purpose models maintain stable performance during extended extraction. Finally, our results show that differences in training data composition lead to distinct and measurable knowledge profiles across model families.

[1118] Multimodal Scientific Learning Beyond Diffusions and Flows

Leonardo Ferreira Guilhoto, Akshat Kaushal, Paris Perdikaris

Main category: cs.LG

TL;DR: Mixture Density Networks (MDNs) offer a principled alternative to implicit generative models for multimodal uncertainty quantification in scientific machine learning, providing better data efficiency and interpretability for physics problems with multiple solution branches.

Details

Motivation: Scientific machine learning needs models that can capture multimodal conditional uncertainty from ill-posed inverse problems, multistability, and chaotic dynamics. Current approaches like diffusion and flow-based models are data-hungry, computationally expensive, and misaligned with structured scientific solution spaces.

Method: The paper proposes using Mixture Density Networks (MDNs) as explicit parametric density estimators that impose an inductive bias tailored to low-dimensional, multimodal physics. MDNs enable direct global allocation of probability mass across distinct solution branches through a unified probabilistic framework contrasting explicit and implicit distribution networks.

Result: MDNs achieve superior generalization, interpretability, and sample efficiency compared to implicit generative models across a range of inverse, multistable, and chaotic scientific regression tasks. They reliably recover separated modes even when scientific data is scarce.

Conclusion: MDNs provide a principled, data-efficient alternative for multimodal uncertainty quantification in scientific machine learning, offering better alignment with structured scientific solution spaces than current implicit generative models.

Abstract: Scientific machine learning (SciML) increasingly requires models that capture multimodal conditional uncertainty arising from ill-posed inverse problems, multistability, and chaotic dynamics. While recent work has favored highly expressive implicit generative models such as diffusion and flow-based methods, these approaches are often data-hungry, computationally costly, and misaligned with the structured solution spaces frequently found in scientific problems. We demonstrate that Mixture Density Networks (MDNs) provide a principled yet largely overlooked alternative for multimodal uncertainty quantification in SciML. As explicit parametric density estimators, MDNs impose an inductive bias tailored to low-dimensional, multimodal physics, enabling direct global allocation of probability mass across distinct solution branches. This structure delivers strong data efficiency, allowing reliable recovery of separated modes in regimes where scientific data is scarce. We formalize these insights through a unified probabilistic framework contrasting explicit and implicit distribution networks, and demonstrate empirically that MDNs achieve superior generalization, interpretability, and sample efficiency across a range of inverse, multistable, and chaotic scientific regression tasks.

[1119] On the Spectral Flattening of Quantized Embeddings

Junlin Huang, Wenyi Fang, Zhenheng Tang, Yuxin Wang, Xueze Kang, Yang Zheng, Bo Li, Xiaowen Chu

Main category: cs.LG

TL;DR: The paper analyzes why training LLMs at ultra-low precision fails, linking Zipfian statistics to spectral properties and showing uniform quantization destroys critical spectral tails needed for semantic encoding.

Details

Motivation: Training LLMs at ultra-low precision faces instability issues, and the paper aims to understand the fundamental conflict between discrete quantization constraints and the heavy-tailed spectral nature of linguistic data.

Method: Formalizes connection between Zipfian statistics and random matrix theory, proves power-law decay in singular value spectra is essential for semantic encoding, derives theoretical bounds showing uniform quantization introduces noise that truncates spectral tails, and validates empirically across GPT-2 and TinyLlama architectures.

Result: Theoretical analysis shows uniform quantization introduces noise floor that disproportionately truncates spectral tails, causing spectral flattening and increased stable rank. Empirical validation confirms this geometric degradation leads to representational collapse.

Conclusion: The work quantifies spectral sensitivity of LLMs and establishes spectral fidelity as a necessary condition for stable low-bit optimization, explaining why ultra-low precision training fails and providing theoretical foundation for future quantization methods.

Abstract: Training Large Language Models (LLMs) at ultra-low precision is critically impeded by instability rooted in the conflict between discrete quantization constraints and the intrinsic heavy-tailed spectral nature of linguistic data. By formalizing the connection between Zipfian statistics and random matrix theory, we prove that the power-law decay in the singular value spectra of embeddings is a fundamental requisite for semantic encoding. We derive theoretical bounds showing that uniform quantization introduces a noise floor that disproportionately truncates this spectral tail, which induces spectral flattening and a strictly provable increase in the stable rank of representations. Empirical validation across diverse architectures including GPT-2 and TinyLlama corroborates that this geometric degradation precipitates representational collapse. This work not only quantifies the spectral sensitivity of LLMs but also establishes spectral fidelity as a necessary condition for stable low-bit optimization.

[1120] Forest-Guided Semantic Transport for Label-Supervised Manifold Alignment

Adrien Aumon, Myriam Lizotte, Guy Wolf, Kevin R. Moon, Jake S. Rhodes

Main category: cs.LG

TL;DR: FoSTA is a label-supervised manifold alignment method that uses forest-induced geometry to denoise intra-domain structure and align multimodal data via hierarchical semantic transport, improving correspondence recovery and label transfer.

Details

Motivation: Existing label-supervised manifold alignment methods rely on Euclidean geometry to model intra-domain relationships, which can fail when features are weakly related to the task of interest, leading to noisy structure and degraded alignment quality.

Method: FoSTA leverages forest-induced geometry to denoise intra-domain structure and recover task-relevant manifolds. It builds semantic representations from label-informed forest affinities and aligns them via fast, hierarchical semantic transport to capture meaningful cross-domain relationships.

Result: Extensive comparisons show FoSTA improves correspondence recovery and label transfer on synthetic benchmarks and delivers strong performance in practical single-cell applications including batch correction and biological conservation.

Conclusion: FoSTA provides a scalable alignment framework that effectively addresses limitations of Euclidean-based approaches by using forest-guided geometry for better manifold alignment in multimodal data integration.

Abstract: Label-supervised manifold alignment bridges the gap between unsupervised and correspondence-based paradigms by leveraging shared label information to align multimodal datasets. Still, most existing methods rely on Euclidean geometry to model intra-domain relationships. This approach can fail when features are only weakly related to the task of interest, leading to noisy, semantically misleading structure and degraded alignment quality. To address this limitation, we introduce FoSTA (Forest-guided Semantic Transport Alignment), a scalable alignment framework that leverages forest-induced geometry to denoise intra-domain structure and recover task-relevant manifolds prior to alignment. FoSTA builds semantic representations directly from label-informed forest affinities and aligns them via fast, hierarchical semantic transport, capturing meaningful cross-domain relationships. Extensive comparisons with established baselines demonstrate that FoSTA improves correspondence recovery and label transfer on synthetic benchmarks and delivers strong performance in practical single-cell applications, including batch correction and biological conservation.

[1121] Scalable Random Wavelet Features: Efficient Non-Stationary Kernel Approximation with Convergence Guarantees

Sawan Kumar, Souvik Chakraborty

Main category: cs.LG

TL;DR: Random Wavelet Features (RWF) is a scalable framework for approximating non-stationary kernels using wavelet-based random features, bridging the gap between expressive but computationally demanding models and scalable but limited stationary approximations.

Details

Motivation: Most scalable kernel methods rely on the simplifying assumption of stationarity, forcing a trade-off between using expressive but computationally demanding models like Deep Gaussian Processes or scalable but limited methods like Random Fourier Features. There's a need for scalable methods that can handle non-stationary processes where statistical properties vary across the input domain.

Method: Introduces Random Wavelet Features (RWF), which constructs scalable, non-stationary kernel approximations by sampling from wavelet families. The framework leverages the inherent localization and multi-resolution structure of wavelets to generate explicit feature maps that capture complex, input-dependent patterns, generalizing RFF to the non-stationary setting.

Result: RWF provides comprehensive theoretical analysis including positive definiteness, unbiasedness, and uniform convergence guarantees. Empirical results on challenging synthetic and real-world datasets show that RWF outperforms stationary random features and offers a compelling accuracy-efficiency trade-off against more complex models.

Conclusion: RWF unlocks scalable and expressive kernel methods for a broad class of real-world non-stationary problems by closing the gap between computational efficiency and modeling expressiveness for non-stationary processes.

Abstract: Modeling non-stationary processes, where statistical properties vary across the input domain, is a critical challenge in machine learning; yet most scalable methods rely on a simplifying assumption of stationarity. This forces a difficult trade-off: use expressive but computationally demanding models like Deep Gaussian Processes, or scalable but limited methods like Random Fourier Features (RFF). We close this gap by introducing Random Wavelet Features (RWF), a framework that constructs scalable, non-stationary kernel approximations by sampling from wavelet families. By harnessing the inherent localization and multi-resolution structure of wavelets, RWF generates an explicit feature map that captures complex, input-dependent patterns. Our framework provides a principled way to generalize RFF to the non-stationary setting and comes with a comprehensive theoretical analysis, including positive definiteness, unbiasedness, and uniform convergence guarantees. We demonstrate empirically on a range of challenging synthetic and real-world datasets that RWF outperforms stationary random features and offers a compelling accuracy-efficiency trade-off against more complex models, unlocking scalable and expressive kernel methods for a broad class of real-world non-stationary problems.

[1122] ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning

Zhishen Sun, Sizhe Dang, Guang Dai, Haishan Ye

Main category: cs.LG

TL;DR: ESSAM combines Evolution Strategies with Sharpness-Aware Maximization for efficient full-parameter fine-tuning of LLMs on mathematical reasoning tasks, achieving comparable performance to RL methods with significantly reduced GPU memory usage.

Details

Motivation: RL methods for improving mathematical reasoning in LLMs have high GPU memory requirements, making them impractical for resource-constrained settings. There's a need for efficient fine-tuning methods that maintain performance while reducing memory usage.

Method: Proposes ESSAM (Evolution Strategies with Sharpness-Aware Maximization), which combines zero-order parameter search from Evolution Strategies with Sharpness-Aware Maximization to improve generalization. This approach enables full parameter fine-tuning with reduced memory requirements compared to RL methods.

Result: On GSM8K mathematical reasoning task: achieves 78.27% average accuracy across models, comparable to RL methods (PPO: 77.72%, GRPO: 78.34%). Reduces GPU memory usage by 18× compared to PPO and 10× compared to GRPO.

Conclusion: ESSAM provides an efficient alternative to RL fine-tuning for mathematical reasoning in LLMs, achieving competitive performance with dramatically reduced GPU memory requirements, making it suitable for resource-constrained environments.

Abstract: Reinforcement learning (RL) has become a key training step for improving mathematical reasoning in large language models (LLMs), but it often has high GPU memory usage, which makes it hard to use in settings with limited resources. To reduce these issues, we propose Evolution Strategies with Sharpness-Aware Maximization (ESSAM), a full parameter fine-tuning framework that tightly combines the zero-order search in parameter space from Evolution Strategies (ES) with the Sharpness-Aware Maximization (SAM) to improve generalization. We conduct fine-tuning experiments on the mainstream mathematica reasoning task GSM8K. The results show that ESSAM achieves an average accuracy of 78.27% across all models and its overall performance is comparable to RL methods. It surpasses classic RL algorithm PPO with an accuracy of 77.72% and is comparable to GRPO with an accuracy of 78.34%, and even surpassing them on some models. In terms of GPU memory usage, ESSAM reduces the average GPU memory usage by $18\times$ compared to PPO and by $10\times$ compared to GRPO, achieving an extremely low GPU memory usage.

[1123] Predicting Anemia Among Under-Five Children in Nepal Using Machine Learning and Deep Learning

Deepak Bastola, Pitambar Acharya, Dipak Dulal, Rabina Dhakal, Yang Li

Main category: cs.LG

TL;DR: Machine learning and deep learning models applied to predict childhood anemia in Nepal using demographic and health survey data, with logistic regression achieving best F1-score and recall.

Details

Motivation: Childhood anemia is a major public health challenge in Nepal associated with impaired growth, cognition, and increased morbidity. The study aims to develop predictive models using machine learning to identify risk factors and improve screening.

Method: Used Nepal Demographic and Health Survey (NDHS 2022) data with 1,855 children and 48 candidate features. Applied four feature selection techniques (Chi-square, mutual information, point-biserial correlation, Boruta) to identify stable features. Compared eight traditional ML classifiers (LR, KNN, DT, RF, XGBoost, SVM, NB, LDA) with two deep learning models (DNN and TabNet).

Result: Five features consistently selected: child age, recent fever, household size, maternal anemia, and parasite deworming. Logistic regression achieved best recall (0.701) and highest F1-score (0.649), DNN achieved highest accuracy (0.709), and SVM yielded highest AUC (0.736).

Conclusion: Both machine learning and deep learning models provide competitive anemia prediction. Key interpretable features (child age, infection proxy, maternal anemia, deworming history) are central for risk stratification and public health screening in Nepal.

Abstract: Childhood anemia remains a major public health challenge in Nepal and is associated with impaired growth, cognition, and increased morbidity. Using World Health Organization hemoglobin thresholds, we defined anemia status for children aged 6-59 months and formulated a binary classification task by grouping all anemia severities as \emph{anemic} versus \emph{not anemic}. We analyzed Nepal Demographic and Health Survey (NDHS 2022) microdata comprising 1,855 children and initially considered 48 candidate features spanning demographic, socioeconomic, maternal, and child health characteristics. To obtain a stable and substantiated feature set, we applied four features selection techniques (Chi-square, mutual information, point-biserial correlation, and Boruta) and prioritized features supported by multi-method consensus. Five features: child age, recent fever, household size, maternal anemia, and parasite deworming were consistently selected by all methods, while amenorrhea, ethnicity indicators, and provinces were frequently retained. We then compared eight traditional machine learning classifiers (LR, KNN, DT, RF, XGBoost, SVM, NB, LDA) with two deep learning models (DNN and TabNet) using standard evaluation metrics, emphasizing F1-score and recall due to class imbalance. Among all models, logistic regression attained the best recall (0.701) and the highest F1-score (0.649), while DNN achieved the highest accuracy (0.709), and SVM yielded the strongest discrimination with the highest AUC (0.736). Overall, the results indicate that both machine learning and deep learning models can provide competitive anemia prediction and the interpretable features such as child age, infection proxy, maternal anemia, and deworming history are central for risk stratification and public health screening in Nepal.

[1124] LASS-ODE: Scaling ODE Computations to Connect Foundation Models with Dynamical Physical Systems

Haoran Li, Chenhan Xiao, Lihao Mai, Yang Weng, Erik Blasch

Main category: cs.LG

TL;DR: LASS-ODE is a foundation model for ODE systems that uses token-wise locally linear ODE representations to scale physics-informed learning and introduces inter-system attention with a common structure hub for knowledge sharing across systems.

Details

Motivation: Current foundation models have transformed language, vision, and time series analysis, but dynamic predictions for physical systems remain limited due to two key challenges: (1) physics-computation scalability issues where physics-informed learning doesn't scale to extensive systems, and (2) knowledge-sharing inefficiency where attention mechanisms are limited to individual systems rather than extracting shared ODE structures across systems.

Method: Proposes LASS-ODE with two key innovations: (1) token-wise locally linear ODE representations that preserve physical fidelity while avoiding expensive nonlinear integration, enabling scaling to foundation-model regimes; (2) inter-system attention augmented with a common structure hub (CSH) that stores shared tokens and aggregates knowledge across different ODE systems.

Result: The model was pretrained on 40GB ODE trajectory collections and demonstrates strong in-domain performance, zero-shot generalization across diverse ODE systems, and additional improvements through fine-tuning.

Conclusion: LASS-ODE successfully addresses scalability and knowledge-sharing challenges in physics-informed learning for ODE systems, enabling foundation-model scale applications with efficient computation and cross-system generalization.

Abstract: Foundation models have transformed language, vision, and time series data analysis, yet progress on dynamic predictions for physical systems remains limited. Given the complexity of physical constraints, two challenges stand out. $(i)$ Physics-computation scalability: physics-informed learning can enforce physical regularization, but its computation (e.g., ODE integration) does not scale to extensive systems. $(ii)$ Knowledge-sharing efficiency: the attention mechanism is primarily computed within each system, which limits the extraction of shared ODE structures across systems. We show that enforcing ODE consistency does not require expensive nonlinear integration: a token-wise locally linear ODE representation preserves physical fidelity while scaling to foundation-model regimes. Thus, we propose novel token representations that respect locally linear ODE evolution. Such linearity substantially accelerates integration while accurately approximating the local data manifold. Second, we introduce a simple yet effective inter-system attention that augments attention with a common structure hub (CSH) that stores shared tokens and aggregates knowledge across systems. The resulting model, termed LASS-ODE (\underline{LA}rge-\underline{S}cale \underline{S}mall \underline{ODE}), is pretrained on our $40$GB ODE trajectory collections to enable strong in-domain performance, zero-shot generalization across diverse ODE systems, and additional improvements through fine-tuning.

[1125] How Does Unfaithful Reasoning Emerge from Autoregressive Training? A Study of Synthetic Experiments

Fuxin Wang, Amr Alazali, Yiqiao Zhong

Main category: cs.LG

TL;DR: Small transformers trained on noisy arithmetic data show a threshold phenomenon: below critical noise, they learn faithful stepwise reasoning; above it, they switch to unfaithful skip-step reasoning via a mixed mode with increased prediction entropy, suggesting implicit self-verification emerges from autoregressive training.

Details

Motivation: To understand what constitutes faithful chain-of-thought (CoT) reasoning and how unfaithfulness emerges from autoregressive training, given that LLM-generated CoT often contains logically inconsistent intermediate steps that don't reflect causal relationships to final answers.

Method: Controlled synthetic experiments training small transformers on noisy data to solve modular arithmetic expressions step by step (Arithmetic Expression Reasoning task), analyzing training dynamics across different noise levels.

Result: Models learn faithful reasoning only when training noise is below a critical threshold (attributable to simplicity bias). At higher noise levels, training exhibits transition from faithful stepwise reasoning to unfaithful skip-step reasoning via intermediate mixed mode with transient prediction entropy increase. Mechanistic analysis shows models learn to encode internal uncertainty by resolving inconsistent reasoning steps.

Conclusion: The study provides fundamental insights into CoT reasoning faithfulness, revealing noise-dependent training dynamics and suggesting that implicit self-verification emerges from autoregressive training when models learn to handle internal uncertainty in reasoning steps.

Abstract: Chain-of-thought (CoT) reasoning generated by large language models (LLMs) is often unfaithful: intermediate steps can be logically inconsistent or fail to reflect the causal relationship leading to the final answer. Despite extensive empirical observations, a fundamental understanding of CoT is lacking–what constitutes faithful CoT reasoning, and how unfaithfulness emerges from autoregressive training. We study these questions using well-controlled synthetic experiments, training small transformers on noisy data to solve modular arithmetic expressions step by step, a task we term Arithmetic Expression Reasoning. We find that models can learn faithful reasoning that causally follows the underlying arithmetic rules, but only when the training noise is below a critical threshold, a phenomenon attributable to simplicity bias. At higher noise levels, training dynamics exhibit a transition from faithful stepwise reasoning to unfaithful skip-step reasoning via an intermediate mixed mode characterized by a transient increase in prediction entropy. Mechanistic analysis reveals that models learn to encode internal uncertainty by resolving inconsistent reasoning steps, which suggests the emergence of implicit self-verification from autoregressive training.

[1126] Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Kaiyuan Cui, Yige Li, Yutao Wu, Xingjun Ma, Sarah Erfani, Christopher Leckie, Hanxun Huang

Main category: cs.LG

TL;DR: UltraBreak is a universal and transferable jailbreak framework for vision-language models that uses vision-level regularization and semantic textual objectives to create adversarial patterns that generalize across models and attack targets.

Details

Motivation: Vision-language models expand attack surfaces with image-based jailbreaks, but existing gradient-based methods overfit to single white-box surrogates and fail to transfer to black-box models, requiring a more universal and transferable approach.

Method: Combines vision-level regularization (transformations and constraints) with semantic-based textual objectives defined in the target LLM’s embedding space to discover universal adversarial patterns that mitigate surrogate overfitting.

Result: Extensive experiments show UltraBreak consistently outperforms prior jailbreak methods and achieves strong transferability across both models and attack targets.

Conclusion: Smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks in vision-language models.

Abstract: Vision-language models (VLMs) extend large language models (LLMs) with vision encoders, enabling text generation conditioned on both images and text. However, this multimodal integration expands the attack surface by exposing the model to image-based jailbreaks crafted to induce harmful responses. Existing gradient-based jailbreak methods transfer poorly, as adversarial patterns overfit to a single white-box surrogate and fail to generalise to black-box models. In this work, we propose Universal and transferable jailbreak (UltraBreak), a framework that constrains adversarial patterns through transformations and regularisation in the vision space, while relaxing textual targets through semantic-based objectives. By defining its loss in the textual embedding space of the target LLM, UltraBreak discovers universal adversarial patterns that generalise across diverse jailbreak objectives. This combination of vision-level regularisation and semantically guided textual supervision mitigates surrogate overfitting and enables strong transferability across both models and attack targets. Extensive experiments show that UltraBreak consistently outperforms prior jailbreak methods. Further analysis reveals why earlier approaches fail to transfer, highlighting that smoothing the loss landscape via semantic objectives is crucial for enabling universal and transferable jailbreaks. The code is publicly available in our \href{https://github.com/kaiyuanCui/UltraBreak}{GitHub repository}.

[1127] SFMP: Fine-Grained, Hardware-Friendly and Search-Free Mixed-Precision Quantization for Large Language Models

Xin Nie, Haicheng Zhang, Liang Dong, Beining Feng, Jinhong Weng, Guiling Sun

Main category: cs.LG

TL;DR: SFMP is a search-free, hardware-friendly mixed-precision quantization framework for LLMs that uses fractional bit-widths, block-wise mixed-precision, weight reordering, and unified GEMM kernels.

Details

Motivation: Existing mixed-precision quantization methods either rely on expensive discrete optimization or create hardware inefficiencies due to irregular memory layouts, making them impractical for real-world deployment.

Method: Four novel techniques: 1) Fractional bit-width transforms discrete precision allocation to continuous problem, 2) Block-wise mixed-precision enables fine-grained precision while remaining hardware-friendly, 3) Row-column weight reordering aggregates salient weights with minimal inference overhead, 4) Unified GEMM kernel supports mixed-precision at arbitrary average bit-width.

Result: SFMP outperforms state-of-the-art layer-wise mixed-precision methods under same memory constraints, while significantly reducing quantization cost and improving inference efficiency.

Conclusion: SFMP provides an effective search-free and hardware-friendly solution for mixed-precision quantization of large language models that balances compression efficiency with practical deployment considerations.

Abstract: Mixed-precision quantization is a promising approach for compressing large language models under tight memory budgets. However, existing mixed-precision methods typically suffer from one of two limitations: they either rely on expensive discrete optimization to determine precision allocation, or introduce hardware inefficiencies due to irregular memory layouts. We propose SFMP, a search-free and hardware-friendly mixed-precision quantization framework for large language models. The framework is built upon four novel ideas: Fractional bit-width, which extends integer bit-width for weight matrix to fractional value and transforms discrete precision allocation as a continuous problem; 2)Block-wise mixed-precision, enabling fine-grained precision within weight matrices while remaining hardware-friendly; 3)Row-column weight reordering, which aggregates salient weights via row and column reordering, incurring only a small activation reordering overhead during inference; 4)Unified GEMM kernel, which supports mixed-precision GEMM at arbitrary average bit-width. Extensive experiments demonstrate that SFMP outperforms state-of-the-art layer-wise mixed-precision methods under the same memory constraints, while significantly reducing quantization cost and improving inference efficiency. Code is available at https://github.com/Nkniexin/SFMP

[1128] Adaptive Dual-Weighting Framework for Federated Learning via Out-of-Distribution Detection

Zhiwei Ling, Hailiang Zhao, Chao Zhang, Xiang Ao, Ziqi Wang, Cheng Zhang, Zhen Qin, Xinkui Zhao, Kingsum Chow, Yuanqing Wu, MengChu Zhou

Main category: cs.LG

TL;DR: FLood: A federated learning framework using OOD detection to handle non-IID data heterogeneity through dual-weighting mechanisms at client and server levels.

Details

Motivation: Real-world federated learning deployments face severe data heterogeneity from diverse users, devices, and applications, which undermines model convergence, generalization, and service quality.

Method: FLood uses out-of-distribution detection with a dual-weighting mechanism: (1) client-level adaptive reweighting of supervised loss by upweighting pseudo-OOD samples, and (2) server-level aggregation weighting based on client OOD confidence scores.

Result: Extensive experiments across multiple benchmarks under diverse non-IID settings show FLood consistently outperforms state-of-the-art FL methods in both accuracy and generalization.

Conclusion: FLood is a practical, scalable plug-in module that enhances existing FL algorithms’ performance under heterogeneity without modifying their core optimization, making it suitable for real-world federated intelligent services.

Abstract: Federated Learning (FL) enables collaborative model training across large-scale distributed service nodes while preserving data privacy, making it a cornerstone of intelligent service systems in edge-cloud environments. However, in real-world service-oriented deployments, data generated by heterogeneous users, devices, and application scenarios are inherently non-IID. This severe data heterogeneity critically undermines the convergence stability, generalization ability, and ultimately the quality of service delivered by the global model. To address this challenge, we propose FLood, a novel FL framework inspired by out-of-distribution (OOD) detection. FLood dynamically counteracts the adverse effects of heterogeneity through a dual-weighting mechanism that jointly governs local training and global aggregation. At the client level, it adaptively reweights the supervised loss by upweighting pseudo-OOD samples, thereby encouraging more robust learning from distributionally misaligned or challenging data. At the server level, it refines model aggregation by weighting client contributions according to their OOD confidence scores, prioritizing updates from clients with higher in-distribution consistency and enhancing the global model’s robustness and convergence stability. Extensive experiments across multiple benchmarks under diverse non-IID settings demonstrate that FLood consistently outperforms state-of-the-art FL methods in both accuracy and generalization. Furthermore, FLood functions as an orthogonal plug-in module: it seamlessly integrates with existing FL algorithms to boost their performance under heterogeneity without modifying their core optimization logic. These properties make FLood a practical and scalable solution for deploying reliable intelligent services in real-world federated environments.

[1129] Superposition unifies power-law training dynamics

Zixin Jessie Chen, Hao Chen, Yizhou Liu, Jeff Gore

Main category: cs.LG

TL;DR: Feature superposition induces universal power-law training dynamics with ~1 exponent, accelerating learning up to 10x compared to sequential learning without superposition.

Details

Motivation: To understand how feature superposition affects training dynamics in neural networks, particularly the emergence of power-law scaling, and to investigate whether superposition leads to universal training behaviors independent of data statistics.

Method: Uses a teacher-student framework to analyze training dynamics, derives analytic theory for training without superposition, then investigates how superposition bottleneck induces transition to universal power-law exponent.

Result: Superposition causes transition to universal power-law exponent of ~1 (one over time), independent of input data statistics and channel importance, representing up to 10x acceleration compared to sequential learning without superposition.

Conclusion: Superposition leads to rapid training with data-independent power-law exponent, which has important implications for production-scale large language models and other neural networks employing superposition.

Abstract: We investigate the role of feature superposition in the emergence of power-law training dynamics using a teacher-student framework. We first derive an analytic theory for training without superposition, establishing that the power-law training exponent depends on both the input data statistics and channel importance. Remarkably, we discover that a superposition bottleneck induces a transition to a universal power-law exponent of $\sim 1$, independent of data and channel statistics. This one over time training with superposition represents an up to tenfold acceleration compared to the purely sequential learning that takes place in the absence of superposition. Our finding that superposition leads to rapid training with a data-independent power law exponent may have important implications for a wide range of neural networks that employ superposition, including production-scale large language models.

[1130] SwiftRepertoire: Few-Shot Immune-Signature Synthesis via Dynamic Kernel Codes

Rong Fu, Wenxin Zhang, Muge Qi, Yang Li, Yabin Jin, Jiekai Wu, Jiaxuan Lu, Chunlei Meng, Youjin Wang, Zeli Su, Juntao Gao, Li Bao, Qi Zhao, Wei Luo, Simon Fong

Main category: cs.LG

TL;DR: A framework for T cell receptor repertoire analysis that enables sample-efficient adaptation to new tasks using lightweight task descriptors and compact adapter modules, without full model fine-tuning.

Details

Motivation: T cell receptor repertoire analysis offers valuable signals for disease detection and immune monitoring, but faces challenges including label sparsity, cohort heterogeneity, and computational burden when adapting large encoders to new tasks.

Method: Uses a learned dictionary of prototypes conditioned on lightweight task descriptors derived from repertoire probes and pooled embedding statistics. Synthesizes small adapter modules applied to a frozen pretrained backbone, enabling adaptation with few support examples without full fine-tuning.

Result: Enables immediate adaptation to novel tasks with only a handful of support examples, preserves interpretability through motif-aware probes and calibrated motif discovery pipeline, and links predictive decisions to sequence-level signals.

Conclusion: Provides a practical, sample-efficient, and interpretable pathway for translating repertoire-informed models into diverse clinical and research settings where labeled data are scarce and computational resources are constrained.

Abstract: Repertoire-level analysis of T cell receptors offers a biologically grounded signal for disease detection and immune monitoring, yet practical deployment is impeded by label sparsity, cohort heterogeneity, and the computational burden of adapting large encoders to new tasks. We introduce a framework that synthesizes compact task-specific parameterizations from a learned dictionary of prototypes conditioned on lightweight task descriptors derived from repertoire probes and pooled embedding statistics. This synthesis produces small adapter modules applied to a frozen pretrained backbone, enabling immediate adaptation to novel tasks with only a handful of support examples and without full model fine-tuning. The architecture preserves interpretability through motif-aware probes and a calibrated motif discovery pipeline that links predictive decisions to sequence-level signals. Together, these components yield a practical, sample-efficient, and interpretable pathway for translating repertoire-informed models into diverse clinical and research settings where labeled data are scarce and computational resources are constrained.

Hyesung Jeon, Hyeongju Ha, Jae-Joon Kim

Main category: cs.LG

TL;DR: LRAgent is a KV cache sharing framework for multi-LoRA agent systems that reduces memory and compute overhead by decomposing cache into shared base components and low-rank adapter components.

Details

Motivation: Multi-LoRA agent systems suffer from substantial memory and compute overhead because each agent independently builds and stores its own KV cache for the same long, tool-augmented trajectories, even though they share base model weights. Existing KV cache sharing methods overlook this multi-LoRA setting.

Method: LRAgent decomposes KV cache into shared base component from pretrained weights and adapter-dependent component from LoRA weights. It shares the base component and stores adapter component in low-rank form. Uses Flash-LoRA-Attention kernel to reorder attention computation and avoid materializing low-rank cache to full dimension.

Result: LRAgent achieves throughput and time-to-first-token latency close to fully shared caching while preserving accuracy near the non-shared caching baseline across agentic question-answering benchmarks.

Conclusion: LRAgent effectively addresses KV cache redundancy in multi-LoRA agent systems by exploiting cache similarity across agents, significantly reducing memory and compute overhead while maintaining accuracy.

Abstract: Role specialization in multi-LLM agent systems is often realized via multi-LoRA, where agents share a pretrained backbone and differ only through lightweight adapters. Despite sharing base model weights, each agent independently builds and stores its own KV cache for the same long, tool-augmented trajectories, incurring substantial memory and compute overhead. Existing KV cache sharing methods largely overlook this multi-LoRA setting. We observe that, across agents, cache differences are dominated by adapter outputs, while activations from the shared pretrained backbone remain highly similar. Based on this observation, we propose LRAgent, a KV cache sharing framework for multi-LoRA agents that decomposes the cache into a shared base component from the pretrained weights and an adapter-dependent component from LoRA weights. LRAgent reduces memory overhead by sharing the base component and storing the adapter component in its inherent low-rank form, and further reduces compute overhead, enabled by shared-$A$ multi-LoRA architectures, by also sharing the low-rank cache and avoiding redundant computations for contexts already processed by other agents. To efficiently reconstruct adapter contributions at runtime, we introduce Flash-LoRA-Attention, a kernel that reorders attention computation to avoid materializing the low-rank cache to full dimension. LRAgent achieves throughput and time-to-first-token latency close to fully shared caching, while preserving accuracy near the non-shared caching baseline across agentic question-answering benchmarks.

[1132] Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

Dylan Zhang, Yufeng Xu, Haojin Wang, Qingzhi Chen, Hao Peng

Main category: cs.LG

TL;DR: PEAR is an SFT-stage method that reweights SFT loss using importance sampling to better prepare models for downstream RL training, addressing distribution mismatch between offline SFT data and online RL policy.

Details

Motivation: Current SFT-RL pipelines suffer from distribution mismatch: offline SFT data differs from the policy distribution learned during online RL, causing stronger SFT checkpoints to underperform weaker ones after RL training.

Method: PEAR uses importance sampling to reweight SFT loss at token, block, or sequence levels, correcting distribution mismatch by considering how offline data relates to the target RL policy distribution.

Result: PEAR consistently improves post-RL performance over standard SFT, with pass@8 gains up to 14.6% on AIME2025 mathematical reasoning tasks across Qwen 2.5/3 and DeepSeek-distilled models.

Conclusion: PEAR enables more holistic LLM post-training by designing SFT with downstream RL in mind, addressing distribution mismatch to better prepare models for reinforcement learning stages.

Abstract: Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We attribute this to a mismatch typical in current SFT-RL pipelines: the distribution that generates the offline SFT data can differ substantially from the policy optimized during online RL, which learns from its own rollouts. We propose PEAR (Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels. It can be used to augment standard SFT objectives and incurs little additional training overhead once probabilities for the offline data are collected. We conduct controlled experiments on verifiable reasoning games and mathematical reasoning tasks on Qwen 2.5 and 3 and DeepSeek-distilled models. PEAR consistently improves post-RL performance over canonical SFT, with pass at 8 gains up to a 14.6 percent on AIME2025. Our results suggest that PEAR is an effective step toward more holistic LLM post-training by designing and evaluating SFT with downstream RL in mind rather than in isolation.

[1133] MarkovScale: Towards Optimal Sequential Scaling at Inference Time

Youkang Wang, Jian Wang, Rubing Chen, Tianyi Zeng, Xiao-Yong Wei, Qing Li

Main category: cs.LG

TL;DR: MarkovScale: A principled framework modeling sequential scaling as a two-state Markov process to derive optimality bounds and achieve better accuracy-efficiency tradeoffs in LLM inference.

Details

Motivation: Current sequential scaling methods show modest improvements with heuristic approaches lacking clear optimality bounds, making performance gains poorly understood and suboptimal.

Method: Models sequential scaling as a two-state Markov process to derive closed-form solutions for accuracy improvement conditions and theoretical performance bounds, then implements MarkovScale system applying these optimality criteria.

Result: MarkovScale consistently outperforms state-of-the-art parallel and sequential scaling methods across 3 backbone LLMs, 5 benchmarks, and over 20 configurations.

Conclusion: The framework provides theoretical grounding for sequential scaling optimization and represents significant progress toward optimal, resource-efficient LLM inference.

Abstract: Sequential scaling is a prominent inference-time scaling paradigm, yet its performance improvements are typically modest and not well understood, largely due to the prevalence of heuristic, non-principled approaches that obscure clear optimality bounds. To address this, we propose a principled framework that models sequential scaling as a two-state Markov process. This approach reveals the underlying properties of sequential scaling and yields closed-form solutions for essential aspects, such as the specific conditions under which accuracy is improved and the theoretical upper, neutral, and lower performance bounds. Leveraging this formulation, we develop MarkovScale, a practical system that applies these optimality criteria to achieve a theoretically grounded balance between accuracy and efficiency. Comprehensive experiments across 3 backbone LLMs, 5 benchmarks, and over 20 configurations show that MarkovScale consistently outperforms state-of-the-art parallel and sequential scaling methods, representing a significant step toward optimal and resource-efficient inference in LLMs. The source code will be open upon acceptance at https://open-upon-acceptance.

[1134] On the Expressive Power of Permutation-Equivariant Weight-Space Networks

Adir Dayan, Yam Eitan, Haggai Maron

Main category: cs.LG

TL;DR: Weight-space networks operate on neural network parameters, with permutation-equivariant designs for generalization but potential expressive power trade-offs; this paper develops systematic expressivity theory showing equivalence of prominent designs and establishing universality conditions.

Details

Motivation: The growing availability of pretrained models motivates weight-space learning, but while permutation-equivariant designs improve generalization, they may limit expressive power. Existing partial expressivity results lack comprehensive characterization, especially since weight-space networks operate on both weight and function spaces.

Method: Develops systematic theory for expressivity of weight-space networks, proving equivalence of prominent permutation-equivariant networks in expressive power, and establishing universality conditions in both weight- and function-space settings under mild assumptions on input weights.

Result: Shows all prominent permutation-equivariant weight-space networks are equivalent in expressive power, establishes universality in weight- and function-space settings under natural assumptions, and characterizes edge-case regimes where universality fails.

Conclusion: Provides unified foundation for expressivity of weight-space networks, resolving theoretical gaps and offering comprehensive characterization of their capabilities across different operational domains.

Abstract: Weight-space learning studies neural architectures that operate directly on the parameters of other neural networks. Motivated by the growing availability of pretrained models, recent work has demonstrated the effectiveness of weight-space networks across a wide range of tasks. SOTA weight-space networks rely on permutation-equivariant designs to improve generalization. However, this may negatively affect expressive power, warranting theoretical investigation. Importantly, unlike other structured domains, weight-space learning targets maps operating on both weight and function spaces, making expressivity analysis particularly subtle. While a few prior works provide partial expressivity results, a comprehensive characterization is still missing. In this work, we address this gap by developing a systematic theory for expressivity of weight-space networks. We first prove that all prominent permutation-equivariant networks are equivalent in expressive power. We then establish universality in both weight- and function-space settings under mild, natural assumptions on the input weights, and characterize the edge-case regimes where universality no longer holds. Together, these results provide a strong and unified foundation for the expressivity of weight-space networks.

[1135] SimpleGPT: Improving GPT via A Simple Normalization Strategy

Marco Chen, Xianbiao Qi, Yelin He, Jiaquan Ye, Rong Xiao

Main category: cs.LG

TL;DR: SimpleNorm normalization strategy stabilizes activation scales in Transformers, enabling 3-10x larger learning rates and better performance through reduced Hessian spectral norm.

Details

Motivation: Transformer optimization faces challenges with learning rate stability and activation scale issues; the paper aims to connect architectural design, activation scale, Hessian properties, and maximum tolerable learning rates.

Method: Introduces SimpleNorm normalization strategy to stabilize intermediate activation scales, analyzes Hessian of loss with respect to network activations, and theoretically shows SimpleNorm reduces Hessian spectral norm.

Result: SimpleGPT (SimpleNorm-based network) tolerates 3-10x larger learning rates than standard, demonstrates strong optimization stability, and achieves substantially better performance (e.g., 7B model reduces loss from 2.290 to 2.208 vs LLaMA2 with QKNorm).

Conclusion: SimpleNorm provides a simple yet effective normalization strategy that improves Transformer optimization stability and performance by enabling larger learning rates through reduced Hessian spectral norm.

Abstract: In this work, we revisit Transformer optimization through the lens of second-order geometry and establish a direct connection between architectural design, activation scale, the Hessian matrix, and the maximum tolerable learning rate. We introduce a simple normalization strategy, termed SimpleNorm, which stabilizes intermediate activation scales by construction. Then, by analyzing the Hessian of the loss with respect to network activations, we theoretically show that SimpleNorm significantly reduces the spectral norm of the Hessian, thereby permitting larger stable learning rates. We validate our theoretical findings through extensive experiments on large GPT models at parameter scales 1B, 1.4B, 7B and 8B. Empirically, SimpleGPT, our SimpleNorm-based network, tolerates learning rates 3$\times$-10$\times$ larger than standard convention, consistently demonstrates strong optimization stability, and achieves substantially better performance than well-established baselines. Specifically, when training 7B-scale models for 60K steps, SimpleGPT achieves a training loss that is 0.08 lower than that of LLaMA2 with QKNorm, reducing the loss from 2.290 to 2.208. Our source code will be released at https://github.com/Ocram7/SimpleGPT.

[1136] OLion: Approaching the Hadamard Ideal by Intersecting Spectral and $\ell_{\infty}$ Implicit Biases

Zixiao Wang, Yifei Shen, Huishuai Zhang

Main category: cs.LG

TL;DR: A new optimizer called Spectral-Sign Momentum (SSM) combines spectral control from orthogonalized updates with coordinate control from sign updates, matching or outperforming AdamW and Muon in large-scale language and vision training while using less memory.

Details

Motivation: Existing optimizers inherit implicit biases from their underlying geometries. The authors aim to create an optimizer that combines the benefits of spectral control (from orthogonalized updates) and coordinate-wise control (from sign updates) to improve training efficiency and performance.

Method: SSM forms a Lion-style momentum direction, approximately orthogonalizes it via Newton-Schulz iterations, then applies entrywise sign. This approximates taking a maximal step over the intersection of spectral and ℓ∞ constraint sets. The method maintains only momentum-level optimizer state.

Result: SSM matches or outperforms AdamW and Muon across large-scale language (GPT-2, Llama pretraining) and vision (SiT image pretraining) tasks under comparable tuning. It also mitigates optimizer mismatch when fine-tuning AdamW-pretrained checkpoints.

Conclusion: SSM provides an efficient optimizer combining spectral and coordinate control, achieving strong performance with reduced memory footprint, making it suitable for large-scale multimodal training.

Abstract: Many optimizers can be interpreted as steepest-descent methods under norm-induced geometries, and thus inherit corresponding implicit biases. We introduce \nameA{} (\fullname{}), which combines spectral control from orthogonalized update directions with $\ell_\infty$-style coordinate control from sign updates. \nameA{} forms a Lion-style momentum direction, approximately orthogonalizes it via a few Newton–Schulz iterations, and then applies an entrywise sign, providing an efficient approximation to taking a maximal step over the intersection of the spectral and $\ell_\infty$ constraint sets (a scaled Hadamard-like set for matrix parameters). Despite the strong nonlinearity of orthogonalization and sign, we prove convergence under a mild, empirically verified diagonal-isotropy assumption. Across large-scale language and vision training, including GPT-2 and Llama pretraining, SiT image pretraining, and supervised fine-tuning, \nameA{} matches or outperforms AdamW and Muon under comparable tuning while using only momentum-level optimizer state, and it mitigates optimizer mismatch when fine-tuning AdamW-pretrained checkpoints.

[1137] PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding

Panagiotis Koromilas, Andreas D. Demou, James Oldfield, Yannis Panagakis, Mihalis Nicolaou

Main category: cs.LG

TL;DR: PolySAE extends sparse autoencoders with polynomial terms to capture feature interactions and compositional structure while maintaining interpretability, achieving better probing performance and distinguishing composition from co-occurrence.

Details

Motivation: Standard sparse autoencoders assume linear feature combinations, which cannot distinguish between compositional structure (e.g., "Starbucks" from "star" and "coffee") versus mere co-occurrence, forcing them to allocate monolithic features for compound concepts rather than decomposing them into interpretable constituents.

Method: PolySAE extends the SAE decoder with higher-order polynomial terms to model feature interactions while preserving the linear encoder essential for interpretability. It uses low-rank tensor factorization on a shared projection subspace to capture pairwise and triple feature interactions with small parameter overhead (3% on GPT2).

Result: Across four language models and three SAE variants, PolySAE achieves ~8% average improvement in probing F1 while maintaining comparable reconstruction error, and produces 2-10× larger Wasserstein distances between class-conditional feature distributions. Learned interaction weights show negligible correlation with co-occurrence frequency (r=0.06 vs. r=0.82 for SAE feature covariance).

Conclusion: Polynomial terms in PolySAE capture compositional structure (morphological binding, phrasal composition) largely independent of surface statistics, enabling better decomposition of compound concepts into interpretable constituents while maintaining SAE’s interpretability advantages.

Abstract: Sparse autoencoders (SAEs) have emerged as a promising method for interpreting neural network representations by decomposing activations into sparse combinations of dictionary atoms. However, SAEs assume that features combine additively through linear reconstruction, an assumption that cannot capture compositional structure: linear models cannot distinguish whether “Starbucks” arises from the composition of “star” and “coffee” features or merely their co-occurrence. This forces SAEs to allocate monolithic features for compound concepts rather than decomposing them into interpretable constituents. We introduce PolySAE, which extends the SAE decoder with higher-order terms to model feature interactions while preserving the linear encoder essential for interpretability. Through low-rank tensor factorization on a shared projection subspace, PolySAE captures pairwise and triple feature interactions with small parameter overhead (3% on GPT2). Across four language models and three SAE variants, PolySAE achieves an average improvement of approximately 8% in probing F1 while maintaining comparable reconstruction error, and produces 2-10$\times$ larger Wasserstein distances between class-conditional feature distributions. Critically, learned interaction weights exhibit negligible correlation with co-occurrence frequency ($r = 0.06$ vs. $r = 0.82$ for SAE feature covariance), suggesting that polynomial terms capture compositional structure, such as morphological binding and phrasal composition, largely independent of surface statistics.

[1138] Single-Edge Node Injection Threats to GNN-Based Security Monitoring in Industrial Graph Systems

Wenjie Liang, Ranhui Yan, Jia Cai, You-Gan Wang

Main category: cs.LG

TL;DR: SEGIA: A single-edge graph injection attack method that compromises industrial GNN systems by injecting counterfeit nodes with minimal edge connections while evading detection.

Details

Motivation: Industrial GNN systems for monitoring critical infrastructure are vulnerable to node injection attacks where adversaries compromise edge devices to inject counterfeit nodes that bias downstream decisions while evading topology- and homophily-based sanitization.

Method: Proposes Single-Edge Graph Injection Attack (SEGIA) where each injected node attaches through only one edge. Integrates pruned SGC surrogate, multi-hop neighborhood sampling, reverse graph convolution-based feature synthesis, and similarity-regularized objective to preserve local homophily and survive edge pruning.

Result: Achieves at least 25% higher attack success than representative baselines under substantially smaller edge budgets across various datasets and defenses, demonstrating significant system-level risk.

Conclusion: Reveals critical vulnerabilities in industrial GNN deployments and motivates the need for lightweight admission validation and neighborhood-consistency monitoring to mitigate such attacks.

Abstract: Graph neural networks (GNNs) are increasingly adopted in industrial graph-based monitoring systems (e.g., Industrial internet of things (IIoT) device graphs, power-grid topology models, and manufacturing communication networks) to support anomaly detection, state estimation, and asset classification. In such settings, an adversary that compromises a small number of edge devices may inject counterfeit nodes (e.g., rogue sensors, virtualized endpoints, or spoofed substations) to bias downstream decisions while evading topology- and homophily-based sanitization. This paper formulates deployment-oriented node-injection attacks under constrained resources and proposes the \emph{Single-Edge Graph Injection Attack} (SEGIA), in which each injected node attaches to the operational graph through a single edge. SEGIA integrates a pruned SGC surrogate, multi-hop neighborhood sampling, and reverse graph convolution-based feature synthesis with a similarity-regularized objective to preserve local homophily and survive edge pruning. Theoretical analysis and extensive evaluations across datasets and defenses show at least $25%$ higher attack success than representative baselines under substantially smaller edge budgets. These results indicate a system-level risk in industrial GNN deployments and motivate lightweight admission validation and neighborhood-consistency monitoring.

[1139] When Domains Interact: Asymmetric and Order-Sensitive Cross-Domain Effects in Reinforcement Learning for Reasoning

Wang Yang, Shouren Wang, Chaoda Song, Chuang Ma, Xinpeng Li, Nengbo Wang, Kaixiong Zhou, Vipin Chaudhary, Xiaotian Han

Main category: cs.LG

TL;DR: Systematic analysis of training-order effects in GRPO across math, science, logic, and puzzle reasoning tasks reveals pronounced asymmetry, order sensitivity, and strategy dependence in multi-domain training.

Details

Motivation: GRPO is important for improving reasoning in LLMs, but its behavior under different domain sequencing strategies (sequential vs mixed-domain training) is poorly understood, especially for multi-domain reasoning tasks.

Method: Systematic analysis of training-order effects across four reasoning domains (math, science, logic, puzzles) using GRPO, comparing sequential (one domain at a time) versus mixed-domain (multiple domains at a time) training strategies.

Result: (1) Single-domain generalization is asymmetric: training on other domains improves math reasoning by ~25% but yields negligible transfer to logic/puzzles; (2) Cross-domain interactions are order-dependent: math→science achieves 83%/41% accuracy while science→math degrades to 77%/25%; (3) No single optimal strategy: sequential favors math (up to 84%), mixed favors science/logic, poor ordering can cause large performance gaps (70% to 56%).

Conclusion: GRPO in multi-domain settings exhibits pronounced asymmetry, order sensitivity, and strategy dependence, highlighting the necessity of domain-aware and order-aware training design for optimal reasoning performance.

Abstract: Group Relative Policy Optimization (GRPO) has become a key technique for improving reasoning abilities in large language models, yet its behavior under different domain sequencing strategies is poorly understood. In particular, the impact of sequential (one domain at a time) versus mixed-domain (multiple domain at a time) training in GRPO has not been systematically studied. We provide the first systematic analysis of training-order effects across math, science, logic, and puzzle reasoning tasks. We found (1) single-domain generalization is highly asymmetric: training on other domains improves math reasoning by approximately 25% accuracy, while yielding negligible transfer to logic and puzzle; (2) cross-domain interactions are highly order-dependent: training in the order math$\rightarrow$science achieves 83% / 41% accuracy on math / science, while reversing the order to science$\rightarrow$math degrades performance to 77% / 25%; (3) no single strategy is universally optimal in multi-domain training: sequential training favors math (up to 84%), mixed training favors science and logic, and poor ordering can incur large performance gaps (from 70% to 56%). Overall, our findings demonstrate that GRPO under multi-domain settings exhibits pronounced asymmetry, order sensitivity, and strategy dependence, highlighting the necessity of domain-aware and order-aware training design.

[1140] ChronoSpike: An Adaptive Spiking Graph Neural Network for Dynamic Graphs

Md Abrar Jahin, Taufikur Rahman Fuad, Jay Pujara, Craig Knoblock

Main category: cs.LG

TL;DR: ChronoSpike: Adaptive spiking graph neural network with learnable LIF neurons, multi-head spatial attention, and lightweight Transformer temporal encoder for efficient dynamic graph representation learning.

Details

Motivation: Existing dynamic graph learning methods face trade-offs: attention-based methods have O(T²) complexity, recurrent architectures suffer from gradient issues and dense storage, while spiking neural networks have limitations like sequential propagation, binary information loss, and lack of global context.

Method: Integrates learnable LIF neurons with per-channel membrane dynamics, multi-head attentive spatial aggregation on continuous features, and lightweight Transformer temporal encoder. Combines fine-grained local modeling with long-range dependency capture while maintaining linear memory complexity O(T·d).

Result: Outperforms 12 state-of-the-art baselines by 2.0% Macro-F1 and 2.4% Micro-F1 on three large-scale benchmarks. Achieves 3-10× faster training than recurrent methods with constant 105K parameters independent of graph size. Shows 83-88% sparsity and learned primacy effect.

Conclusion: ChronoSpike provides an efficient solution for dynamic graph representation learning with theoretical guarantees for stability and boundedness, while maintaining high performance and interpretability through heterogeneous temporal receptive fields.

Abstract: Dynamic graph representation learning requires capturing both structural relationships and temporal evolution, yet existing approaches face a fundamental trade-off: attention-based methods achieve expressiveness at $O(T^2)$ complexity, while recurrent architectures suffer from gradient pathologies and dense state storage. Spiking neural networks offer event-driven efficiency but remain limited by sequential propagation, binary information loss, and local aggregation that misses global context. We propose ChronoSpike, an adaptive spiking graph neural network that integrates learnable LIF neurons with per-channel membrane dynamics, multi-head attentive spatial aggregation on continuous features, and a lightweight Transformer temporal encoder, enabling both fine-grained local modeling and long-range dependency capture with linear memory complexity $O(T \cdot d)$. On three large-scale benchmarks, ChronoSpike outperforms twelve state-of-the-art baselines by $2.0%$ Macro-F1 and $2.4%$ Micro-F1 while achieving $3-10\times$ faster training than recurrent methods with a constant 105K-parameter budget independent of graph size. We provide theoretical guarantees for membrane potential boundedness, gradient flow stability under contraction factor $ρ< 1$, and BIBO stability; interpretability analyses reveal heterogeneous temporal receptive fields and a learned primacy effect with $83-88%$ sparsity.

[1141] The Gradient-Causal Gap: Why Gradient Importance Fails on Complex Tasks

Donald Ye

Main category: cs.LG

TL;DR: Gradient magnitude in neural networks doesn’t reliably indicate component importance - removing low-gradient components can destroy generalization while removing high-gradient components can sometimes improve it, creating an unpredictable relationship between gradient magnitude and causal importance.

Details

Motivation: The paper investigates the paradox that gradient magnitude doesn't reliably indicate component importance in neural networks, challenging the common assumption that high-gradient components are most important for model performance.

Method: The authors formalize the Gradient-Causal Gap in Transformers trained on algorithmic tasks, measure correlation between gradient magnitude and causal importance across tasks of varying complexity, and conduct pruning experiments to test the effects of removing high-gradient vs low-gradient components.

Result: Gradient magnitude and causal importance align on simple tasks (ρ=0.73 for reversal) but collapse with increasing complexity (ρ=0.32 for sorting), sometimes becoming inverted (ρ=-0.11). Removing low-gradient “Hidden Heroes” consistently devastates OOD accuracy (-32%), while removing high-gradient “Gradient Bloats” is unpredictable - harmless in most seeds but catastrophic in others.

Conclusion: Gradient-based pruning cannot reliably preserve model capabilities due to the unpredictable relationship between gradient magnitude and causal importance, challenging common pruning approaches that rely on gradient magnitude as an importance metric.

Abstract: Removing ‘‘important’’ high-gradient components from a neural network can improve generalization, while removing unimportant’’ low-gradient components can destroy it. We demonstrate this paradox by formalizing the \textit{Gradient-Causal Gap} in Transformers trained on algorithmic tasks. While gradient magnitude and causal importance align on simple tasks ($ρ=0.73$ for reversal), this relationship collapses as task complexity increases ($ρ=0.32$ for sorting), sometimes becoming inverted ($ρ=-0.11$). Pruning experiments reveal that gradient magnitude is not merely inaccurate but \textit{unpredictably} so. Removing low-gradient ‘‘Hidden Heroes’’ consistently devastates OOD accuracy ($-32%$). Removing high-gradient ‘‘Gradient Bloats’’ is a coin flip: harmless in most seeds (indicating optimization noise), catastrophic in others (indicating overfitting circuits). This unpredictability means gradient-based pruning cannot reliably preserve model capabilities.

[1142] WinFLoRA: Incentivizing Client-Adaptive Aggregation in Federated LoRA under Privacy Heterogeneity

Mengsha Kou, Xiaoyu Xia, Ziqi Wang, Ibrahim Khalil, Runkun Luo, Jingwen Zhou, Minhui Xue

Main category: cs.LG

TL;DR: WinFLoRA: A federated learning framework for LLMs that addresses privacy heterogeneity by using noise-aware aggregation weights to incentivize lower-noise contributions while accommodating varying client privacy requirements.

Details

Motivation: In federated learning for LLMs using LoRA, clients inject varying levels of differential privacy noise based on their privacy requirements, creating privacy heterogeneity that misaligns individual incentives with global model performance.

Method: Proposes WinFLoRA, a privacy-heterogeneous federated LoRA framework that estimates client noise levels from uploaded LoRA adapters and uses aggregation weights as incentives - giving larger weights to lower-noise contributions to improve global accuracy while accommodating heterogeneous privacy needs.

Result: Extensive evaluations show WinFLoRA achieves up to 52.58% higher global accuracy and up to 2.56x client utility compared to state-of-the-art benchmarks across multiple LLMs and datasets.

Conclusion: WinFLoRA successfully aligns heterogeneous client utility (privacy vs. performance) with global model objectives without third-party involvement, providing an effective solution for privacy-heterogeneous federated learning of LLMs.

Abstract: Large Language Models (LLMs) increasingly underpin intelligent web applications, from chatbots to search and recommendation, where efficient specialization is essential. Low-Rank Adaptation (LoRA) enables such adaptation with minimal overhead, while federated LoRA allows web service providers to fine-tune shared models without data sharing. However, in privacy-sensitive deployments, clients inject varying levels of differential privacy (DP) noise, creating privacy heterogeneity that misaligns individual incentives and global performance. In this paper, we propose WinFLoRA, a privacy-heterogeneous federated LoRA that utilizes aggregation weights as incentives with noise awareness. Specifically, the noises from clients are estimated based on the uploaded LoRA adapters. A larger weight indicates greater influence on the global model and better downstream task performance, rewarding lower-noise contributions. By up-weighting low-noise updates, WinFLoRA improves global accuracy while accommodating clients’ heterogeneous privacy requirements. Consequently, WinFLoRA aligns heterogeneous client utility in terms of privacy and downstream performance with global model objectives without third-party involvement. Extensive evaluations demonstrate that across multiple LLMs and datasets, WinFLoRA achieves up to 52.58% higher global accuracy and up to 2.56x client utility than state-of-the-art benchmarks. Source code is publicly available at https://github.com/koums24/WinFLoRA.git.

[1143] A Relative-Budget Theory for Reinforcement Learning with Verifiable Rewards in Large Language Model Reasoning

Akifumi Wachi, Hirota Kinoshita, Shokichi Takakura, Rei Higuchi, Taiji Suzuki

Main category: cs.LG

TL;DR: The paper proposes a “relative-budget” theory explaining RL effectiveness for LLM reasoning through a single quantity ξ = H/𝔼[T], where H is token budget and T is tokens until first correct solution.

Details

Motivation: Reinforcement learning effectiveness for improving LLM reasoning varies across tasks and compute budgets, but there's no unified theory explaining this variation.

Method: Proposes relative-budget theory with quantity ξ = H/𝔼[T], analyzes three regimes (deficient, balanced, ample), provides finite-sample guarantees, and validates with empirical studies.

Result: Identifies three regimes: deficient (ξ→0) with rare informative trajectories, balanced (ξ=Θ(1)) with maximum sample efficiency, and ample (ξ→∞) with diminishing returns. Empirical results show ξ ∈ [1.5, 2.0] maximizes learning efficiency.

Conclusion: Relative budget ξ is a key determinant of RL sample efficiency for LLM reasoning, with balanced regime (ξ=Θ(1)) being optimal for learning, providing theoretical framework for RL application to language models.

Abstract: Reinforcement learning (RL) is a dominant paradigm for improving the reasoning abilities of large language models, yet its effectiveness varies across tasks and compute budgets. We propose a \emph{relative-budget} theory explaining this variation through a single quantity called relative budget $ξ:= H/\mathbb{E}[T]$, where $H$ is the generation horizon (token budget) and $T$ denotes the number of tokens until the first correct solution under a base policy. We show that $ξ$ determines sample efficiency by controlling reward variance and the likelihood of informative trajectories. Our analysis reveals three regimes: in the \emph{deficient} regime ($ξ\to 0$), informative trajectories are rare and the sample complexity explodes; in the \emph{balanced} regime ($ξ=Θ(1)$), informative trajectories occur with non-negligible probability and RL is maximally sample-efficient; and in the \emph{ample} regime ($ξ\to \infty$), learning remains stable but marginal gains per iteration diminish. We further provide finite-sample guarantees for online RL that characterize learning progress across these regimes. Specifically, in a case study under idealized distributional assumptions, we show that the relative budget grows linearly over iterations. Our empirical results confirm these predictions in realistic settings, identifying a budget $ξ\in [1.5, 2.0]$ that maximizes learning efficiency and coincides with peak reasoning performance.

[1144] Tangent Space Fine-Tuning for Directional Preference Alignment in Large Language Models

Mete Erdogan

Main category: cs.LG

TL;DR: TS-DPO enables LLMs to balance multiple human preferences (helpfulness, safety, verbosity) through tangent-space optimization, allowing linear combination of learned preference directions at inference without retraining.

Details

Motivation: Existing preference optimization methods like DPO collapse feedback into single scalar rewards, fixing one balance among objectives and preventing traversal of the Pareto front for multi-objective alignment.

Method: Extends tangent-space formulation to preference alignment by performing DPO within locally linear regime to learn per-objective update directions that can be linearly combined at inference without additional optimization.

Result: TS-DPO achieves broader Pareto-optimal coverage and smoother preference control than scalarized DPO on helpfulness-verbosity trade-off using HelpSteer and UltraFeedback datasets, with CCA showing improved disentanglement of preference directions.

Conclusion: Tangent-space training enables principled and controllable alignment of LLMs to multiple human preferences, allowing flexible traversal of Pareto front through linear combination of learned preference directions.

Abstract: Our goal is to enable large language models (LLMs) to balance multiple human preference dimensions; such as helpfulness, safety, and verbosity, through principled and controllable alignment. Existing preference optimization methods, including Direct Preference Optimization (DPO), collapse feedback into a single scalar reward, fixing one balance among objectives and preventing traversal of the Pareto front. Recent work by Ortiz-Jimenez et al. (2023) showed that fine-tuning can be viewed in a model’s tangent space, where linearized updates act as additive vectors that can be composed to jointly perform well on multiple tasks. Building on this formulation, we extend this idea to preference alignment and propose Tangent-Space Direct Preference Optimization (TS-DPO), which performs DPO within this locally linear regime to learn per-objective update directions. These directions can be linearly combined at inference to generate user-specified behaviors without additional optimization. Evaluated on the helpfulness-verbosity trade-off using the HelpSteer and UltraFeedback datasets, TS-DPO achieves broader Pareto-optimal coverage and smoother preference control than scalarized DPO. Canonical Correlation Analysis (CCA) further shows that tangent-space training amplifies canonical directions aligned with distinct preferences, improving disentanglement.

[1145] TRACE: Scalable Amortized Causal Discovery from Single Sequences via Autoregressive Density Estimation

Hugo Math, Rainer Lienhart

Main category: cs.LG

TL;DR: TRACE: A scalable framework for causal discovery from single event sequences using autoregressive models as pretrained density estimators for conditional mutual information estimation.

Details

Motivation: Causal discovery from single observed sequences of discrete events (e.g., vehicle logs, manufacturing systems, patient trajectories) is challenging due to absence of repeated samples, high dimensionality, and long-range temporal dependencies. Existing methods struggle with these challenges.

Method: TRACE repurposes autoregressive models as pretrained density estimators for conditional mutual information estimation. It infers summary causal graphs between event types, scales linearly with event vocabulary, supports delayed causal effects, and is fully parallel on GPUs.

Result: Experiments show robust performance across different baselines and varying vocabulary sizes, including successful application to root-cause analysis in vehicle diagnostics with over 29,100 event types.

Conclusion: TRACE provides a scalable solution for causal discovery from single event sequences, with theoretical identifiability guarantees and practical effectiveness demonstrated on large-scale real-world applications.

Abstract: We study causal discovery from a single observed sequence of discrete events generated by a stochastic process, as encountered in vehicle logs, manufacturing systems, or patient trajectories. This regime is particularly challenging due to the absence of repeated samples, high dimensionality, and long-range temporal dependencies of the single observation during inference. We introduce TRACE, a scalable framework that repurposes autoregressive models as pretrained density estimators for conditional mutual information estimation. TRACE infers the summary causal graph between event types in a sequence, scaling linearly with the event vocabulary and supporting delayed causal effects, while being fully parallel on GPUs. We establish its theoretical identifiability under imperfect autoregressive models. Experiments demonstrate robust performance across different baselines and varying vocabulary sizes including an application to root-cause analysis in vehicle diagnostics with over 29,100 event types.

[1146] Adaptive Rollout Allocation for Online Reinforcement Learning with Verifiable Rewards

Hieu Trung Nguyen, Bao Nguyen, Wenao Ma, Yuzhi Zhao, Ruifeng She, Viet Anh Nguyen

Main category: cs.LG

TL;DR: VIP is a variance-informed predictive allocation strategy for reinforcement learning that optimizes rollout budget allocation across training prompts to minimize gradient variance and improve sampling efficiency.

Details

Motivation: Existing group-based policy optimization methods allocate fixed numbers of rollouts uniformly across all training prompts, treating them as equally informative. This leads to inefficient computational budget usage and impedes training progress.

Method: VIP uses a lightweight Gaussian process model to predict per-prompt success probabilities based on recent rollouts, translates these predictions into variance estimates, and solves a convex optimization problem to determine optimal rollout allocations under a compute budget constraint.

Result: Empirical results show VIP consistently improves sampling efficiency and achieves higher performance than uniform or heuristic allocation strategies across multiple benchmarks.

Conclusion: VIP provides an effective variance-informed allocation strategy that optimizes computational budget usage in reinforcement learning with verifiable rewards, leading to better sampling efficiency and performance.

Abstract: Sampling efficiency is a key bottleneck in reinforcement learning with verifiable rewards. Existing group-based policy optimization methods, such as GRPO, allocate a fixed number of rollouts for all training prompts. This uniform allocation implicitly treats all prompts as equally informative, and could lead to inefficient computational budget usage and impede training progress. We introduce \Ours, a Variance-Informed Predictive allocation strategy that allocates a given rollout budget to the prompts in the incumbent batch to minimize the expected gradient variance of the policy update. At each iteration, \Oursuses a lightweight Gaussian process model to predict per-prompt success probabilities based on recent rollouts. These probability predictions are translated into variance estimates, which are then fed into a convex optimization problem to determine the optimal rollout allocations under a hard compute budget constraint. Empirical results show that \Oursconsistently improves sampling efficiency and achieves higher performance than uniform or heuristic allocation strategies in multiple benchmarks. Our code will be available at https://github.com/HieuNT91/VIP.

[1147] A Unified Matrix-Spectral Framework for Stability and Interpretability in Deep Learning

Ronald Katende

Main category: cs.LG

TL;DR: A unified matrix-spectral framework for analyzing stability and interpretability in deep neural networks using spectral quantities from Jacobians, gradients, NTK operators, and Hessians to measure sensitivity to perturbations and improve attribution robustness.

Details

Motivation: To develop a comprehensive framework for understanding and improving stability in deep neural networks by connecting spectral properties of network operators to sensitivity to input perturbations, label noise, and training dynamics.

Method: Represent networks as data-dependent products of linear operators and analyze spectral quantities from Jacobians, parameter gradients, Neural Tangent Kernel operators, and loss Hessians. Introduce Global Matrix Stability Index and spectral entropy measures to capture typical sensitivity rather than worst-case bounds.

Result: Synthetic experiments and studies on MNIST, CIFAR-10, and CIFAR-100 show that modest spectral regularization substantially improves attribution stability even when global spectral summaries change little, establishing connection between spectral concentration and analytic stability.

Conclusion: The framework provides computable diagnostics and stability-oriented regularization principles for robustness-aware model design and training, offering practical guidance for improving neural network stability.

Abstract: We develop a unified matrix-spectral framework for analyzing stability and interpretability in deep neural networks. Representing networks as data-dependent products of linear operators reveals spectral quantities governing sensitivity to input perturbations, label noise, and training dynamics. We introduce a Global Matrix Stability Index that aggregates spectral information from Jacobians, parameter gradients, Neural Tangent Kernel operators, and loss Hessians into a single stability scale controlling forward sensitivity, attribution robustness, and optimization conditioning. We further show that spectral entropy refines classical operator-norm bounds by capturing typical, rather than purely worst-case, sensitivity. These quantities yield computable diagnostics and stability-oriented regularization principles. Synthetic experiments and controlled studies on MNIST, CIFAR-10, and CIFAR-100 confirm that modest spectral regularization substantially improves attribution stability even when global spectral summaries change little. The results establish a precise connection between spectral concentration and analytic stability, providing practical guidance for robustness-aware model design and training.

[1148] $\textbf{AGT$^{AO}$}$: Robust and Stabilized LLM Unlearning via Adversarial Gating Training with Adaptive Orthogonality

Pengyu Li, Lingling Zhang, Zhitao Gao, Yanrui Wu, Yuxuan Dong, Huan Liu, Bifan Wei, Jun Liu

Main category: cs.LG

TL;DR: A novel framework called AGT^AO that addresses the trade-off between effective unlearning of sensitive data and preserving model utility in LLMs through adaptive orthogonality and adversarial gating training.

Details

Motivation: LLMs unintentionally memorize sensitive data, creating privacy and security risks. Existing unlearning methods face a dilemma: aggressive unlearning causes catastrophic forgetting (degrading utility) while conservative approaches risk superficial forgetting (leaving models vulnerable to recovery).

Method: AGT^AO combines Adaptive Orthogonality (AO) to dynamically mitigate gradient conflicts between forgetting and retention objectives, and Adversarial Gating Training (AGT) that formulates unlearning as a latent-space min-max game with curriculum-based gating to counter internal recovery attempts.

Result: The framework achieves superior trade-off between unlearning efficacy (KUR ≈ 0.01) and model utility (MMLU 58.30), demonstrating effective sensitive data removal while preserving language understanding capabilities.

Conclusion: AGT^AO successfully reconciles robust erasure with utility preservation in LLMs, addressing the fundamental trade-off in machine unlearning through geometric gradient conflict mitigation and adversarial training strategies.

Abstract: While Large Language Models (LLMs) have achieved remarkable capabilities, they unintentionally memorize sensitive data, posing critical privacy and security risks. Machine unlearning is pivotal for mitigating these risks, yet existing paradigms face a fundamental dilemma: aggressive unlearning often induces catastrophic forgetting that degrades model utility, whereas conservative strategies risk superficial forgetting, leaving models vulnerable to adversarial recovery. To address this trade-off, we propose $\textbf{AGT$^{AO}$}$ (Adversarial Gating Training with Adaptive Orthogonality), a unified framework designed to reconcile robust erasure with utility preservation. Specifically, our approach introduces $\textbf{Adaptive Orthogonality (AO)}$ to dynamically mitigate geometric gradient conflicts between forgetting and retention objectives, thereby minimizing unintended knowledge degradation. Concurrently, $\textbf{Adversarial Gating Training (AGT)}$ formulates unlearning as a latent-space min-max game, employing a curriculum-based gating mechanism to simulate and counter internal recovery attempts. Extensive experiments demonstrate that $\textbf{AGT$^{AO}$}$ achieves a superior trade-off between unlearning efficacy (KUR $\approx$ 0.01) and model utility (MMLU 58.30). Code is available at https://github.com/TiezMind/AGT-unlearning.

[1149] Self-Generative Adversarial Fine-Tuning for Large Language Models

Shiguang Wu, Yaqing Wang, Quanming Yao

Main category: cs.LG

TL;DR: SGALM is a self-generative adversarial framework for LLM alignment that evolves generation and discrimination capabilities within a single model without external reward models.

Details

Motivation: Current LLM alignment methods rely on costly human annotations or heuristic synthetic data approaches that can cause bias accumulation and performance drift. There's a need for more efficient, self-contained alignment frameworks.

Method: Proposes Self-Generative Adversarial LLM (SGALM) that formulates alignment as a generative adversarial game within a single LLM, jointly evolving generation and discrimination capabilities without external reward models.

Result: Achieves state-of-the-art performance, serves as an effective alignment algorithm and a robust synthetic data engine according to theoretical and empirical results.

Conclusion: SGALM provides a unified fine-tuning framework that reduces dependence on costly human annotations while avoiding bias accumulation issues of heuristic approaches.

Abstract: Fine-tuning large language models (LLMs) for alignment typically relies on supervised fine-tuning or reinforcement learning from human feedback, both limited by the cost and scarcity of high-quality annotations. Recent self-play and synthetic data approaches reduce this dependence but often rely on heuristic assumptions or ungrounded self-evaluation, which can cause bias accumulation and performance drift. In this paper, we propose Self-Generative Adversarial LLM (SGALM), a unified fine-tuning framework that formulates alignment as a generative adversarial game within a single LLM. SGALM jointly evolves generation and discrimination capabilities without external reward models. Theoretical and empirical results demonstrate that SGALM achieves state-of-the-art performance, serves as an effective alignment algorithm and a robust synthetic data engine.

[1150] Key Principles of Graph Machine Learning: Representation, Robustness, and Generalization

Yassine Abbahaddou

Main category: cs.LG

TL;DR: This dissertation addresses three key challenges in Graph Neural Networks: representation learning, generalization, and robustness, through novel techniques involving Graph Shift Operators, data augmentation, and orthonormalization defenses.

Details

Motivation: GNNs face limitations in generalization, robustness to adversarial attacks, and representation learning effectiveness, which this research aims to address through principled approaches.

Method: Three main contributions: (1) new representation learning techniques using Graph Shift Operators, (2) generalization enhancement through graph data augmentation, and (3) robust GNN development via orthonormalization techniques and noise-based defenses against adversarial attacks.

Result: The work provides a more principled understanding of GNN limitations and potential, though specific quantitative results are not detailed in the abstract.

Conclusion: The dissertation systematically addresses core challenges in GNNs through novel techniques, advancing the theoretical and practical understanding of graph neural networks.

Abstract: Graph Neural Networks (GNNs) have emerged as powerful tools for learning representations from structured data. Despite their growing popularity and success across various applications, GNNs encounter several challenges that limit their performance. in their generalization, robustness to adversarial perturbations, and the effectiveness of their representation learning capabilities. In this dissertation, I investigate these core aspects through three main contributions: (1) developing new representation learning techniques based on Graph Shift Operators (GSOs, aiming for enhanced performance across various contexts and applications, (2) introducing generalization-enhancing methods through graph data augmentation, and (3) developing more robust GNNs by leveraging orthonormalization techniques and noise-based defenses against adversarial attacks. By addressing these challenges, my work provides a more principled understanding of the limitations and potential of GNNs.

[1151] Dissecting Outlier Dynamics in LLM NVFP4 Pretraining

Peijie Dong, Ruibo Fan, Yuechen Tao, Di Mou, Wenhu Hu, Zhenheng Tang, Yinghao Yu, Jiamang Wang, Wenbo Su, Guodong Yang, Liping Zhang, Xiaowen Chu, Baochun Li, Bo Li

Main category: cs.LG

TL;DR: Analysis of outlier dynamics in 4-bit quantization training reveals persistent hot channels in later training stages; Hot-Channel Patch (HCP) compensation mechanism reduces loss gap to BF16 from 0.94% to 0.58%.

Details

Motivation: 4-bit arithmetic training improves throughput and memory efficiency but suffers from limited dynamic range and sensitivity to outliers. While NVFP4 reduces quantization error, a persistent loss gap remains compared to BF16, motivating analysis of outlier dynamics and compensation mechanisms.

Method: Conducted longitudinal analysis of outlier dynamics during NVFP4 pretraining, identifying where outliers localize and how they evolve. Found persistent hot channels in later training stages. Introduced Hot-Channel Patch (HCP) - an online compensation mechanism that identifies hot channels and reinjects residuals using hardware-efficient kernels. Developed CHON training recipe integrating HCP with post-QK operation protection.

Result: On GLA-1.3B model trained for 60B tokens, CHON reduced the loss gap to BF16 from 0.94% to 0.58% while maintaining downstream accuracy. Identified outliers in specific architectural components: Softmax in SA, gating in LA, and SwiGLU in FFN, with “post-QK” operations showing higher quantization sensitivity.

Conclusion: Outlier analysis reveals persistent hot channels in later training stages that can be effectively compensated via HCP. The CHON recipe successfully reduces quantization loss gap while maintaining model accuracy, advancing 4-bit training efficiency.

Abstract: Training large language models using 4-bit arithmetic enhances throughput and memory efficiency. Yet, the limited dynamic range of FP4 increases sensitivity to outliers. While NVFP4 mitigates quantization error via hierarchical microscaling, a persistent loss gap remains compared to BF16. This study conducts a longitudinal analysis of outlier dynamics across architecture during NVFP4 pretraining, focusing on where they localize, why they occur, and how they evolve temporally. We find that, compared with Softmax Attention (SA), Linear Attention (LA) reduces per-tensor heavy tails but still exhibits persistent block-level spikes under block quantization. Our analysis attributes outliers to specific architectural components: Softmax in SA, gating in LA, and SwiGLU in FFN, with “post-QK” operations exhibiting higher sensitivity to quantization. Notably, outliers evolve from transient spikes early in training to a small set of persistent hot channels (i.e., channels with persistently large magnitudes) in later stages. Based on these findings, we introduce Hot-Channel Patch (HCP), an online compensation mechanism that identifies hot channels and reinjects residuals using hardware-efficient kernels. We then develop CHON, an NVFP4 training recipe integrating HCP with post-QK operation protection. On GLA-1.3B model trained for 60B tokens, CHON reduces the loss gap to BF16 from 0.94% to 0.58% while maintaining downstream accuracy.

[1152] Generalized Radius and Integrated Codebook Transforms for Differentiable Vector Quantization

Haochen You, Heng Zhang, Hongyang He, Yuqi Li, Baojing Liu

Main category: cs.LG

TL;DR: GRIT-VQ introduces a differentiable vector quantization framework that replaces heuristic straight-through estimators with radius-based updates and integrated transforms for stable training and better codebook utilization.

Details

Motivation: Current vector quantization methods use non-differentiable nearest-neighbor assignments with heuristic straight-through estimators, causing unstable gradients, poor codebook utilization, and coupling of update step size to quantization gap.

Method: GRIT-VQ uses radius-based updates that move latents along quantization direction with controllable geometry-aware steps, and applies data-agnostic integrated transforms to codebooks so all codes update through shared parameters rather than independently.

Result: GRIT-VQ consistently improves reconstruction error, generative quality, and recommendation accuracy while substantially increasing codebook utilization across image reconstruction, image generation, and recommendation tokenization benchmarks.

Conclusion: GRIT-VQ provides a unified differentiable framework for vector quantization that enables stable training, coordinated codebook evolution, and avoids collapse while improving performance across multiple domains.

Abstract: Vector quantization (VQ) underpins modern generative and representation models by turning continuous latents into discrete tokens. Yet hard nearest-neighbor assignments are non-differentiable and are typically optimized with heuristic straight-through estimators, which couple the update step size to the quantization gap and train each code in isolation, leading to unstable gradients and severe codebook under-utilization at scale. In this paper, we introduce GRIT-VQ (Generalized Radius and Integrated Transform-Vector Quantization), a unified surrogate framework that keeps hard assignments in the forward pass while making VQ fully differentiable. GRIT-VQ replaces the straight-through estimator with a radius-based update that moves latents along the quantization direction with a controllable, geometry-aware step, and applies a data-agnostic integrated transform to the codebook so that all codes are updated through shared parameters instead of independently. Our theoretical analysis clarifies the fundamental optimization dynamics introduced by GRIT-VQ, establishing conditions for stable gradient flow, coordinated codebook evolution, and reliable avoidance of collapse across a broad family of quantizers. Across image reconstruction, image generation, and recommendation tokenization benchmarks, GRIT-VQ consistently improves reconstruction error, generative quality, and recommendation accuracy while substantially increasing codebook utilization compared to existing VQ variants.

[1153] No Global Plan in Chain-of-Thought: Uncover the Latent Planning Horizon of LLMs

Liyan Xu, Mo Yu, Fandong Meng, Jie Zhou

Main category: cs.LG

TL;DR: The paper investigates latent planning in LLMs during Chain-of-Thought reasoning, finding models have myopic horizons and proposing methods to enhance uncertainty estimation and recognize when CoT can be bypassed.

Details

Motivation: To understand the relationship between LLMs' internal states and their verbalized reasoning trajectories, particularly examining whether LLMs engage in latent planning before Chain-of-Thought reasoning emerges and how this affects reasoning performance.

Method: Developed Tele-Lens, a probing method applied to hidden states across diverse task domains to measure latent planning strength. Analyzed LLM behavior to identify myopic horizons and incremental transitions, then proposed methods for uncertainty estimation and CoT bypass recognition.

Result: Found that LLMs exhibit myopic horizons with primarily incremental transitions rather than precise global planning. Showed that a small subset of CoT positions can effectively represent uncertainty of the entire reasoning path, and demonstrated automatic CoT bypass recognition without performance degradation.

Conclusion: Understanding CoT dynamics reveals LLMs’ limited planning capabilities, enabling more efficient uncertainty estimation and selective CoT usage. Exploiting these dynamics can lead to performance improvements without requiring full reasoning chains.

Abstract: This work stems from prior complementary observations on the dynamics of Chain-of-Thought (CoT): Large Language Models (LLMs) is shown latent planning of subsequent reasoning prior to CoT emergence, thereby diminishing the significance of explicit CoT; whereas CoT remains critical for tasks requiring multi-step reasoning. To deepen the understanding between LLM’s internal states and its verbalized reasoning trajectories, we investigate the latent planning strength of LLMs, through our probing method, Tele-Lens, applying to hidden states across diverse task domains. Our empirical results indicate that LLMs exhibit a myopic horizon, primarily conducting incremental transitions without precise global planning. Leveraging this characteristic, we propose a hypothesis on enhancing uncertainty estimation of CoT, which we validate that a small subset of CoT positions can effectively represent the uncertainty of the entire path. We further underscore the significance of exploiting CoT dynamics, and demonstrate that automatic recognition of CoT bypass can be achieved without performance degradation. Our code, data and models are released at https://github.com/lxucs/tele-lens.

[1154] Statistical MIA: Rethinking Membership Inference Attack for Reliable Unlearning Auditing

Jialong Sun, Zeming Wei, Jiaxuan Zou, Jiacheng Gong, Guanheng Wang, Chengyang Dong, Jialong Li, Bo Liu

Main category: cs.LG

TL;DR: SMIA is a training-free statistical framework for reliable machine unlearning auditing that replaces traditional MIA approaches, providing confidence intervals for forgetting rates without shadow model training.

Details

Motivation: Current machine unlearning auditing relies on Membership Inference Attacks (MIAs), which are flawed because failed membership inference doesn't guarantee true forgetting, leads to statistical errors that can't be observed, and requires expensive shadow model training.

Method: Proposes Statistical Membership Inference Attack (SMIA) - a training-free framework that directly compares member and non-member data distributions using statistical tests, eliminating learned attack models and providing both forgetting rates and confidence intervals.

Result: Extensive experiments show SMIA provides more reliable auditing with significantly lower computational cost than existing MIA approaches, offering theoretical guarantees and empirical effectiveness.

Conclusion: SMIA represents a new paradigm for reliable machine unlearning auditing, addressing fundamental limitations of MIA-based approaches through statistical methods that provide quantified reliability.

Abstract: Machine unlearning (MU) is essential for enforcing the right to be forgotten in machine learning systems. A key challenge of MU is how to reliably audit whether a model has truly forgotten specified training data. Membership Inference Attacks (MIAs) are widely used for unlearning auditing, where samples that evade membership detection are often regarded as successfully forgotten. After carefully revisiting the reliability of MIA, we show that this assumption is flawed: failed membership inference does not imply true forgetting. We theoretically demonstrate that MIA-based auditing, when formulated as a binary classification problem, inevitably incurs statistical errors whose magnitude cannot be observed during the auditing process. This leads to overly optimistic evaluations of unlearning performance, while incurring substantial computational overhead due to shadow model training. To address these limitations, we propose Statistical Membership Inference Attack (SMIA), a novel training-free and highly effective auditing framework. SMIA directly compares the distributions of member and non-member data using statistical tests, eliminating the need for learned attack models. Moreover, SMIA outputs both a forgetting rate and a corresponding confidence interval, enabling quantified reliability of the auditing results. Extensive experiments show that SMIA provides more reliable auditing with significantly lower computational cost than existing MIA-based approaches. Notably, the theoretical guarantees and empirical effectiveness of SMIA suggest it as a new paradigm for reliable machine unlearning auditing.

[1155] Unifying Masked Diffusion Models with Various Generation Orders and Beyond

Chunsan Hong, Sanghyun Lee, Jong Chul Ye

Main category: cs.LG

TL;DR: OeMDM and LoMDM are diffusion models for text generation that learn optimal generation orderings rather than using fixed patterns, outperforming existing discrete diffusion models.

Details

Motivation: Masked diffusion models for text generation depend critically on generation order, but prior work either uses hard-coded orderings or learns ordering policies separately from model training, leading to suboptimal solutions and extra computational cost.

Method: Proposes OeMDM (order-expressive masked diffusion model) as a unified framework for various generation orders, then introduces LoMDM (learnable-order masked diffusion model) that jointly learns generation ordering and diffusion backbone through a single objective from scratch.

Result: LoMDM outperforms various discrete diffusion models across multiple language modeling benchmarks, demonstrating the effectiveness of learning context-dependent generation orderings.

Conclusion: Jointly learning generation ordering with the diffusion model enables more flexible and effective text generation, providing a unified framework that encompasses various existing approaches.

Abstract: Masked diffusion models (MDMs) are a potential alternative to autoregressive models (ARMs) for language generation, but generation quality depends critically on the generation order. Prior work either hard-codes an ordering (e.g., blockwise left-to-right) or learns an ordering policy for a pretrained MDM, which incurs extra cost and can yield suboptimal solutions due to the two-stage optimization. Motivated by this, we propose order-expressive masked diffusion model (OeMDM) for a broad class of diffusion generative processes with various generation orders, enabling the interpretation of MDM, ARM, and block diffusion in a single framework. Furthermore, building on OeMDM, we introduce learnable-order masked diffusion model (LoMDM), which jointly learns the generation ordering and diffusion backbone through a single objective from scratch, enabling the diffusion model to generate text in context-dependent ordering. Empirically, we confirm that LoMDM outperforms various discrete diffusion models across multiple language modeling benchmarks.

[1156] PolicyFlow: Policy Optimization with Continuous Normalizing Flow in Reinforcement Learning

Shunpeng Yang, Ben Liu, Hua Chen

Main category: cs.LG

TL;DR: PolicyFlow: A novel on-policy RL algorithm that integrates expressive continuous normalizing flow (CNF) policies with PPO-style objectives without requiring expensive likelihood evaluation, using velocity field variations and Brownian regularization.

Details

Motivation: Extending PPO to expressive flow-based policies is challenging due to computational expense and numerical instability of likelihood evaluation along full flow trajectories. Current methods struggle with high-capacity policy models like continuous normalizing flows.

Method: PolicyFlow approximates importance ratios using velocity field variations along simple interpolation paths instead of full likelihood evaluation. Introduces Brownian Regularizer for implicit policy entropy regularization to prevent mode collapse and encourage diverse behaviors.

Result: Achieves competitive or superior performance compared to PPO with Gaussian policies and flow-based baselines (FPO, DPPO) across MultiGoal, PointMaze, IsaacLab, and MuJoCo Playground environments. Particularly excels at capturing multimodal action distributions in MultiGoal tasks.

Conclusion: PolicyFlow successfully integrates expressive CNF policies with PPO-style objectives through efficient importance ratio approximation and Brownian regularization, enabling stable training with high-capacity policy models while maintaining or improving performance.

Abstract: Among on-policy reinforcement learning algorithms, Proximal Policy Optimization (PPO) demonstrates is widely favored for its simplicity, numerical stability, and strong empirical performance. Standard PPO relies on surrogate objectives defined via importance ratios, which require evaluating policy likelihood that is typically straightforward when the policy is modeled as a Gaussian distribution. However, extending PPO to more expressive, high-capacity policy models such as continuous normalizing flows (CNFs), also known as flow-matching models, is challenging because likelihood evaluation along the full flow trajectory is computationally expensive and often numerically unstable. To resolve this issue, we propose PolicyFlow, a novel on-policy CNF-based reinforcement learning algorithm that integrates expressive CNF policies with PPO-style objectives without requiring likelihood evaluation along the full flow path. PolicyFlow approximates importance ratios using velocity field variations along a simple interpolation path, reducing computational overhead without compromising training stability. To further prevent mode collapse and further encourage diverse behaviors, we propose the Brownian Regularizer, an implicit policy entropy regularizer inspired by Brownian motion, which is conceptually elegant and computationally lightweight. Experiments on diverse tasks across various environments including MultiGoal, PointMaze, IsaacLab and MuJoCo Playground show that PolicyFlow achieves competitive or superior performance compared to PPO using Gaussian policies and flow-based baselines including FPO and DPPO. Notably, results on MultiGoal highlight PolicyFlow’s ability to capture richer multimodal action distributions.

[1157] EvoMU: Evolutionary Machine Unlearning

Pawel Batorski, Paul Swoboda

Main category: cs.LG

TL;DR: EvoMU uses evolutionary search to automatically discover optimal loss functions for machine unlearning tasks, outperforming existing methods while using a small 4B parameter model.

Details

Motivation: Machine unlearning needs effective loss functions, but the search space is vast and no universal optimal loss exists due to dataset differences. Current approaches require manual design and may not adapt well to different forget/retain data structures.

Method: Evolutionary search procedure automatically finds task-specific unlearning loss functions in the vast space of possible formulations, using a small 4B parameter model (Qwen3-4B-Thinking) to achieve state-of-the-art results on a computational budget.

Result: EvoMU surpasses previous loss-based unlearning formulations on TOFU-5%, TOFU-10%, MUSE and WMDP benchmarks by synthesizing novel unlearning losses, demonstrating the potential of AI co-scientists with limited computational resources.

Conclusion: Evolutionary search enables automatic discovery of effective unlearning loss functions tailored to specific datasets, advancing machine unlearning while showing that AI co-scientists can achieve state-of-the-art results with modest computational resources.

Abstract: Machine unlearning aims to unlearn specified training data (e.g. sensitive or copyrighted material). A prominent approach is to fine-tune an existing model with an unlearning loss that retains overall utility. The space of suitable unlearning loss functions is vast, making the search for an optimal loss function daunting. Additionally, there might not even exist a universally optimal loss function: differences in the structure and overlap of the forget and retain data can cause a loss to work well in one setting but over-unlearn or under-unlearn in another. Our approach EvoMU tackles these two challenges simultaneously. An evolutionary search procedure automatically finds task-specific losses in the vast space of possible unlearning loss functions. This allows us to find dataset-specific losses that match or outperform existing losses from the literature, without the need for a human-in-the-loop. This work is therefore an instance of automatic scientific discovery, a.k.a. an AI co-scientist. In contrast to previous AI co-scientist works, we do so on a budget: We achieve SotA results using a small 4B parameter model (Qwen3-4B-Thinking), showing the potential of AI co-scientists with limited computational resources. Our experimental evaluation shows that we surpass previous loss-based unlearning formulations on TOFU-5%, TOFU-10%, MUSE and WMDP by synthesizing novel unlearning losses. Our code is available at https://github.com/Batorskq/EvoMU.

[1158] Multi-Horizon Electricity Price Forecasting with Deep Learning in the Australian National Electricity Market

Mohammed Osman Gani, Zhipeng He, Chun Ouyang, Sara Khalifa

Main category: cs.LG

TL;DR: A novel electricity price forecasting framework using state-of-the-art deep learning models for multi-day-ahead predictions with comprehensive intraday interval-level evaluation across Australian electricity markets.

Details

Motivation: Electricity price forecasting is challenging due to volatility, heavy-tailed spikes, and regime shifts. Current deep learning approaches have gaps: limited multi-day horizon forecasting, insufficient exploration of SOTA time series models, and aggregated evaluation that obscures time-of-day variations.

Method: Proposed a novel EPF framework extending forecast horizons to multi-day-ahead using benchmarked SOTA time series DL models. Conducted comprehensive evaluation at intraday interval levels across all five regions in the Australian National Electricity Market.

Result: No single model consistently dominates across regions, metrics, and horizons. Standard DL models deliver superior performance in most regions, while SOTA time series DL models show greater robustness to forecast horizon extension. Intraday evaluation reveals diurnal error patterns with peaks during evening ramp, midday negative-price regimes, and trend change periods.

Conclusion: Future DL-based EPF research should focus on enriched feature representations and modeling strategies that enhance longer-term forecasting robustness while maintaining sensitivity to intraday volatility and structural price dynamics.

Abstract: Accurate electricity price forecasting (EPF) is essential for operational planning, trading, and flexible asset scheduling in liberalised power systems, yet remains challenging due to volatility, heavy-tailed spikes, and frequent regime shifts. While deep learning (DL) has been increasingly adopted in EPF to capture complex and nonlinear price dynamics, several important gaps persist: (i) limited attention to multi-day horizons beyond day-ahead forecasting, (ii) insufficient exploration of state-of-the-art (SOTA) time series DL models, and (iii) a predominant reliance on aggregated horizon-level evaluation that obscures time-of-day forecasting variation. To address these gaps, we propose a novel EPF framework that extends the forecast horizon to multi-day-ahead by systematically building forecasting models that leverage benchmarked SOTA time series DL models. We conduct a comprehensive evaluation to analyse time-of-day forecasting performance by integrating model assessment at intraday interval levels across all five regions in the Australian National Electricity Market (NEM). The results show that no single model consistently dominates across regions, metrics, and horizons. Overall, standard DL models deliver superior performance in most regions, while SOTA time series DL models demonstrate greater robustness to forecast horizon extension. Intraday interval-level evaluation reveals pronounced diurnal error patterns, indicating that absolute errors peak during the evening ramp, relative errors inflate during midday negative-price regimes, and directional accuracy degrades during periods of frequent trend changes. These findings suggest that future research on DL-based EPF can benefit from enriched feature representations and modelling strategies that enhance longer-term forecasting robustness while maintaining sensitivity to intraday volatility and structural price dynamics.

[1159] Learning Generative Selection for Best-of-N

Shubham Toshniwal, Aleksander Ficek, Siddhartha Jain, Wei Du, Vahid Noroozi, Sadegh Mahdavi, Somshubra Majumdar, Igor Gitman

Main category: cs.LG

TL;DR: Small reasoning models can achieve strong generative selection capabilities through targeted reinforcement learning on synthesized selection tasks from math and code datasets, enabling efficient test-time compute scaling.

Details

Motivation: Scaling test-time compute via parallel sampling is limited by Best-of-N selection quality. While generative selection methods like GenSelect address this bottleneck, strong selection performance remains largely limited to large models. The paper aims to enable small models to acquire strong GenSelect capabilities.

Method: Synthesize selection tasks from large-scale math and code instruction datasets by filtering instances with both correct and incorrect candidate solutions. Train 1.7B-parameter models with DAPO (Direct Alignment Preference Optimization) to reward correct selections. Evaluate across math (AIME24, AIME25, HMMT25) and code (LiveCodeBench) reasoning benchmarks.

Result: The trained small models consistently outperform prompting and majority-voting baselines, often approaching or exceeding much larger models. Gains generalize to selecting outputs from stronger models despite training only on outputs from weaker models.

Conclusion: Reinforcement learning is established as a scalable way to unlock strong generative selection in small models, enabling efficient test-time scaling through improved selection quality.

Abstract: Scaling test-time compute via parallel sampling can substantially improve LLM reasoning, but is often limited by Best-of-N selection quality. Generative selection methods, such as GenSelect, address this bottleneck, yet strong selection performance remains largely limited to large models. We show that small reasoning models can acquire strong GenSelect capabilities through targeted reinforcement learning. To this end, we synthesize selection tasks from large-scale math and code instruction datasets by filtering to instances with both correct and incorrect candidate solutions, and train 1.7B-parameter models with DAPO to reward correct selections. Across math (AIME24, AIME25, HMMT25) and code (LiveCodeBench) reasoning benchmarks, our models consistently outperform prompting and majority-voting baselines, often approaching or exceeding much larger models. Moreover, these gains generalize to selecting outputs from stronger models despite training only on outputs from weaker models. Overall, our results establish reinforcement learning as a scalable way to unlock strong generative selection in small models, enabling efficient test-time scaling.

[1160] Multi-Fidelity Physics-Informed Neural Networks with Bayesian Uncertainty Quantification and Adaptive Residual Learning for Efficient Solution of Parametric Partial Differential Equations

Olaf Yunus Laitinen Imanov

Main category: cs.LG

TL;DR: MF-BPINN: A multi-fidelity Bayesian physics-informed neural network framework that combines low-fidelity simulations with sparse high-fidelity data using hierarchical neural architecture and adaptive residual learning with Bayesian uncertainty quantification.

Details

Motivation: PINNs are powerful for solving PDEs but computationally prohibitive for high-fidelity parametric systems requiring multiple evaluations across varying parameters. Need efficient framework leveraging abundant low-fidelity data with sparse high-fidelity measurements.

Method: Multi-fidelity framework combining PINNs with Bayesian uncertainty quantification and adaptive residual learning. Uses hierarchical neural architecture to learn nonlinear correlations across fidelity levels. Introduces adaptive residual network with learnable gating mechanisms to dynamically balance linear/nonlinear fidelity discrepancies. Employs Hamiltonian Monte Carlo for rigorous Bayesian inference.

Result: Not specified in abstract - would need full paper for experimental results.

Conclusion: MF-BPINN provides a novel approach to efficiently solve high-fidelity parametric PDEs by leveraging multi-fidelity data with uncertainty quantification and adaptive learning mechanisms.

Abstract: Physics-informed neural networks (PINNs) have emerged as a powerful paradigm for solving partial differential equations (PDEs) by embedding physical laws directly into neural network training. However, solving high-fidelity PDEs remains computationally prohibitive, particularly for parametric systems requiring multiple evaluations across varying parameter configurations. This paper presents MF-BPINN, a novel multi-fidelity framework that synergistically combines physics-informed neural networks with Bayesian uncertainty quantification and adaptive residual learning. Our approach leverages abundant low-fidelity simulations alongside sparse high-fidelity data through a hierarchical neural architecture that learns nonlinear correlations across fidelity levels. We introduce an adaptive residual network with learnable gating mechanisms that dynamically balances linear and nonlinear fidelity discrepancies. Furthermore, we develop a rigorous Bayesian framework employing Hamiltonian Monte Carlo.

[1161] Revisiting Adaptive Rounding with Vectorized Reparameterization for LLM Quantization

Yuli Zhou, Qingxuan Chen, Luca Benini, Guolei Sun, Yawei Li

Main category: cs.LG

TL;DR: VQRound is a parameter-efficient quantization method that uses compact codebooks to optimize rounding matrices for LLM quantization, achieving better convergence with only 0.2% trainable parameters.

Details

Motivation: Adaptive rounding methods for post-training quantization are computationally expensive for billion-parameter LLMs due to dense rounding matrices. There's a need for efficient quantization methods that can handle heavy-tailed weight distributions in LLMs while maintaining accuracy.

Method: VQRound reparameterizes the rounding matrix into a compact codebook, minimizing element-wise worst-case error under L∞ norm. It uses a lightweight end-to-end finetuning pipeline that optimizes codebooks across all layers with only 128 samples, making it parameter-efficient.

Result: Extensive experiments on OPT, LLaMA, LLaMA2, and Qwen3 models show VQRound achieves better convergence than traditional adaptive rounding at the same number of steps while using as little as 0.2% of trainable parameters.

Conclusion: Adaptive rounding can be made both scalable and fast-fitting through parameter-efficient optimization frameworks like VQRound, which enables efficient quantization of large language models.

Abstract: Adaptive Rounding has emerged as an alternative to round-to-nearest (RTN) for post-training quantization by enabling cross-element error cancellation. Yet, dense and element-wise rounding matrices are prohibitively expensive for billion-parameter large language models (LLMs). We revisit adaptive rounding from an efficiency perspective and propose VQRound, a parameter-efficient optimization framework that reparameterizes the rounding matrix into a compact codebook. Unlike low-rank alternatives, VQRound minimizes the element-wise worst-case error under $L_\infty$ norm, which is critical for handling heavy-tailed weight distributions in LLMs. Beyond reparameterization, we identify rounding initialization as a decisive factor and develop a lightweight end-to-end finetuning pipeline that optimizes codebooks across all layers using only 128 samples. Extensive experiments on OPT, LLaMA, LLaMA2, and Qwen3 models demonstrate that VQRound achieves better convergence than traditional adaptive rounding at the same number of steps while using as little as 0.2% of the trainable parameters. Our results show that adaptive rounding can be made both scalable and fast-fitting. The code is available at https://github.com/zhoustan/VQRound.

[1162] Rethinking the Flow-Based Gradual Domain Adaption: A Semi-Dual Optimal Transport Perspective

Zhichao Chen, Zhan Zhuang, Yunfei Teng, Hao Wang, Fangyikang Wang, Zhengnan Li, Tianqiao Liu, Haoxuan Li, Zhouchen Lin

Main category: cs.LG

TL;DR: Proposes E-SUOT framework for gradual domain adaptation using entropy-regularized optimal transport to construct intermediate domains without needing real intermediate data or likelihood estimation.

Details

Motivation: Gradual domain adaptation requires intermediate domains to bridge source and target domains, but real intermediate domains are often unavailable or ineffective. Existing flow-based methods use likelihood estimation which discards useful information and degrades performance.

Method: Proposes Entropy-regularized Semi-dual Unbalanced Optimal Transport (E-SUOT) framework that reformulates flow-based GDA as a Lagrangian dual problem with an equivalent semi-dual objective avoiding likelihood estimation. Uses entropy regularization to convert unstable min-max training into stable alternative optimization.

Result: Extensive experiments demonstrate the efficacy of the E-SUOT framework, with theoretical analysis provided for stability and generalization.

Conclusion: E-SUOT provides an effective framework for gradual domain adaptation by constructing intermediate domains through optimal transport without needing real intermediate data or likelihood estimation, addressing limitations of existing flow-based methods.

Abstract: Gradual domain adaptation (GDA) aims to mitigate domain shift by progressively adapting models from the source domain to the target domain via intermediate domains. However, real intermediate domains are often unavailable or ineffective, necessitating the synthesis of intermediate samples. Flow-based models have recently been used for this purpose by interpolating between source and target distributions; however, their training typically relies on sample-based log-likelihood estimation, which can discard useful information and thus degrade GDA performance. The key to addressing this limitation is constructing the intermediate domains via samples directly. To this end, we propose an Entropy-regularized Semi-dual Unbalanced Optimal Transport (E-SUOT) framework to construct intermediate domains. Specifically, we reformulate flow-based GDA as a Lagrangian dual problem and derive an equivalent semi-dual objective that circumvents the need for likelihood estimation. However, the dual problem leads to an unstable min-max training procedure. To alleviate this issue, we further introduce entropy regularization to convert it into a more stable alternative optimization procedure. Based on this, we propose a novel GDA training framework and provide theoretical analysis in terms of stability and generalization. Finally, extensive experiments are conducted to demonstrate the efficacy of the E-SUOT framework.

[1163] Analyzing and Improving Diffusion Models for Time-Series Data Imputation: A Proximal Recursion Perspective

Zhichao Chen, Hao Wang, Fangyikang Wang, Licheng Pan, Zhengnan Li, Yunfei Teng, Haoxuan Li, Zhouchen Lin

Main category: cs.LG

TL;DR: SPIRIT is a novel diffusion-based framework for time-series data imputation that addresses non-stationary temporal dynamics and objective inconsistency through semi-proximal transport regularization.

Details

Motivation: Current diffusion models for time-series imputation suffer from inconsistent performance due to non-stationary temporal dynamics (biasing inference and causing outlier-sensitive imputations) and objective inconsistency (imputation requires accurate pointwise recovery while DMs generate diverse samples).

Method: Analyzed DM-based TSDI through proximal-operator perspective, identified implicit Wasserstein distance regularization as problematic, proposed SPIRIT framework using entropy-induced Bregman divergence to relax mass preserving constraint, formulated semi-proximal transport discrepancy, and derived complete workflow with SPT as proximal operator.

Result: Extensive experiments demonstrate the effectiveness of SPIRIT approach, showing improved performance in time-series imputation tasks compared to existing diffusion-based methods.

Conclusion: SPIRIT successfully addresses limitations of diffusion models for time-series imputation by introducing semi-proximal transport regularization that balances diversity and fidelity while handling non-stationary temporal dynamics.

Abstract: Diffusion models (DMs) have shown promise for Time-Series Data Imputation (TSDI); however, their performance remains inconsistent in complex scenarios. We attribute this to two primary obstacles: (1) non-stationary temporal dynamics, which can bias the inference trajectory and lead to outlier-sensitive imputations; and (2) objective inconsistency, since imputation favors accurate pointwise recovery whereas DMs are inherently trained to generate diverse samples. To better understand these issues, we analyze DM-based TSDI process through a proximal-operator perspective and uncover that an implicit Wasserstein distance regularization inherent in the process hinders the model’s ability to counteract non-stationarity and dissipative regularizer, thereby amplifying diversity at the expense of fidelity. Building on this insight, we propose a novel framework called SPIRIT (Semi-Proximal Transport Regularized time-series Imputation). Specifically, we introduce entropy-induced Bregman divergence to relax the mass preserving constraint in the Wasserstein distance, formulate the semi-proximal transport (SPT) discrepancy, and theoretically prove the robustness of SPT against non-stationarity. Subsequently, we remove the dissipative structure and derive the complete SPIRIT workflow, with SPT serving as the proximal operator. Extensive experiments demonstrate the effectiveness of the proposed SPIRIT approach.

[1164] Learning While Staying Curious: Entropy-Preserving Supervised Fine-Tuning via Adaptive Self-Distillation for Large Reasoning Models

Hao Wang, Hao Gu, Hongming Piao, Kaixiong Gong, Yuxiao Ye, Xiangyu Yue, Sirui Han, Yike Guo, Dapeng Wu

Main category: cs.LG

TL;DR: CurioSFT is an entropy-preserving supervised fine-tuning method that enhances exploration capabilities through intrinsic curiosity, improving both SFT and subsequent RL performance on reasoning tasks.

Details

Motivation: Standard SFT-then-RL pipeline limits RL benefits because SFT causes overconfidence and reduces generation diversity, narrowing RL's exploration space. Existing entropy regularization methods flatten distributions without improving meaningful exploration.

Method: CurioSFT uses Self-Exploratory Distillation (distilling toward a self-generated, temperature-scaled teacher) and Entropy-Guided Temperature Selection (adaptively adjusting distillation strength to amplify exploration at reasoning tokens while stabilizing factual tokens).

Result: On mathematical reasoning tasks, CurioSFT outperforms vanilla SFT by 2.5 points on in-distribution tasks and 2.9 points on out-of-distribution tasks. Exploration capabilities preserved during SFT translate to 5.0 point average improvement in RL stage.

Conclusion: CurioSFT effectively preserves exploration capabilities during SFT, enabling better subsequent RL performance by maintaining diversity and preventing overconfidence in reasoning models.

Abstract: The standard post-training recipe for large reasoning models, supervised fine-tuning followed by reinforcement learning (SFT-then-RL), may limit the benefits of the RL stage: while SFT imitates expert demonstrations, it often causes overconfidence and reduces generation diversity, leaving RL with a narrowed solution space to explore. Adding entropy regularization during SFT is not a cure-all; it tends to flatten token distributions toward uniformity, increasing entropy without improving meaningful exploration capability. In this paper, we propose CurioSFT, an entropy-preserving SFT method designed to enhance exploration capabilities through intrinsic curiosity. It consists of (a) Self-Exploratory Distillation, which distills the model toward a self-generated, temperature-scaled teacher to encourage exploration within its capability; and (b) Entropy-Guided Temperature Selection, which adaptively adjusts distillation strength to mitigate knowledge forgetting by amplifying exploration at reasoning tokens while stabilizing factual tokens. Extensive experiments on mathematical reasoning tasks demonstrate that, in SFT stage, CurioSFT outperforms the vanilla SFT by 2.5 points on in-distribution tasks and 2.9 points on out-of-distribution tasks. We also verify that exploration capabilities preserved during SFT successfully translate into concrete gains in RL stage, yielding an average improvement of 5.0 points.

[1165] The Gaussian-Head OFL Family: One-Shot Federated Learning from Client Global Statistics

Fabio Turazza, Marco Picone, Marco Mamei

Main category: cs.LG

TL;DR: GH-OFL: One-shot federated learning using Gaussian distribution assumptions on pretrained embeddings, transmitting only statistical moments to build classification heads without data sharing.

Details

Motivation: Reduce communication costs and privacy risks in federated learning by moving from multi-round iterative processes to one-shot approaches, overcoming limitations of existing methods that require public datasets, homogeneous models, or additional data uploads.

Method: Clients transmit only sufficient statistics (per-class counts and first/second-order moments) of pretrained embeddings. Server builds heads via: 1) Closed-form Gaussian heads (NB/LDA/QDA), 2) FisherMix (linear head with cosine margin trained on synthetic samples in Fisher subspace), 3) Proto-Hyper (lightweight low-rank residual head refining Gaussian logits via knowledge distillation on synthetic samples).

Result: GH-OFL methods deliver state-of-the-art robustness and accuracy under strong non-IID skew while remaining strictly data-free.

Conclusion: The Gaussian-Head OFL family provides practical one-shot federated learning solutions that overcome limitations of existing approaches while maintaining privacy and reducing communication overhead.

Abstract: Classical Federated Learning relies on a multi-round iterative process of model exchange and aggregation between server and clients, with high communication costs and privacy risks from repeated model transmissions. In contrast, one-shot federated learning (OFL) alleviates these limitations by reducing communication to a single round, thereby lowering overhead and enhancing practical deployability. Nevertheless, most existing one-shot approaches remain either impractical or constrained, for example, they often depend on the availability of a public dataset, assume homogeneous client models, or require uploading additional data or model information. To overcome these issues, we introduce the Gaussian-Head OFL (GH-OFL) family, a suite of one-shot federated methods that assume class-conditional Gaussianity of pretrained embeddings. Clients transmit only sufficient statistics (per-class counts and first/second-order moments) and the server builds heads via three components: (i) Closed-form Gaussian heads (NB/LDA/QDA) computed directly from the received statistics; (ii) FisherMix, a linear head with cosine margin trained on synthetic samples drawn in an estimated Fisher subspace; and (iii) Proto-Hyper, a lightweight low-rank residual head that refines Gaussian logits via knowledge distillation on those synthetic samples. In our experiments, GH-OFL methods deliver state-of-the-art robustness and accuracy under strong non-IID skew while remaining strictly data-free.

[1166] Unraveling the Hidden Dynamical Structure in Recurrent Neural Policies

Jin Li, Yue Wu, Mengsha Huang, Yuhao Sun, Hao He, Xianyuan Zhan

Main category: cs.LG

TL;DR: Recurrent neural policies develop stable cyclic structures resembling limit cycles in dynamical systems, which stabilize internal memory and environmental states while suppressing environmental uncertainty, explaining their superior generalization and robustness.

Details

Motivation: To understand why recurrent neural policies outperform non-recurrent counterparts in partially observable control and meta-RL tasks, particularly their superior generalization and robustness mechanisms which remain poorly understood.

Method: Analyzed hidden state domains of recurrent policies across diverse training methods, model architectures, and tasks, examining the emergence of cyclic structures during environment interaction and comparing them to limit cycles in dynamical system analysis.

Result: Found that stable cyclic structures consistently emerge in recurrent policies, resembling limit cycles in dynamical systems. These cycles stabilize both internal memory and task-relevant environmental states while suppressing environmental uncertainty. The geometry of limit cycles encodes relational structures of behaviors.

Conclusion: Limit cycles explain recurrent policies’ nice properties: they stabilize memory and environmental states, suppress nuisance variability, and encode behavioral structures that facilitate skill adaptation in non-stationary environments.

Abstract: Recurrent neural policies are widely used in partially observable control and meta-RL tasks. Their abilities to maintain internal memory and adapt quickly to unseen scenarios have offered them unparalleled performance when compared to non-recurrent counterparts. However, until today, the underlying mechanisms for their superior generalization and robustness performance remain poorly understood. In this study, by analyzing the hidden state domain of recurrent policies learned over a diverse set of training methods, model architectures, and tasks, we find that stable cyclic structures consistently emerge during interaction with the environment. Such cyclic structures share a remarkable similarity with \textit{limit cycles} in dynamical system analysis, if we consider the policy and the environment as a joint hybrid dynamical system. Moreover, we uncover that the geometry of such limit cycles also has a structured correspondence with the policies’ behaviors. These findings offer new perspectives to explain many nice properties of recurrent policies: the emergence of limit cycles stabilizes both the policies’ internal memory and the task-relevant environmental states, while suppressing nuisance variability arising from environmental uncertainty; the geometry of limit cycles also encodes relational structures of behaviors, facilitating easier skill adaptation when facing non-stationary environments.

[1167] Learning from Anonymized and Incomplete Tabular Data

Lucas Lange, Adrian Böttinger, Victor Christen, Anushka Vidanage, Peter Christen, Erhard Rahm

Main category: cs.LG

TL;DR: Novel data transformation strategies for machine learning on user-driven privacy-protected tabular data that mixes original, generalized, and missing values, showing that proper handling of anonymized values is crucial for maintaining utility.

Details

Motivation: User-driven privacy creates datasets with mixed original, generalized, and missing values, but standard ML approaches treat non-original values as new categories or missing, discarding generalization semantics and reducing utility.

Method: Proposed novel data transformation strategies that account for heterogeneous anonymization, evaluated alongside standard imputation and LLM-based approaches across multiple datasets, privacy configurations, and deployment scenarios.

Result: The method reliably regains utility, showing that generalized values are preferable to pure suppression, the best data preparation strategy depends on the scenario, and consistent data representations are crucial for downstream utility.

Conclusion: Effective learning from privacy-protected data is tied to appropriate handling of anonymized values, with generalized values being better than suppression and scenario-dependent strategies being optimal.

Abstract: User-driven privacy allows individuals to control whether and at what granularity their data is shared, leading to datasets that mix original, generalized, and missing values within the same records and attributes. While such representations are intuitive for privacy, they pose challenges for machine learning, which typically treats non-original values as new categories or as missing, thereby discarding generalization semantics. For learning from such tabular data, we propose novel data transformation strategies that account for heterogeneous anonymization and evaluate them alongside standard imputation and LLM-based approaches. We employ multiple datasets, privacy configurations, and deployment scenarios, demonstrating that our method reliably regains utility. Our results show that generalized values are preferable to pure suppression, that the best data preparation strategy depends on the scenario, and that consistent data representations are crucial for maintaining downstream utility. Overall, our findings highlight that effective learning is tied to the appropriate handling of anonymized values.

[1168] Statistical Learning Theory in Lean 4: Empirical Processes from Scratch

Yuanhe Zhang, Jason D. Lee, Fanghui Liu

Main category: cs.LG

TL;DR: First comprehensive Lean 4 formalization of statistical learning theory with empirical process theory foundations, including Gaussian Lipschitz concentration, Dudley’s entropy integral theorem, and applications to least-squares regression.

Details

Motivation: To create a formal, verified foundation for statistical learning theory in Lean 4, addressing missing content in the Mathlib library and establishing rigorous mathematical foundations for machine learning theory.

Method: Human-AI collaborative workflow where humans design proof strategies and AI agents execute tactical proof construction, implementing Gaussian Lipschitz concentration, Dudley’s entropy integral theorem, and applying to least-squares regression.

Result: Successfully formalized key statistical learning theory concepts in Lean 4, exposed and resolved implicit assumptions in standard textbooks, and created a human-verified toolbox for SLT with sharp rates for regression problems.

Conclusion: Establishes a reusable formal foundation for statistical learning theory, opens doors for future machine learning theory developments, and demonstrates the value of human-AI collaboration in formal mathematics.

Abstract: We present the first comprehensive Lean 4 formalization of statistical learning theory (SLT) grounded in empirical process theory. Our end-to-end formal infrastructure implement the missing contents in latest Lean 4 Mathlib library, including a complete development of Gaussian Lipschitz concentration, the first formalization of Dudley’s entropy integral theorem for sub-Gaussian processes, and an application to least-squares (sparse) regression with a sharp rate. The project was carried out using a human-AI collaborative workflow, in which humans design proof strategies and AI agents execute tactical proof construction, leading to the human-verified Lean 4 toolbox for SLT. Beyond implementation, the formalization process exposes and resolves implicit assumptions and missing details in standard SLT textbooks, enforcing a granular, line-by-line understanding of the theory. This work establishes a reusable formal foundation and opens the door for future developments in machine learning theory. The code is available at https://github.com/YuanheZ/lean-stat-learning-theory

[1169] MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-$k$ Activations

Qishuai Wen, Zhiyuan Huang, Xianghan Meng, Wei He, Chun-Guang Li

Main category: cs.LG

TL;DR: MiTA attention: A compress-and-route efficient attention mechanism that compresses N-width MLP using landmark queries and constructs deformable experts via top-k activated key-value pairs.

Details

Motivation: Standard attention scales poorly for long sequences due to O(N²) complexity. Recent work views attention as fast-weight MLP where width equals sequence length, motivating Mixture-of-Experts approaches. Need more efficient methods that maintain expressive capacity while reducing computational cost.

Method: Proposes Mixture of Top-k Activations (MiTA): 1) Compresses N-width MLP using landmark queries, 2) Constructs deformable experts by gathering top-k activated key-value pairs for each landmark query, 3) Implements compress-and-route strategy combining compression and routing.

Result: Preliminary experiments on vision tasks show promise of MiTA attention. Results motivate further investigation on optimization and broader applications in more challenging settings.

Conclusion: MiTA attention provides a unifying framework for efficient attention methods through fast-weight scaling perspective. The compress-and-route strategy offers promising direction for efficient attention in long-context scenarios.

Abstract: The attention operator in Transformers can be viewed as a two-layer fast-weight MLP, whose weights are dynamically instantiated from input tokens and whose width equals sequence length $N$. As the context extends, the expressive capacity of such an $N$-width MLP increases, but scaling its fast weights becomes prohibitively expensive for extremely long sequences. Recently, this fast-weight scaling perspective has motivated the Mixture-of-Experts (MoE) attention, which partitions the sequence into fast-weight experts and sparsely routes the tokens to them. In this paper, we elevate this perspective to a unifying framework for a wide range of efficient attention methods by interpreting them as scaling fast weights through routing and/or compression. Then we propose a compress-and-route strategy, which compresses the $N$-width MLP into a narrower one using a small set of landmark queries and constructs deformable experts by gathering top-$k$ activated key-value pairs for each landmark query. We call this strategy a Mixture of Top-$k$ Activations (MiTA), and refer to the resulting efficient mechanism as MiTA attention. Preliminary experiments on vision tasks demonstrate the promise of our MiTA attention and motivate further investigation on its optimization and broader applications in more challenging settings.

[1170] SPARKLING: Balancing Signal Preservation and Symmetry Breaking for Width-Progressive Learning

Qifan Yu, Xinyu Ma, Zhijian Zhuo, Minrui Wang, Deyi Liu, Shiyi Zhan, Yiyuan Ma, Liang Xiang, Xingyan Bin, Di He

Main category: cs.LG

TL;DR: SPARKLING is a novel framework for mid-stage width expansion in progressive learning that addresses training instabilities through signal preservation and symmetry breaking techniques.

Details

Motivation: Progressive learning reduces pre-training costs by gradually increasing model scale, but width expansion during mid-stage training remains challenging due to severe training instabilities that disrupt activation statistics and cause loss spikes.

Method: SPARKLING uses RMS-scale consistency for signal preservation to stabilize activation statistics during width expansion, and asymmetric optimizer state resetting with learning rate re-warmup for symmetry breaking to maintain feature diversity.

Result: Extensive experiments on Mixture-of-Experts models show SPARKLING consistently outperforms training from scratch, reducing training costs by up to 35% under 2× width expansion across multiple width axes and optimizer families.

Conclusion: SPARKLING effectively addresses mid-stage width expansion challenges in progressive learning, enabling significant computational savings while maintaining model performance.

Abstract: Progressive Learning (PL) reduces pre-training computational overhead by gradually increasing model scale. While prior work has extensively explored depth expansion, width expansion remains significantly understudied, with the few existing methods limited to the early stages of training. However, expanding width during the mid-stage is essential for maximizing computational savings, yet it remains a formidable challenge due to severe training instabilities. Empirically, we show that naive initialization at this stage disrupts activation statistics, triggering loss spikes, while copy-based initialization introduces gradient symmetry that hinders feature diversity. To address these issues, we propose SPARKLING (balancing {S}ignal {P}reservation {A}nd symmet{R}y brea{K}ing for width-progressive {L}earn{ING}), a novel framework for mid-stage width expansion. Our method achieves signal preservation via RMS-scale consistency, stabilizing activation statistics during expansion. Symmetry breaking is ensured through asymmetric optimizer state resetting and learning rate re-warmup. Extensive experiments on Mixture-of-Experts (MoE) models demonstrate that, across multiple width axes and optimizer families, SPARKLING consistently outperforms training from scratch and reduces training cost by up to 35% under $2\times$ width expansion.

[1171] Lotus: Efficient LLM Training by Randomized Low-Rank Gradient Projection with Adaptive Subspace Switching

Tianhao Miao, Zhongyuan Bao, Lejun Zhang

Main category: cs.LG

TL;DR: Lotus is a memory-efficient training method that reduces both training time and memory consumption by efficiently transitioning between low-rank gradient subspaces without expensive SVD computations.

Details

Motivation: Current memory-efficient training methods like GaLore suffer from trade-offs between memory consumption, training time, and model performance. GaLore reduces memory but adds significant training time overhead due to SVD computations on gradients.

Method: Lotus modifies the projection process by introducing a criterion to quantify displacement of unit gradients, enabling efficient transitions between low-rank gradient subspaces without expensive SVD operations.

Result: Lotus achieves 30% reduction in training time and 40% decrease in memory consumption for gradient and optimizer states, while outperforming baseline methods in both pre-training and fine-tuning tasks.

Conclusion: Lotus resolves the trade-off between memory efficiency and training time in large-scale model training, providing a more efficient alternative to existing methods like GaLore.

Abstract: Training efficiency in large-scale models is typically assessed through memory consumption, training time, and model performance. Current methods often exhibit trade-offs among these metrics, as optimizing one generally degrades at least one of the others. Addressing this trade-off remains a central challenge in algorithm design. While GaLore enables memory-efficient training by updating gradients in a low-rank subspace, it incurs a comparable extra training time cost due to the Singular Value Decomposition(SVD) process on gradients. In this paper, we propose Lotus, a method that resolves this trade-off by simply modifying the projection process. We propose a criterion that quantifies the displacement of the unit gradient to enable efficient transitions between low-rank gradient subspaces. Experimental results indicate that Lotus is the most efficient method, achieving a 30% reduction in training time and a 40% decrease in memory consumption for gradient and optimizer states. Additionally, it outperforms the baseline method in both pre-training and fine-tuning tasks.

[1172] RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

Yinjie Wang, Tianbao Xie, Ke Shen, Mengdi Wang, Ling Yang

Main category: cs.LG

TL;DR: RLAnything is a reinforcement learning framework that dynamically optimizes environment, policy, and reward models through closed-loop optimization for LLM and agentic scenarios.

Details

Motivation: To create a comprehensive RL framework that can strengthen any LLM or agentic system by dynamically optimizing all components (environment, policy, reward) through integrated feedback loops, rather than relying on static or human-labeled signals.

Method: Closed-loop optimization framework with three key components: 1) Policy trained with integrated step-wise and outcome feedback, 2) Reward model jointly optimized via consistency feedback, 3) Theory-motivated automatic environment adaptation using critic feedback from both policy and reward models.

Result: Each component consistently improves overall system performance. RLAnything boosts Qwen3-VL-8B-Thinking by 9.1% on OSWorld, Qwen2.5-7B-Instruct by 18.7% on AlfWorld and 11.9% on LiveBench. Optimized reward-model signals outperform human-labeled outcomes.

Conclusion: RLAnything provides an effective framework for strengthening LLM and agentic systems through dynamic closed-loop optimization of all RL components, demonstrating substantial performance gains across diverse tasks.

Abstract: We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with integrated feedback from step-wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. We also that optimized reward-model signals outperform outcomes that rely on human labels. Code: https://github.com/Gen-Verse/Open-AgentRL

[1173] Mechanistic Interpretability of Brain-to-Speech Models Across Speech Modes

Maryam Maghsoudi, Ayushi Mishra

Main category: cs.LG

TL;DR: Mechanistic interpretability study of brain-to-speech decoding models reveals cross-modal speech representations lie on continuous manifolds mediated by compact subspaces rather than diffuse activity.

Details

Motivation: While brain-to-speech decoding models work well across vocalized, mimed, and imagined speech, the fundamental mechanisms of how these models capture and transmit information across different speech modalities remain poorly understood.

Method: Used mechanistic interpretability techniques including cross-mode activation patching, tri-modal interpolation, coarse-to-fine causal tracing, causal scrubbing, and neuron-level activation patching to investigate internal representations of neural speech decoders.

Result: Found that speech modes lie on a shared continuous causal manifold, and cross-mode transfer is mediated by compact, layer-specific subspaces rather than diffuse activity. Small but not distributed subsets of neurons affect cross-mode transfer.

Conclusion: Provides causal explanation for how speech modality information is organized in brain-to-speech models, revealing hierarchical and direction-dependent representational structure across speech modes.

Abstract: Brain-to-speech decoding models demonstrate robust performance in vocalized, mimed, and imagined speech; yet, the fundamental mechanisms via which these models capture and transmit information across different speech modalities are less explored. In this work, we use mechanistic interpretability to causally investigate the internal representations of a neural speech decoder. We perform cross-mode activation patching of internal activations across speech modes, and use tri-modal interpolation to examine whether speech representations vary discretely or continuously. We use coarse-to-fine causal tracing and causal scrubbing to find localized causal structure, allowing us to find internal subspaces that are sufficient for cross-mode transfer. In order to determine how finely distributed these effects are within layers, we perform neuron-level activation patching. We discover that small but not distributed subsets of neurons, rather than isolated units, affect the cross-mode transfer. Our results show that speech modes lie on a shared continuous causal manifold, and cross-mode transfer is mediated by compact, layer-specific subspaces rather than diffuse activity. Together, our findings give a causal explanation for how speech modality information is organized and used in brain-to-speech decoding models, revealing hierarchical and direction-dependent representational structure across speech modes.

[1174] Sample Efficient Active Algorithms for Offline Reinforcement Learning

Soumyadeep Roy, Shashwat Kushwaha, Ambedkar Dukkipati

Main category: cs.LG

TL;DR: ActiveRL combines offline RL with limited online interactions using Gaussian Process uncertainty modeling to selectively refine uncertain regions, achieving better sample efficiency than purely offline methods.

Details

Motivation: Offline RL suffers from poor state-action space coverage and distributional shift problems. The paper aims to address this by allowing limited online interactions to selectively refine uncertain regions of the learned value function, which is called Active Reinforcement Learning (ActiveRL).

Method: Proposes an ActiveRL algorithm using Gaussian Process (GP) uncertainty modeling. Uses GP concentration inequalities and information-gain bounds to guide uncertainty reduction. The method selectively collects online data in uncertain regions to accelerate value-function convergence.

Result: Theoretical analysis shows ActiveRL can learn an ε-optimal policy with O(1/ε²) active transitions, improving upon the Ω(1/ε²(1-γ)⁴) rate of purely offline methods. Experimental results validate the algorithm and theoretical findings.

Conclusion: ActiveRL achieves near-optimal information efficiency by guided uncertainty reduction, bridging Bayesian nonparametric regression and reinforcement learning theories. Limited online interactions significantly improve sample efficiency over purely offline methods.

Abstract: Offline reinforcement learning (RL) enables policy learning from static data but often suffers from poor coverage of the state-action space and distributional shift problems. This problem can be addressed by allowing limited online interactions to selectively refine uncertain regions of the learned value function, which is referred to as Active Reinforcement Learning (ActiveRL). While there has been good empirical success, no theoretical analysis is available in the literature. We fill this gap by developing a rigorous sample-complexity analysis of ActiveRL through the lens of Gaussian Process (GP) uncertainty modeling. In this respect, we propose an algorithm and using GP concentration inequalities and information-gain bounds, we derive high-probability guarantees showing that an $ε$-optimal policy can be learned with ${\mathcal{O}}(1/ε^2)$ active transitions, improving upon the $Ω(1/ε^2(1-γ)^4)$ rate of purely offline methods. Our results reveal that ActiveRL achieves near-optimal information efficiency, that is, guided uncertainty reduction leads to accelerated value-function convergence with minimal online data. Our analysis builds on GP concentration inequalities and information-gain bounds, bridging Bayesian nonparametric regression and reinforcement learning theories. We conduct several experiments to validate the algorithm and theoretical findings.

[1175] BicKD: Bilateral Contrastive Knowledge Distillation

Jiangnan Zhu, Yukai Xu, Li Xiong, Yixuan Liu, Junxu Liu, Hong kyu Lee, Yujie Gu

Main category: cs.LG

TL;DR: BicKD introduces bilateral contrastive loss for knowledge distillation that enhances class-wise orthogonality while maintaining sample-wise consistency, improving knowledge transfer over vanilla KD.

Details

Motivation: Vanilla knowledge distillation only performs sample-wise probability alignment between teacher and student models, lacking class-wise comparison mechanisms and structural constraints on the probability space.

Method: Proposes bilateral contrastive knowledge distillation (BicKD) with a novel bilateral contrastive loss that intensifies orthogonality among different class generalization spaces while preserving consistency within the same class.

Result: Extensive experiments show BicKD enhances knowledge transfer and consistently outperforms state-of-the-art knowledge distillation techniques across various model architectures and benchmarks.

Conclusion: BicKD provides an effective methodology for knowledge distillation that addresses limitations of vanilla KD through bilateral contrastive learning, improving both sample-wise and class-wise alignment.

Abstract: Knowledge distillation (KD) is a machine learning framework that transfers knowledge from a teacher model to a student model. The vanilla KD proposed by Hinton et al. has been the dominant approach in logit-based distillation and demonstrates compelling performance. However, it only performs sample-wise probability alignment between teacher and student’s predictions, lacking an mechanism for class-wise comparison. Besides, vanilla KD imposes no structural constraint on the probability space. In this work, we propose a simple yet effective methodology, bilateral contrastive knowledge distillation (BicKD). This approach introduces a novel bilateral contrastive loss, which intensifies the orthogonality among different class generalization spaces while preserving consistency within the same class. The bilateral formulation enables explicit comparison of both sample-wise and class-wise prediction patterns between teacher and student. By emphasizing probabilistic orthogonality, BicKD further regularizes the geometric structure of the predictive distribution. Extensive experiments show that our BicKD method enhances knowledge transfer, and consistently outperforms state-of-the-art knowledge distillation techniques across various model architectures and benchmarks.

[1176] Diving into Kronecker Adapters: Component Design Matters

Jiayu Bai, Danchen Yu, Zhenyu Liao, TianQi Hou, Feng Zhou, Robert C. Qiu, Zenan Ling

Main category: cs.LG

TL;DR: CDKA introduces component-designed Kronecker adapters with optimized component dimensions and numbers for efficient fine-tuning of large models, outperforming existing adapter methods.

Details

Motivation: Existing Kronecker adapters treat component structure as fixed or heuristic, leaving the dimensions and number of Kronecker components underexplored. The authors identify component structure as a key factor governing adapter capacity and aim to optimize it.

Method: Proposes Component Designed Kronecker Adapters (CDKA) with fine-grained analysis of both dimensions and number of Kronecker components. Provides parameter-budget-aware configuration guidelines and a tailored training stabilization strategy for practical deployment.

Result: Experiments across various natural language processing tasks demonstrate the effectiveness of CDKA, showing improved performance over existing adapter methods.

Conclusion: Component structure is crucial for Kronecker adapter performance, and CDKA provides an effective approach to optimize this structure for efficient fine-tuning of large models.

Abstract: Kronecker adapters have emerged as a promising approach for fine-tuning large-scale models, enabling high-rank updates through tunable component structures. However, existing work largely treats the component structure as a fixed or heuristic design choice, leaving the dimensions and number of Kronecker components underexplored. In this paper, we identify component structure as a key factor governing the capacity of Kronecker adapters. We perform a fine-grained analysis of both the dimensions and number of Kronecker components. In particular, we show that the alignment between Kronecker adapters and full fine-tuning depends on component configurations. Guided by these insights, we propose Component Designed Kronecker Adapters (CDKA). We further provide parameter-budget-aware configuration guidelines and a tailored training stabilization strategy for practical deployment. Experiments across various natural language processing tasks demonstrate the effectiveness of CDKA. Code is available at https://github.com/rainstonee/CDKA.

[1177] Mixture-of-World Models: Scaling Multi-Task Reinforcement Learning with Modular Latent Dynamics

Boxuan Zhang, Weipu Zhang, Zhaohan Feng, Wei Xiao, Jian Sun, Jie Chen, Gang Wang

Main category: cs.LG

TL;DR: MoW: Mixture-of-World Models for multi-task RL using modular VAEs for visual compression, hybrid Transformer dynamics with task-conditioned experts, and gradient-based task clustering for parameter efficiency.

Details

Motivation: Address sample efficiency in multi-task RL with visual heterogeneity; standard monolithic world models struggle with diverse task dynamics, leading to poor reconstruction and prediction.

Method: Combines modular variational autoencoders for task-adaptive visual compression, hybrid Transformer-based dynamics model with task-conditioned experts and shared backbone, and gradient-based task clustering for parameter allocation.

Result: On Atari 100k: 110.4% mean human-normalized score (vs 114.2% for 26 task-specific ensemble) with 50% fewer parameters. On Meta-World: 74.5% average success rate within 300k steps, new SOTA.

Conclusion: MoW provides scalable and parameter-efficient foundation for generalist world models in multi-task RL with visual domains.

Abstract: A fundamental challenge in multi-task reinforcement learning (MTRL) is achieving sample efficiency in visual domains where tasks exhibit substantial heterogeneity in both observations and dynamics. Model-based reinforcement learning offers a promising path to improved sample efficiency through world models, but standard monolithic architectures struggle to capture diverse task dynamics, resulting in poor reconstruction and prediction accuracy. We introduce Mixture-of-World Models (MoW), a scalable architecture that combines modular variational autoencoders for task-adaptive visual compression, a hybrid Transformer-based dynamics model with task-conditioned experts and a shared backbone, and a gradient-based task clustering strategy for efficient parameter allocation. On the Atari 100k benchmark, a single MoW agent trained once on 26 Atari games achieves a mean human-normalized score of 110.4%, competitive with the score of 114.2% achieved by STORM, an ensemble of 26 task-specific models, while using 50% fewer parameters. On Meta-World, MoW achieves a 74.5% average success rate within 300 thousand environment steps, establishing a new state of the art. These results demonstrate that MoW provides a scalable and parameter-efficient foundation for generalist world models.

[1178] From Intents to Actions: Agentic AI in Autonomous Networks

Burak Demirel, Pablo Soldati, Yu Wang

Main category: cs.LG

TL;DR: An Agentic AI system for intent-driven autonomous networks using three specialized agents: interpreter agent (language models), optimizer agent (optimization problems), and controller agent (multi-objective reinforcement learning) to transform high-level service intents into network control actions.

Details

Motivation: Telecommunication networks need to operate autonomously while supporting heterogeneous services with diverse and conflicting intents, but existing heuristic approaches cannot effectively transform high-level intents into concrete control actions.

Method: Three-agent system: 1) Supervisory interpreter agent using language models for lexical parsing and cognitive refinement of intents; 2) Optimizer agent converting templates into optimization problems and analyzing trade-offs; 3) Preference-driven controller agent using multi-objective reinforcement learning to operate near Pareto frontier.

Result: The system enables networks to autonomously interpret, reason over, adapt to, and act upon diverse intents and network conditions in a scalable manner, transforming high-level intents into low-level control actions.

Conclusion: The proposed Agentic AI system provides a scalable solution for intent-driven autonomous networks by combining language models, optimization techniques, and reinforcement learning to handle diverse service requirements.

Abstract: Telecommunication networks are increasingly expected to operate autonomously while supporting heterogeneous services with diverse and often conflicting intents – that is, performance objectives, constraints, and requirements specific to each service. However, transforming high-level intents – such as ultra-low latency, high throughput, or energy efficiency – into concrete control actions (i.e., low-level actuator commands) remains beyond the capability of existing heuristic approaches. This work introduces an Agentic AI system for intent-driven autonomous networks, structured around three specialized agents. A supervisory interpreter agent, powered by language models, performs both lexical parsing of intents into executable optimization templates and cognitive refinement based on feedback, constraint feasibility, and evolving network conditions. An optimizer agent converts these templates into tractable optimization problems, analyzes trade-offs, and derives preferences across objectives. Lastly, a preference-driven controller agent, based on multi-objective reinforcement learning, leverages these preferences to operate near the Pareto frontier of network performance that best satisfies the original intent. Collectively, these agents enable networks to autonomously interpret, reason over, adapt to, and act upon diverse intents and network conditions in a scalable manner.

[1179] Richer Bayesian Last Layers with Subsampled NTK Features

Sergio Calvo-Ordoñez, Jonathan Plenk, Richard Bergna, Álvaro Cartea, Yarin Gal, Jose Miguel Hernández-Lobato, Kamil Ciosek

Main category: cs.LG

TL;DR: Bayesian Last Layers (BLLs) underestimate epistemic uncertainty; proposed method improves BLLs using Neural Tangent Kernel feature projection to account for full network variability while maintaining computational efficiency.

Details

Motivation: BLLs are computationally efficient for uncertainty estimation but underestimate epistemic uncertainty because they only apply Bayesian treatment to the final layer, ignoring uncertainty from earlier layers.

Method: Leverages projection of Neural Tangent Kernel (NTK) features onto the space spanned by last-layer features, enabling posterior inference that accounts for full network variability. Introduces uniform subsampling scheme for estimating projection matrix to reduce computational cost.

Result: Method yields posterior variances provably greater or equal to standard BLLs, correcting underestimation tendency. Empirical evaluations on UCI regression, contextual bandits, image classification, and OOD detection show improved calibration and uncertainty estimates compared to standard BLLs and competitive baselines.

Conclusion: Proposed method improves BLLs by better accounting for epistemic uncertainty from all network layers while maintaining computational efficiency through NTK feature projection and subsampling techniques.

Abstract: Bayesian Last Layers (BLLs) provide a convenient and computationally efficient way to estimate uncertainty in neural networks. However, they underestimate epistemic uncertainty because they apply a Bayesian treatment only to the final layer, ignoring uncertainty induced by earlier layers. We propose a method that improves BLLs by leveraging a projection of Neural Tangent Kernel (NTK) features onto the space spanned by the last-layer features. This enables posterior inference that accounts for variability of the full network while retaining the low computational cost of inference of a standard BLL. We show that our method yields posterior variances that are provably greater or equal to those of a standard BLL, correcting its tendency to underestimate epistemic uncertainty. To further reduce computational cost, we introduce a uniform subsampling scheme for estimating the projection matrix and for posterior inference. We derive approximation bounds for both types of sub-sampling. Empirical evaluations on UCI regression, contextual bandits, image classification, and out-of-distribution detection tasks in image and tabular datasets, demonstrate improved calibration and uncertainty estimates compared to standard BLLs and competitive baselines, while reducing computational cost.

[1180] Multi-LLM Adaptive Conformal Inference for Reliable LLM Responses

Kangjun Noh, Seongchan Lee, Ilmun Kim, Kyungwoo Song

Main category: cs.LG

TL;DR: MACI is a conformal inference method that models factuality as a product of claim-level scores using multiplicative filtering, achieving better retention while maintaining validity through group-conditional calibration.

Details

Motivation: Existing conformal inference approaches for LLM factuality are either too conservative (discarding many true claims) or rely on simple linear models that fail to capture complex group structures, limiting their effectiveness in high-stakes domains.

Method: Reformulates conformal inference in a multiplicative filtering setting, modeling factuality as a product of claim-level scores. Uses ensembles to produce more accurate factuality scores and preserves validity through group-conditional calibration.

Result: MACI consistently achieves user-specified coverage with substantially higher retention and lower time cost than baselines while maintaining validity.

Conclusion: MACI provides an effective solution for ensuring LLM factuality with distribution-free guarantees, offering better performance than existing conformal inference methods.

Abstract: Ensuring factuality is essential for the safe use of Large Language Models (LLMs) in high-stakes domains such as medicine and law. Conformal inference provides distribution-free guarantees, but existing approaches are either overly conservative, discarding many true-claims, or rely on adaptive error rates and simple linear models that fail to capture complex group structures. To address these challenges, we reformulate conformal inference in a multiplicative filtering setting, modeling factuality as a product of claim-level scores. Our method, Multi-LLM Adaptive Conformal Inference (MACI), leverages ensembles to produce more accurate factuality-scores, which in our experiments led to higher retention, while validity is preserved through group-conditional calibration. Experiments show that MACI consistently achieves user-specified coverage with substantially higher retention and lower time cost than baselines. Our repository is available at https://github.com/MLAI-Yonsei/MACI

[1181] EDIS: Diagnosing LLM Reasoning via Entropy Dynamics

Chenghua Zhu, Siyan Wu, Xiangkang Zeng, Zishan Xu, Zhaolu Kang, Yifu Guo, Yuquan Lu, Junduan Huang, Guojing Zhou

Main category: cs.LG

TL;DR: Analyzing temporal entropy dynamics during LLM generation reveals characteristic instability patterns in erroneous reasoning, leading to a new metric (EDIS) that improves inference-time selection and training-time curation.

Details

Motivation: Current approaches treat confidence as static aggregated statistics, but the temporal evolution of confidence during generation may carry richer information about reasoning quality. The authors aim to explore whether entropy dynamics can reveal intrinsic patterns distinguishing correct from incorrect reasoning.

Method: Analyze token-level entropy trajectories during LLM generation, identify characteristic patterns (burst spikes and peak-valley spikes) in erroneous solutions, and introduce the Entropy Dynamics Instability Score (EDIS) as a trajectory-level metric quantifying instability in entropy evolution.

Result: Erroneous solutions exhibit unstable entropy dynamics with characteristic patterns that persist across models and training stages. EDIS serves as an effective diagnostic signal that substantially improves reasoning accuracy through inference-time selection and offers potential for training-time sample curation.

Conclusion: Entropy dynamics provide an underexplored yet informative lens for understanding and improving LLM reasoning, with EDIS offering practical benefits for both inference-time selection and training-time sample curation.

Abstract: Entropy-based confidence signals are increasingly leveraged to improve reasoning in large language models (LLMs), yet existing approaches treat confidence as a static quantity – typically aggregated over tokens. We show that the \emph{temporal evolution} of confidence during generation carries richer information than aggregate statistics alone. Analyzing token-level entropy trajectories, we identify characteristic patterns distinguishing correct from incorrect reasoning: erroneous solutions exhibit unstable dynamics, including burst spikes (sustained uncertainty growth) and peak-valley spikes (sharp rebounds following transient confidence). These patterns persist across models and training stages, suggesting they reflect intrinsic properties of reasoning failure rather than superficial noise. To formalize this observation, we introduce the Entropy Dynamics Instability Score (\textbf{EDIS}), a trajectory-level metric quantifying instability in entropy evolution. EDIS serves as an effective diagnostic signal for inference-time selection, substantially improving reasoning accuracy, and offers a promising direction for training-time sample curation. Our findings establish entropy dynamics as an underexplored yet informative lens for understanding and improving LLM reasoning.

[1182] Gradient-Aligned Calibration for Post-Training Quantization of Diffusion Models

Dung Anh Hoang, Cuong Pham anh Trung Le, Jianfei Cai, Toan Do

Main category: cs.LG

TL;DR: A novel post-training quantization method for diffusion models that learns optimal weights for calibration samples across different timesteps to improve quantization performance.

Details

Motivation: Diffusion models have slow inference speed, high memory usage, and computational demands. Existing PTQ methods use uniform weights for calibration samples across timesteps, which is sub-optimal since different timesteps contribute differently to the diffusion process and have varying activation distributions.

Method: Proposes a PTQ method that learns to assign optimal weights to calibration samples to align the quantized model’s gradients across timesteps, addressing the issue of conflicting gradients that degrade performance when using uniform quantization.

Result: Extensive experiments on CIFAR-10, LSUN-Bedrooms, and ImageNet demonstrate superiority over other PTQ methods for diffusion models.

Conclusion: The proposed method effectively addresses the limitations of uniform PTQ for diffusion models by learning optimal calibration sample weights across timesteps, improving quantization performance.

Abstract: Diffusion models have shown remarkable performance in image synthesis by progressively estimating a smooth transition from a Gaussian distribution of noise to a real image. Unfortunately, their practical deployment is limited by slow inference speed, high memory usage, and the computational demands of the noise estimation process. Post-training quantization (PTQ) emerges as a promising solution to accelerate sampling and reduce memory overhead for diffusion models. Existing PTQ methods for diffusion models typically apply uniform weights to calibration samples across timesteps, which is sub-optimal since data at different timesteps may contribute differently to the diffusion process. Additionally, due to varying activation distributions and gradients across timesteps, a uniform quantization approach is sub-optimal. Each timestep requires a different gradient direction for optimal quantization, and treating them equally can lead to conflicting gradients that degrade performance. In this paper, we propose a novel PTQ method that addresses these challenges by assigning appropriate weights to calibration samples. Specifically, our approach learns to assign optimal weights to calibration samples to align the quantized model’s gradients across timesteps, facilitating the quantization process. Extensive experiments on CIFAR-10, LSUN-Bedrooms, and ImageNet demonstrate the superiority of our method compared to other PTQ methods for diffusion models.

[1183] The BoBW Algorithms for Heavy-Tailed MDPs

Yu Chen, Yuhao Liu, Jiatai Huang, Yihan Du, Longbo Huang

Main category: cs.LG

TL;DR: HT-FTRL algorithms for episodic Markov Decision Processes with heavy-tailed feedback achieve Best-of-Both-Worlds guarantees: instance-independent regret in adversarial environments and logarithmic regret in stochastic environments.

Details

Motivation: Existing approaches for heavy-tailed MDPs are conservative in stochastic environments and lack adaptivity in adversarial regimes, creating a need for algorithms that perform well in both settings.

Method: HT-FTRL-OM applies FTRL over occupancy measures with novel skipping loss estimators for known transitions. HT-FTRL-UOB uses pessimistic skipping loss estimators for unknown transitions with local control mechanisms and suboptimal-mass propagation principles.

Result: HT-FTRL-OM achieves Õ(T^{1/α}) regret in adversarial regimes and O(log T) in stochastic regimes. HT-FTRL-UOB achieves Õ(T^{1/α} + √T) in adversarial and O(log²T) in stochastic regimes.

Conclusion: The proposed algorithms provide Best-of-Both-Worlds guarantees for heavy-tailed MDPs through novel technical insights including local control mechanisms and regret decomposition isolating transition uncertainty.

Abstract: We investigate episodic Markov Decision Processes with heavy-tailed feedback (HTMDPs). Existing approaches for HTMDPs are conservative in stochastic environments and lack adaptivity in adversarial regimes. In this work, we propose algorithms HT-FTRL-OM and HT-FTRL-UOB for HTMDPs that achieve Best-of-Both-Worlds (BoBW) guarantees: instance-independent regret in adversarial environments and logarithmic instance-dependent regret in self-bounding (including the stochastic case) environments. For the known transition setting, HT-FTRL-OM applies the Follow-The-Regularized-Leader (FTRL) framework over occupancy measures with novel skipping loss estimators, achieving a $\widetilde{\mathcal{O}}(T^{1/α})$ regret bound in adversarial regimes and a $\mathcal{O}(\log T)$ regret in stochastic regimes. Building upon this framework, we develop a novel algorithm HT-FTRL-UOB to tackle the more challenging unknown-transition setting. This algorithm employs a pessimistic skipping loss estimator and achieves a $\widetilde{\mathcal{O}}(T^{1/α} + \sqrt{T})$ regret in adversarial regimes and a $\mathcal{O}(\log^2(T))$ regret in stochastic regimes. Our analysis overcomes key barriers through several technical insights, including a local control mechanism for heavy-tailed shifted losses, a new suboptimal-mass propagation principle, and a novel regret decomposition that isolates transition uncertainty from heavy-tailed estimation errors and skipping bias.

[1184] Dispelling the Curse of Singularities in Neural Network Optimizations

Hengjie Cao, Mengyi Chen, Yifeng Yang, Fang Dong, Ruijun Huang, Anrui Chen, Jixian Zhou, Mingzhi Dong, Yujiang Wang, Dongsheng Li, Wenyi Fang, Yuanyi Lin, Fan Wu, Li Shang

Main category: cs.LG

TL;DR: The paper investigates optimization instability in deep neural networks through the lens of parametric singularities, showing how they amplify during training and lead to sharp loss explosions, and proposes Parametric Singularity Smoothing (PSS) to mitigate this issue.

Details

Motivation: The paper aims to understand optimization instability in deep neural networks from a novel perspective - the emergence and amplification of singularities in parametric space, which has been less explored despite being insightful for addressing training instability issues.

Method: The authors analyze how parametric singularities grow with gradient updates and intensify alignment with representations, leading to increased singularities in representation space. They show gradient Frobenius norms are bounded by top singular values of weight matrices. To counter the “curse of singularities,” they propose Parametric Singularity Smoothing (PSS), a lightweight method for smoothing singular spectra of weight matrices.

Result: Extensive experiments across diverse datasets, architectures, and optimizers demonstrate that PSS mitigates instability, restores trainability even after failure, and improves both training efficiency and generalization performance.

Conclusion: The work provides a novel perspective on optimization instability through parametric singularities and offers an effective solution (PSS) that addresses training instability while improving efficiency and generalization across various settings.

Abstract: This work investigates the optimization instability of deep neural networks from a less-explored yet insightful perspective: the emergence and amplification of singularities in the parametric space. Our analysis reveals that parametric singularities inevitably grow with gradient updates and further intensify alignment with representations, leading to increased singularities in the representation space. We show that the gradient Frobenius norms are bounded by the top singular values of the weight matrices, and as training progresses, the mutually reinforcing growth of weight and representation singularities, termed the curse of singularities, relaxes these bounds, escalating the risk of sharp loss explosions. To counter this, we propose Parametric Singularity Smoothing (PSS), a lightweight, flexible, and effective method for smoothing the singular spectra of weight matrices. Extensive experiments across diverse datasets, architectures, and optimizers demonstrate that PSS mitigates instability, restores trainability even after failure, and improves both training efficiency and generalization.

[1185] Imperfect Influence, Preserved Rankings: A Theory of TRAK for Data Attribution

Han Tong, Shubhangi Ghosh, Haolin Zou, Arian Maleki

Main category: cs.LG

TL;DR: Theoretical analysis of TRAK data attribution algorithm showing its approximations introduce errors but preserve relative ranking of training data influence.

Details

Motivation: Data attribution is crucial for interpreting AI models, but the widely used TRAK algorithm lacks theoretical understanding of when its approximations work or fail.

Method: Provides theoretical analysis of TRAK algorithm, characterizing performance and quantifying approximation errors, then validates through simulations and empirical studies.

Result: TRAK’s approximations introduce significant errors, but the estimated influence remains highly correlated with original influence, preserving relative ranking of data points.

Conclusion: TRAK is theoretically sound for relative ranking despite approximation errors, providing justification for its empirical success in data attribution tasks.

Abstract: Data attribution, tracing a model’s prediction back to specific training data, is an important tool for interpreting sophisticated AI models. The widely used TRAK algorithm addresses this challenge by first approximating the underlying model with a kernel machine and then leveraging techniques developed for approximating the leave-one-out (ALO) risk. Despite its strong empirical performance, the theoretical conditions under which the TRAK approximations are accurate as well as the regimes in which they break down remain largely unexplored. In this paper, we provide a theoretical analysis of the TRAK algorithm, characterizing its performance and quantifying the errors introduced by the approximations on which the method relies. We show that although the approximations incur significant errors, TRAK’s estimated influence remains highly correlated with the original influence and therefore largely preserves the relative ranking of data points. We corroborate our theoretical results through extensive simulations and empirical studies.

[1186] High-accuracy sampling for diffusion models and log-concave distributions

Fan Chen, Sinho Chewi, Constantinos Daskalakis, Alexander Rakhlin

Main category: cs.LG

TL;DR: Novel diffusion model sampling algorithms achieve exponential speedup with polylog(1/δ) steps using δ-accurate score estimates, with complexity scaling with data dimension and intrinsic structure.

Details

Motivation: Current diffusion model sampling methods require many steps to achieve high accuracy, limiting practical applications. The paper aims to develop exponentially faster sampling algorithms that maintain accuracy while dramatically reducing computational complexity.

Method: Develops new algorithms for diffusion model sampling that leverage δ-accurate score estimates in L² norm. The approach analyzes complexity under different data assumptions: minimal data assumptions, non-uniform L-Lipschitz conditions, and intrinsic dimensionality considerations.

Result: Achieves exponential improvement over previous results with polylog(1/δ) steps complexity. Under minimal assumptions: Õ(d polylog(1/δ)); under L-Lipschitz: Õ(√dL polylog(1/δ)); with intrinsic dimension d⋆: Õ(d⋆ polylog(1/δ)). Also yields first polylog(1/δ) sampler for general log-concave distributions using only gradients.

Conclusion: The paper presents groundbreaking sampling algorithms that achieve exponential speedup for diffusion models, with complexity scaling favorably with data dimension and structure, enabling more efficient generation and inference in high-dimensional settings.

Abstract: We present algorithms for diffusion model sampling which obtain $δ$-error in $\mathrm{polylog}(1/δ)$ steps, given access to $\widetilde O(δ)$-accurate score estimates in $L^2$. This is an exponential improvement over all previous results. Specifically, under minimal data assumptions, the complexity is $\widetilde O(d,\mathrm{polylog}(1/δ))$ where $d$ is the dimension of the data; under a non-uniform $L$-Lipschitz condition, the complexity is $\widetilde O(\sqrt{dL},\mathrm{polylog}(1/δ))$; and if the data distribution has intrinsic dimension $d_\star$, then the complexity reduces to $\widetilde O(d_\star,\mathrm{polylog}(1/δ))$. Our approach also yields the first $\mathrm{polylog}(1/δ)$ complexity sampler for general log-concave distributions using only gradient evaluations.

[1187] Finding Differentially Private Second Order Stationary Points in Stochastic Minimax Optimization

Difei Xu, Youming Tao, Meng Ding, Chenglin Fan, Di Wang

Main category: cs.LG

TL;DR: First study of differentially private second-order stationary points for stochastic minimax optimization, proposing a first-order method with privacy guarantees for both empirical and population risks.

Details

Motivation: Existing literature lacks methods for finding differentially private second-order stationary points in stochastic minimax optimization, focusing only on first-order points or classical minimization problems.

Method: Proposes a first-order method combining nested gradient descent-ascent with SPIDER-style variance reduction and Gaussian perturbations for privacy, using block-wise analysis to control stochastic variance and privacy noise.

Result: Establishes high-probability guarantees for reaching approximate second-order stationary points with rates matching best known private first-order stationarity results.

Conclusion: Provides first unified treatment of differentially private second-order stationary points for stochastic minimax optimization with optimal rates.

Abstract: We provide the first study of the problem of finding differentially private (DP) second-order stationary points (SOSP) in stochastic (non-convex) minimax optimization. Existing literature either focuses only on first-order stationary points for minimax problems or on SOSP for classical stochastic minimization problems. This work provides, for the first time, a unified and detailed treatment of both empirical and population risks. Specifically, we propose a purely first-order method that combines a nested gradient descent–ascent scheme with SPIDER-style variance reduction and Gaussian perturbations to ensure privacy. A key technical device is a block-wise ($q$-period) analysis that controls the accumulation of stochastic variance and privacy noise without summing over the full iteration horizon, yielding a unified treatment of both empirical-risk and population formulations. Under standard smoothness, Hessian-Lipschitzness, and strong concavity assumptions, we establish high-probability guarantees for reaching an $(α,\sqrt{ρ_Φα})$-approximate second-order stationary point with $α= \mathcal{O}( (\frac{\sqrt{d}}{n\varepsilon})^{2/3})$ for empirical risk objectives and $\mathcal{O}(\frac{1}{n^{1/3}} + (\frac{\sqrt{d}}{n\varepsilon})^{1/2})$ for population objectives, matching the best known rates for private first-order stationarity.

[1188] Your Self-Play Algorithm is Secretly an Adversarial Imitator: Understanding LLM Self-Play through the Lens of Imitation Learning

Shangzhe Li, Xuchao Zhang, Chetan Bansal, Weitong Zhang

Main category: cs.LG

TL;DR: Theoretical analysis of self-play finetuning for LLMs, connecting it to adversarial imitation learning via min-max game formulation, leading to a new χ²-divergence-based algorithm with improved stability.

Details

Motivation: Self-play post-training methods effectively improve LLMs without preference data, but lack theoretical foundations. The paper aims to establish theoretical connections between self-play finetuning and adversarial imitation learning.

Method: Formulates finetuning as a min-max game between the model and a regularized implicit reward player parameterized by the model itself. Proposes a new self-play imitation algorithm based on χ²-divergence variational objective with bounded rewards for improved stability.

Result: Game-theoretic analysis shows self-play finetuning converges to equilibrium. Experiments on various language model finetuning tasks demonstrate consistent improvements over existing self-play methods, validating theoretical insights.

Conclusion: The paper provides theoretical foundations for self-play finetuning, unifies self-play imitation and preference alignment, and introduces a more stable algorithm with proven convergence properties.

Abstract: Self-play post-training methods has emerged as an effective approach for finetuning large language models and turn the weak language model into strong language model without preference data. However, the theoretical foundations for self-play finetuning remain underexplored. In this work, we tackle this by connecting self-play finetuning with adversarial imitation learning by formulating finetuning procedure as a min-max game between the model and a regularized implicit reward player parameterized by the model itself. This perspective unifies self-play imitation and general preference alignment within a common framework. Under this formulation, we present a game-theoretic analysis showing that the self-play finetuning will converge to it’s equilibrium. Guided by this theoretical formulation, we propose a new self-play imitation finetuning algorithm based on the $χ^2$-divergence variational objective with bounded rewards and improved stability. Experiments on various of language model finetuning tasks demonstrate consistent improvements over existing self-play methods and validate our theoretical insights.

[1189] PaAno: Patch-Based Representation Learning for Time-Series Anomaly Detection

Jinju Park, Seokho Kang

Main category: cs.LG

TL;DR: PaAno is a lightweight patch-based representation learning method for time-series anomaly detection that uses 1D CNNs and contrastive learning, achieving SOTA performance with lower computational cost than heavy transformer models.

Details

Motivation: Current time-series anomaly detection methods increasingly use large transformer/foundation models which are computationally expensive and memory-intensive, making them impractical for real-time and resource-constrained scenarios. These heavy models often don't show significant performance gains over simpler methods under rigorous evaluation.

Method: Extracts short temporal patches from time-series training data and uses a 1D convolutional neural network to embed each patch into vector representations. The model is trained using a combination of triplet loss and pretext loss to ensure embeddings capture informative temporal patterns. During inference, anomaly scores are computed by comparing embeddings of test patches to those of normal patches from training data.

Result: Evaluated on TSB-AD benchmark, PaAno achieved state-of-the-art performance, significantly outperforming existing methods (including heavy architectures) on both univariate and multivariate time-series anomaly detection across various range-wise and point-wise performance measures.

Conclusion: PaAno provides a lightweight yet effective alternative to computationally expensive transformer-based methods for time-series anomaly detection, offering practical advantages for real-time and resource-constrained applications while maintaining superior performance.

Abstract: Although recent studies on time-series anomaly detection have increasingly adopted ever-larger neural network architectures such as transformers and foundation models, they incur high computational costs and memory usage, making them impractical for real-time and resource-constrained scenarios. Moreover, they often fail to demonstrate significant performance gains over simpler methods under rigorous evaluation protocols. In this study, we propose Patch-based representation learning for time-series Anomaly detection (PaAno), a lightweight yet effective method for fast and efficient time-series anomaly detection. PaAno extracts short temporal patches from time-series training data and uses a 1D convolutional neural network to embed each patch into a vector representation. The model is trained using a combination of triplet loss and pretext loss to ensure the embeddings capture informative temporal patterns from input patches. During inference, the anomaly score at each time step is computed by comparing the embeddings of its surrounding patches to those of normal patches extracted from the training time-series. Evaluated on the TSB-AD benchmark, PaAno achieved state-of-the-art performance, significantly outperforming existing methods, including those based on heavy architectures, on both univariate and multivariate time-series anomaly detection across various range-wise and point-wise performance measures.

[1190] Deep Variational Contrastive Learning for Joint Risk Stratification and Time-to-Event Estimation

Pinar Erbil, Alberto Archetti, Eugenio Lomurno, Matteo Matteucci

Main category: cs.LG

TL;DR: CONVERSE is a deep survival analysis model that combines variational autoencoders with contrastive learning to achieve both high predictive performance and interpretable risk stratification for clinical decision-making.

Details

Motivation: There's a fundamental trade-off in deep survival analysis between predictive performance (high accuracy from neural networks) and interpretability (clear risk stratification from clustering methods). Current methods either achieve high accuracy but are black-box, or provide interpretable risk groups but sacrifice predictive power.

Method: CONVERSE unifies variational autoencoders with contrastive learning, using variational embeddings with multiple intra- and inter-cluster contrastive losses. It employs self-paced learning to progressively incorporate samples from easy to hard, improving training stability, and supports cluster-specific survival heads for ensemble predictions.

Result: Comprehensive evaluation on four benchmark datasets shows CONVERSE achieves competitive or superior performance compared to existing deep survival methods while maintaining meaningful patient stratification.

Conclusion: CONVERSE successfully bridges the gap between performance and interpretability in deep survival analysis, offering both accurate predictions and interpretable risk stratification for clinical applications.

Abstract: Survival analysis is essential for clinical decision-making, as it allows practitioners to estimate time-to-event outcomes, stratify patient risk profiles, and guide treatment planning. Deep learning has revolutionized this field with unprecedented predictive capabilities but faces a fundamental trade-off between performance and interpretability. While neural networks achieve high accuracy, their black-box nature limits clinical adoption. Conversely, deep clustering-based methods that stratify patients into interpretable risk groups typically sacrifice predictive power. We propose CONVERSE (CONtrastive Variational Ensemble for Risk Stratification and Estimation), a deep survival model that bridges this gap by unifying variational autoencoders with contrastive learning for interpretable risk stratification. CONVERSE combines variational embeddings with multiple intra- and inter-cluster contrastive losses. Self-paced learning progressively incorporates samples from easy to hard, improving training stability. The model supports cluster-specific survival heads, enabling accurate ensemble predictions. Comprehensive evaluation on four benchmark datasets demonstrates that CONVERSE achieves competitive or superior performance compared to existing deep survival methods, while maintaining meaningful patient stratification.

[1191] An Odd Estimator for Shapley Values

Fabian Fumagalli, Landon Butler, Justin Singh Kang, Kannan Ramchandran, R. Teal Witter

Main category: cs.LG

TL;DR: OddSHAP is a novel Shapley value estimator that leverages the insight that Shapley values depend only on the odd component of set functions, using polynomial regression on the odd subspace to achieve state-of-the-art accuracy.

Details

Motivation: The Shapley value is widely used for attribution in ML but computationally intractable. While paired sampling heuristics improve estimation, their theoretical basis was unclear. The authors aim to provide a fundamental justification for paired sampling and develop a more efficient estimator.

Method: Prove that Shapley values depend exclusively on the odd component of set functions. Develop OddSHAP: use Fourier basis to isolate odd subspace, employ polynomial regression on this subspace, and use proxy models to identify high-impact interactions to avoid combinatorial explosion.

Result: OddSHAP achieves state-of-the-art estimation accuracy in extensive benchmark evaluations, providing both theoretical justification for paired sampling and practical improvements in Shapley value computation.

Conclusion: The paper provides fundamental theoretical insights about Shapley values and odd/even function components, leading to OddSHAP - an efficient, accurate estimator that advances Shapley value approximation methods.

Abstract: The Shapley value is a ubiquitous framework for attribution in machine learning, encompassing feature importance, data valuation, and causal inference. However, its exact computation is generally intractable, necessitating efficient approximation methods. While the most effective and popular estimators leverage the paired sampling heuristic to reduce estimation error, the theoretical mechanism driving this improvement has remained opaque. In this work, we provide an elegant and fundamental justification for paired sampling: we prove that the Shapley value depends exclusively on the odd component of the set function, and that paired sampling orthogonalizes the regression objective to filter out the irrelevant even component. Leveraging this insight, we propose OddSHAP, a novel consistent estimator that performs polynomial regression solely on the odd subspace. By utilizing the Fourier basis to isolate this subspace and employing a proxy model to identify high-impact interactions, OddSHAP overcomes the combinatorial explosion of higher-order approximations. Through an extensive benchmark evaluation, we find that OddSHAP achieves state-of-the-art estimation accuracy.

[1192] SNIP: An Adaptive Mixed Precision Framework for Subbyte Large Language Model Training

Yunjie Pan, Yongyi Yang, Hanmei Yang, Scott Mahlke

Main category: cs.LG

TL;DR: SNIP: Fine-grained adaptive mixed-precision training framework for LLMs that uses periodic statistics collection and ILP optimization to determine optimal layerwise subbyte precision, achieving up to 80% FLOP reduction while preserving model quality.

Details

Motivation: Current mixed-precision training approaches for LLMs either use uniform precision across all operations or rely on heuristic methods that don't generalize well during training, leading to suboptimal convergence, instability, and inefficient use of modern GPU subbyte precision capabilities.

Method: SNIP periodically collects statistics on activations, gradients, and optimizer states to assess precision loss impact. It defines two key metrics: loss divergence (forward pass) and weight divergence (backward pass). These metrics guide an Integer Linear Programming (ILP) problem that systematically optimizes layerwise precision to minimize quality loss while meeting efficiency targets.

Result: Experiments on 1B, 3B, 7B, and 70B Llama-like models show SNIP consistently outperforms existing baselines, reducing FLOPs by up to 80% while preserving model quality across different model sizes and training phases with minimal computational overhead.

Conclusion: SNIP provides an effective framework for fine-grained adaptive mixed-precision training of LLMs that balances efficiency and model quality, enabling more efficient use of modern GPU capabilities for large-scale model training.

Abstract: Training large language models (LLMs) efficiently while preserving model quality poses significant challenges, particularly with subbyte precision supported by state-of-the-art GPUs. Current mixed-precision training approaches either apply uniform precision to all GEMM operations or rely on heuristic-based methods that fail to generalize during training, leading to suboptimal convergence and instability. To address these challenges, this paper introduces SNIP, a fine-grained adaptive mixed-precision training framework for LLM pretraining that supports subbyte precision. SNIP periodically collects statistics on activations, gradients, and optimizer states to assess the precision loss impact on model quality. We define two key metrics: loss divergence in the forward pass, caused by quantization-induced increases in training loss, and weight divergence in the backward pass, which measures error propagation through gradients affecting model updates. These metrics guide an Integer Linear Programming (ILP) problem that systematically optimizes layerwise precision to minimize overall quality loss while meeting efficiency targets. Experiments on 1B, 3B, 7B and 70B Llama-like models demonstrate that SNIP consistently outperforms existing baselines, reducing FLOPs by up to 80% while preserving model quality across different model sizes and training phases with minimal computational overhead.

[1193] Semi-supervised CAPP Transformer Learning via Pseudo-labeling

Dennis Gross, Helge Spieker, Arnaud Gotlieb, Emmanuel Stathatos, Panorios Benardos, George-Christopher Vosniakos

Main category: cs.LG

TL;DR: Semi-supervised learning approach for Computer-Aided Process Planning (CAPP) using transformer models with oracle filtering for data-scarce manufacturing environments

Details

Motivation: High-level CAPP suffers from limited dataset availability in industry, which reduces model generalization. There's a need to improve transformer-based CAPP models without requiring extensive manual labeling.

Method: Proposes a semi-supervised learning approach where an oracle trained on available transformer behavior data filters correct predictions from unseen parts. These filtered predictions are then used for one-shot retraining of the transformer model.

Result: Experiments on small-scale datasets with simulated ground truth across the full data distribution show consistent accuracy gains over baseline methods.

Conclusion: The method demonstrates effectiveness in data-scarce manufacturing environments by improving transformer-based CAPP models without manual labeling requirements.

Abstract: High-level Computer-Aided Process Planning (CAPP) generates manufacturing process plans from part specifications. It suffers from limited dataset availability in industry, reducing model generalization. We propose a semi-supervised learning approach to improve transformer-based CAPP transformer models without manual labeling. An oracle, trained on available transformer behaviour data, filters correct predictions from unseen parts, which are then used for one-shot retraining. Experiments on small-scale datasets with simulated ground truth across the full data distribution show consistent accuracy gains over baselines, demonstrating the method’s effectiveness in data-scarce manufacturing environments.

[1194] Improve the Trade-off Between Watermark Strength and Speculative Sampling Efficiency for Language Models

Weiqing He, Xiang Li, Li Shen, Weijie Su, Qi Long

Main category: cs.LG

TL;DR: A principled approach that unites speculative sampling and watermarking by injecting pseudorandomness into draft-token acceptance, enabling maximal watermark strength while maintaining inference efficiency.

Details

Motivation: Current watermarking methods for LLMs suffer from inference inefficiency, while speculative sampling accelerates inference but faces a trade-off with watermark strength. The paper aims to resolve this fundamental conflict between watermarking and inference efficiency.

Method: Introduces a quantitative measure of watermark strength based on statistical detectability, formulates the trade-off as a constrained optimization problem, derives Pareto curves for existing schemes, and proposes a mechanism that injects pseudorandomness into draft-token acceptance.

Result: The approach achieves maximal watermark strength while maintaining speculative sampling efficiency, improving detectability without sacrificing inference speed.

Conclusion: The paper uncovers a principle that unites speculative sampling and watermarking, enabling their efficient and practical deployment by resolving the fundamental trade-off between watermark strength and inference efficiency.

Abstract: Watermarking is a principled approach for tracing the provenance of large language model (LLM) outputs, but its deployment in practice is hindered by inference inefficiency. Speculative sampling accelerates inference, with efficiency improving as the acceptance rate between draft and target models increases. Yet recent work reveals a fundamental trade-off: higher watermark strength reduces acceptance, preventing their simultaneous achievement. We revisit this trade-off and show it is not absolute. We introduce a quantitative measure of watermark strength that governs statistical detectability and is maximized when tokens are deterministic functions of pseudorandom numbers. Using this measure, we fully characterize the trade-off as a constrained optimization problem and derive explicit Pareto curves for two existing watermarking schemes. Finally, we introduce a principled mechanism that injects pseudorandomness into draft-token acceptance, ensuring maximal watermark strength while maintaining speculative sampling efficiency. Experiments further show that this approach improves detectability without sacrificing efficiency. Our findings uncover a principle that unites speculative sampling and watermarking, paving the way for their efficient and practical deployment.

[1195] DCD: Decomposition-based Causal Discovery from Autocorrelated and Non-Stationary Temporal Data

Muhammad Hasan Ferdous, Md Osman Gani

Main category: cs.LG

TL;DR: A decomposition-based causal discovery framework for multivariate time series that separates data into trend, seasonal, and residual components for component-specific causal analysis, improving accuracy under non-stationarity and autocorrelation.

Details

Motivation: Multivariate time series in domains like finance, climate science, and healthcare exhibit complex patterns including long-term trends, seasonal patterns, and short-term fluctuations. Existing causal discovery methods operating on raw observations are vulnerable to spurious edges and misattributed temporal dependencies due to non-stationarity and autocorrelation.

Method: The framework decomposes each time series into trend, seasonal, and residual components. Trend components are assessed using stationarity tests, seasonal components using kernel-based dependence measures, and residual components using constraint-based causal discovery. The component-level graphs are then integrated into a unified multi-scale causal structure.

Result: The approach more accurately recovers ground-truth causal structure than state-of-the-art baselines across extensive synthetic benchmarks and real-world climate data, particularly under strong non-stationarity and temporal autocorrelation.

Conclusion: The decomposition-based framework isolates long- and short-range causal effects, reduces spurious associations, improves interpretability, and provides more accurate causal discovery for complex time series data with non-stationarity and autocorrelation.

Abstract: Multivariate time series in domains such as finance, climate science, and healthcare often exhibit long-term trends, seasonal patterns, and short-term fluctuations, complicating causal inference under non-stationarity and autocorrelation. Existing causal discovery methods typically operate on raw observations, making them vulnerable to spurious edges and misattributed temporal dependencies. We introduce a decomposition-based causal discovery framework that separates each time series into trend, seasonal, and residual components and performs component-specific causal analysis. Trend components are assessed using stationarity tests, seasonal components using kernel-based dependence measures, and residual components using constraint-based causal discovery. The resulting component-level graphs are integrated into a unified multi-scale causal structure. This approach isolates long- and short-range causal effects, reduces spurious associations, and improves interpretability. Across extensive synthetic benchmarks and real-world climate data, our framework more accurately recovers ground-truth causal structure than state-of-the-art baselines, particularly under strong non-stationarity and temporal autocorrelation.

[1196] Phase Transitions for Feature Learning in Neural Networks

Andrea Montanari, Zihao Wang

Main category: cs.LG

TL;DR: Theoretical analysis of gradient descent dynamics in two-layer neural networks learning multi-index models, identifying a threshold δ_NN for successful feature learning in proportional asymptotics.

Details

Motivation: To formalize the phenomenon where neural networks first identify low-dimensional representations then fit models, specifically studying feature learning in multi-index models where responses depend only on k-dimensional projections of covariates.

Method: Analyzes gradient descent dynamics of two-layer neural networks under proportional asymptotics (n,d→∞, n/d→δ) with fixed latent dimension k and hidden neurons m. Studies phase transitions in Hessian spectrum during training.

Result: Derives threshold δ_NN for two-layer networks analogous to earlier algorithmic thresholds δ_alg. Shows feature learning is possible above δ_NN and characterizes training dynamics with two phases: initial gradient-dominated learning followed by Hessian-dominated dynamics.

Conclusion: The threshold δ_NN provides a theoretical characterization of when two-layer networks can successfully learn features in multi-index models, enabling future study of architecture and algorithm dependencies.

Abstract: According to a popular viewpoint, neural networks learn from data by first identifying low-dimensional representations, and subsequently fitting the best model in this space. Recent works provide a formalization of this phenomenon when learning multi-index models. In this setting, we are given $n$ i.i.d. pairs $({\boldsymbol x}i,y_i)$, where the covariate vectors ${\boldsymbol x}i\in\mathbb{R}^d$ are isotropic, and responses $y_i$ only depend on ${\boldsymbol x}i$ through a $k$-dimensional projection ${\boldsymbol Θ}^{\sf T}{\boldsymbol x}i$. Feature learning amounts to learning the latent space spanned by ${\boldsymbol Θ}$. In this context, we study the gradient descent dynamics of two-layer neural networks under the proportional asymptotics $n,d\to\infty$, $n/d\toδ$, while the dimension of the latent space $k$ and the number of hidden neurons $m$ are kept fixed. Earlier work establishes that feature learning via polynomial-time algorithms is possible if $δ> δ{\text{alg}}$, for $δ{\text{alg}}$ a threshold depending on the data distribution, and is impossible (within a certain class of algorithms) below $δ_{\text{alg}}$. Here we derive an analogous threshold $δ_{\text{NN}}$ for two-layer networks. Our characterization of $δ_{\text{NN}}$ opens the way to study the dependence of learning dynamics on the network architecture and training algorithm. The threshold $δ_{\text{NN}}$ is determined by the following scenario. Training first visits points for which the gradient of the empirical risk is large and learns the directions spanned by these gradients. Then the gradient becomes smaller and the dynamics becomes dominated by negative directions of the Hessian. The threshold $δ_{\text{NN}}$ corresponds to a phase transition in the spectrum of the Hessian in this second phase.

[1197] Theoretical Analysis of Measure Consistency Regularization for Partially Observed Data

Yinsong Wang, Shahin Shahrampour

Main category: cs.LG

TL;DR: Theoretical analysis of Measure Consistency Regularization (MCR) for handling corrupted/missing data, showing when and why it improves imputation quality, with a novel early stopping protocol based on duality gap monitoring.

Details

Motivation: Corrupted data, missing features, and missing modalities are persistent problems in machine learning. While MCR methods have shown empirical success for tasks like image inpainting and data imputation, there's limited theoretical understanding of why they work and when they provide generalization benefits.

Method: Theoretical analysis of MCR through the lens of neural network distance, identifying the term responsible for generalization advantage. Proposes a novel training protocol that monitors duality gap to determine optimal early stopping point that preserves generalization benefits.

Result: Theoretical insights show MCR’s generalization advantage is not always guaranteed, especially in imperfect training regimes. Empirical evidence supports theoretical claims, and the proposed early stopping protocol effectively preserves generalization benefits across different model architectures and data sources.

Conclusion: The paper provides fundamental theoretical understanding of MCR, explains when and why it enhances imputation quality, and offers practical guidance through duality gap monitoring for optimal training, making MCR more reliable for handling partial observability in real-world applications.

Abstract: The problem of corrupted data, missing features, or missing modalities continues to plague the modern machine learning landscape. To address this issue, a class of regularization methods that enforce consistency between imputed and fully observed data has emerged as a promising approach for improving model generalization, particularly in partially observed settings. We refer to this class of methods as Measure Consistency Regularization (MCR). Despite its empirical success in various applications, such as image inpainting, data imputation and semi-supervised learning, a fundamental understanding of the theoretical underpinnings of MCR remains limited. This paper bridges this gap by offering theoretical insights into why, when, and how MCR enhances imputation quality under partial observability, viewed through the lens of neural network distance. Our theoretical analysis identifies the term responsible for MCR’s generalization advantage and extends to the imperfect training regime, demonstrating that this advantage is not always guaranteed. Guided by these insights, we propose a novel training protocol that monitors the duality gap to determine an early stopping point that preserves the generalization benefit. We then provide detailed empirical evidence to support our theoretical claims and to show the effectiveness and accuracy of our proposed stopping condition. We further provide a set of real-world data simulations to show the versatility of MCR under different model architectures designed for different data sources.

[1198] TQL: Scaling Q-Functions with Transformers by Preventing Attention Collapse

Perry Dong, Kuo-Han Hung, Alexander Swerdlow, Dorsa Sadigh, Chelsea Finn

Main category: cs.LG

TL;DR: Transformer-based value functions in RL suffer from attention collapse when scaled, but controlling attention entropy stabilizes training and enables effective scaling.

Details

Motivation: Despite the success of scaling in other ML domains, RL value functions remain small. Naive scaling of transformers for value functions causes instability and worse performance, prompting investigation into what prevents effective scaling.

Method: Proposes Transformer Q-Learning (TQL) which controls the entropy of attention scores to prevent attention collapse when scaling transformer-based value functions. This stabilization enables the use of larger models.

Result: TQL yields up to 43% performance improvement when scaling from smallest to largest network sizes, while prior methods suffer from performance degradation.

Conclusion: Attention collapse is the critical failure mode in scaling transformer value functions, and controlling attention entropy effectively stabilizes training, unlocking the scaling potential of transformers for RL value functions.

Abstract: Despite scale driving substantial recent advancements in machine learning, reinforcement learning (RL) methods still primarily use small value functions. Naively scaling value functions – including with a transformer architecture, which is known to be highly scalable – often results in learning instability and worse performance. In this work, we ask what prevents transformers from scaling effectively for value functions? Through empirical analysis, we identify the critical failure mode in this scaling: attention scores collapse as capacity increases. Our key insight is that we can effectively prevent this collapse and stabilize training by controlling the entropy of the attention scores, thereby enabling the use of larger models. To this end, we propose Transformer Q-Learning (TQL), a method that unlocks the scaling potential of transformers in learning value functions in RL. Our approach yields up to a 43% improvement in performance when scaling from the smallest to the largest network sizes, while prior methods suffer from performance degradation.

[1199] A Meta-Knowledge-Augmented LLM Framework for Hyperparameter Optimization in Time-Series Forecasting

Ons Saadallah, Mátyás andó, Tamás Gábor Orosz

Main category: cs.LG

TL;DR: LLM-AutoOpt: A hybrid hyperparameter optimization framework combining Bayesian Optimization with LLM-based contextual reasoning for time-series forecasting, using structured meta-knowledge to improve performance and interpretability.

Details

Motivation: Hyperparameter optimization is computationally expensive and difficult to interpret for time-series forecasting. Bayesian Optimization treats tasks independently with limited insight, while LLMs offer opportunities to incorporate structured prior knowledge and reasoning into optimization pipelines.

Method: LLM-AutoOpt combines BO with LLM-based contextual reasoning, encoding dataset meta-features, model descriptions, historical optimization outcomes, and target objectives as structured meta-knowledge within LLM prompts. BO initializes search to mitigate cold-start effects, enabling context-aware hyperparameter refinement with exposed reasoning.

Result: Experiments on multivariate time series forecasting benchmarks show LLM-AutoOpt achieves improved predictive performance and more interpretable optimization behavior compared to BO and LLM baselines without meta-knowledge.

Conclusion: The hybrid framework successfully integrates LLM reasoning with BO for more effective and interpretable hyperparameter optimization in time-series forecasting tasks.

Abstract: Hyperparameter optimization (HPO) plays a central role in the performance of deep learning models, yet remains computationally expensive and difficult to interpret, particularly for time-series forecasting. While Bayesian Optimization (BO) is a standard approach, it typically treats tuning tasks independently and provides limited insight into its decisions. Recent advances in large language models (LLMs) offer new opportunities to incorporate structured prior knowledge and reasoning into optimization pipelines. We introduce LLM-AutoOpt, a hybrid HPO framework that combines BO with LLM-based contextual reasoning. The framework encodes dataset meta-features, model descriptions, historical optimization outcomes, and target objectives as structured meta-knowledge within LLM prompts, using BO to initialize the search and mitigate cold-start effects. This design enables context-aware and stable hyperparameter refinement while exposing the reasoning behind optimization decisions. Experiments on a multivariate time series forecasting benchmark demonstrate that LLM-AutoOpt achieves improved predictive performance and more interpretable optimization behavior compared to BO and LLM baselines without meta-knowledge.

[1200] Provable Cooperative Multi-Agent Exploration for Reward-Free MDPs

Idan Barnea, Orin Levy, Yishay Mansour

Main category: cs.LG

TL;DR: Multi-agent RL for reward-free exploration in tabular MDPs, analyzing tradeoff between number of learning phases and agents needed to learn dynamics.

Details

Motivation: Study cooperative multi-agent reinforcement learning in reward-free exploration settings where multiple agents jointly explore unknown MDPs to learn dynamics without observing rewards, focusing on the tradeoff between number of learning phases and number of agents.

Method: Phased learning framework where multiple agents independently interact with environment in each phase, each executing assigned policies and observing trajectories. Analysis focuses on tabular finite-horizon MDPs with theoretical characterization of phase-agent tradeoff.

Result: Identified sharp transition governed by horizon H: when number of learning phases equals H, algorithm uses Õ(S⁶H⁶A/ε²) agents for ε-approximation of dynamics. Lower bound shows any algorithm with ρ<H phases requires at least A^(H/ρ) agents for constant accuracy.

Conclusion: Essential to have order H learning phases when limiting number of agents to be polynomial, establishing fundamental tradeoff between exploration phases and agent count in multi-agent reward-free RL.

Abstract: We study cooperative multi-agent reinforcement learning in the setting of reward-free exploration, where multiple agents jointly explore an unknown MDP in order to learn its dynamics (without observing rewards). We focus on a tabular finite-horizon MDP and adopt a phased learning framework. In each learning phase, multiple agents independently interact with the environment. More specifically, in each learning phase, each agent is assigned a policy, executes it, and observes the resulting trajectory. Our primary goal is to characterize the tradeoff between the number of learning phases and the number of agents, especially when the number of learning phases is small. Our results identify a sharp transition governed by the horizon $H$. When the number of learning phases equals $H$, we present a computationally efficient algorithm that uses only $\tilde{O}(S^6 H^6 A / ε^2)$ agents to obtain an $ε$ approximation of the dynamics (i.e., yields an $ε$-optimal policy for any reward function). We complement our algorithm with a lower bound showing that any algorithm restricted to $ρ< H$ phases requires at least $A^{H/ρ}$ agents to achieve constant accuracy. Thus, we show that it is essential to have an order of $H$ learning phases if we limit the number of agents to be polynomial.

[1201] Modeling Topological Impact on Node Attribute Distributions in Attributed Graphs

Amirreza Shiralinasab Langari, Leila Yeganeh, Kim Khoa Nguyen

Main category: cs.LG

TL;DR: The paper introduces an algebraic framework to study how graph topology influences node attribute distributions, treating topology and attributes as interacting components, with applications to graph anomaly detection.

Details

Motivation: To understand how the topological structure of attributed graphs affects the distribution of node attributes, treating topology and attributes as distinct but interacting components rather than as a single unified entity.

Method: Develops a categorical framework to formalize how nodes perceive graph topology, quantifies these perspectives, and integrates them with node attribute distributions to create topology-influenced distributions. Introduces a simple testbed model (ID) and uses unsupervised graph anomaly detection as an evaluation task.

Result: The approach produces topology-conditioned distributions that approximate posteriors P(·|v) and P(·|G), and establishes a sufficiency condition showing that on complete graphs (with no informative structure), the construction recovers the original attribute distribution.

Conclusion: The paper provides a novel algebraic perspective on the interaction between graph topology and node attributes, offering a principled framework for understanding topological influences on attribute distributions with applications to graph analysis tasks.

Abstract: We investigate how the topology of attributed graphs influences the distribution of node attributes. This work offers a novel perspective by treating topology and attributes as structurally distinct but interacting components. We introduce an algebraic approach that combines a graph’s topology with the probability distribution of node attributes, resulting in topology-influenced distributions. First, we develop a categorical framework to formalize how a node perceives the graph’s topology. We then quantify this point of view and integrate it with the distribution of node attributes to capture topological effects. We interpret these topology-conditioned distributions as approximations of the posteriors $P(\cdot \mid v)$ and $P(\cdot \mid \mathcal{G})$. We further establish a principled sufficiency condition by showing that, on complete graphs, where topology carries no informative structure, our construction recovers the original attribute distribution. To evaluate our approach, we introduce an intentionally simple testbed model, $\textbf{ID}$, and use unsupervised graph anomaly detection as a probing task.

[1202] Rectified LpJEPA: Joint-Embedding Predictive Architectures with Sparse and Maximum-Entropy Representations

Yilun Kuang, Yash Dagade, Tim G. J. Rudner, Randall Balestriero, Yann LeCun

Main category: cs.LG

TL;DR: Rectified LpJEPA introduces a new regularization method (RDMReg) that enforces sparse, non-negative representations in joint-embedding predictive architectures by aligning representations to Rectified Generalized Gaussian distributions instead of isotropic Gaussians.

Details

Motivation: Existing JEPA approaches regularize representations towards isotropic Gaussian distributions, which favor dense representations and fail to capture the sparsity property observed in efficient biological and artificial representations. There's a need for methods that can explicitly control sparsity while preserving task-relevant information.

Method: Introduces Rectified Distribution Matching Regularization (RDMReg), a sliced two-sample distribution-matching loss that aligns representations to a Rectified Generalized Gaussian (RGG) distribution. RGG enables explicit control over expected ℓ₀ norm through rectification while preserving maximum-entropy under expected ℓₚ norm constraints. This yields Rectified LpJEPA, which generalizes prior Gaussian-based JEPAs.

Result: Rectified LpJEPA learns sparse, non-negative representations with favorable sparsity-performance trade-offs and achieves competitive downstream performance on image classification benchmarks. The method effectively enforces sparsity while preserving task-relevant information.

Conclusion: RDMReg successfully addresses the limitation of existing JEPA approaches by enabling explicit sparsity control through rectified distribution matching, leading to more efficient representations that maintain competitive performance on vision tasks.

Abstract: Joint-Embedding Predictive Architectures (JEPA) learn view-invariant representations and admit projection-based distribution matching for collapse prevention. Existing approaches regularize representations towards isotropic Gaussian distributions, but inherently favor dense representations and fail to capture the key property of sparsity observed in efficient representations. We introduce Rectified Distribution Matching Regularization (RDMReg), a sliced two-sample distribution-matching loss that aligns representations to a Rectified Generalized Gaussian (RGG) distribution. RGG enables explicit control over expected $\ell_0$ norm through rectification, while preserving maximum-entropy up to rescaling under expected $\ell_p$ norm constraints. Equipping JEPAs with RDMReg yields Rectified LpJEPA, which strictly generalizes prior Gaussian-based JEPAs. Empirically, Rectified LpJEPA learns sparse, non-negative representations with favorable sparsity-performance trade-offs and competitive downstream performance on image classification benchmarks, demonstrating that RDMReg effectively enforces sparsity while preserving task-relevant information.

[1203] A Statistical Theory of Gated Attention through the Lens of Hierarchical Mixture of Experts

Viet Nguyen, Tuan Minh Pham, Thinh Cao, Tan Dinh, Huy Nguyen, Nhat Ho, Alessandro Rinaldo

Main category: cs.LG

TL;DR: Gated attention improves Transformer performance by using hierarchical mixture of experts formulation, showing polynomial vs exponential sample efficiency compared to standard multi-head attention.

Details

Motivation: While gated attention has been empirically shown to improve Transformer performance and address attention sink issues, there's a lack of theoretical understanding about why it works better than standard multi-head self-attention.

Method: The authors mathematically reformulate both gated attention and multi-head self-attention as hierarchical mixture of experts models, then analyze them as expert estimation problems to compare their sample efficiency.

Result: Gated attention requires only polynomial number of data points to estimate experts, while multi-head self-attention needs exponentially many data points for the same estimation error, explaining gated attention’s superior sample efficiency.

Conclusion: The theoretical analysis provides justification for gated attention’s empirical benefits and explains why placing gates at specific positions (output of scaled dot product attention or value map) yields better performance.

Abstract: Self-attention has greatly contributed to the success of the widely used Transformer architecture by enabling learning from data with long-range dependencies. In an effort to improve performance, a gated attention model that leverages a gating mechanism within the multi-head self-attention has recently been proposed as a promising alternative. Gated attention has been empirically demonstrated to increase the expressiveness of low-rank mapping in standard attention and even to eliminate the attention sink phenomenon. Despite its efficacy, a clear theoretical understanding of gated attention’s benefits remains lacking in the literature. To close this gap, we rigorously show that each entry in a gated attention matrix or a multi-head self-attention matrix can be written as a hierarchical mixture of experts. By recasting learning as an expert estimation problem, we demonstrate that gated attention is more sample-efficient than multi-head self-attention. In particular, while the former needs only a polynomial number of data points to estimate an expert, the latter requires exponentially many data points to achieve the same estimation error. Furthermore, our analysis also provides a theoretical justification for why gated attention yields higher performance when a gate is placed at the output of the scaled dot product attention or the value map rather than at other positions in the multi-head self-attention architecture.

[1204] P-EAGLE: Parallel-Drafting EAGLE with Scalable Training

Mude Hui, Xin Huang, Jaime Campos Salas, Yue Sun, Nathan Pemberton, Xiang Song, Ashish Khetan, George Karypis

Main category: cs.LG

TL;DR: P-EAGLE transforms EAGLE from autoregressive to parallel multi-token prediction using learnable shared hidden states, enabling faster speculative decoding for long-context reasoning LLMs.

Details

Motivation: Reasoning LLMs produce longer outputs requiring speculative decoding, but parallel drafting training complexity scales quadratically with sequence length and parallel positions, making long-context training impractical.

Method: P-EAGLE transforms EAGLE from autoregressive to parallel multi-token prediction via learnable shared hidden states. Uses attention mask pre-computation and sequence partitioning techniques to enable gradient accumulation within sequences for parallel-prediction training.

Result: Implemented in vLLM with speedups of 1.10-1.36x over autoregressive EAGLE-3 across GPT-OSS 120B, 20B, and Qwen3-Coder 30B models.

Conclusion: P-EAGLE enables efficient parallel speculative decoding for long-context reasoning LLMs with practical training scalability.

Abstract: Reasoning LLMs produce longer outputs, requiring speculative decoding drafters trained on extended sequences. Parallel drafting - predicting multiple tokens per forward pass - offers latency benefits over sequential generation, but training complexity scales quadratically with the product of sequence length and parallel positions, rendering long-context training impractical. We present P(arallel)-EAGLE, which transforms EAGLE from autoregressive to parallel multi-token prediction via a learnable shared hidden state. To scale training to long contexts, we develop a framework featuring attention mask pre-computation and sequence partitioning techniques, enabling gradient accumulation within individual sequences for parallel-prediction training. We implement P-EAGLE in vLLM and demonstrate speedups of 1.10-1.36x over autoregressive EAGLE-3 across GPT-OSS 120B, 20B, and Qwen3-Coder 30B.

[1205] Rod Flow: A Continuous-Time Model for Gradient Descent at the Edge of Stability

Eric Regis, Sinho Chewi

Main category: cs.LG

TL;DR: Rod Flow: A new ODE approximation for gradient descent dynamics based on a physical “rod” model that captures edge of stability phenomena better than previous approaches.

Details

Motivation: To better understand gradient-based training over non-convex landscapes, particularly the edge of stability phenomenon where gradient descent with large step sizes diverges from gradient flow. Existing approximations like Central Flow have limitations, so a more principled, accurate, and computationally efficient alternative is needed.

Method: Proposes Rod Flow, an ODE approximation derived from a physical picture of GD iterates as an extended one-dimensional object (a “rod”). This approach provides an explicit, cheap-to-compute model that captures GD dynamics in the edge of stability regime.

Result: Rod Flow better captures GD dynamics for simple toy examples and matches the accuracy of Central Flow for representative neural network architectures. Theoretically proven to correctly predict critical sharpness threshold and explain self-stabilization in quartic potentials, validated through numerical experiments.

Conclusion: Rod Flow provides a principled, accurate, and computationally efficient alternative to Central Flow for understanding gradient descent dynamics in non-convex optimization, particularly in the edge of stability regime.

Abstract: How can we understand gradient-based training over non-convex landscapes? The edge of stability phenomenon, introduced in Cohen et al. (2021), indicates that the answer is not so simple: namely, gradient descent (GD) with large step sizes often diverges away from the gradient flow. In this regime, the “Central Flow”, recently proposed in Cohen et al. (2025), provides an accurate ODE approximation to the GD dynamics over many architectures. In this work, we propose Rod Flow, an alternative ODE approximation, which carries the following advantages: (1) it rests on a principled derivation stemming from a physical picture of GD iterates as an extended one-dimensional object – a “rod”; (2) it better captures GD dynamics for simple toy examples and matches the accuracy of Central Flow for representative neural network architectures, and (3) is explicit and cheap to compute. Theoretically, we prove that Rod Flow correctly predicts the critical sharpness threshold and explains self-stabilization in quartic potentials. We validate our theory with a range of numerical experiments.

[1206] Causal Preference Elicitation

Edwin V. Bonilla, He Zhao, Daniel M. Steinberg

Main category: cs.LG

TL;DR: Bayesian framework for active causal discovery that queries experts about edge relations to concentrate posterior over DAGs

Details

Motivation: Causal discovery from observational data is challenging, and expert knowledge can help but is often underutilized. The paper aims to develop an active learning framework that efficiently queries experts about local edge relations to improve causal structure learning.

Method: Proposes causal preference elicitation: a Bayesian framework that models expert judgments with three-way likelihood (edge existence/direction), uses particle approximation for posterior inference, and selects queries via expected information gain criterion on categorical expert responses.

Result: Experiments on synthetic graphs, protein signaling data, and human gene perturbation benchmark show faster posterior concentration and improved recovery of directed effects under limited query budgets.

Conclusion: The framework effectively integrates expert knowledge through active querying to accelerate causal discovery and improve accuracy of directed causal effects.

Abstract: We propose causal preference elicitation, a Bayesian framework for expert-in-the-loop causal discovery that actively queries local edge relations to concentrate a posterior over directed acyclic graphs (DAGs). From any black-box observational posterior, we model noisy expert judgments with a three-way likelihood over edge existence and direction. Posterior inference uses a flexible particle approximation, and queries are selected by an efficient expected information gain criterion on the expert’s categorical response. Experiments on synthetic graphs, protein signaling data, and a human gene perturbation benchmark show faster posterior concentration and improved recovery of directed effects under tight query budgets.

[1207] Predicting and improving test-time scaling laws via reward tail-guided search

Muheng Li, Jian Qian, Wenlong Mou

Main category: cs.LG

TL;DR: A test-time scaling method called Scaling-Law Guided (SLG) Search that uses tail distribution estimation to predict LLM scaling laws and dynamically allocate compute for better reasoning performance than Best-of-N.

Details

Motivation: Best-of-N strategy for test-time scaling lacks principled guidance on N selection, budget allocation, and multi-stage decision-making, leaving room for optimization with limited theoretical guarantees.

Method: Proposes tail-guided search that estimates reward tail distributions to predict scaling laws without exhaustive evaluations, then uses Scaling-Law Guided (SLG) Search to dynamically allocate compute to identify high-potential intermediate states.

Result: Theoretically proves SLG achieves vanishing regret compared to perfect-information oracles and achieves expected rewards requiring polynomially larger compute with Best-of-N. Empirically validates across LLMs and reward models showing higher reward yields than Best-of-N under identical budgets.

Conclusion: Tail-guided allocation consistently outperforms Best-of-N, providing a principled approach to test-time scaling with theoretical guarantees and practical improvements in LLM reasoning capabilities.

Abstract: Test-time scaling has emerged as a critical avenue for enhancing the reasoning capabilities of Large Language Models (LLMs). Though the straight-forward ‘‘best-of-$N$’’ (BoN) strategy has already demonstrated significant improvements in performance, it lacks principled guidance on the choice of $N$, budget allocation, and multi-stage decision-making, thereby leaving substantial room for optimization. While many works have explored such optimization, rigorous theoretical guarantees remain limited. In this work, we propose new methodologies to predict and improve scaling properties via tail-guided search. By estimating the tail distribution of rewards, our method predicts the scaling law of LLMs without the need for exhaustive evaluations. Leveraging this prediction tool, we introduce Scaling-Law Guided (SLG) Search, a new test-time algorithm that dynamically allocates compute to identify and exploit intermediate states with the highest predicted potential. We theoretically prove that SLG achieves vanishing regret compared to perfect-information oracles, and achieves expected rewards that would otherwise require a polynomially larger compute budget required when using BoN. Empirically, we validate our framework across different LLMs and reward models, confirming that tail-guided allocation consistently achieves higher reward yields than Best-of-$N$ under identical compute budgets. Our code is available at https://github.com/PotatoJnny/Scaling-Law-Guided-search.

[1208] OpInf-LLM: Parametric PDE Solving with LLMs via Operator Inference

Zhuoyuan Wang, Hanjiang Hu, Xiyu Deng, Saviz Mowlavi, Yorie Nakahira

Main category: cs.LG

TL;DR: OpInf-LLM: An LLM-based framework that combines operator inference with large language models to solve parametric PDEs using minimal solution data, enabling accurate predictions for unseen parameters and configurations.

Details

Motivation: While LLMs show strong capabilities in code generation and symbolic reasoning, reliably solving diverse PDEs across heterogeneous settings remains challenging. There's a persistent trade-off between execution success rate and numerical accuracy, especially when generalizing to unseen parameters and boundary conditions.

Method: Proposes OpInf-LLM, an LLM parametric PDE solving framework based on operator inference. The framework leverages small amounts of solution data to enable accurate prediction of diverse PDE instances, provides natural language specification of PDE solving tasks, and offers low computational demands with unified tool interface.

Result: The framework enables accurate prediction of diverse PDE instances including unseen parameters and configurations, achieves high execution success rate across heterogeneous settings, and opens new possibilities for generalizable reduced-order modeling in LLM-based PDE solving.

Conclusion: OpInf-LLM combines operator inference with LLM capabilities to address the trade-off between execution success rate and numerical accuracy in PDE solving, offering a promising approach for generalizable reduced-order modeling with natural language interfaces.

Abstract: Solving diverse partial differential equations (PDEs) is fundamental in science and engineering. Large language models (LLMs) have demonstrated strong capabilities in code generation, symbolic reasoning, and tool use, but reliably solving PDEs across heterogeneous settings remains challenging. Prior work on LLM-based code generation and transformer-based foundation models for PDE learning has shown promising advances. However, a persistent trade-off between execution success rate and numerical accuracy arises, particularly when generalization to unseen parameters and boundary conditions is required. In this work, we propose OpInf-LLM, an LLM parametric PDE solving framework based on operator inference. The proposed framework leverages a small amount of solution data to enable accurate prediction of diverse PDE instances, including unseen parameters and configurations, and provides seamless integration with LLMs for natural language specification of PDE solving tasks. Its low computational demands and unified tool interface further enable a high execution success rate across heterogeneous settings. By combining operator inference with LLM capabilities, OpInf-LLM opens new possibilities for generalizable reduced-order modeling in LLM-based PDE solving.

[1209] Multi-Scale Wavelet Transformers for Operator Learning of Dynamical Systems

Xuesong Wang, Michael Groom, Rafael Oliveira, He Zhao, Terence O’Kane, Edwin V. Bonilla

Main category: cs.LG

TL;DR: MSWTs (multi-scale wavelet transformers) address spectral bias in neural operators for dynamical systems by learning dynamics in tokenized wavelet domain, improving high-frequency retention and long-horizon stability for weather forecasting.

Details

Motivation: Neural operators for dynamical systems suffer from spectral bias that attenuates high-frequency components, which is particularly damaging for applications like weather forecasting where misrepresented high frequencies cause long-horizon instability.

Method: Propose multi-scale wavelet transformers (MSWTs) that learn system dynamics in a tokenized wavelet domain. Use wavelet transform to separate low- and high-frequency content across scales, employ wavelet-preserving downsampling to retain high-frequency features, and utilize wavelet-based attention to capture dependencies across scales and frequency bands.

Result: Experiments on chaotic dynamical systems show substantial error reductions and improved long horizon spectral fidelity. On ERA5 climate reanalysis, MSWTs further reduce climatological bias, demonstrating effectiveness in real-world forecasting.

Conclusion: MSWTs effectively address spectral bias in neural operators for dynamical systems, particularly improving high-frequency retention and long-horizon stability for weather forecasting applications.

Abstract: Recent years have seen a surge in data-driven surrogates for dynamical systems that can be orders of magnitude faster than numerical solvers. However, many machine learning-based models such as neural operators exhibit spectral bias, attenuating high-frequency components that often encode small-scale structure. This limitation is particularly damaging in applications such as weather forecasting, where misrepresented high frequencies can induce long-horizon instability. To address this issue, we propose multi-scale wavelet transformers (MSWTs), which learn system dynamics in a tokenized wavelet domain. The wavelet transform explicitly separates low- and high-frequency content across scales. MSWTs leverage a wavelet-preserving downsampling scheme that retains high-frequency features and employ wavelet-based attention to capture dependencies across scales and frequency bands. Experiments on chaotic dynamical systems show substantial error reductions and improved long horizon spectral fidelity. On the ERA5 climate reanalysis, MSWTs further reduce climatological bias, demonstrating their effectiveness in a real-world forecasting setting.

[1210] Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum

Navdeep Kumar, Tehila Dahan, Lior Cohen, Ananyabrata Barua, Giorgia Ramponi, Kfir Yehuda Levy, Shie Mannor

Main category: cs.LG

TL;DR: Single-timescale actor-critic algorithm achieves optimal O(ε⁻²) sample complexity for ε-optimal policy in discounted MDPs, improving from previous O(ε⁻³), using STORM variance reduction and buffer sampling.

Details

Motivation: Prior actor-critic algorithms for infinite-horizon discounted MDPs had suboptimal O(ε⁻³) sample complexity. The authors aim to achieve the optimal O(ε⁻²) rate using a single-timescale approach while maintaining practical applicability with deep learning architectures.

Method: Combines STORM (STOchastic Recursive Momentum) for variance reduction in critic updates with a buffer mechanism that stores recent samples from the evolving policy’s nonstationary occupancy measure. Critic updates uniformly sample from this buffer to address the challenge of nonstationary data distribution.

Result: Achieves optimal O(ε⁻²) sample complexity for obtaining ε-optimal global policy in discounted MDPs with finite state-action spaces, improving upon prior O(ε⁻³) state-of-the-art. The approach maintains compatibility with deep learning architectures.

Conclusion: The proposed single-timescale actor-critic algorithm with STORM variance reduction and buffer sampling achieves optimal sample complexity while requiring only minor modifications to existing architectures, preserving practical applicability.

Abstract: We establish an optimal sample complexity of $O(ε^{-2})$ for obtaining an $ε$-optimal global policy using a single-timescale actor-critic (AC) algorithm in infinite-horizon discounted Markov decision processes (MDPs) with finite state-action spaces, improving upon the prior state of the art of $O(ε^{-3})$. Our approach applies STORM (STOchastic Recursive Momentum) to reduce variance in the critic updates. However, because samples are drawn from a nonstationary occupancy measure induced by the evolving policy, variance reduction via STORM alone is insufficient. To address this challenge, we maintain a buffer of small fraction of recent samples and uniformly sample from it for each critic update. Importantly, these mechanisms are compatible with existing deep learning architectures and require only minor modifications, without compromising practical applicability.

[1211] Enhancing Generalization in Evolutionary Feature Construction for Symbolic Regression through Vicinal Jensen Gap Minimization

Hengzhe Zhang, Qi Chen, Bing Xue, Wolfgang Banzhaf, Mengjie Zhang

Main category: cs.LG

TL;DR: Genetic programming feature construction framework with vicinal risk regularization using Jensen gap minimization and manifold intrusion detection to control overfitting.

Details

Motivation: Genetic programming-based feature construction has shown success but suffers from overfitting, limiting broader applicability. The paper aims to improve generalization by controlling overfitting through theoretical analysis of vicinal risk bounds.

Method: Proves vicinal risk is bounded by empirical risk plus regularization term (finite difference or vicinal Jensen gap). Develops evolutionary feature construction framework jointly optimizing empirical risk and Jensen gap. Includes noise estimation for dynamic regularization adjustment and manifold intrusion detection to avoid unrealistic augmented samples.

Result: Experimental results on 58 datasets show Jensen gap minimization outperforms other complexity measures. Comparisons with 15 ML algorithms demonstrate superior performance of genetic programming with the proposed overfitting control strategy.

Conclusion: The proposed framework effectively controls overfitting in genetic programming-based feature construction through vicinal Jensen gap regularization and manifold intrusion detection, achieving better generalization performance.

Abstract: Genetic programming-based feature construction has achieved significant success in recent years as an automated machine learning technique to enhance learning performance. However, overfitting remains a challenge that limits its broader applicability. To improve generalization, we prove that vicinal risk, estimated through noise perturbation or mixup-based data augmentation, is bounded by the sum of empirical risk and a regularization term-either finite difference or the vicinal Jensen gap. Leveraging this decomposition, we propose an evolutionary feature construction framework that jointly optimizes empirical risk and the vicinal Jensen gap to control overfitting. Since datasets may vary in noise levels, we develop a noise estimation strategy to dynamically adjust regularization strength. Furthermore, to mitigate manifold intrusion-where data augmentation may generate unrealistic samples that fall outside the data manifold-we propose a manifold intrusion detection mechanism. Experimental results on 58 datasets demonstrate the effectiveness of Jensen gap minimization compared to other complexity measures. Comparisons with 15 machine learning algorithms further indicate that genetic programming with the proposed overfitting control strategy achieves superior performance.

[1212] White-Box Neural Ensemble for Vehicular Plasticity: Quantifying the Efficiency Cost of Symbolic Auditability in Adaptive NMPC

Enzo Nicolas Spotorno, Matheus Wagner, Antonio Augusto Medeiros Frohlich

Main category: cs.LG

TL;DR: White-box adaptive NMPC architecture using Modular Sovereignty paradigm to arbitrate among frozen neural specialists for vehicular plasticity, with symbolic graph representation for auditability.

Details

Motivation: To address vehicular plasticity - adaptation to varying operating regimes without retraining - while maintaining white-box auditability and transparency in neural model predictive control systems.

Method: Uses Modular Sovereignty paradigm to arbitrate among frozen, regime-specific neural specialists. Ensemble dynamics are maintained as a fully traversable symbolic graph in CasADi for maximal runtime auditability.

Result: Validates rapid adaptation (~7.3 ms) and near-ideal tracking fidelity under compound regime shifts (friction, mass, drag) where non-adaptive baselines fail. Shows transparency cost: symbolic graph maintenance increases solver latency by 72-102X versus compiled parametric physics models.

Conclusion: The architecture successfully resolves vehicular plasticity with white-box auditability, but at significant computational cost, establishing the efficiency price of strict white-box implementation.

Abstract: We present a white-box adaptive NMPC architecture that resolves vehicular plasticity (adaptation to varying operating regimes without retraining) by arbitrating among frozen, regime-specific neural specialists using a Modular Sovereignty paradigm. The ensemble dynamics are maintained as a fully traversable symbolic graph in CasADi, enabling maximal runtime auditability. Synchronous simulation validates rapid adaptation (~7.3 ms) and near-ideal tracking fidelity under compound regime shifts (friction, mass, drag) where non-adaptive baselines fail. Empirical benchmarking quantifies the transparency cost: symbolic graph maintenance increases solver latency by 72-102X versus compiled parametric physics models, establishing the efficiency price of strict white-box implementation.

[1213] You Need an Encoder for Native Position-Independent Caching

Shiju Zhao, Junhao Hu, Jiaqi Zheng, Guihai Chen

Main category: cs.LG

TL;DR: COMB introduces native Position-Independent Caching (PIC) by adding an encoder to decoder-only LLMs, enabling efficient KV cache reuse without positional constraints, reducing latency and increasing throughput.

Details

Motivation: Current KV cache in LLMs is prefix-based and inefficient for processing contexts retrieved in arbitrary order. Existing PIC approaches suffer from accuracy degradation, limiting practical adoption.

Method: Proposes native PIC by reintroducing an encoder to decoder-only LLMs and explicitly training it to support PIC. Develops COMB, a PIC-aware caching system that integrates with existing inference frameworks.

Result: COMB reduces Time-to-First-Token by 51-94% and increases throughput by 3× with comparable accuracy. Quality improvement demonstrated with DeepSeek-V2-Lite-Chat shows applicability to other decoder-only LLMs.

Conclusion: COMB enables efficient position-independent KV caching with minimal accuracy loss, making it practical for real-world deployment and applicable to various decoder-only LLM architectures.

Abstract: The Key-Value (KV) cache of Large Language Models (LLMs) is prefix-based, making it highly inefficient for processing contexts retrieved in arbitrary order. Position-Independent Caching (PIC) has been proposed to enable KV reuse without positional constraints; however, existing approaches often incur substantial accuracy degradation, limiting their practical adoption. To address this issue, we propose native PIC by reintroducing the encoder to prevalent decoder-only LLMs and explicitly training it to support PIC. We further develop COMB, a PIC-aware caching system that integrates seamlessly with existing inference frameworks. Experimental results show that COMB reduces Time-to-First-Token (TTFT) by 51-94% and increases throughput by 3$\times$ with comparable accuracy. Furthermore, the quality improvement when using DeepSeek-V2-Lite-Chat demonstrates the applicability of COMB to other types of decoder-only LLMs. Our code is available at https://github.com/shijuzhao/Comb.

[1214] When Is Rank-1 Enough? Geometry-Guided Initialization for Parameter-Efficient Fine-Tuning

Haoran Zhao, Soyeon Caren Han, Eduard Hovy

Main category: cs.LG

TL;DR: Gap-Init: A geometry-aware initialization method that stabilizes rank-1 LoRA fine-tuning for multimodal LLMs by aligning with the modality-gap direction.

Details

Motivation: Parameter-efficient fine-tuning (PEFT) with extremely low-rank LoRA (especially rank-1) is often unstable for multimodal large language models, and this instability is not just due to limited capacity but rather optimization sensitivity to update direction.

Method: Analyzes pretrained vision and text features to identify a modality-gap axis that dominates early gradient flow. Proposes Gap-Init initialization that aligns rank-1 LoRA direction with estimated modality-gap vector from a small calibration set while keeping initial LoRA update zero.

Result: Across multiple vision-language tasks and backbones, Gap-Init consistently stabilizes rank-1 training and can match or outperform strong rank-8 baselines.

Conclusion: At the extreme low-rank limit, initial alignment can matter as much as rank itself, and geometry-aware initialization can effectively stabilize rank-1 LoRA training for multimodal LLMs.

Abstract: Parameter-efficient fine-tuning (PEFT) is a standard way to adapt multimodal large language models, yet extremely low-rank settings – especially rank-1 LoRA – are often unstable. We show that this instability is not solely due to limited capacity: in the rank-1 regime, optimization is highly sensitive to the update direction. Concretely, pretrained vision and text features form mismatched anisotropic regions, yielding a dominant “gap” direction that acts like a translation component and disproportionately steers early gradients under rank-1 constraints. Analyzing pretrained representations, we identify a modality-gap axis that dominates early gradient flow, while a random rank-1 initialization is unlikely to align with it, leading to weak gradients and training collapse. We propose Gap-Init, a geometry-aware initialization that aligns the rank-1 LoRA direction with an estimated modality-gap vector from a small calibration set, while keeping the initial LoRA update zero. Across multiple vision-language tasks and backbones, Gap-Init consistently stabilizes rank-1 training and can match or outperform strong rank-8 baselines. Our results suggest that at the extreme low-rank limit, initial alignment can matter as much as rank itself.

[1215] The Inlet Rank Collapse in Implicit Neural Representations: Diagnosis and Unified Remedy

Jianqiao Zheng, Hemanth Saratchandran, Simon Lucey

Main category: cs.LG

TL;DR: The paper introduces a structural diagnostic framework analyzing Implicit Neural Representations’ limitations through layer-wise NTK decomposition, identifying “Inlet Rank Collapse” as the core issue, and proposes Rank-Expanding Initialization as a principled solution.

Details

Motivation: INRs struggle with fine-grained detail recovery within finite training budgets. While empirical techniques like positional encoding, sinusoidal activations, and batch normalization help, their theoretical justifications are post hoc. The authors aim to provide a unified theoretical framework to understand and address INR limitations.

Method: The authors develop a structural diagnostic framework using layer-wise decomposition of the Neural Tangent Kernel (NTK). They mathematically identify “Inlet Rank Collapse” - a phenomenon where low-dimensional input coordinates fail to span high-dimensional embedding space, creating a rank deficiency bottleneck. This framework reinterprets existing techniques as different forms of rank restoration.

Result: The framework provides a unified perspective on existing techniques and leads to the derivation of Rank-Expanding Initialization, a minimalist remedy that ensures representation rank scales with layer width without architectural changes or computational overhead.

Conclusion: The key to empowering INRs lies in structural optimization of initial rank propagation to effectively populate latent space. The proposed principled remedy enables standard MLPs to achieve high-fidelity reconstructions.

Abstract: Implicit Neural Representations (INRs) have revolutionized continuous signal modeling, yet they struggle to recover fine-grained details within finite training budgets. While empirical techniques, such as positional encoding (PE), sinusoidal activations (SIREN), and batch normalization (BN), effectively mitigate this, their theoretical justifications are predominantly post hoc, focusing on the global NTK spectrum only after modifications are applied. In this work, we reverse this paradigm by introducing a structural diagnostic framework. By performing a layer-wise decomposition of the NTK, we mathematically identify the ``Inlet Rank Collapse’’: a phenomenon where the low-dimensional input coordinates fail to span the high-dimensional embedding space, creating a fundamental rank deficiency at the first layer that acts as an expressive bottleneck for the entire network. This framework provides a unified perspective to re-interpret PE, SIREN, and BN as different forms of rank restoration. Guided by this diagnosis, we derive a Rank-Expanding Initialization, a minimalist remedy that ensures the representation rank scales with the layer width without architectural modifications or computational overhead. Our results demonstrate that this principled remedy enables standard MLPs to achieve high-fidelity reconstructions, proving that the key to empowering INRs lies in the structural optimization of the initial rank propagation to effectively populate the latent space.

[1216] Plain Transformers are Surprisingly Powerful Link Predictors

Quang Truong, Yu Song, Donald Loveland, Mingxuan Ju, Tong Zhao, Neil Shah, Jiliang Tang

Main category: cs.LG

TL;DR: PENCIL is a plain Transformer encoder for link prediction that uses attention over sampled local subgraphs instead of complex structural encodings, achieving competitive performance with better scalability and parameter efficiency.

Details

Motivation: Current link prediction methods rely on GNNs with explicit structural heuristics or memory-intensive node embeddings, which struggle with generalization and scalability. Graph Transformers offer an alternative but have significant overhead from complex structural encodings.

Method: PENCIL uses an encoder-only plain Transformer architecture that replaces hand-crafted priors with attention over sampled local subgraphs. It maintains the scalability and hardware efficiency of standard Transformers while implicitly capturing structural signals.

Result: PENCIL outperforms heuristic-informed GNNs and is more parameter-efficient than ID-embedding-based alternatives. It remains competitive across diverse benchmarks even without node features, demonstrating that simple design choices can achieve strong capabilities.

Conclusion: The results challenge the prevailing reliance on complex engineering techniques for link prediction, showing that plain Transformers with attention over local subgraphs can effectively capture structural dependencies while maintaining scalability and efficiency.

Abstract: Link prediction is a core challenge in graph machine learning, demanding models that capture rich and complex topological dependencies. While Graph Neural Networks (GNNs) are the standard solution, state-of-the-art pipelines often rely on explicit structural heuristics or memory-intensive node embeddings – approaches that struggle to generalize or scale to massive graphs. Emerging Graph Transformers (GTs) offer a potential alternative but often incur significant overhead due to complex structural encodings, hindering their applications to large-scale link prediction. We challenge these sophisticated paradigms with PENCIL, an encoder-only plain Transformer that replaces hand-crafted priors with attention over sampled local subgraphs, retaining the scalability and hardware efficiency of standard Transformers. Through experimental and theoretical analysis, we show that PENCIL extracts richer structural signals than GNNs, implicitly generalizing a broad class of heuristics and subgraph-based expressivity. Empirically, PENCIL outperforms heuristic-informed GNNs and is far more parameter-efficient than ID-embedding–based alternatives, while remaining competitive across diverse benchmarks – even without node features. Our results challenge the prevailing reliance on complex engineering techniques, demonstrating that simple design choices are potentially sufficient to achieve the same capabilities.

[1217] InfoTok: Regulating Information Flow for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs

Lv Tang, Tianyi Zheng, Bo Li, Xingyu Li

Main category: cs.LG

TL;DR: InfoTok introduces an information-regularized visual tokenization mechanism based on Information Bottleneck principle for unified multimodal LLMs, prioritizing reusable structure over high-entropy variations to improve both understanding and generation tasks.

Details

Motivation: Existing shared-token designs in unified MLLMs lack explicit criteria for what information tokens should preserve to support both understanding and generation. The paper introduces a capacity-constrained perspective, viewing the visual tokenizer as a compute-bounded learner that should prioritize reusable structure over hard-to-exploit high-entropy variations.

Method: Proposes InfoTok, an information-regularized visual tokenization mechanism grounded in the Information Bottleneck principle. It formulates tokenization as controlling information flow from images to shared tokens to multimodal outputs, using mutual-information regularization to achieve a principled trade-off between compression and task relevance.

Result: Experiments integrating InfoTok into three representative unified MLLMs without additional training data show consistent improvements on both understanding and generation tasks.

Conclusion: Information-regularized tokenization serves as a principled foundation for learning a shared token space in unified MLLMs, supporting both visual understanding and generation capabilities.

Abstract: Unified multimodal large language models (MLLMs) integrate image understanding and generation in a single framework, with the visual tokenizer acting as the sole interface that maps visual inputs into tokens for downstream tasks. However, existing shared-token designs are mostly architecture-driven and lack an explicit criterion for what information tokens should preserve to support both understanding and generation. Therefore, we introduce a capacity-constrained perspective, highlighting that in shared-token unified MLLMs the visual tokenizer behaves as a compute-bounded learner, so the token budget should prioritize reusable structure over hard-to-exploit high-entropy variations and redundancy. Motivated by this perspective, we propose InfoTok, an information-regularized visual tokenization mechanism grounded in the Information Bottleneck (IB) principle. InfoTok formulates tokenization as controlling information flow from images to shared tokens to multimodal outputs, yielding a principled trade-off between compression and task relevance via mutual-information regularization. We integrate InfoTok into three representative unified MLLMs without introducing any additional training data. Experiments show consistent improvements on both understanding and generation, supporting information-regularized tokenization as a principled foundation for learning a shared token space in unified MLLMs.

[1218] How Implicit Bias Accumulates and Propagates in LLM Long-term Memory

Yiming Ma, Lixu Wang, Lionel Z. Wang, Hongkun Yang, Haoming Sun, Xin Xu, Jiaqi Wu, Bin Chen, Wei Dong

Main category: cs.LG

TL;DR: The paper studies how implicit bias accumulates and propagates in LLMs with long-term memory, introduces a benchmark for evaluation, and proposes a dynamic memory tagging intervention to mitigate bias.

Details

Motivation: Long-term memory mechanisms in LLMs enable continuity and personalization but introduce underexplored fairness risks, particularly how implicit bias accumulates and propagates over extended interactions.

Method: Introduces Decision-based Implicit Bias (DIB) Benchmark with 3,776 decision-making scenarios across nine social domains; evaluates six LLMs with three memory architectures using long-horizon simulation; proposes Dynamic Memory Tagging (DMT) intervention that enforces fairness constraints at memory write time.

Result: LLMs’ implicit bias intensifies over time and propagates across unrelated domains; static system-level prompting provides limited debiasing; Dynamic Memory Tagging substantially reduces bias accumulation and curtails cross-domain bias propagation.

Conclusion: Long-term memory in LLMs poses significant fairness risks through bias accumulation and propagation, requiring dynamic interventions like DMT rather than static approaches for effective mitigation.

Abstract: Long-term memory mechanisms enable Large Language Models (LLMs) to maintain continuity and personalization across extended interaction lifecycles, but they also introduce new and underexplored risks related to fairness. In this work, we study how implicit bias, defined as subtle statistical prejudice, accumulates and propagates within LLMs equipped with long-term memory. To support systematic analysis, we introduce the Decision-based Implicit Bias (DIB) Benchmark, a large-scale dataset comprising 3,776 decision-making scenarios across nine social domains, designed to quantify implicit bias in long-term decision processes. Using a realistic long-horizon simulation framework, we evaluate six state-of-the-art LLMs integrated with three representative memory architectures on DIB and demonstrate that LLMs’ implicit bias does not remain static but intensifies over time and propagates across unrelated domains. We further analyze mitigation strategies and show that a static system-level prompting baseline provides limited and short-lived debiasing effects. To address this limitation, we propose Dynamic Memory Tagging (DMT), an agentic intervention that enforces fairness constraints at memory write time. Extensive experimental results show that DMT substantially reduces bias accumulation and effectively curtails cross-domain bias propagation.

[1219] Generative Visual Code Mobile World Models

Woosung Koh, Sungjun Han, Segyu Lee, Se-Young Yun, Jamin Shin

Main category: cs.LG

TL;DR: gWorld introduces visual world modeling via renderable code generation for mobile GUI agents, where a VLM predicts next GUI state as executable web code rather than pixels, combining precise text rendering with high-fidelity visual generation.

Details

Motivation: Current mobile GUI world models face trade-offs: text-based models sacrifice visual fidelity, while visual models struggle with precise text rendering and rely on slow, complex pipelines with external models. There's a need for a solution that combines precise text rendering with high-fidelity visual generation.

Method: Proposes visual world modeling via renderable code generation where a single Vision-Language Model predicts the next GUI state as executable web code that renders to pixels. Introduces gWorld models (8B, 32B) and a data generation framework that automatically synthesizes code-based training data.

Result: gWorld sets new pareto frontier in accuracy vs model size, outperforming 8 frontier open-weight models over 50.25x larger across 4 in-distribution and 2 out-of-distribution benchmarks. Scaling training data yields meaningful gains, each pipeline component improves data quality, and stronger world modeling improves downstream mobile GUI policy performance.

Conclusion: Renderable code generation paradigm successfully combines precise text rendering with high-fidelity visual generation for mobile GUI world models, offering a promising direction for improving GUI agent performance through visual world modeling.

Abstract: Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and inference-time. However, current approaches face a critical trade-off: text-based WMs sacrifice visual fidelity, while the inability of visual WMs in precise text rendering led to their reliance on slow, complex pipelines dependent on numerous external models. We propose a novel paradigm: visual world modeling via renderable code generation, where a single Vision-Language Model (VLM) predicts the next GUI state as executable web code that renders to pixels, rather than generating pixels directly. This combines the strengths of both approaches: VLMs retain their linguistic priors for precise text rendering while their pre-training on structured web code enables high-fidelity visual generation. We introduce gWorld (8B, 32B), the first open-weight visual mobile GUI WMs built on this paradigm, along with a data generation framework (gWorld) that automatically synthesizes code-based training data. In extensive evaluation across 4 in- and 2 out-of-distribution benchmarks, gWorld sets a new pareto frontier in accuracy versus model size, outperforming 8 frontier open-weight models over 50.25x larger. Further analyses show that (1) scaling training data via gWorld yields meaningful gains, (2) each component of our pipeline improves data quality, and (3) stronger world modeling improves downstream mobile GUI policy performance.

[1220] Local Exponential Stability of Mean-Field Langevin Descent-Ascent in Wasserstein Space

Geuntaek Seo, Minseop Shin, Pierre Monmarché, Beomjun Choi

Main category: cs.LG

TL;DR: MFL-DA dynamics for entropically regularized two-player zero-sum games are proven to have local exponential stability near the unique mixed Nash equilibrium via spectral analysis and coercivity estimates.

Details

Motivation: The paper addresses an open question about the long-time behavior of mean-field Langevin descent-ascent (MFL-DA) dynamics for nonconvex-nonconcave payoffs in two-player zero-sum games, which had remained largely unresolved despite the existence of a unique mixed Nash equilibrium.

Method: The authors establish a coercivity estimate for the entropy near equilibrium through spectral analysis of the linearized operator. This reveals a local displacement convex-concave structure that drives contraction, proving local exponential stability when initialization is sufficiently close in Wasserstein metric.

Result: The paper proves that the unique mixed Nash equilibrium is locally exponentially stable for MFL-DA dynamics: if initialized sufficiently close in Wasserstein metric, the dynamics converge to equilibrium at an exponential rate. This settles the local stability and quantitative rate questions posed by Wang and Chizat.

Conclusion: The work provides a partial resolution to the open question about MFL-DA dynamics, establishing local exponential stability through spectral analysis and coercivity estimates, while leaving global convergence as a remaining open challenge.

Abstract: We study the mean-field Langevin descent-ascent (MFL-DA), a coupled optimization dynamics on the space of probability measures for entropically regularized two-player zero-sum games. Although the associated mean-field objective admits a unique mixed Nash equilibrium, the long-time behavior of the original MFL-DA for general nonconvex-nonconcave payoffs has remained largely open. Answering an open question posed by Wang and Chizat (COLT 2024), we provide a partial resolution by proving that this equilibrium is locally exponentially stable: if the initialization is sufficiently close in Wasserstein metric, the dynamics trends to the equilibrium at an exponential rate. The key to our analysis is to establish a coercivity estimate for the entropy near equilibrium via spectral analysis of the linearized operator. We show that this coercivity effectively reveals a local displacement convex-concave structure, thereby driving contraction. This result settles the local stability and quantitative rate questions of Wang and Chizat, leaving global convergence as a remaining open challenge.

[1221] Nearly Optimal Active Preference Learning and Its Application to LLM Alignment

Yao Zhao, Kwang-Sung Jun

Main category: cs.LG

TL;DR: Active learning algorithms for preference learning that improve sample efficiency over classical experimental design methods by leveraging problem-specific structure.

Details

Motivation: High-quality human preference datasets for aligning LLMs are costly to collect, and existing active learning approaches use classical experimental design criteria not tailored to preference learning structure, leaving room for problem-specific algorithm design.

Method: Identifies a preference-learning-specific intuition that questions existing design objectives, then proposes two active learning algorithms: one with instance-dependent label complexity guarantees, and a practical greedy method.

Result: Evaluation on real-world preference datasets shows improved sample efficiency compared to existing methods.

Conclusion: Problem-specific active learning algorithms for preference learning can significantly improve sample efficiency over classical experimental design approaches when collecting human preference data for LLM alignment.

Abstract: Aligning large language models (LLMs) depends on high-quality datasets of human preference labels, which are costly to collect. Although active learning has been studied to improve sample efficiency relative to passive collection, many existing approaches adopt classical experimental design criteria such as G- or D-optimality. These objectives are not tailored to the structure of preference learning, leaving open the design of problem-specific algorithms. In this work, we identify a simple intuition specific to preference learning that calls into question the suitability of these existing design objectives. Motivated by this insight, we propose two active learning algorithms. The first provides the first instance-dependent label complexity guarantee for this setting, and the second is a simple, practical greedy method. We evaluate our algorithm on real-world preference datasets and observe improved sample efficiency compared to existing methods.

[1222] A Lightweight Sparse Interaction Network for Time Series Forecasting

Xu Zhang, Qitong Wang, Peng Wang, Wei Wang

Main category: cs.LG

TL;DR: LSINet is a lightweight sparse interaction network for time-series forecasting that uses multihead sparse interaction mechanisms with Bernoulli distributions to capture temporal dependencies more effectively than linear or transformer models.

Details

Motivation: Linear models outperform transformers in long-term time-series forecasting but lack explicit temporal interaction mechanisms, relying instead on stacked MLP structures which may be insufficient for capturing complex temporal dependencies.

Method: Proposes LSINet with Multihead Sparse Interaction Mechanism (MSIM) that learns important connections between time steps through sparsity-induced Bernoulli distribution, using self-adaptive regularization loss for sparsity and Shared Interaction Learning (SIL) for efficiency.

Result: Extensive experiments show LSINet achieves higher accuracy and better efficiency than advanced linear models and transformer models in time-series forecasting tasks.

Conclusion: LSINet successfully combines the efficiency of linear models with explicit temporal interaction mechanisms, outperforming both linear and transformer baselines in time-series forecasting.

Abstract: Recent work shows that linear models can outperform several transformer models in long-term time-series forecasting (TSF). However, instead of explicitly performing temporal interaction through self-attention, linear models implicitly perform it based on stacked MLP structures, which may be insufficient in capturing the complex temporal dependencies and their performance still has potential for improvement. To this end, we propose a Lightweight Sparse Interaction Network (LSINet) for TSF task. Inspired by the sparsity of self-attention, we propose a Multihead Sparse Interaction Mechanism (MSIM). Different from self-attention, MSIM learns the important connections between time steps through sparsity-induced Bernoulli distribution to capture temporal dependencies for TSF. The sparsity is ensured by the proposed self-adaptive regularization loss. Moreover, we observe the shareability of temporal interactions and propose to perform Shared Interaction Learning (SIL) for MSIM to further enhance efficiency and improve convergence. LSINet is a linear model comprising only MLP structures with low overhead and equipped with explicit temporal interaction mechanisms. Extensive experiments on public datasets show that LSINet achieves both higher accuracy and better efficiency than advanced linear models and transformer models in TSF tasks. The code is available at the link https://github.com/Meteor-Stars/LSINet.

[1223] Spectral Text Fusion: A Frequency-Aware Approach to Multimodal Time-Series Forecasting

Huu Hiep Nguyen, Minh Hoang Nguyen, Dung Nguyen, Hung Le

Main category: cs.LG

TL;DR: SpecTF integrates textual context with time series forecasting via frequency domain fusion, using spectral decomposition and cross-attention to adaptively weight frequency bands based on textual relevance.

Details

Motivation: Existing multimodal time series forecasting methods align textual features with time-series patterns locally but neglect multiscale temporal influences (cycles, dynamic shifts). There's a mismatch between local alignment and global textual context that spectral decomposition can address.

Method: Extracts textual embeddings, projects them into frequency domain, fuses with time series’ spectral components using lightweight cross-attention to adaptively reweight frequency bands based on textual relevance, then maps back to temporal domain for predictions.

Result: SpecTF significantly outperforms state-of-the-art models across diverse multimodal time series datasets while using considerably fewer parameters.

Conclusion: Frequency domain integration of textual context with time series via spectral decomposition and adaptive reweighting is an effective approach for multimodal forecasting with better performance and efficiency.

Abstract: Multimodal time series forecasting is crucial in real-world applications, where decisions depend on both numerical data and contextual signals. The core challenge is to effectively combine temporal numerical patterns with the context embedded in other modalities, such as text. While most existing methods align textual features with time-series patterns one step at a time, they neglect the multiscale temporal influences of contextual information such as time-series cycles and dynamic shifts. This mismatch between local alignment and global textual context can be addressed by spectral decomposition, which separates time series into frequency components capturing both short-term changes and long-term trends. In this paper, we propose SpecTF, a simple yet effective framework that integrates the effect of textual data on time series in the frequency domain. Our method extracts textual embeddings, projects them into the frequency domain, and fuses them with the time series’ spectral components using a lightweight cross-attention mechanism. This adaptively reweights frequency bands based on textual relevance before mapping the results back to the temporal domain for predictions. Experimental results demonstrate that SpecTF significantly outperforms state-of-the-art models across diverse multi-modal time series datasets while utilizing considerably fewer parameters. Code is available at https://github.com/hiepnh137/SpecTF.

[1224] The Multiple Ticket Hypothesis: Random Sparse Subnetworks Suffice for RLVR

Israel Adewuyi, Solomon Okibe, Vladmir Ivanov

Main category: cs.LG

TL;DR: Training only 1% of randomly selected parameters matches full-parameter RLVR finetuning, suggesting pretrained models contain many viable sparse subnetworks rather than one privileged set.

Details

Motivation: The Lottery Ticket Hypothesis shows sparse subnetworks can match full-model performance, and RLVR work shows updates concentrate on sparse parameter subsets. This suggests parameter redundancy that could be exploited through extreme sparsity training.

Method: Train only a randomly selected subset of parameters at extreme sparsities (1%) in Reinforcement Learning with Verifiable Rewards (RLVR) settings across 3 models and 2 task domains.

Result: Training just 1% of parameters matches or exceeds full-parameter RLVR finetuning. Different random masks show minimal overlap (≤0.005 Jaccard similarity) yet all succeed, suggesting many viable sparse subnetworks exist.

Conclusion: Pretrained models contain many viable sparse subnetworks rather than one privileged set (Multiple Ticket Hypothesis), enabled by RLVR’s implicit per-step KL constraint restricting updates to low-dimensional subspace.

Abstract: The Lottery Ticket Hypothesis demonstrated that sparse subnetworks can match full-model performance, suggesting parameter redundancy. Meanwhile, in Reinforcement Learning with Verifiable Rewards (RLVR), recent work has shown that updates concentrate on a sparse subset of parameters, which further lends evidence to this underlying redundancy. We study the simplest possible way to exploit this redundancy: training only a randomly selected subset of parameters at extreme sparsities. Empirically, we find that training just 1% of parameters matches or exceeds full-parameter RLVR finetuning across 3 models and 2 task domains. Moreover, different random masks show minimal overlap ($\leq 0.005$ Jaccard similarity) and yet all succeed, suggesting pretrained models contain many viable sparse subnetworks rather than one privileged set. We term this the Multiple Ticket Hypothesis. We explain this phenomenon through the implicit per-step KL constraint in RLVR, which restricts updates to a low-dimensional subspace, enabling arbitrary sparse masks to succeed.

[1225] Boosting Maximum Entropy Reinforcement Learning via One-Step Flow Matching

Zeqiao Li, Yijing Wang, Haoyu Wang, Zheng Li, Zhiqiang Zuo

Main category: cs.LG

TL;DR: FLAME is a flow-based reinforcement learning framework that enables efficient one-step policy generation while maintaining maximum entropy exploration, outperforming Gaussian baselines and matching diffusion policies with lower inference cost.

Details

Motivation: Diffusion policies offer expressiveness but have high inference latency, while Flow Matching enables one-step generation but is challenging to integrate with Maximum Entropy RL due to intractable optimal policy distributions and discretization bias in log-likelihood estimation.

Method: Proposes FLAME framework with three key components: 1) Q-Reweighted FM objective that bypasses partition function estimation via importance reweighting, 2) decoupled entropy estimator that corrects bias for efficient exploration, and 3) MeanFlow formulation for expressive one-step control.

Result: Empirical results on MuJoCo show FLAME outperforms Gaussian baselines and matches multi-step diffusion policies with significantly lower inference cost.

Conclusion: FLAME provides a principled framework for integrating flow matching with maximum entropy RL, enabling efficient one-step policy generation while maintaining exploration-exploitation balance.

Abstract: Diffusion policies are expressive yet incur high inference latency. Flow Matching (FM) enables one-step generation, but integrating it into Maximum Entropy Reinforcement Learning (MaxEnt RL) is challenging: the optimal policy is an intractable energy-based distribution, and the efficient log-likelihood estimation required to balance exploration and exploitation suffers from severe discretization bias. We propose \textbf{F}low-based \textbf{L}og-likelihood-\textbf{A}ware \textbf{M}aximum \textbf{E}ntropy RL (\textbf{FLAME}), a principled framework that addresses these challenges. First, we derive a Q-Reweighted FM objective that bypasses partition function estimation via importance reweighting. Second, we design a decoupled entropy estimator that rigorously corrects bias, which enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Third, we integrate the MeanFlow formulation to achieve expressive and efficient one-step control. Empirical results on MuJoCo show that FLAME outperforms Gaussian baselines and matches multi-step diffusion policies with significantly lower inference cost. Code is available at https://github.com/lzqw/FLAME.

[1226] Universal Redundancies in Time Series Foundation Models

Anthony Bao, Venkata Hasith Vattikuti, Jeffrey Lai, William Gilpin

Main category: cs.LG

TL;DR: TSFMs show redundant components; mechanistic interpretability tools reveal robustness to layer ablations and identify heads causing degenerate behaviors like motif parroting and seasonality bias.

Details

Motivation: Time Series Foundation Models (TSFMs) are emerging as powerful tools for time series prediction, but their internal mechanisms are poorly understood. The paper aims to analyze the redundancy and interpretability of transformer-based TSFMs to understand their universal properties and identify components responsible for degenerate behaviors.

Method: Developed mechanistic interpretability tools including component ablations and direct logit attribution on residual streams. Applied these to multiple leading TSFMs with diverse architectures across real-world and synthetic datasets. Introduced theoretical framework framing transformers as kernel regressors, enabling intrinsic ablation strategy based on stable rank of per-head projection matrices.

Result: Found that all studied TSFMs are robust to ablations of entire layers. Identified specific attention heads responsible for degenerate phenomena like parroting of motifs from context and seasonality bias. Discovered consistent patterns across diverse models and datasets, revealing universal properties of transformer-based TSFMs.

Conclusion: TSFMs exhibit significant redundancy in intermediate layers, and mechanistic interpretability tools can successfully identify components responsible for specific behaviors. The theoretical framework provides principled approach for analyzing transformer architectures in time series modeling, shedding light on universal properties of this emerging class.

Abstract: Time Series Foundation Models (TSFMs) leverage extensive pretraining to accurately predict unseen time series during inference, without the need for task-specific fine-tuning. Through large-scale evaluations on standard benchmarks, we find that leading transformer-based TSFMs exhibit redundant components in their intermediate layers. We introduce a set of tools for mechanistic interpretability of TSFMs, including ablations of specific components and direct logit attribution on the residual stream. Our findings are consistent across several leading TSFMs with diverse architectures, and across a diverse set of real-world and synthetic time-series datasets. We discover that all models in our study are robust to ablations of entire layers. Furthermore, we develop a theoretical framework framing transformers as kernel regressors, motivating a purely intrinsic strategy for ablating heads based on the stable rank of the per-head projection matrices. Using this approach, we uncover the specific heads responsible for degenerate phenomena widely observed in TSFMs, such as parroting of motifs from the context and seasonality bias. Our study sheds light on the universal properties of this emerging class of architectures for continuous-time sequence modeling.

[1227] A Practical Tensor-Network Compression Pipeline for Production-Scale Large Language Models

Sergii Kozyrev, Davyd Maiboroda

Main category: cs.LG

TL;DR: Minima: A production compression pipeline for Transformers that uses sensitivity prediction and tensor decompositions to reduce memory footprint and improve inference latency, enabling speculative decoding.

Details

Motivation: Large language models face deployment challenges due to GPU memory constraints and inference latency. Existing compression methods need to be more practical for production serving with real throughput gains.

Method: Trains convolutional predictor to estimate layer/patch sensitivity, applies Tucker/tensor-train/tensor-ring decompositions to low-sensitivity regions, performs healing fine-tune, and implements custom Triton/CUDA kernels. Enables speculative decoding with reduced memory footprint.

Result: On Qwen3-32B with 8k context: reduces VRAM from 64GB to 40GB; throughput increases from 40 to 50 tokens/sec (single request) and 34 to 44 tokens/sec (50 parallel requests); with speculative decoding reaches 75 tokens/sec (single) and 53 tokens/sec (parallel).

Conclusion: Minima provides practical structural compression for production deployment, enabling memory reduction and latency improvements while maintaining effectiveness under high concurrency, paving way for more aggressive compression via shared tensor backbones.

Abstract: Large language models are limited in deployment by GPU memory and inference latency. We present Minima, a production compression pipeline that learns where and how to structurally compress a Transformer and turns that compression into real serving gains. Minima trains a lightweight convolutional predictor to estimate layer- and patch-level sensitivity, applies a mixture of Tucker, tensor-train, and tensor-ring decompositions to low-sensitivity regions, performs a short healing fine-tune, and executes the resulting operators with custom Triton and CUDA kernels. The reduced memory footprint enables speculative decoding with a small draft model and a larger verifier. On Qwen3-32B at an 8k-token context window, Minima reduces peak VRAM from 64 GiB to 40 GiB. For a single active request, throughput increases from 40 tokens per second (baseline) to 50 tokens per second (Minima) and 75 tokens per second (Minima with speculative decoding). Under 50 parallel requests, throughput is 34, 44, and 53 tokens per second respectively, showing that Minima remains effective under high concurrency even when speculative decoding gains compress. We position Minima relative to recent tensor-network, low-rank plus quantization, and cross-layer sharing methods, and argue that it is a practical step toward more aggressive structural compression via shared tensor backbones with tiny per-layer adapters.

[1228] What Do Agents Learn from Trajectory-SFT: Semantics or Interfaces?

Weizheng Gu, Chengze Li, Zhuohao Yu, Mengyuan Sun, Zhibang Yang, Wei Wang, Hongrui Jia, Shikun Zhang, Wei Ye

Main category: cs.LG

TL;DR: PIPE protocol evaluates agent reliance on interface shortcuts vs. semantic understanding by minimally rewriting environment interfaces while preserving task semantics.

Details

Motivation: Current agent benchmarks conflate semantic tool-use with interface-specific pattern memorization, making it impossible to distinguish true environment-invariant capabilities from interface shortcutting.

Method: PIPE (Protocol-level Interface Perturbation Evaluation) minimally rewrites environment interfaces while preserving task semantics and execution behavior, then measures performance degradation. Introduces Interface Reliance (IR) metric to quantify preference for training-time interfaces.

Result: Trajectory-SFT trained agents degrade sharply under minimal interface rewrites, while non-trajectory-trained models remain stable. Interface shortcutting exhibits environment-dependent, non-monotonic training dynamics invisible under standard evaluation.

Conclusion: Standard benchmarks are insufficient for evaluating true agent capabilities; PIPE reveals interface reliance patterns, especially for trajectory-SFT trained agents, highlighting the need for more robust evaluation protocols.

Abstract: Large language models are increasingly evaluated as interactive agents, yet standard agent benchmarks conflate two qualitatively distinct sources of success: semantic tool-use and interface-specific interaction pattern memorization. Because both mechanisms can yield identical task success on the original interface, benchmark scores alone are not identifiable evidence of environment-invariant capability. We propose PIPE, a protocol-level evaluation augmentation for diagnosing interface reliance by minimally rewriting environment interfaces while preserving task semantics and execution behavior. Across 16 environments from AgentBench and AgentGym and a range of open-source and API-based agents, PIPE reveals that trajectory-SFT substantially amplifies interface shortcutting: trained agents degrade sharply under minimal interface rewrites, while non-trajectory-trained models remain largely stable. We further introduce Interface Reliance (IR), a counterbalanced alias-based metric that quantifies preference for training-time interfaces, and show that interface shortcutting exhibits environment-dependent, non-monotonic training dynamics that remain invisible under standard evaluation. Our code is available at https://anonymous.4open.science/r/What-Do-Agents-Learn-from-Trajectory-SFT-Semantics-or-Interfaces--0831/.

[1229] AgroFlux: A Spatial-Temporal Benchmark for Carbon and Nitrogen Flux Prediction in Agricultural Ecosystems

Qi Cheng, Licheng Liu, Yao Zhang, Mu Hong, Yiqun Xie, Xiaowei Jia

Main category: cs.LG

TL;DR: First spatial-temporal agroecosystem GHG benchmark dataset integrating physics-based simulations with real-world observations, evaluated with sequential deep learning models for carbon/nitrogen flux prediction.

Details

Motivation: Agroecosystems are crucial for climate mitigation but require accurate quantification of carbon/nutrient/water fluxes. Current approaches face data sparsity, heterogeneity, and complex subsurface processes, lacking AI-ready benchmark datasets.

Method: Created benchmark dataset integrating physics-based model simulations (Ecosys, DayCent) with real-world observations (eddy covariance flux towers, controlled-environment facilities). Evaluated LSTM, temporal CNN, and Transformer models on carbon/nitrogen flux prediction, and explored transfer learning from simulated to real data.

Result: Developed first-of-its-kind spatial-temporal agroecosystem GHG benchmark dataset. Evaluated performance of various sequential deep learning models on flux prediction tasks.

Conclusion: The benchmark dataset and evaluation framework contribute to developing more accurate and scalable AI-driven agroecosystem models, advancing understanding of ecosystem-climate interactions.

Abstract: Agroecosystem, which heavily influenced by human actions and accounts for a quarter of global greenhouse gas emissions (GHGs), plays a crucial role in mitigating global climate change and securing environmental sustainability. However, we can’t manage what we can’t measure. Accurately quantifying the pools and fluxes in the carbon, nutrient, and water nexus of the agroecosystem is therefore essential for understanding the underlying drivers of GHG and developing effective mitigation strategies. Conventional approaches like soil sampling, process-based models, and black-box machine learning models are facing challenges such as data sparsity, high spatiotemporal heterogeneity, and complex subsurface biogeochemical and physical processes. Developing new trustworthy approaches such as AI-empowered models, will require the AI-ready benchmark dataset and outlined protocols, which unfortunately do not exist. In this work, we introduce a first-of-its-kind spatial-temporal agroecosystem GHG benchmark dataset that integrates physics-based model simulations from Ecosys and DayCent with real-world observations from eddy covariance flux towers and controlled-environment facilities. We evaluate the performance of various sequential deep learning models on carbon and nitrogen flux prediction, including LSTM-based models, temporal CNN-based model, and Transformer-based models. Furthermore, we explored transfer learning to leverage simulated data to improve the generalization of deep learning models on real-world observations. Our benchmark dataset and evaluation framework contribute to the development of more accurate and scalable AI-driven agroecosystem models, advancing our understanding of ecosystem-climate interactions.

[1230] SUSD: Structured Unsupervised Skill Discovery through State Factorization

Seyed Mohammad Hadi Hosseini, Mahdieh Soleymani Baghshah

Main category: cs.LG

TL;DR: SUSD introduces a framework for unsupervised skill discovery that factorizes state space into independent components (objects/entities) and allocates distinct skill variables to different factors, enabling fine-grained control and more diverse skill discovery.

Details

Motivation: Existing unsupervised skill discovery methods like MI-based approaches tend to favor simple, static skills due to invariance properties, while distance-maximizing methods still fall short in encouraging comprehensive skill sets that engage all controllable factors in the environment.

Method: SUSD factorizes the state space into independent components (objects or controllable entities), allocates distinct skill variables to different factors, and uses a dynamic model to track learning across factors, adaptively steering the agent’s focus toward underexplored factors.

Result: Experimental results across three environments with factors ranging from 1 to 10 demonstrate that SUSD discovers diverse and complex skills without supervision, significantly outperforming existing unsupervised skill discovery methods in factorized and complex environments.

Conclusion: SUSD’s structured approach promotes discovery of richer, more diverse skills and yields factorized skill representations that enable fine-grained disentangled control over individual entities, facilitating efficient training of compositional downstream tasks via Hierarchical Reinforcement Learning.

Abstract: Unsupervised Skill Discovery (USD) aims to autonomously learn a diverse set of skills without relying on extrinsic rewards. One of the most common USD approaches is to maximize the Mutual Information (MI) between skill latent variables and states. However, MI-based methods tend to favor simple, static skills due to their invariance properties, limiting the discovery of dynamic, task-relevant behaviors. Distance-Maximizing Skill Discovery (DSD) promotes more dynamic skills by leveraging state-space distances, yet still fall short in encouraging comprehensive skill sets that engage all controllable factors or entities in the environment. In this work, we introduce SUSD, a novel framework that harnesses the compositional structure of environments by factorizing the state space into independent components (e.g., objects or controllable entities). SUSD allocates distinct skill variables to different factors, enabling more fine-grained control on the skill discovery process. A dynamic model also tracks learning across factors, adaptively steering the agent’s focus toward underexplored factors. This structured approach not only promotes the discovery of richer and more diverse skills, but also yields a factorized skill representation that enables fine-grained and disentangled control over individual entities which facilitates efficient training of compositional downstream tasks via Hierarchical Reinforcement Learning (HRL). Our experimental results across three environments, with factors ranging from 1 to 10, demonstrate that our method can discover diverse and complex skills without supervision, significantly outperforming existing unsupervised skill discovery methods in factorized and complex environments. Code is publicly available at: https://github.com/hadi-hosseini/SUSD.

[1231] Toward Enhancing Representation Learning in Federated Multi-Task Settings

Mehdi Setayesh, Mahdi Beitollahi, Yasser H. Khalil, Hongliang Li

Main category: cs.LG

TL;DR: FedMuscle: A federated multi-task learning framework using Muscle loss, a novel contrastive learning objective that aligns representations across heterogeneous models by maximizing mutual information among all models’ representations.

Details

Motivation: Existing federated multi-task learning approaches assume model congruity (homogeneous models), limiting applicability in realistic settings where users have different tasks and heterogeneous models.

Method: Proposes Muscle loss, a contrastive learning objective that simultaneously aligns representations from all participating models by maximizing mutual information among all models’ representations. Develops FedMuscle as a practical, communication-efficient algorithm that handles both model and task heterogeneity.

Result: Experiments on diverse image and language tasks show FedMuscle consistently outperforms state-of-the-art baselines, delivering substantial improvements and robust performance across heterogeneous settings.

Conclusion: FedMuscle effectively addresses model and task heterogeneity in federated multi-task learning by learning shared representation spaces through mutual information maximization, enabling practical applications with diverse user requirements.

Abstract: Federated multi-task learning (FMTL) seeks to collaboratively train customized models for users with different tasks while preserving data privacy. Most existing approaches assume model congruity (i.e., the use of fully or partially homogeneous models) across users, which limits their applicability in realistic settings. To overcome this limitation, we aim to learn a shared representation space across tasks rather than shared model parameters. To this end, we propose Muscle loss, a novel contrastive learning objective that simultaneously aligns representations from all participating models. Unlike existing multi-view or multi-model contrastive methods, which typically align models pairwise, Muscle loss can effectively capture dependencies across tasks because its minimization is equivalent to the maximization of mutual information among all the models’ representations. Building on this principle, we develop FedMuscle, a practical and communication-efficient FMTL algorithm that naturally handles both model and task heterogeneity. Experiments on diverse image and language tasks demonstrate that FedMuscle consistently outperforms state-of-the-art baselines, delivering substantial improvements and robust performance across heterogeneous settings.

[1232] AdaptNC: Adaptive Nonconformity Scores for Uncertainty-Aware Autonomous Systems in Dynamic Environments

Renukanandan Tumu, Aditya Singh, Rahul Mangharam

Main category: cs.LG

TL;DR: AdaptNC: Joint online adaptation of nonconformity score parameters and conformal threshold for better uncertainty quantification in robotics under distribution shifts.

Details

Motivation: Standard conformal prediction methods assume exchangeability, which is violated by real-world robotics distribution shifts. Existing online CP methods only adapt thresholds but keep nonconformity scores static, leading to conservative, volume-inefficient prediction regions during structural shifts.

Method: Proposes AdaptNC framework that jointly adapts both nonconformity score parameters and conformal threshold online. Uses adaptive reweighting scheme to optimize score functions and replay buffer mechanism to mitigate coverage instability during score transitions.

Result: Evaluated on diverse robotic benchmarks involving multi-agent policy changes, environmental changes, and sensor degradation. Significantly reduces prediction region volume compared to state-of-the-art threshold-only baselines while maintaining target coverage levels.

Conclusion: AdaptNC provides more efficient uncertainty quantification for autonomous systems in dynamic environments by jointly adapting both score functions and thresholds, overcoming limitations of existing online CP methods.

Abstract: Rigorous uncertainty quantification is essential for the safe deployment of autonomous systems in unconstrained environments. Conformal Prediction (CP) provides a distribution-free framework for this task, yet its standard formulations rely on exchangeability assumptions that are violated by the distribution shifts inherent in real-world robotics. Existing online CP methods maintain target coverage by adaptively scaling the conformal threshold, but typically employ a static nonconformity score function. We show that this fixed geometry leads to highly conservative, volume-inefficient prediction regions when environments undergo structural shifts. To address this, we propose \textbf{AdaptNC}, a framework for the joint online adaptation of both the nonconformity score parameters and the conformal threshold. AdaptNC leverages an adaptive reweighting scheme to optimize score functions, and introduces a replay buffer mechanism to mitigate the coverage instability that occurs during score transitions. We evaluate AdaptNC on diverse robotic benchmarks involving multi-agent policy changes, environmental changes and sensor degradation. Our results demonstrate that AdaptNC significantly reduces prediction region volume compared to state-of-the-art threshold-only baselines while maintaining target coverage levels.

[1233] The Effect of Mini-Batch Noise on the Implicit Bias of Adam

Matias D. Cattaneo, Boris Shigida

Main category: cs.LG

TL;DR: Theoretical analysis of how Adam optimizer’s momentum hyperparameters (β₁, β₂) interact with batch size to influence implicit bias toward sharper/flatter loss regions, affecting generalization in multi-epoch training.

Details

Motivation: With limited high-quality data and increasing compute, multi-epoch training is becoming more important. Adam(W) is widely used but its hyperparameters (β₁, β₂) and batch size interact in complex ways affecting generalization. Understanding how mini-batch noise influences Adam's implicit bias toward sharper or flatter loss regions is crucial for improving multi-epoch training.

Method: Developed a theoretical framework to analyze how mini-batch noise influences the implicit bias of Adam’s momentum memory. Examined the interaction between batch size and momentum hyperparameters (β₁, β₂) in controlling regularization effects. Connected batch size scale to critical batch size theory. Validated findings with experiments on small-scale data in about-to-overfit regimes.

Result: Found that for large batch sizes, higher β₂ increases anti-regularization (hurting generalization), but this dependence reverses for smaller batches. Similar monotonicity shift occurs in β₁ in the opposite direction. The default (0.9, 0.999) works well for small batches; for larger batches, moving β₁ closer to β₂ improves validation accuracy. Connected batch size scale shift to critical batch size scale.

Conclusion: Batch size significantly influences how Adam’s momentum hyperparameters affect regularization and generalization. Optimal hyperparameter settings depend on batch size, with different regimes requiring different (β₁, β₂) configurations. This provides guidance for tuning Adam in multi-epoch training scenarios.

Abstract: With limited high-quality data and growing compute, multi-epoch training is gaining back its importance across sub-areas of deep learning. Adam(W), versions of which are go-to optimizers for many tasks such as next token prediction, has two momentum hyperparameters $(β_1, β_2)$ controlling memory and one very important hyperparameter, batch size, controlling (in particular) the amount mini-batch noise. We introduce a theoretical framework to understand how mini-batch noise influences the implicit bias of memory in Adam (depending on $β_1$, $β_2$) towards sharper or flatter regions of the loss landscape, which is commonly observed to correlate with the generalization gap in multi-epoch training. We find that in the case of large batch sizes, higher $β_2$ increases the magnitude of anti-regularization by memory (hurting generalization), but as the batch size becomes smaller, the dependence of (anti-)regulariation on $β_2$ is reversed. A similar monotonicity shift (in the opposite direction) happens in $β_1$. In particular, the commonly “default” pair $(β_1, β_2) = (0.9, 0.999)$ is a good choice if batches are small; for larger batches, in many settings moving $β_1$ closer to $β_2$ is much better in terms of validation accuracy in multi-epoch training. Moreover, our theoretical derivations connect the scale of the batch size at which the shift happens to the scale of the critical batch size. We illustrate this effect in experiments with small-scale data in the about-to-overfit regime.

[1234] COMET: Codebook-based Online-adaptive Multi-scale Embedding for Time-series Anomaly Detection

Jinwoo Park, Hyeongwon Kang, Seung Hun Han, Pilsung Kang

Main category: cs.LG

TL;DR: COMET: Codebook-based Online-adaptive Multi-scale Embedding for Time-series anomaly detection using multi-scale patch encoding, vector-quantized coreset, and online adaptation.

Details

Motivation: Time series anomaly detection needs better temporal dependency capture, multi-scale pattern handling, and robustness to distribution shifts at inference time.

Method: Three components: 1) Multi-scale Patch Encoding for temporal dependencies and correlations, 2) Vector-Quantized Coreset for normal pattern learning with dual-score detection, 3) Online Codebook Adaptation with pseudo-labeling and contrastive learning.

Result: Achieves best performance in 36 out of 45 evaluation metrics across five benchmark datasets.

Conclusion: COMET effectively addresses limitations in time series anomaly detection through multi-scale representation learning and online adaptation.

Abstract: Time series anomaly detection is a critical task across various industrial domains. However, capturing temporal dependencies and multivariate correlations within patch-level representation learning remains underexplored, and reliance on single-scale patterns limits the detection of anomalies across different temporal ranges. Furthermore, focusing on normal data representations makes models vulnerable to distribution shifts at inference time. To address these limitations, we propose Codebook-based Online-adaptive Multi-scale Embedding for Time-series anomaly detection (COMET), which consists of three key components: (1) Multi-scale Patch Encoding captures temporal dependencies and inter-variable correlations across multiple patch scales. (2) Vector-Quantized Coreset learns representative normal patterns via codebook and detects anomalies with a dual-score combining quantization error and memory distance. (3) Online Codebook Adaptation generates pseudo-labels based on codebook entries and dynamically adapts the model at inference through contrastive learning. Experiments on five benchmark datasets demonstrate that COMET achieves the best performance in 36 out of 45 evaluation metrics, validating its effectiveness across diverse environments.

[1235] De Novo Molecular Generation from Mass Spectra via Many-Body Enhanced Diffusion

Xichen Sun, Wentao Wei, Jiahua Rao, Jiancong Xie, Yuedong Yang

Main category: cs.LG

TL;DR: MBGen: A many-body enhanced diffusion framework for de novo molecular structure generation from mass spectrometry data, focusing on capturing higher-order interactions in MS/MS spectra for accurate molecular generation and isomer differentiation.

Details

Motivation: Existing methods for molecular structure generation from mass spectrometry data use atom-centric and pairwise interaction modeling, which overlooks higher-order edge interactions and many-body characteristics crucial for resolving complex isomers and non-local fragmentation mechanisms in MS/MS spectra.

Method: MBGen integrates a many-body attention mechanism and higher-order edge modeling within a diffusion framework to comprehensively leverage structural information encoded in MS/MS spectra for de novo molecular structure generation.

Result: MBGen achieves superior performance on NPLIB1 and MassSpecGym benchmarks with improvements up to 230% over state-of-the-art methods, effectively capturing higher-order interactions and showing enhanced sensitivity to complex isomeric and non-local fragmentation information.

Conclusion: The many-body modeling approach demonstrates scientific value and practical utility for mass spectrometry-based molecular generation, enabling accurate de novo generation and isomer differentiation for novel molecules.

Abstract: Molecular structure generation from mass spectrometry is fundamental for understanding cellular metabolism and discovering novel compounds. Although tandem mass spectrometry (MS/MS) enables the high-throughput acquisition of fragment fingerprints, these spectra often reflect higher-order interactions involving the concerted cleavage of multiple atoms and bonds-crucial for resolving complex isomers and non-local fragmentation mechanisms. However, most existing methods adopt atom-centric and pairwise interaction modeling, overlooking higher-order edge interactions and lacking the capacity to systematically capture essential many-body characteristics for structure generation. To overcome these limitations, we present MBGen, a Many-Body enhanced diffusion framework for de novo molecular structure Generation from mass spectra. By integrating a many-body attention mechanism and higher-order edge modeling, MBGen comprehensively leverages the rich structural information encoded in MS/MS spectra, enabling accurate de novo generation and isomer differentiation for novel molecules. Experimental results on the NPLIB1 and MassSpecGym benchmarks demonstrate that MBGen achieves superior performance, with improvements of up to 230% over state-of-the-art methods, highlighting the scientific value and practical utility of many-body modeling for mass spectrometry-based molecular generation. Further analysis and ablation studies show that our approach effectively captures higher-order interactions and exhibits enhanced sensitivity to complex isomeric and non-local fragmentation information.

[1236] Chance-Constrained Inference for Hallucination Risk Control in Large Language Models

Sreenivasan Mohandas

Main category: cs.LG

TL;DR: Chance-constrained inference provides probabilistic guarantees against hallucinations in LLM outputs by treating hallucinations as stochastic constraint violations and using sequential testing for risk control.

Details

Motivation: LLMs produce stochastic outputs that may include factual hallucinations. Existing methods reduce average error rates but don't provide explicit control over hallucination frequency under repeated use, lacking probabilistic guarantees for deployment safety.

Method: Formulates inference as deployment-time risk control problem with chance constraints bounding hallucination probability. Uses sequential, anytime-valid inference procedure that adaptively certifies feasibility/infeasibility using finite samples, avoiding conservative fixed-sample bounds.

Result: Experiments on NaturalQuestions-style and multi-hop QA show reliable risk control, early detection of intrinsically infeasible inputs, and safe composition under repeated use. Confidence-based baselines fail to provide consistent guarantees.

Conclusion: Chance-constrained inference provides formal probabilistic guarantees against hallucinations in LLM outputs, enabling safe deployment with controlled risk, outperforming confidence-based approaches that lack consistent guarantees.

Abstract: Large language models generate outputs stochastically and may produce fluent but invalid responses, including factual hallucinations. Existing mitigation strategies reduce average error rates but do not provide explicit control over the \emph{frequency} of such failures under repeated use. We formulate inference as a deployment-time risk control problem and introduce \emph{chance-constrained inference}, which directly bounds the probability of hallucinations among accepted generations. Hallucinations are modeled as stochastic constraint violations, and we show that confidence-based selective prediction does not, in general, imply probabilistic risk guarantees. To enforce chance constraints efficiently, we propose a sequential, anytime-valid inference procedure that adaptively certifies feasibility or infeasibility using finite samples, avoiding conservative fixed-sample bounds. Experiments on questions inspired by NaturalQuestions and controlled multi-hop question answering demonstrate reliable risk control, early detection of intrinsically infeasible inputs, and safe composition under repeated use, while confidence-based baselines fail to provide consistent guarantees.

[1237] On the Spatiotemporal Dynamics of Generalization in Neural Networks

Zichao Wei

Main category: cs.LG

TL;DR: The paper proposes a physics-inspired neural architecture (SEAD) that achieves perfect length generalization on arithmetic tasks by enforcing locality, symmetry, and stability constraints derived from physical postulates.

Details

Motivation: Neural networks fail to generalize arithmetic operations (like addition) from short to long sequences, unlike humans who learn rules. The authors argue this is not an engineering problem but a violation of fundamental physical principles that govern generalizing systems.

Method: Derived from three physical postulates (locality, symmetry, stability), the authors propose Spatiotemporal Evolution with Attractor Dynamics (SEAD) - a neural cellular automaton architecture with local convolutional rules iterated until convergence to discrete attractors.

Result: SEAD achieves: 1) Perfect length generalization on parity via light-cone propagation; 2) 100% accuracy on addition from length 16 to 1 million digits; 3) Learning Turing-complete Rule 110 cellular automaton without trajectory divergence.

Conclusion: The gap between statistical learning and logical reasoning can be bridged by respecting the physics of computation rather than scaling parameters, suggesting fundamental architectural constraints are needed for true generalization.

Abstract: Why do neural networks fail to generalize addition from 16-digit to 32-digit numbers, while a child who learns the rule can apply it to arbitrarily long sequences? We argue that this failure is not an engineering problem but a violation of physical postulates. Drawing inspiration from physics, we identify three constraints that any generalizing system must satisfy: (1) Locality – information propagates at finite speed; (2) Symmetry – the laws of computation are invariant across space and time; (3) Stability – the system converges to discrete attractors that resist noise accumulation. From these postulates, we derive – rather than design – the Spatiotemporal Evolution with Attractor Dynamics (SEAD) architecture: a neural cellular automaton where local convolutional rules are iterated until convergence. Experiments on three tasks validate our theory: (1) Parity – demonstrating perfect length generalization via light-cone propagation; (2) Addition – achieving scale-invariant inference from L=16 to L=1 million with 100% accuracy, exhibiting input-adaptive computation; (3) Rule 110 – learning a Turing-complete cellular automaton without trajectory divergence. Our results suggest that the gap between statistical learning and logical reasoning can be bridged – not by scaling parameters, but by respecting the physics of computation.

[1238] Efficient Adversarial Attacks on High-dimensional Offline Bandits

Seyed Mohammad Hadi Hosseini, Amir Najafi, Mahdieh Soleymani Baghshah

Main category: cs.LG

TL;DR: Bandit algorithms for model evaluation are vulnerable to adversarial attacks on reward models, with small weight perturbations causing drastic behavior changes, especially in high-dimensional settings like image evaluation.

Details

Motivation: While bandit algorithms have become popular for efficient evaluation of ML models using reward models from platforms like Hugging Face, the adversarial robustness of offline bandit evaluation remains unexplored, particularly when attackers perturb the reward model rather than training data.

Method: The paper investigates vulnerability theoretically and empirically, introducing a threat model where attackers exploit offline data to hijack bandit behavior. It studies attacks on linear reward functions and extends to nonlinear models like ReLU neural networks, targeting two Hugging Face evaluators for generative models (aesthetic quality and compositional alignment).

Result: Results show small, imperceptible perturbations to reward model weights can drastically alter bandit behavior. Theoretically, perturbation norm required for successful attacks decreases as input dimensionality increases, making modern applications like image evaluation especially vulnerable. Experiments confirm targeted perturbations achieve near-perfect attack success rates while random perturbations are ineffective.

Conclusion: Offline bandit evaluation with reward models from public platforms is vulnerable to adversarial weight perturbations, requiring new security considerations for ML evaluation pipelines, especially in high-dimensional domains like image and generative model assessment.

Abstract: Bandit algorithms have recently emerged as a powerful tool for evaluating machine learning models, including generative image models and large language models, by efficiently identifying top-performing candidates without exhaustive comparisons. These methods typically rely on a reward model, often distributed with public weights on platforms such as Hugging Face, to provide feedback to the bandit. While online evaluation is expensive and requires repeated trials, offline evaluation with logged data has become an attractive alternative. However, the adversarial robustness of offline bandit evaluation remains largely unexplored, particularly when an attacker perturbs the reward model (rather than the training data) prior to bandit training. In this work, we fill this gap by investigating, both theoretically and empirically, the vulnerability of offline bandit training to adversarial manipulations of the reward model. We introduce a novel threat model in which an attacker exploits offline data in high-dimensional settings to hijack the bandit’s behavior. Starting with linear reward functions and extending to nonlinear models such as ReLU neural networks, we study attacks on two Hugging Face evaluators used for generative model assessment: one measuring aesthetic quality and the other assessing compositional alignment. Our results show that even small, imperceptible perturbations to the reward model’s weights can drastically alter the bandit’s behavior. From a theoretical perspective, we prove a striking high-dimensional effect: as input dimensionality increases, the perturbation norm required for a successful attack decreases, making modern applications such as image evaluation especially vulnerable. Extensive experiments confirm that naive random perturbations are ineffective, whereas carefully targeted perturbations achieve near-perfect attack success rates …

[1239] ASGMamba: Adaptive Spectral Gating Mamba for Multivariate Time Series Forecasting

Qianyang Li, Xingjun Zhang, Shaoxun Wang, Jia Wei, Yueqi Xing

Main category: cs.LG

TL;DR: ASGMamba: Efficient multivariate time series forecasting framework combining adaptive spectral gating with Mamba architecture for resource-constrained environments

Details

Motivation: Transformer-based models have quadratic complexity limiting scalability, while linear State Space Models struggle to distinguish valuable signals from high-frequency noise, wasting state capacity. Need efficient forecasting for resource-constrained supercomputing environments.

Method: Proposes ASGMamba with lightweight Adaptive Spectral Gating (ASG) mechanism that dynamically filters noise based on local spectral energy, enabling Mamba backbone to focus on robust temporal dynamics. Includes hierarchical multi-scale architecture with variable-specific Node Embeddings.

Result: Achieves state-of-the-art accuracy on nine benchmarks while maintaining O(L) complexity. Significantly reduces memory usage on long-horizon tasks, establishing scalability for high-throughput forecasting in resource-limited environments.

Conclusion: ASGMamba provides a scalable solution for long-term multivariate time series forecasting in resource-constrained supercomputing environments, balancing efficiency and accuracy through adaptive spectral gating and Mamba architecture.

Abstract: Long-term multivariate time series forecasting (LTSF) plays a crucial role in various high-performance computing applications, including real-time energy grid management and large-scale traffic flow simulation. However, existing solutions face a dilemma: Transformer-based models suffer from quadratic complexity, limiting their scalability on long sequences, while linear State Space Models (SSMs) often struggle to distinguish valuable signals from high-frequency noise, leading to wasted state capacity. To bridge this gap, we propose ASGMamba, an efficient forecasting framework designed for resource-constrained supercomputing environments. ASGMamba integrates a lightweight Adaptive Spectral Gating (ASG) mechanism that dynamically filters noise based on local spectral energy, enabling the Mamba backbone to focus its state evolution on robust temporal dynamics. Furthermore, we introduce a hierarchical multi-scale architecture with variable-specific Node Embeddings to capture diverse physical characteristics. Extensive experiments on nine benchmarks demonstrate that ASGMamba achieves state-of-the-art accuracy. While keeping strictly $$\mathcal{O}(L)$$ complexity we significantly reduce the memory usage on long-horizon tasks, thus establishing ASGMamba as a scalable solution for high-throughput forecasting in resource limited environments.The code is available at https://github.com/hit636/ASGMamba

[1240] Quantifying Epistemic Predictive Uncertainty in Conformal Prediction

Siu Lun Chau, Soroush H. Zargarbashi, Yusuf Sale, Michele Caprio

Main category: cs.LG

TL;DR: The paper connects conformal prediction with epistemic uncertainty quantification by showing CP procedures induce credal sets, and proposes a Maximum Mean Imprecision measure to quantify epistemic predictive uncertainty from these sets.

Details

Motivation: The paper aims to address the problem of quantifying epistemic predictive uncertainty (uncertainty due to multiple plausible models) within the conformal prediction framework, going beyond just prediction region size to provide more informative uncertainty assessments.

Method: The authors show that conformal prediction procedures induce credal sets (closed convex sets of predictive distributions), prove this holds for split CP, and propose Maximum Mean Imprecision as a computationally efficient measure to quantify epistemic uncertainty from these credal sets.

Result: Experiments on active learning and selective classification demonstrate that the proposed epistemic uncertainty measure provides substantially more informative and fine-grained uncertainty assessments than relying solely on conformal prediction region size.

Conclusion: This work establishes a principled connection between conformal prediction and epistemic uncertainty quantification, showing CP can serve as a basis for decision-making under epistemic uncertainty with practical computational efficiency.

Abstract: We study the problem of quantifying epistemic predictive uncertainty (EPU) – that is, uncertainty faced at prediction time due to the existence of multiple plausible predictive models – within the framework of conformal prediction (CP). To expose the implicit model multiplicity underlying CP, we build on recent results showing that, under a mild assumption, any full CP procedure induces a set of closed and convex predictive distributions, commonly referred to as a credal set. Importantly, the conformal prediction region (CPR) coincides exactly with the set of labels to which all distributions in the induced credal set assign probability at least $1-α$. As our first contribution, we prove that this characterisation also holds in split CP. Building on this connection, we then propose a computationally efficient and analytically tractable uncertainty measure, based on \emph{Maximum Mean Imprecision}, to quantify the EPU by measuring the degree of conflicting information within the induced credal set. Experiments on active learning and selective classification demonstrate that the quantified EPU provides substantially more informative and fine-grained uncertainty assessments than reliance on CPR size alone. More broadly, this work highlights the potential of CP serving as a principled basis for decision-making under epistemic uncertainty.

[1241] Finite and Corruption-Robust Regret Bounds in Online Inverse Linear Optimization under M-Convex Action Sets

Taihei Oki, Shinsaku Sakaue

Main category: cs.LG

TL;DR: Online inverse linear optimization with M-convex feasible sets achieves O(d log d) regret, resolving open question about polynomial bounds in dimension d.

Details

Motivation: The paper addresses the open question in online inverse linear optimization (contextual recommendation) about whether finite regret bounds polynomial in dimension d are achievable, given prior work showed either O(d log T) bounds or exponential bounds.

Method: Combines structural characterization of optimal solutions on M-convex sets (which include matroids) with geometric volume arguments, and extends to adversarial corruption using directed graph monitoring for adaptive detection.

Result: Achieves finite regret bound of O(d log d) for M-convex sets, and O((C+1)d log d) for adversarially corrupted feedback without prior knowledge of corruption level C.

Conclusion: Partially resolves the open question by showing polynomial regret bounds are achievable for M-convex feasible sets, with extensions to adversarial corruption settings.

Abstract: We study online inverse linear optimization, also known as contextual recommendation, where a learner sequentially infers an agent’s hidden objective vector from observed optimal actions over feasible sets that change over time. The learner aims to recommend actions that perform well under the agent’s true objective, and the performance is measured by the regret, defined as the cumulative gap between the agent’s optimal values and those achieved by the learner’s recommended actions. Prior work has established a regret bound of $O(d\log T)$, as well as a finite but exponentially large bound of $\exp(O(d\log d))$, where $d$ is the dimension of the optimization problem and $T$ is the time horizon, while a regret lower bound of $Ω(d)$ is known (Gollapudi et al. 2021; Sakaue et al. 2025). Whether a finite regret bound polynomial in $d$ is achievable or not has remained an open question. We partially resolve this by showing that when the feasible sets are M-convex – a broad class that includes matroids – a finite regret bound of $O(d\log d)$ is possible. We achieve this by combining a structural characterization of optimal solutions on M-convex sets with a geometric volume argument. Moreover, we extend our approach to adversarially corrupted feedback in up to $C$ rounds. We obtain a regret bound of $O((C+1)d\log d)$ without prior knowledge of $C$, by monitoring directed graphs induced by the observed feedback to detect corruptions adaptively.

[1242] Semantic-aware Wasserstein Policy Regularization for Large Language Model Alignment

Byeonghu Na, Hyungho Na, Yeongmin Kim, Suhyeon Jo, HeeSun Bae, Mina Kang, Il-Chul Moon

Main category: cs.LG

TL;DR: Wasserstein Policy Regularization (WPR) introduces semantic-aware regularization for RLHF using Wasserstein distance instead of KL divergence, enabling better alignment by considering token space geometry.

Details

Motivation: Current RLHF methods use KL divergence regularization which only compares token probabilities at identical indices, failing to capture semantic similarity between tokens. This limitation motivates a more semantic-aware regularization approach.

Method: Proposes Wasserstein Policy Regularization (WPR) based on entropy-regularized Wasserstein distance that incorporates token space geometry. Uses dual formulation to express regularization as penalty terms applied to reward via optimal dual variables, creating tractable objective compatible with standard RL algorithms.

Result: Empirically outperforms KL- and f-divergence-based baselines, demonstrating benefits of semantic-aware policy distances for alignment.

Conclusion: WPR provides a semantic-aware regularization method for RLHF that better captures token relationships and improves alignment performance compared to traditional divergence measures.

Abstract: Large language models (LLMs) are commonly aligned with human preferences using reinforcement learning from human feedback (RLHF). In this method, LLM policies are generally optimized through reward maximization with Kullback-Leibler (KL) divergence regularization of the reference policy. However, KL and its $f$-divergence variants only compare token probabilities at identical indices, failing to capture semantic similarity. We propose Wasserstein Policy Regularization (WPR), a semantic-aware regularization for the RLHF framework based on the entropy-regularized Wasserstein distance, which incorporates the geometry of the token space. The dual formulation of the distance expresses the regularization as penalty terms applied to the reward via optimal dual variables, which yield a tractable objective compatible with standard RL algorithms. Empirically, our method outperforms KL- and $f$-divergence-based baselines, demonstrating the benefits of semantic-aware policy distances for alignment. Our code is available at https://github.com/aailab-kaist/WPR.

[1243] Beyond Mode Elicitation: Diversity-Preserving Reinforcement Learning via Latent Diffusion Reasoner

Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Yi-An Ma, Lianhui Qin

Main category: cs.LG

TL;DR: LaDi-RL uses latent diffusion models for RL-based LLM reasoning optimization, exploring in continuous latent space instead of discrete token space to avoid diversity collapse and improve reasoning performance.

Details

Motivation: Current RL methods for LLM reasoning optimize discrete Chain-of-Thought generation but suffer from diversity collapse as policy entropy decreases due to mode elicitation behavior in discrete RL. The authors aim to address this limitation by enabling exploration in a more expressive continuous latent space.

Method: Proposes Latent Diffusion Reasoning with Reinforcement Learning (LaDi-RL), which conducts exploration directly in a continuous latent space where latent variables encode semantic-level reasoning trajectories. Uses guided diffusion modeling for exploration, with multi-step denoising that distributes stochasticity and preserves multiple coexisting solution modes. Decouples latent-space exploration from text-space generation, combining latent diffusion optimization with complementary text policy.

Result: Experiments on code generation and mathematical reasoning benchmarks show consistent improvements over discrete RL baselines. Achieves absolute pass@1 gains of +9.4% on code generation and +5.7% on mathematical reasoning. Demonstrates better performance in both pass@1 and pass@k metrics.

Conclusion: Diffusion-based latent RL provides a principled alternative to discrete token-level RL for reasoning tasks, effectively addressing diversity collapse issues while improving reasoning performance through continuous latent space exploration.

Abstract: Recent reinforcement learning (RL) methods improve LLM reasoning by optimizing discrete Chain-of-Thought (CoT) generation; however, exploration in token space often suffers from diversity collapse as policy entropy decreases due to mode elicitation behavior in discrete RL. To mitigate this issue, we propose Latent Diffusion Reasoning with Reinforcement Learning (LaDi-RL), a framework that conducts exploration directly in a continuous latent space, where latent variables encode semantic-level reasoning trajectories. By modeling exploration via guided diffusion, multi-step denoising distributes stochasticity and preserves multiple coexisting solution modes without mutual suppression. Furthermore, by decoupling latent-space exploration from text-space generation, we show that latent diffusion-based optimization is more effective than text-space policy optimization alone, while a complementary text policy provides additional gains when combined with latent exploration. Experiments on code generation and mathematical reasoning benchmarks demonstrate consistent improvements in both pass@1 and pass@k over discrete RL baselines, with absolute pass@1 gains of +9.4% on code generation and +5.7% on mathematical reasoning, highlighting diffusion-based latent RL as a principled alternative to discrete token-level RL for reasoning.

[1244] Revisiting Generalization Measures Beyond IID: An Empirical Study under Distributional Shift

Sora Nakai, Youssef Fadhloun, Kacem Mathlouthi, Kotaro Yoshida, Ganesh Talluri, Ioannis Mitliagkas, Hiroki Naganuma

Main category: cs.LG

TL;DR: Large-scale study benchmarking 40+ generalization measures across 10,000+ hyperparameter configurations, evaluating their robustness beyond IID settings to diverse distribution shifts

Details

Motivation: Generalization remains an unresolved challenge in deep learning, particularly predicting model performance beyond training distribution using pre-test-time available quantities. Previous studies raised concerns about instability across training configurations, motivating a broader evaluation of generalization measures' robustness.

Method: Trained small-to-medium models over 10,000 hyperparameter configurations and evaluated more than 40 generalization measures computable from trained models and training data alone. Extended evaluation beyond IID to include diverse distribution shifts, multiple architectures/training recipes, and incorporated calibration- and information-criteria-based measures.

Result: Distribution shifts substantially alter predictive performance of many generalization measures, while a smaller subset remains comparatively stable across settings. Some measures show robustness across diverse distribution shifts.

Conclusion: Generalization measures vary significantly in their robustness to distribution shifts, with only a subset maintaining stable predictive performance across different OOD settings. This highlights the importance of evaluating generalization measures beyond IID regimes.

Abstract: Generalization remains a central yet unresolved challenge in deep learning, particularly the ability to predict a model’s performance beyond its training distribution using quantities available prior to test-time evaluation. Building on the large-scale study of Jiang et al. (2020). and concerns by Dziugaite et al. (2020). about instability across training configurations, we benchmark the robustness of generalization measures beyond IID regime. We train small-to-medium models over 10,000 hyperparameter configurations and evaluate more than 40 measures computable from the trained model and the available training data alone. We significantly broaden the experimental scope along multiple axes: (i) extending the evaluation beyond the standard IID setting to include benchmarking for robustness across diverse distribution shifts, (ii) evaluating multiple architectures and training recipes, and (iii) newly incorporating calibration- and information-criteria-based measures to assess their alignment with both IID and OOD generalization. We find that distribution shifts can substantially alter the predictive performance of many generalization measures, while a smaller subset remains comparatively stable across settings.

[1245] Softmax Linear Attention: Reclaiming Global Competition

Mingwei Xu, Xuan Lin, Xinnan Guo, Wanqing Xu, Wanyun Cui

Main category: cs.LG

TL;DR: Softmax Linear Attention (SLA) restores competitive selection in linear attention by applying softmax at head level instead of token level, enabling efficient long-context understanding while maintaining linear complexity.

Details

Motivation: Linear attention reduces quadratic complexity to linear time but loses expressivity by removing softmax normalization, which eliminates global competition - a critical mechanism for focusing on relevant information in long-context scenarios with noise.

Method: SLA lifts softmax operation from token level to head level, using attention heads as coarse semantic slots and applying competitive gating to dynamically select the most relevant subspaces, reintroducing winner-take-all dynamics while maintaining efficiency.

Result: SLA consistently enhances state-of-the-art linear baselines (RetNet, GLA, GDN) across language modeling and long-context benchmarks, particularly improving robustness in challenging retrieval scenarios against noise.

Conclusion: SLA successfully restores precise focus mechanisms in linear attention while maintaining linear complexity, offering a new perspective by exploiting multi-head aggregation structure rather than refining local kernel functions.

Abstract: While linear attention reduces the quadratic complexity of standard Transformers to linear time, it often lags behind in expressivity due to the removal of softmax normalization. This omission eliminates \emph{global competition}, a critical mechanism that enables models to sharply focus on relevant information amidst long-context noise. In this work, we propose \textbf{Softmax Linear Attention (SLA)}, a framework designed to restore this competitive selection without sacrificing efficiency. By lifting the softmax operation from the token level to the head level, SLA leverages attention heads as coarse semantic slots, applying a competitive gating mechanism to dynamically select the most relevant subspaces. This reintroduces the ``winner-take-all’’ dynamics essential for precise retrieval and robust long-context understanding. Distinct from prior methods that focus on refining local kernel functions, SLA adopts a broader perspective by exploiting the higher-level multi-head aggregation structure. Extensive experiments demonstrate that SLA consistently enhances state-of-the-art linear baselines (RetNet, GLA, GDN) across language modeling and long-context benchmarks, particularly in challenging retrieval scenarios where it significantly boosts robustness against noise, validating its capability to restore precise focus while maintaining linear complexity.

[1246] MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration

Lianhai Ren, Yucheng Ding, Xiao Liu, Qianxiao Li, Peng Cheng, Yeyun Gong

Main category: cs.LG

TL;DR: MSign optimizer prevents LLM training instability by periodically applying matrix sign operations to restore stable rank when detecting gradient explosion precursors.

Details

Motivation: Training instability in large language models causes sudden gradient explosions that waste computational resources, making stable training a critical challenge for scaling LLMs.

Method: Identified two precursors to collapse: declining weight matrix stable rank and increasing alignment between adjacent layer Jacobians. Proposed MSign optimizer that periodically applies matrix sign operations to restore stable rank when these conditions are detected.

Result: MSign effectively prevents training failures in models from 5M to 3B parameters with less than 7.0% computational overhead, demonstrating scalability across model sizes.

Conclusion: Matrix sign operations can effectively break the instability mechanism in LLM training, providing a practical solution to prevent gradient explosions during large-scale pretraining.

Abstract: Training instability remains a critical challenge in large language model (LLM) pretraining, often manifesting as sudden gradient explosions that waste significant computational resources. We study training failures in a 5M-parameter NanoGPT model scaled via $μ$P, identifying two key phenomena preceding collapse: (1) rapid decline in weight matrix stable rank (ratio of squared Frobenius norm to squared spectral norm), and (2) increasing alignment between adjacent layer Jacobians. We prove theoretically that these two conditions jointly cause exponential gradient norm growth with network depth. To break this instability mechanism, we propose MSign, a new optimizer that periodically applies matrix sign operations to restore stable rank. Experiments on models from 5M to 3B parameters demonstrate that MSign effectively prevents training failures with a computational overhead of less than 7.0%.

[1247] Probability-Entropy Calibration: An Elastic Indicator for Adaptive Fine-tuning

Wenhao Yu, Shaohang Wei, Jiahong Liu, Yifan Li, Minda Hu, Aiwei Liu, Hao Zhang, Irwin King

Main category: cs.LG

TL;DR: RankTuner introduces token-level reweighting using probability-entropy calibration to focus fine-tuning on truly under-learned tokens, improving mathematical reasoning and code generation.

Details

Motivation: Current token-level reweighting methods are one-dimensional: ground-truth probability reflects downstream alignment but ignores intrinsic uncertainty, while token entropy reflects uncertainty but ignores target-specific alignment. This leads to misidentifying noisy or replaceable tokens as learning-critical.

Method: RankTuner introduces a probability-entropy calibration signal called Relative Rank Indicator, which compares the rank of the ground-truth token with its expected rank under the prediction distribution. The inverse of this indicator is used as a token-wise Relative Scale to reweight the fine-tuning objective.

Result: Experiments show consistent improvements on mathematical reasoning benchmarks, transfer gains on out-of-distribution reasoning, and improved code generation performance over probability-only or entropy-only reweighting baselines across multiple backbones.

Conclusion: RankTuner’s probability-entropy calibration effectively identifies truly under-learned tokens, leading to better fine-tuning outcomes without over-penalizing intrinsically uncertain positions.

Abstract: Token-level reweighting is a simple yet effective mechanism for controlling supervised fine-tuning, but common indicators are largely one-dimensional: the ground-truth probability reflects downstream alignment, while token entropy reflects intrinsic uncertainty induced by the pre-training prior. Ignoring entropy can misidentify noisy or easily replaceable tokens as learning-critical, while ignoring probability fails to reflect target-specific alignment. RankTuner introduces a probability–entropy calibration signal, the Relative Rank Indicator, which compares the rank of the ground-truth token with its expected rank under the prediction distribution. The inverse indicator is used as a token-wise Relative Scale to reweight the fine-tuning objective, focusing updates on truly under-learned tokens without over-penalizing intrinsically uncertain positions. Experiments on multiple backbones show consistent improvements on mathematical reasoning benchmarks, transfer gains on out-of-distribution reasoning, and pre code generation performance over probability-only or entropy-only reweighting baselines.

[1248] Position: The Inevitable End of One-Architecture-Fits-All-Domains in Time Series Forecasting

Qinwei Ma, Jingzhe Shi, Jiahao Qiu, Zaiwen Yang

Main category: cs.LG

TL;DR: Paper critiques current neural network architectures for time series forecasting, arguing they’re too general and don’t help domain-specific applications like Finance, Weather, or Traffic. Calls for shift to domain-specific deep learning or meta-learning approaches.

Details

Motivation: The paper is motivated by concerns about the effectiveness and robustness of neural network architectures for time series forecasting. The authors observe that current architectures designed for general domains have become overly complex, performance has saturated, and they don't translate well to real-world domain-specific applications like Finance, Weather, or Traffic where specialized methods dominate.

Method: This is a position/analysis paper rather than a technical method paper. The authors analyze the limitations of current neural network architectures for time series forecasting, identifying the fundamental conflict between achieving SOTA on general domains versus practical usefulness for specific domains. They conduct a critical analysis of the field’s current state.

Result: The analysis reveals that neural network architectures for general domain time series forecasting have become increasingly complex with diminishing returns, and their performance has nearly saturated. Domain-specific applications continue to develop their own specialized methods that rarely incorporate recent advances from the time series neural network community.

Conclusion: The paper concludes by calling for a paradigm shift in the time series community: researchers should either (1) focus on deep learning methods for specific domains (like Finance, Weather, Traffic), or (2) develop meta-learning approaches for general domains, moving away from the current saturated research on general-purpose neural network architectures.

Abstract: Recent work has questioned the effectiveness and robustness of neural network architectures for time series forecasting tasks. We summarize these concerns and analyze groundly their inherent limitations: i.e. the irreconcilable conflict between single (or few similar) domains SOTA and generalizability over general domains for time series forecasting neural network architecture designs. Moreover, neural networks architectures for general domain time series forecasting are becoming more and more complicated and their performance has almost saturated in recent years. As a result, network architectures developed aiming at fitting general time series domains are almost not inspiring for real world practices for certain single (or few similar) domains such as Finance, Weather, Traffic, etc: each specific domain develops their own methods that rarely utilize advances in neural network architectures of time series community in recent 2-3 years. As a result, we call for the time series community to shift focus away from research on time series neural network architectures for general domains: these researches have become saturated and away from domain-specific SOTAs over time. We should either (1) focus on deep learning methods for certain specific domain(s), or (2) turn to the development of meta-learning methods for general domains.

[1249] Rethinking LoRA for Data Heterogeneous Federated Learning: Subspace and State Alignment

Hongyi Peng, Han Yu, Xiaoxiao Li, Qiang Yang

Main category: cs.LG

TL;DR: FedGaLore improves federated fine-tuning with LoRA by addressing update-space and optimizer-state mismatches in non-IID settings through gradient-subspace optimization and drift-robust synchronization.

Details

Motivation: Low-Rank Adaptation (LoRA) underperforms full-parameter fine-tuning in federated learning under non-IID data distributions, creating a performance gap that needs to be addressed.

Method: FedGaLore combines client-side GaLore-style gradient-subspace optimization with server-side drift-robust synchronization of projected second-moment states using spectral shared-signal extraction.

Result: FedGaLore improves robustness and accuracy over state-of-the-art federated LoRA baselines across NLU, vision, and NLG benchmarks in non-IID settings.

Conclusion: The proposed FedGaLore effectively addresses the performance gap of LoRA in federated fine-tuning under non-IID conditions by solving coupled mismatches in update space and optimizer states.

Abstract: Low-Rank Adaptation (LoRA) is widely used for federated fine-tuning. Yet under non-IID settings, it can substantially underperform full-parameter fine-tuning. Through with-high-probability robustness analysis, we uncover that this gap can be attributed to two coupled mismatches: (i) update-space mismatch, where clients optimize in a low-rank subspace but aggregation occurs in the full space; and (ii) optimizer-state mismatch, where unsynchronized adaptive states amplify drift across rounds. We propose FedGaLore, which combines client-side GaLore-style gradient-subspace optimization with server-side drift-robust synchronization of projected second-moment states via spectral shared-signal extraction, to address this challenge. Across NLU, vision, and NLG benchmarks, FedGaLore improves robustness and accuracy over state-of-the-art federated LoRA baselines in non-IID settings.

[1250] MGKAN: Predicting Asymmetric Drug-Drug Interactions via a Multimodal Graph Kolmogorov-Arnold Network

Kunyi Fan, Mengjie Chen, Longlong Li, Cunquan Qu

Main category: cs.LG

TL;DR: MGKAN is a Graph Kolmogorov-Arnold Network for predicting asymmetric drug-drug interactions using learnable basis functions and multi-view network integration.

Details

Motivation: Existing GNN models for DDI prediction rely on linear aggregation and symmetric assumptions, limiting their ability to capture nonlinear and heterogeneous patterns in drug interactions, which are often asymmetric in nature.

Method: Proposes MGKAN that replaces conventional MLP transformations with KAN-driven basis functions for expressive nonlinear modeling. Integrates three network views (asymmetric DDI network, co-interaction network, biochemical similarity network) with role-specific embeddings, and uses a fusion module combining linear attention and nonlinear transformation.

Result: MGKAN outperforms seven state-of-the-art baselines on two benchmark datasets. Ablation studies and case studies confirm its predictive accuracy and effectiveness in modeling directional drug effects.

Conclusion: MGKAN provides a more expressive framework for asymmetric DDI prediction by leveraging learnable basis functions and multi-view network integration, effectively capturing nonlinear and heterogeneous patterns in drug interactions.

Abstract: Predicting drug-drug interactions (DDIs) is essential for safe pharmacological treatments. Previous graph neural network (GNN) models leverage molecular structures and interaction networks but mostly rely on linear aggregation and symmetric assumptions, limiting their ability to capture nonlinear and heterogeneous patterns. We propose MGKAN, a Graph Kolmogorov-Arnold Network that introduces learnable basis functions into asymmetric DDI prediction. MGKAN replaces conventional MLP transformations with KAN-driven basis functions, enabling more expressive and nonlinear modeling of drug relationships. To capture pharmacological dependencies, MGKAN integrates three network views-an asymmetric DDI network, a co-interaction network, and a biochemical similarity network-with role-specific embeddings to preserve directional semantics. A fusion module combines linear attention and nonlinear transformation to enhance representational capacity. On two benchmark datasets, MGKAN outperforms seven state-of-the-art baselines. Ablation studies and case studies confirm its predictive accuracy and effectiveness in modeling directional drug effects.

[1251] A Provable Expressiveness Hierarchy in Hybrid Linear-Full Attention

Xiaowei Ye, Xiaoyu He, Chao Liao, Chen Wu, Pinyan Lu

Main category: cs.LG

TL;DR: Theoretical analysis showing a clear expressive power hierarchy between full attention and linear/hybrid attention mechanisms, with full attention being exponentially more powerful for sequential function composition tasks.

Details

Motivation: While efficient attention mechanisms (linear, hybrid) have been developed to mitigate the quadratic complexity of full attention, there's a fundamental gap in understanding their expressive power relative to full attention. The paper aims to provide rigorous theoretical characterization of performance differences among these attention mechanisms.

Method: Theoretical analysis establishing an expressiveness hierarchy. The theory applies to all linear attention variants that can be formulated as recurrence (including Mamba, DeltaNet). The analysis focuses on sequential function composition as a multi-step reasoning task that must occur within a model’s forward pass.

Result: For sequential function composition tasks, an (L+1)-layer full attention network is sufficient, whereas any hybrid network interleaving L-1 layers of full attention with a substantially larger number (2^{3L^2}) of linear attention layers cannot solve it. This demonstrates a clear separation in expressive power between the two attention types.

Conclusion: The work provides the first provable separation between hybrid attention and standard full attention, offering a theoretical perspective for understanding the fundamental capabilities and limitations of different attention mechanisms in transformer architectures.

Abstract: Transformers serve as the foundation of most modern large language models. To mitigate the quadratic complexity of standard full attention, various efficient attention mechanisms, such as linear and hybrid attention, have been developed. A fundamental gap remains: their expressive power relative to full attention lacks a rigorous theoretical characterization. In this work, we theoretically characterize the performance differences among these attention mechanisms. Our theory applies to all linear attention variants that can be formulated as a recurrence, including Mamba, DeltaNet, etc. Specifically, we establish an expressiveness hierarchy: for the sequential function composition-a multi-step reasoning task that must occur within a model’s forward pass, an ($L+1$)-layer full attention network is sufficient, whereas any hybrid network interleaving $L-1$ layers of full attention with a substantially larger number ($2^{3L^2}$) of linear attention layers cannot solve it. This result demonstrates a clear separation in expressive power between the two types of attention. Our work provides the first provable separation between hybrid attention and standard full attention, offering a theoretical perspective for understanding the fundamental capabilities and limitations of different attention mechanisms.

[1252] CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling

Runsong Zhao, Shilei Liu, Jiwei Tang, Langming Liu, Haibin Chen, Weidong Zhang, Yujin Yuan, Tong Xiao, Jingbo Zhu, Wenbo Su, Bo Zheng

Main category: cs.LG

TL;DR: CoMeT enables LLMs to process arbitrarily long sequences with constant memory and linear time via a dual-memory system and pipeline parallelism.

Details

Motivation: Standard Transformers have quadratic complexity and growing KV cache that limits long-context processing; need efficient architecture for arbitrarily long sequences.

Method: Collaborative Memory Transformer (CoMeT) with dual-memory system: temporary FIFO queue for recent events and global memory with gated update for long-range dependencies; uses layer-level pipeline parallelism for efficient fine-tuning.

Result: CoMeT can retrieve passkey from any position in 1M token sequence; surpasses other efficient methods on SCROLLS benchmark; comparable to full-attention baseline on summarization; validated on real-world agent and QA tasks.

Conclusion: CoMeT enables efficient long-context processing with constant memory and linear time, achieving strong performance on various long-sequence tasks.

Abstract: The quadratic complexity and indefinitely growing key-value (KV) cache of standard Transformers pose a major barrier to long-context processing. To overcome this, we introduce the Collaborative Memory Transformer (CoMeT), a novel architecture that enables LLMs to handle arbitrarily long sequences with constant memory usage and linear time complexity. Designed as an efficient, plug-in module, CoMeT can be integrated into pre-trained models with only minimal fine-tuning. It operates on sequential data chunks, using a dual-memory system to manage context: a temporary memory on a FIFO queue for recent events, and a global memory with a gated update rule for long-range dependencies. These memories then act as a dynamic soft prompt for the next chunk. To enable efficient fine-tuning on extremely long contexts, we introduce a novel layer-level pipeline parallelism strategy. The effectiveness of our approach is remarkable: a model equipped with CoMeT and fine-tuned on 32k contexts can accurately retrieve a passkey from any position within a 1M token sequence. On the SCROLLS benchmark, CoMeT surpasses other efficient methods and achieves performance comparable to a full-attention baseline on summarization tasks. Its practical effectiveness is further validated on real-world agent and user behavior QA tasks. The code is available at: https://anonymous.4open.science/r/comet-B00B/

[1253] IRIS: Implicit Reward-Guided Internal Sifting for Mitigating Multimodal Hallucination

Yuanshuai Li, Yuping Yan, Jirui Han, Fei Ming, Lingjuan Lv, Yaochu Jin

Main category: cs.LG

TL;DR: IRIS is a novel alignment method for Multimodal LLMs that uses implicit rewards in log-probability space to address hallucinations by capturing internal modal conflicts, requiring minimal data and no external feedback.

Details

Motivation: Existing DPO approaches for MLLMs rely on costly external evaluators for scoring/rewriting, creating off-policy learnability gaps and discretization loss. External feedback overlooks fine-grained conflicts between modalities that cause hallucinations during generation.

Method: IRIS leverages continuous implicit rewards in native log-probability space to preserve full information density and capture internal modal competition. It uses an on-policy paradigm with self-generated preference pairs, sifting them based on multimodal implicit rewards to directly resolve modal conflicts.

Result: Extensive experiments show IRIS achieves highly competitive performance on key hallucination benchmarks using only 5.7k samples, without requiring any external feedback during preference alignment.

Conclusion: IRIS provides an efficient and principled paradigm for mitigating MLLM hallucinations by addressing modal conflicts through implicit reward-guided internal sifting.

Abstract: Hallucination remains a fundamental challenge for Multimodal Large Language Models (MLLMs). While Direct Preference Optimization (DPO) is a key alignment framework, existing approaches often rely heavily on costly external evaluators for scoring or rewriting, incurring off-policy learnability gaps and discretization loss. Due to the lack of access to internal states, such feedback overlooks the fine-grained conflicts between different modalities that lead to hallucinations during generation. To address this issue, we propose IRIS (Implicit Reward-Guided Internal Sifting), which leverages continuous implicit rewards in the native log-probability space to preserve full information density and capture internal modal competition. This on-policy paradigm eliminates learnability gaps by utilizing self-generated preference pairs. By sifting these pairs based on multimodal implicit rewards, IRIS ensures that optimization is driven by signals that directly resolve modal conflicts. Extensive experiments demonstrate that IRIS achieves highly competitive performance on key hallucination benchmarks using only 5.7k samples, without requiring any external feedback during preference alignment. These results confirm that IRIS provides an efficient and principled paradigm for mitigating MLLM hallucinations.

[1254] DIA-CLIP: a universal representation learning framework for zero-shot DIA proteomics

Yucheng Liao, Han Wen, Weinan E, Weijie Zhang

Main category: cs.LG

TL;DR: DIA-CLIP is a pre-trained model that uses cross-modal representation learning for peptide-spectrum matching in mass spectrometry, eliminating the need for per-run semi-supervised training and achieving zero-shot inference with improved accuracy.

Details

Motivation: Current DIA-MS analysis frameworks require semi-supervised training within each run for peptide-spectrum match re-scoring, which is prone to overfitting and lacks generalizability across diverse species and experimental conditions.

Method: DIA-CLIP integrates dual-encoder contrastive learning framework with encoder-decoder architecture to establish unified cross-modal representations for peptides and corresponding spectral features, enabling high-precision, zero-shot PSM inference.

Result: DIA-CLIP consistently outperforms state-of-the-art tools, yielding up to 45% increase in protein identification while achieving 12% reduction in entrapment identifications across diverse benchmarks.

Conclusion: DIA-CLIP shifts DIA analysis from semi-supervised training to universal cross-modal representation learning, with potential applications in single-cell and spatial proteomics for biomarker discovery and cellular mechanism elucidation.

Abstract: Data-independent acquisition mass spectrometry (DIA-MS) has established itself as a cornerstone of proteomic profiling and large-scale systems biology, offering unparalleled depth and reproducibility. Current DIA analysis frameworks, however, require semi-supervised training within each run for peptide-spectrum match (PSM) re-scoring. This approach is prone to overfitting and lacks generalizability across diverse species and experimental conditions. Here, we present DIA-CLIP, a pre-trained model shifting the DIA analysis paradigm from semi-supervised training to universal cross-modal representation learning. By integrating dual-encoder contrastive learning framework with encoder-decoder architecture, DIA-CLIP establishes a unified cross-modal representation for peptides and corresponding spectral features, achieving high-precision, zero-shot PSM inference. Extensive evaluations across diverse benchmarks demonstrate that DIA-CLIP consistently outperforms state-of-the-art tools, yielding up to a 45% increase in protein identification while achieving a 12% reduction in entrapment identifications. Moreover, DIA-CLIP holds immense potential for diverse practical applications, such as single-cell and spatial proteomics, where its enhanced identification depth facilitates the discovery of novel biomarkers and the elucidates of intricate cellular mechanisms.

[1255] Position: Beyond Model-Centric Prediction – Agentic Time Series Forecasting

Mingyue Cheng, Xiaoyu Tao, Qi Liu, Ze Guo, Enhong Chen

Main category: cs.LG

TL;DR: The paper proposes Agentic Time Series Forecasting (ATSF), reframing forecasting as an agentic process with perception, planning, action, reflection, and memory components, moving beyond traditional model-centric approaches.

Details

Motivation: Traditional time series forecasting is model-centric, static, and single-pass, which is insufficient for adaptive, multi-turn settings requiring informative feature extraction, reasoning-driven inference, iterative refinement, and continual adaptation over time.

Method: Proposes ATSF framework with three implementation paradigms: workflow-based design, agentic reinforcement learning, and a hybrid agentic workflow paradigm. Emphasizes organizing forecasting as an agentic workflow that can interact with tools, incorporate feedback, and evolve through experience.

Result: Position paper establishing agentic forecasting as a foundation for future research at the intersection of time series forecasting, outlining opportunities and challenges in shifting from model-centric to agentic approaches.

Conclusion: Agentic forecasting represents a paradigm shift that could address limitations of traditional approaches by enabling adaptive, multi-turn forecasting with reasoning, iterative refinement, and continual learning capabilities.

Abstract: Time series forecasting has traditionally been formulated as a model-centric, static, and single-pass prediction problem that maps historical observations to future values. While this paradigm has driven substantial progress, it proves insufficient in adaptive and multi-turn settings where forecasting requires informative feature extraction, reasoning-driven inference, iterative refinement, and continual adaptation over time. In this paper, we argue for agentic time series forecasting (ATSF), which reframes forecasting as an agentic process composed of perception, planning, action, reflection, and memory. Rather than focusing solely on predictive models, ATSF emphasizes organizing forecasting as an agentic workflow that can interact with tools, incorporate feedback from outcomes, and evolve through experience accumulation. We outline three representative implementation paradigms – workflow-based design, agentic reinforcement learning, and a hybrid agentic workflow paradigm – and discuss the opportunities and challenges that arise when shifting from model-centric prediction to agentic forecasting. Together, this position aims to establish agentic forecasting as a foundation for future research at the intersection of time series forecasting.

[1256] Stein-Rule Shrinkage for Stochastic Gradient Estimation in High Dimensions

M. Arashi, M. Amintoosi

Main category: cs.LG

TL;DR: Proposes a shrinkage gradient estimator using Stein-rule shrinkage to improve stochastic gradient estimation in deep learning, showing it dominates standard stochastic gradients under squared error loss and improves Adam optimizer performance.

Details

Motivation: Standard stochastic gradient methods treat mini-batch gradients as unbiased estimators, but in high-dimensional settings, unbiased estimators are generally inadmissible under quadratic loss, suggesting standard stochastic gradients may be suboptimal from a risk perspective.

Method: Formulates stochastic gradient computation as a high-dimensional estimation problem and introduces a decision-theoretic framework based on Stein-rule shrinkage. Constructs a shrinkage gradient estimator that adaptively contracts noisy mini-batch gradients toward a stable restricted estimator derived from historical momentum, with shrinkage intensity determined using online estimates of gradient noise variance.

Result: The proposed estimator uniformly dominates standard stochastic gradient under squared error loss for dimension p>=3 and is minimax-optimal. When incorporated into Adam optimizer, it shows consistent improvements on CIFAR10 and CIFAR100 across multiple levels of label noise, particularly in large-batch regimes.

Conclusion: Classical shrinkage principles provide a principled and effective approach to improving stochastic gradient estimation in modern deep learning, with gains primarily from selectively applying shrinkage to high-dimensional convolutional layers.

Abstract: Stochastic gradient methods are central to large-scale learning, yet their analysis typically treats mini-batch gradients as unbiased estimators of the population gradient. In high-dimensional settings, however, classical results from statistical decision theory show that unbiased estimators are generally inadmissible under quadratic loss, suggesting that standard stochastic gradients may be suboptimal from a risk perspective. In this work, we formulate stochastic gradient computation as a high-dimensional estimation problem and introduce a decision-theoretic framework based on Stein-rule shrinkage. We construct a shrinkage gradient estimator that adaptively contracts noisy mini-batch gradients toward a stable restricted estimator derived from historical momentum. The shrinkage intensity is determined in a data-driven manner using an online estimate of gradient noise variance, leveraging second-moment statistics commonly maintained by adaptive optimization methods. Under a Gaussian noise model and for dimension p>=3, we show that the proposed estimator uniformly dominates the standard stochastic gradient under squared error loss and is minimax-optimal in the classical decision-theoretic sense. We further demonstrate how this estimator can be incorporated into the Adam optimizer, yielding a practical algorithm with negligible additional computational cost. Empirical evaluations on CIFAR10 and CIFAR100, across multiple levels of label noise, show consistent improvements over Adam in the large-batch regime. Ablation studies indicate that the gains arise primarily from selectively applying shrinkage to high-dimensional convolutional layers, while indiscriminate shrinkage across all parameters degrades performance. These results illustrate that classical shrinkage principles provide a principled and effective approach to improving stochastic gradient estimation in modern deep learning.

[1257] Grad2Reward: From Sparse Judgment to Dense Rewards for Improving Open-Ended LLM Reasoning

Zheng Zhang, Ao Lu, Yuanhao Zeng, Ziwei Shan, Jinjin Guo, Lufei Li, Yexin Li, Kan Ren

Main category: cs.LG

TL;DR: Grad2Reward extracts dense process rewards from LLM Judges via gradient attribution for better RL in open-ended tasks.

Details

Motivation: Current RL with LLM-as-Judge provides sparse rewards that lack fine-grained supervision for complex long-form generation tasks, and treats the Judge as a black-box without leveraging its intermediate feedback signals.

Method: Grad2Reward framework extracts dense process rewards from the Judge’s model inference process using a single backward pass and gradient-based attribution, enabling token-level credit assignment. Includes self-judging mechanism for policy improvement without external Judges.

Result: Policies optimized with Grad2Reward achieve outstanding performance across diverse open-ended tasks, demonstrating effectiveness and broad generalizability.

Conclusion: Grad2Reward addresses limitations of sparse rewards in open-ended RL by extracting dense process-level feedback from LLM Judges, significantly improving training efficiency and reasoning quality.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant breakthroughs in complex LLM reasoning within verifiable domains, such as mathematics and programming. Recent efforts have sought to extend this paradigm to open-ended tasks by employing LLMs-as-a-Judge to provide sequence-level rewards for policy optimization. However, these rewards are inherently sparse, failing to provide the fine-grained supervision necessary for generating complex, long-form trajectories. Furthermore, current work treats the Judge as a black-box oracle, discarding the rich intermediate feedback signals encoded in it. To address these limitations, we introduce Grad2Reward, a novel framework that extracts dense process rewards directly from the Judge’s model inference process via a single backward pass. By leveraging gradient-based attribution, Grad2Reward enables precise token-level credit assignment, substantially enhancing training efficiency and reasoning quality. Additionally, Grad2Reward introduces a self-judging mechanism, allowing the policy to improve through its own evaluative signals without training specialized reward models or reliance on superior external Judges. The experiments demonstrate that policies optimized with Grad2Reward achieve outstanding performance across diverse open-ended tasks, affirming its effectiveness and broad generalizability.

[1258] Beyond Precision: Training-Inference Mismatch is an Optimization Problem and Simple LR Scheduling Fixes It

Yaxiang Zhang, Yingru Li, Jiacai Liu, Jiawei Xu, Ziniu Li, Qian Liu, Haoyuan Li

Main category: cs.LG

TL;DR: Analysis of RL training instability in LLMs, attributing it to dynamic gradient noise and training-inference mismatch that escalates over time, with a solution using response length as an early-warning signal to trigger learning rate decay.

Details

Motivation: RL training for Large Language Models is notoriously unstable, and while recent studies attribute this to "training inference mismatch," standard remedies like Importance Sampling often fail during extended training runs. The paper aims to understand and address this instability through an optimization lens.

Method: Analyzes RL instability through optimization perspective, showing gradient noise and training-inference mismatch escalate together. Proposes a specialized Learning Rate scheduler that dynamically triggers LR decay based on response length, which serves as an early-warning signal for impending instability, rather than using pre-defined decay schedules.

Result: Empirical evidence shows that by reducing learning rate as gradient noise rises, RL training can be consistently stabilized and training-inference mismatch kept at safe levels. Response length proves to be a reliable indicator for triggering timely LR adjustments.

Conclusion: Training-inference mismatch in RL for LLMs is not just a static numerical discrepancy but a dynamic failure coupled with model optimization. A simple yet effective solution using response length as an early-warning signal for learning rate decay can effectively stabilize RL training.

Abstract: Reinforcement Learning (RL) for training Large Language Models is notoriously unstable. While recent studies attribute this to “training inference mismatch stemming” from inconsistent hybrid engines, standard remedies, such as Importance Sampling, might fail during extended training runs. In this work, we analyze this instability through the lens of optimization, demonstrating that gradient noise and training-inference mismatch escalate in tandem as training progresses. Meanwhile, we find that the mismatch can be effectively suppressed by shrinking the update size. Taken together, we deduce that the mismatch is not merely a static numerical discrepancy, but a dynamic failure coupled with the model’s optimization. Based on this insight, we propose a simple yet effective solution: a specialized Learning Rate (LR) scheduler. Instead of pre-defined decay schedule in traditional LR scheduler, our method dynamically triggers LR decay based on response length, which we identify as a reliable early-warning signal for impending instability. Empirical evidence suggests that by reducing the learning rate as gradient noise rises, we can consistently stabilize RL training and keep the training-inference mismatch at a safe level.

[1259] DOGMA: Weaving Structural Information into Data-centric Single-cell Transcriptomics Analysis

Ru Zhang, Xunkai Li, Yaxin Deng, Sicheng Liu, Daohan Su, Qiangqiang Dai, Hongchao Qin, Rong-Hua Li, Guoren Wang, Jia Li

Main category: cs.LG

TL;DR: DOGMA is a data-centric framework for single-cell transcriptomics that uses biological prior knowledge (Statistical Anchors, Cell Ontology, Phylogenetic Trees, Gene Ontology) to structure raw sequencing data and enhance semantic understanding, achieving state-of-the-art performance with better robustness and efficiency.

Details

Motivation: Current single-cell transcriptomics methods either treat cells as independent entities (overlooking intercellular relationships) or use heuristic rules that neglect biological prior knowledge, leading to suboptimal graph representations and poor ML model utility.

Method: DOGMA integrates multi-level biological prior knowledge: Statistical Anchors with Cell Ontology and Phylogenetic Trees for deterministic graph construction and cross-species alignment, and Gene Ontology for feature-level semantic enhancement of raw sequencing data.

Result: DOGMA achieves state-of-the-art performance on complex multi-species and multi-organ benchmarks, demonstrating superior zero-shot robustness, sample efficiency, and significantly lower computational cost compared to existing methods.

Conclusion: Incorporating biological prior knowledge through DOGMA’s holistic framework enables more effective structural reshaping and semantic enhancement of single-cell transcriptomics data, overcoming limitations of heuristic approaches and improving ML model utility.

Abstract: Recently, data-centric AI methodology has been a dominant paradigm in single-cell transcriptomics analysis, which treats data representation rather than model complexity as the fundamental bottleneck. In the review of current studies, earlier sequence methods treat cells as independent entities and adapt prevalent ML models to analyze their directly inherited sequence data. Despite their simplicity and intuition, these methods overlook the latent intercellular relationships driven by the functional mechanisms of biological systems and the inherent quality issues of the raw sequence data. Therefore, a series of structured methods has emerged. Although they employ various heuristic rules to capture intricate intercellular relationships and enhance the raw sequencing data, these methods often neglect biological prior knowledge. This omission incurs substantial overhead and yields suboptimal graph representations, thereby hindering the utility of ML models. To address them, we propose DOGMA, a holistic data-centric framework designed for the structural reshaping and semantic enhancement of raw data through multi-level biological prior knowledge. Transcending reliance on stochastic heuristics, DOGMA redefines graph construction by integrating Statistical Anchors with Cell Ontology and Phylogenetic Trees to enable deterministic structure discovery and robust cross-species alignment. Furthermore, Gene Ontology is utilized to bridge the feature-level semantic gap by incorporating functional priors. In complex multi-species and multi-organ benchmarks, DOGMA achieves SOTA performance, exhibiting superior zero-shot robustness and sample efficiency while operating with significantly lower computational cost.

[1260] Hyperbolic Graph Neural Networks Under the Microscope: The Role of Geometry-Task Alignment

Dionisia Naddeo, Jonas Linkerhägner, Nicola Toschi, Geri Skenderi, Veronica Lachi

Main category: cs.LG

TL;DR: HGNNs are effective for tree-like graphs when tasks align with hyperbolic geometry, but their advantage disappears when tasks don’t preserve metric structure.

Details

Motivation: To challenge the prevailing assumption that hyperbolic geometry is always beneficial for tree-like graphs by introducing the concept of geometry-task alignment - whether the task requires preserving the metric structure of the input graph.

Method: Proposes geometry-task alignment condition, theoretically and empirically analyzes HGNNs on synthetic regression problems, and evaluates on link prediction and node classification tasks while jointly analyzing predictive performance and embedding distortion.

Result: HGNNs can recover low-distortion representations when geometry-task alignment exists, outperform Euclidean models for geometry-aligned tasks like link prediction, but show no advantage for non-aligned tasks like node classification.

Conclusion: The focus should shift from just asking “Is the graph hyperbolic?” to also questioning “Is the task aligned with hyperbolic geometry?” - HGNNs are beneficial only when both conditions are met.

Abstract: Many complex networks exhibit hyperbolic structural properties, making hyperbolic space a natural candidate for representing hierarchical and tree-like graphs with low distortion. Based on this observation, Hyperbolic Graph Neural Networks (HGNNs) have been widely adopted as a principled choice for representation learning on tree-like graphs. In this work, we question this paradigm by proposing an additional condition of geometry-task alignment, i.e., whether the metric structure of the target follows that of the input graph. We theoretically and empirically demonstrate the capability of HGNNs to recover low-distortion representations on two synthetic regression problems, and show that their geometric inductive bias becomes helpful when the problem requires preserving metric structure. Additionally, we evaluate HGNNs on the tasks of link prediction and node classification by jointly analyzing predictive performance and embedding distortion, revealing that only link prediction is geometry-aligned. Overall, our findings shift the focus from only asking “Is the graph hyperbolic?” to also questioning “Is the task aligned with hyperbolic geometry?”, showing that HGNNs consistently outperform Euclidean models under such alignment, while their advantage vanishes otherwise.

[1261] Time2Vec-Integrated Transformer for Robust Gesture Recognition from Low-Density sEMG

Blagoj Hristov, Hristijan Gjoreski, Vesna Ojleska Latkoska, Gorjan Nadzinski

Main category: cs.LG

TL;DR: A novel deep learning framework using hybrid Transformer with Time2Vec embeddings achieves precise myoelectric prosthesis control with minimal 2-channel sEMG sensors, achieving 95.7% F1-score for 10-class movements.

Details

Motivation: Current myoelectric prosthesis control relies on complex, dense multi-sensor arrays that limit consumer accessibility. The paper aims to develop a data-efficient framework that achieves precise control using minimal sensor hardware.

Method: Hybrid Transformer optimized for sparse 2-channel sEMG with Time2Vec learnable temporal embeddings to capture biological signal stochasticity. Uses normalized additive fusion to align spatial and temporal feature distributions, and two-stage curriculum learning for robust feature extraction despite data scarcity.

Result: Achieves state-of-the-art multi-subject F1-score of 95.7% ± 0.20% for 10-class movement set, statistically outperforming standard Transformer and CNN-LSTM models. Rapid calibration with just 2 trials per gesture recovers performance from 21.0% to 96.9% for unseen subjects.

Conclusion: High-fidelity temporal embeddings can compensate for low spatial resolution, challenging the necessity of high-density sensing. The framework offers a robust, cost-effective blueprint for next-generation prosthetic interfaces with rapid personalization capabilities.

Abstract: Accurate and responsive myoelectric prosthesis control typically relies on complex, dense multi-sensor arrays, which limits consumer accessibility. This paper presents a novel, data-efficient deep learning framework designed to achieve precise and accurate control using minimal sensor hardware. Leveraging an external dataset of 8 subjects, our approach implements a hybrid Transformer optimized for sparse, two-channel surface electromyography (sEMG). Unlike standard architectures that use fixed positional encodings, we integrate Time2Vec learnable temporal embeddings to capture the stochastic temporal warping inherent in biological signals. Furthermore, we employ a normalized additive fusion strategy that aligns the latent distributions of spatial and temporal features, preventing the destructive interference common in standard implementations. A two-stage curriculum learning protocol is utilized to ensure robust feature extraction despite data scarcity. The proposed architecture achieves a state-of-the-art multi-subject F1-score of 95.7% $\pm$ 0.20% for a 10-class movement set, statistically outperforming both a standard Transformer with fixed encodings and a recurrent CNN-LSTM model. Architectural optimization reveals that a balanced allocation of model capacity between spatial and temporal dimensions yields the highest stability. Furthermore, while direct transfer to a new unseen subject led to poor accuracy due to domain shifts, a rapid calibration protocol utilizing only two trials per gesture recovered performance from 21.0% $\pm$ 2.98% to 96.9% $\pm$ 0.52%. By validating that high-fidelity temporal embeddings can compensate for low spatial resolution, this work challenges the necessity of high-density sensing. The proposed framework offers a robust, cost-effective blueprint for next-generation prosthetic interfaces capable of rapid personalization.

[1262] Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models

Jinbin Bai, Yixuan Li, Yuchen Zhu, Yi Xin, Qingyu Shi, Aosong Feng, Xiaohong Liu, Molei Tao, Jianru Xue, Xiangtai Li, Ming-Hsuan Yang

Main category: cs.LG

TL;DR: Prism is an efficient test-time scaling framework for discrete diffusion language models that uses hierarchical trajectory search, local branching with partial remasking, and self-verified feedback to improve reasoning performance with fewer function evaluations.

Details

Motivation: Current test-time scaling methods rely on autoregressive decoding, which doesn't work well with discrete diffusion language models due to their parallel decoding nature. There's a need for efficient TTS methods to unlock dLLMs' full generative potential for reasoning tasks.

Method: Prism combines three techniques: (1) Hierarchical Trajectory Search that dynamically prunes and reallocates compute during early-to-mid denoising, (2) Local branching with partial remasking to explore diverse implementations while preserving high-confidence tokens, and (3) Self-Verified Feedback using self-evaluation prompts instead of external verifiers.

Result: On four mathematical reasoning and code generation benchmarks using three dLLMs (LLaDA 8B Instruct, Dream 7B Instruct, LLaDA 2.0-mini), Prism achieves favorable performance-efficiency trade-off, matching best-of-N performance with substantially fewer function evaluations.

Conclusion: Prism provides an effective and efficient test-time scaling framework for discrete diffusion language models, enabling improved reasoning capabilities while maintaining computational efficiency through intelligent compute allocation and self-verification.

Abstract: Inference-time compute has re-emerged as a practical way to improve LLM reasoning. Most test-time scaling (TTS) algorithms rely on autoregressive decoding, which is ill-suited to discrete diffusion language models (dLLMs) due to their parallel decoding over the entire sequence. As a result, developing effective and efficient TTS methods to unlock dLLMs’ full generative potential remains an underexplored challenge. To address this, we propose Prism (Pruning, Remasking, and Integrated Self-verification Method), an efficient TTS framework for dLLMs that (i) performs Hierarchical Trajectory Search (HTS) which dynamically prunes and reallocates compute in an early-to-mid denoising window, (ii) introduces Local branching with partial remasking to explore diverse implementations while preserving high-confidence tokens, and (iii) replaces external verifiers with Self-Verified Feedback (SVF) obtained via self-evaluation prompts on intermediate completions. Across four mathematical reasoning and code generation benchmarks on three dLLMs, including LLaDA 8B Instruct, Dream 7B Instruct, and LLaDA 2.0-mini, our Prism achieves a favorable performance-efficiency trade-off, matching best-of-N performance with substantially fewer function evaluations (NFE). The code is released at https://github.com/viiika/Prism.

[1263] No Generation without Representation: Efficient Causal Protein Language Models Enable Zero-Shot Fitness Estimation

Furkan Eris

Main category: cs.LG

TL;DR: Proust is a 309M-parameter causal protein language model that bridges the gap between masked language models (good for fitness prediction) and causal models (good for generation), achieving competitive performance on protein fitness prediction while retaining native generative capabilities.

Details

Motivation: Protein language models face a fundamental divide: masked language models excel at fitness prediction while causal models enable generation, forcing practitioners to maintain separate architectures. The authors aim to bridge this gap with a single model that can do both well.

Method: Proust uses architectural innovations adapted from recent LLM research, including grouped-query attention with shared K/V projections, cross-layer value residuals, and depthwise causal convolutions. It’s a 309M-parameter causal PLM trained on 33B tokens in 40 B200 GPU-hours.

Result: Proust achieves Spearman ρ=0.390 on ProteinGym substitutions, competitive with MLMs requiring 50-200× the compute. On indels, it sets new SOTA, outperforming models up to 20× larger. On EVEREST viral fitness benchmarks, it approaches structure-aware methods using sequence alone.

Conclusion: Proust bridges the gap between fitness prediction and generation in protein language models, achieving strong performance on both tasks with a single architecture. The model retains native generative capabilities that MLMs lack, positioning it in a sweet spot for protein modeling.

Abstract: Protein language models (PLMs) face a fundamental divide: masked language models (MLMs) excel at fitness prediction while causal models enable generation, forcing practitioners to maintain separate architectures. We introduce \textbf{Proust}, a 309M-parameter causal PLM that bridges this gap through architectural innovations adapted from recent LLM research, including grouped-query attention with shared K/V projections, cross-layer value residuals, and depthwise causal convolutions. Trained on 33B tokens in 40 B200 GPU-hours, Proust achieves Spearman $ρ= 0.390$ on ProteinGym substitutions, competitive with MLMs requiring 50–200$\times$ the compute. On indels, Proust sets a new state-of-the-art, outperforming models up to 20$\times$ larger. On EVEREST viral fitness benchmarks, it approaches structure-aware methods using sequence alone. These powerful representations position Proust in a sweet spot as it also retains native generative capabilities that MLMs lack by design. Interpretability analysis reveals that per-position entropy variance predicts, to an extent, when retrieval augmentation helps and hurts. Such insights can grow in both quantity and quality at scale and inform capabilities such as test-time scaling. Code and weights are available at https://github.com/Furkan9015/proust-inference

[1264] Self-Rewarding Sequential Monte Carlo for Masked Diffusion Language Models

Ziwei Luo, Ziqi Jin, Lei Wang, Lidong Bing, Thomas B. Schön

Main category: cs.LG

TL;DR: Self-rewarding sequential Monte Carlo (SMC) improves masked diffusion language model sampling by using multiple parallel diffusion processes with trajectory-level confidence as self-rewarding signals for better sample diversity and quality.

Details

Motivation: Existing masked diffusion language models rely on greedy confidence-based sampling that restricts generation diversity and is noise-sensitive, leading to inevitable collapse in possible paths.

Method: Launches multiple interacting diffusion processes (particles) in parallel for trajectory exploration, uses trajectory-level confidence as self-rewarding signal for particle importance weights, and iteratively weights/resamples particles to steer generation toward globally confident, high-quality samples.

Result: Achieves significant improvement on various masked diffusion language models and benchmarks without extra training or reward guidance, effectively converting parallel inference capacity into improved sampling quality.

Conclusion: Self-rewarding SMC provides an effective inference-time scaling algorithm that addresses diversity collapse in masked diffusion language models through parallel particle exploration with trajectory-level confidence rewards.

Abstract: This work presents self-rewarding sequential Monte Carlo (SMC), an inference-time scaling algorithm enabling effective sampling of masked diffusion language models (MDLMs). Our algorithm stems from the observation that most existing MDLMs rely on a confidence-based sampling strategy, where only tokens with the highest prediction confidence are preserved at each step. This restricts the generation to a noise-sensitive, greedy decoding paradigm, resulting in an inevitable collapse in the diversity of possible paths. We address this problem by launching multiple interacting diffusion processes in parallel, referred to as particles, for trajectory exploration. Importantly, we introduce the trajectory-level confidence as a self-rewarding signal for assigning particle importance weights. During sampling, particles are iteratively weighted and resampled to systematically steer generation towards globally confident, high-quality samples. Our self-rewarding SMC is verified on various masked diffusion language models and benchmarks, achieving significant improvement without extra training or reward guidance, while effectively converting parallel inference capacity into improved sampling quality. Our code is available at https://github.com/Algolzw/self-rewarding-smc.

[1265] VLM-Guided Experience Replay

Elad Sharony, Tom Jurgenson, Orr Krupnik, Dotan Di Castro, Shie Mannor

Main category: cs.LG

TL;DR: Using frozen pre-trained Vision-Language Models to prioritize experiences in reinforcement learning replay buffers, improving sample efficiency and success rates across game and robotics domains.

Details

Motivation: While LLMs and VLMs have been integrated into various RL components, the replay buffer remains unexplored. The paper aims to leverage VLMs' semantic reasoning to intelligently prioritize experiences for more efficient learning.

Method: Uses a frozen, pre-trained VLM as an automated evaluator to identify and prioritize promising sub-trajectories from agent experiences. No fine-tuning required. The VLM guides replay buffer prioritization across discrete and continuous domains.

Result: Agents achieve 11-52% higher average success rates and improve sample efficiency by 19-45% compared to previous approaches across game-playing and robotics scenarios.

Conclusion: VLMs can effectively guide replay buffer prioritization, demonstrating significant improvements in RL performance and sample efficiency without requiring model fine-tuning.

Abstract: Recent advances in Large Language Models (LLMs) and Vision-Language Models (VLMs) have enabled powerful semantic and multimodal reasoning capabilities, creating new opportunities to enhance sample efficiency, high-level planning, and interpretability in reinforcement learning (RL). While prior work has integrated LLMs and VLMs into various components of RL, the replay buffer, a core component for storing and reusing experiences, remains unexplored. We propose addressing this gap by leveraging VLMs to guide the prioritization of experiences in the replay buffer. Our key idea is to use a frozen, pre-trained VLM (requiring no fine-tuning) as an automated evaluator to identify and prioritize promising sub-trajectories from the agent’s experiences. Across scenarios, including game-playing and robotics, spanning both discrete and continuous domains, agents trained with our proposed prioritization method achieve 11-52% higher average success rates and improve sample efficiency by 19-45% compared to previous approaches. https://esharony.me/projects/vlm-rb/

[1266] FUPareto: Bridging the Forgetting-Utility Gap in Federated Unlearning via Pareto Augmented Optimization

Zeyan Wang, Zhengmao Liu, Yongxin Cai, Chi Li, Xiaoying Tang, Jingchao Chen, Zibin Pan, Jing Qiu

Main category: cs.LG

TL;DR: FUPareto: A federated unlearning framework using Pareto-augmented optimization with Minimum Boundary Shift Loss and Null-Space Projected MGDA for efficient multi-client data removal while preserving model utility.

Details

Motivation: Address three key challenges in Federated Unlearning: (1) existing methods compromise utility or increase vulnerability to membership inference attacks, (2) persistent conflict between forgetting and utility, and (3) poor support for concurrent multi-client unlearning due to gradient conflicts.

Method: Proposes FUPareto framework with Minimum Boundary Shift Loss to suppress target class logits below non-target class logits. Uses Pareto improvement steps to preserve utility and Pareto expansion with Null-Space Projected Multiple Gradient Descent Algorithm to decouple gradient conflicts for concurrent multi-client unlearning.

Result: Extensive experiments show FUPareto consistently outperforms state-of-the-art federated unlearning methods in both unlearning efficacy and retained utility across diverse scenarios.

Conclusion: FUPareto provides an effective solution for federated unlearning that balances forgetting and utility while supporting concurrent multi-client unlearning through Pareto optimization and gradient conflict resolution.

Abstract: Federated Unlearning (FU) aims to efficiently remove the influence of specific client data from a federated model while preserving utility for the remaining clients. However, three key challenges remain: (1) existing unlearning objectives often compromise model utility or increase vulnerability to Membership Inference Attacks (MIA); (2) there is a persistent conflict between forgetting and utility, where further unlearning inevitably harms retained performance; and (3) support for concurrent multi-client unlearning is poor, as gradient conflicts among clients degrade the quality of forgetting. To address these issues, we propose FUPareto, an efficient unlearning framework via Pareto-augmented optimization. We first introduce the Minimum Boundary Shift (MBS) Loss, which enforces unlearning by suppressing the target class logit below the highest non-target class logit; this can improve the unlearning efficiency and mitigate MIA risks. During the unlearning process, FUPareto performs Pareto improvement steps to preserve model utility and executes Pareto expansion to guarantee forgetting. Specifically, during Pareto expansion, the framework integrates a Null-Space Projected Multiple Gradient Descent Algorithm (MGDA) to decouple gradient conflicts. This enables effective, fair, and concurrent unlearning for multiple clients while minimizing utility degradation. Extensive experiments across diverse scenarios demonstrate that FUPareto consistently outperforms state-of-the-art FU methods in both unlearning efficacy and retained utility.

[1267] PIMPC-GNN: Physics-Informed Multi-Phase Consensus Learning for Enhancing Imbalanced Node Classification in Graph Neural Networks

Abdul Joseph Fofanah, Lian Wen, David Chen

Main category: cs.LG

TL;DR: PIMPC-GNN is a physics-informed multi-phase consensus framework for imbalanced node classification in graph neural networks, combining thermodynamic diffusion, Kuramoto synchronization, and spectral embedding to improve minority class performance.

Details

Motivation: GNNs struggle with class-imbalanced settings where minority classes are under-represented, leading to biased predictions toward majority classes. Existing methods often fail to effectively handle severe imbalance ratios and capture long-range dependencies for minority nodes.

Method: Proposes a physics-informed multi-phase consensus framework integrating three complementary dynamics: (1) thermodynamic diffusion for spreading minority labels and capturing long-range dependencies, (2) Kuramoto synchronization for aligning minority nodes through oscillatory consensus, and (3) spectral embedding for separating classes via structural regularization. These are combined through class-adaptive ensemble weighting and trained with an imbalance-aware loss that couples balanced cross-entropy with physics-based constraints.

Result: Outperforms 16 state-of-the-art baselines across five benchmark datasets with imbalance ratios from 5-100, achieving notable gains in minority-class recall (up to +12.7%) and balanced accuracy (up to +8.3%). The framework also provides interpretable insights into consensus dynamics in graph learning.

Conclusion: PIMPC-GNN effectively addresses class imbalance in GNNs through a physics-informed multi-phase consensus approach, demonstrating significant improvements in minority class performance while offering interpretable insights into the underlying dynamics.

Abstract: Graph neural networks (GNNs) often struggle in class-imbalanced settings, where minority classes are under-represented and predictions are biased toward majorities. We propose \textbf{PIMPC-GNN}, a physics-informed multi-phase consensus framework for imbalanced node classification. Our method integrates three complementary dynamics: (i) thermodynamic diffusion, which spreads minority labels to capture long-range dependencies, (ii) Kuramoto synchronisation, which aligns minority nodes through oscillatory consensus, and (iii) spectral embedding, which separates classes via structural regularisation. These perspectives are combined through class-adaptive ensemble weighting and trained with an imbalance-aware loss that couples balanced cross-entropy with physics-based constraints. Across five benchmark datasets and imbalance ratios from 5-100, PIMPC-GNN outperforms 16 state-of-the-art baselines, achieving notable gains in minority-class recall (up to +12.7%) and balanced accuracy (up to +8.3%). Beyond empirical improvements, the framework also provides interpretable insights into consensus dynamics in graph learning. The code is available at \texttt{https://github.com/afofanah/PIMPC-GNN}.

[1268] Designing Time Series Experiments in A/B Testing with Transformer Reinforcement Learning

Xiangkun Wu, Qianglin Wen, Yingying Zhang, Hongtu Zhu, Ting Li, Chengchun Shi

Main category: cs.LG

TL;DR: Transformer-based RL approach for optimal A/B testing in time series experiments that conditions on full history and directly optimizes MSE without restrictive assumptions

Details

Motivation: A/B testing for time series experiments is challenging due to sequential policy assignments over time. Existing designs don't leverage full history and rely on strong assumptions to approximate objective functions like MSE.

Method: Proposes a transformer reinforcement learning approach that uses transformers to condition treatment allocation on entire history and employs RL to directly optimize mean squared error without restrictive assumptions.

Result: Empirical evaluations on synthetic data, a dispatch simulator, and real-world ridesharing data show the proposed method consistently outperforms existing designs.

Conclusion: The transformer RL approach effectively addresses limitations of existing time series A/B testing designs by leveraging full history and direct MSE optimization.

Abstract: A/B testing has become a gold standard for modern technological companies to conduct policy evaluation. Yet, its application to time series experiments, where policies are sequentially assigned over time, remains challenging. Existing designs suffer from two limitations: (i) they do not fully leverage the entire history for treatment allocation; (ii) they rely on strong assumptions to approximate the objective function (e.g., the mean squared error of the estimated treatment effect) for optimizing the design. We first establish an impossibility theorem showing that failure to condition on the full history leads to suboptimal designs, due to the dynamic dependencies in time series experiments. To address both limitations simultaneously, we next propose a transformer reinforcement learning (RL) approach which leverages transformers to condition allocation on the entire history and employs RL to directly optimize the MSE without relying on restrictive assumptions. Empirical evaluations on synthetic data, a publicly available dispatch simulator, and a real-world ridesharing dataset demonstrate that our proposal consistently outperforms existing designs.

[1269] COLT: Lightweight Multi-LLM Collaboration through Shared MCTS Reasoning for Model Compilation

Annabelle Sujun Tang, Christopher Priebe, Lianhui Qin, Hadi Esmaeilzadeh

Main category: cs.LG

TL;DR: COLT: A lightweight collaborative multi-LLM framework for compiler optimization that coordinates multiple language models within a single Monte Carlo tree search process, using smaller models primarily while escalating to larger models when needed.

Details

Motivation: Model serving costs dominate AI systems, making compiler optimization essential. While LLMs can guide compiler search, using a single large model is expensive, and smaller models alone are less reliable. The paper investigates whether multi-LLM collaborative reasoning with primarily small LLMs can match or exceed single large model performance.

Method: Proposes COLT framework that enables coordinated reasoning across multiple LLMs within a single Monte Carlo tree search (MCTS) process. Uses a shared MCTS tree as collaboration substrate, allowing reuse of transformation prefixes and cross-model value propagation. Each iteration, the acting LLM proposes a joint action: (compiler transformation, model to be queried next). Includes model-aware tree policy that biases search toward smaller models while preserving exploration, and course-alteration mechanism that escalates to largest model when search shows persistent regressions.

Result: Not specified in the abstract, but the framework aims to achieve performance comparable to or better than single large models while reducing costs through collaborative use of smaller models.

Conclusion: COLT provides a lightweight collaborative approach to compiler optimization that circumvents heavy reasoning mechanisms and conventional agentic machinery by endogenizing model selection within the MCTS optimization loop, potentially enabling cost-effective optimization with multiple LLMs.

Abstract: Model serving costs dominate AI systems, making compiler optimization essential for scalable deployment. Recent works show that a large language model (LLM) can guide compiler search by reasoning over program structure and optimization history. However, using a single large model throughout the search is expensive, while smaller models are less reliable when used alone. Thus, this paper seeks to answer whether multi-LLM collaborative reasoning relying primarily on small LLMs can match or exceed the performance of a single large model. As such, we propose a lightweight collaborative multi-LLM framework, dubbed COLT, for compiler optimization that enables coordinated reasoning across multiple models within a single Monte Carlo tree search (MCTS) process. A key contribution is the use of a single shared MCTS tree as the collaboration substrate across LLMs, enabling the reuse of transformation prefixes and cross-model value propagation. Hence, we circumvent both heavy internal reasoning mechanisms and conventional agentic machinery that relies on external planners, multiple concurrent LLMs, databases, external memory/versioning of intermediate results, and controllers by simply endogenizing model selection within the lightweight MCTS optimization loop. Every iteration, the acting LLM proposes a joint action: (compiler transformation, model to be queried next). We also introduce a model-aware tree policy that biases search toward smaller models while preserving exploration, and a course-alteration mechanism that escalates to the largest model when the search exhibits persistent regressions attributable to smaller models.

[1270] Autocorrelated Optimize-via-Estimate: Predict-then-Optimize versus Finite-sample Optimal

Zichun Wang, Gar Goei Loke, Ruiting Zuo

Main category: cs.LG

TL;DR: A-OVE model for data-driven optimization under autocorrelated uncertainties using VARMA processes, applied to portfolio optimization with trading costs, achieving low regret compared to perfect information oracle.

Details

Motivation: Traditional estimate-then-optimize approaches may not perform well in finite-sample regimes with autocorrelated uncertainties. There's a need for models that directly optimize for out-of-sample performance in such settings, particularly for financial applications like portfolio optimization.

Method: Proposes an autocorrelated Optimize-via-Estimate (A-OVE) model that obtains out-of-sample optimal solutions as functions of sufficient statistics under VARMA(p,q) processes. Develops a recursive form for computing sufficient statistics and evaluates on portfolio optimization with trading costs.

Result: A-OVE achieves low regret relative to perfect information oracle and outperforms predict-then-optimize machine learning benchmarks. Notably shows that machine learning models with higher accuracy can have poorer decision quality. Performance is retained under small model mis-specification.

Conclusion: Direct optimization for out-of-sample performance under autocorrelated uncertainties is effective, with A-OVE demonstrating superior decision quality over traditional ML approaches in portfolio optimization contexts.

Abstract: Models that directly optimize for out-of-sample performance in the finite-sample regime have emerged as a promising alternative to traditional estimate-then-optimize approaches in data-driven optimization. In this work, we compare their performance in the context of autocorrelated uncertainties, specifically, under a Vector Autoregressive Moving Average VARMA(p,q) process. We propose an autocorrelated Optimize-via-Estimate (A-OVE) model that obtains an out-of-sample optimal solution as a function of sufficient statistics, and propose a recursive form for computing its sufficient statistics. We evaluate these models on a portfolio optimization problem with trading costs. A-OVE achieves low regret relative to a perfect information oracle, outperforming predict-then-optimize machine learning benchmarks. Notably, machine learning models with higher accuracy can have poorer decision quality, echoing the growing literature in data-driven optimization. Performance is retained under small mis-specification.

[1271] PIMCST: Physics-Informed Multi-Phase Consensus and Spatio-Temporal Few-Shot Learning for Traffic Flow Forecasting

Abdul Joseph Fofanah, Lian Wen, David Chen

Main category: cs.LG

TL;DR: MCPST is a multi-phase consensus spatio-temporal framework for few-shot traffic forecasting that models traffic dynamics through diffusion, synchronization, and spectral embeddings, with adaptive consensus fusion and meta-learning for cross-city adaptation.

Details

Motivation: Traffic flow prediction is challenging in cross-domain, data-scarce scenarios where limited historical data hinders model training and generalization. Complex spatio-temporal dependencies and nonlinear dynamics of urban mobility networks complicate few-shot learning across different cities.

Method: Proposes MCPST with three innovations: (1) multi-phase engine modeling traffic dynamics through diffusion, synchronization, and spectral embeddings; (2) adaptive consensus mechanism that dynamically fuses phase-specific predictions with consistency enforcement; (3) structured meta-learning strategy for rapid adaptation to new cities with minimal data.

Result: Outperforms fourteen state-of-the-art methods across four real-world datasets in spatio-temporal graph learning, dynamic graph transfer learning, prompt-based spatio-temporal prediction, and cross-domain few-shot settings, improving accuracy while reducing required training data.

Conclusion: MCPST effectively addresses few-shot traffic forecasting through multi-phase consensus learning, providing theoretical guarantees and interpretable insights while demonstrating superior performance in cross-domain scenarios.

Abstract: Accurate traffic flow prediction remains a fundamental challenge in intelligent transportation systems, particularly in cross-domain, data-scarce scenarios where limited historical data hinders model training and generalisation. The complex spatio-temporal dependencies and nonlinear dynamics of urban mobility networks further complicate few-shot learning across different cities. This paper proposes MCPST, a novel Multi-phase Consensus Spatio-Temporal framework for few-shot traffic forecasting that reconceptualises traffic prediction as a multi-phase consensus learning problem. Our framework introduces three core innovations: (1) a multi-phase engine that models traffic dynamics through diffusion, synchronisation, and spectral embeddings for comprehensive dynamic characterisation; (2) an adaptive consensus mechanism that dynamically fuses phase-specific predictions while enforcing consistency; and (3) a structured meta-learning strategy for rapid adaptation to new cities with minimal data. We establish extensive theoretical guarantees, including representation theorems with bounded approximation errors and generalisation bounds for few-shot adaptation. Through experiments on four real-world datasets, MCPST outperforms fourteen state-of-the-art methods in spatio-temporal graph learning methods, dynamic graph transfer learning methods, prompt-based spatio-temporal prediction methods and cross-domain few-shot settings, improving prediction accuracy while reducing required training data and providing interpretable insights. The implementation code is available at https://github.com/afofanah/MCPST.

Sungheon Jeong, Sanggeon Yun, Ryozo Masukawa, Wenjun Haung, Hanning Chen, Mohsen Imani

Main category: cs.LG

TL;DR: A method using internal flow signatures to detect and localize unfaithful generation in LLMs through depthwise dynamics monitoring and targeted refinement.

Details

Motivation: LLMs can generate fluent but unfaithful answers to provided context, and existing safeguards often require external verification or separate judges after generation, lacking internal self-checking mechanisms.

Method: Introduces internal flow signatures that audit decision formation via depthwise dynamics at inter-block boundaries. Uses bias-centered monitoring to stabilize token-wise motion, summarizes trajectories in moving readout-aligned subspaces, aligns neighboring windows via orthogonal transport, and trains a lightweight GRU validator on these signatures for self-checking without modifying the base model.

Result: The method enables detection of unfaithful generation, localizes culprit depth events, and allows targeted refinement by rolling back to culprit tokens and clamping abnormal transported steps while preserving orthogonal residuals.

Conclusion: Provides actionable localization and low-overhead self-checking from internal decision dynamics, offering a pipeline for detecting and refining unfaithful LLM generations without external verification.

Abstract: Large language models can generate fluent answers that are unfaithful to the provided context, while many safeguards rely on external verification or a separate judge after generation. We introduce \emph{internal flow signatures} that audit decision formation from depthwise dynamics at a fixed inter-block monitoring boundary. The method stabilizes token-wise motion via bias-centered monitoring, then summarizes trajectories in compact \emph{moving} readout-aligned subspaces constructed from the top token and its close competitors within each depth window. Neighboring window frames are aligned by an orthogonal transport, yielding depth-comparable transported step lengths, turning angles, and subspace drift summaries that are invariant to within-window basis choices. A lightweight GRU validator trained on these signatures performs self-checking without modifying the base model. Beyond detection, the validator localizes a culprit depth event and enables a targeted refinement: the model rolls back to the culprit token and clamps an abnormal transported step at the identified block while preserving the orthogonal residual. The resulting pipeline provides actionable localization and low-overhead self-checking from internal decision dynamics. \emph{Code is available at} \texttt{github.com/EavnJeong/Internal-Flow-Signatures-for-Self-Checking-and-Refinement-in-LLMs}.

[1273] T-LLM: Teaching Large Language Models to Forecast Time Series via Temporal Distillation

Suhan Guo, Bingxu Wang, Shaodan Zhang, Furao Shen

Main category: cs.LG

TL;DR: T-LLM is a temporal distillation framework that transfers time series forecasting capabilities from a lightweight temporal teacher to general-purpose LLMs, enabling them to perform forecasting without specialized temporal modules at inference.

Details

Motivation: Time series data differs from vision/language data as it's tied to temporal evolution and accumulates only as real-world time progresses. Existing LLM approaches for forecasting rely on representation-level alignment or inference-time temporal modules rather than explicitly teaching forecasting behavior to LLMs.

Method: Proposes T-LLM, a temporal distillation framework where a lightweight temporal teacher (combining trend modeling and frequency-domain analysis) provides structured temporal supervision during training. The teacher is removed entirely at inference, leaving only the LLM as the forecasting model.

Result: T-LLM consistently outperforms existing LLM-based forecasting methods on benchmark datasets and infectious disease forecasting tasks under full-shot, few-shot, and zero-shot settings, while enabling simple and efficient deployment.

Conclusion: The framework successfully equips general-purpose LLMs with time series forecasting capability through temporal distillation, addressing the unique challenges of time-bound data while maintaining deployment efficiency.

Abstract: Time series forecasting plays a critical role in decision-making across many real-world applications. Unlike data in vision and language domains, time series data is inherently tied to the evolution of underlying processes and can only accumulate as real-world time progresses, limiting the effectiveness of scale-driven pretraining alone. This time-bound constraint poses a challenge for enabling large language models (LLMs) to acquire forecasting capability, as existing approaches primarily rely on representation-level alignment or inference-time temporal modules rather than explicitly teaching forecasting behavior to the LLM. We propose T-LLM, a temporal distillation framework that equips general-purpose LLMs with time series forecasting capability by transferring predictive behavior from a lightweight temporal teacher during training. The teacher combines trend modeling and frequency-domain analysis to provide structured temporal supervision, and is removed entirely at inference, leaving the LLM as the sole forecasting model. Experiments on benchmark datasets and infectious disease forecasting tasks demonstrate that T-LLM consistently outperforms existing LLM-based forecasting methods under full-shot, few-shot, and zero-shot settings, while enabling a simple and efficient deployment pipeline.

[1274] Observation-dependent Bayesian active learning via input-warped Gaussian processes

Sanna Jarl, Maria Bånkestad, Jonathan J. S. Scragg, Jens Sjölund

Main category: cs.LG

TL;DR: Bayesian active learning method using input space warping to make exploration sensitive to observed measurements, improving sample efficiency especially for non-stationary functions.

Details

Motivation: Traditional Gaussian process surrogates in Bayesian active learning have posterior variance that depends only on hyperparameters, not actual measurements, making exploration insensitive to observed data. This limits efficiency in exploring unknown function landscapes.

Method: Proposes warping the input space with a learned monotone reparameterization that expands or compresses regions based on observed variability. Uses a novel self-supervised objective (rather than marginal likelihood) to train the warping functions, making variance-based acquisition functions responsive to actual measurements.

Result: The approach improves sample efficiency across various active learning benchmarks, particularly in non-stationary regimes where traditional methods struggle. The self-supervised training objective yields substantially better performance than marginal likelihood training.

Conclusion: Input space warping with learned monotone reparameterization successfully injects observation-dependent feedback into Bayesian active learning, making exploration sensitive to actual measurements and improving performance in challenging non-stationary scenarios.

Abstract: Bayesian active learning relies on the precise quantification of predictive uncertainty to explore unknown function landscapes. While Gaussian process surrogates are the standard for such tasks, an underappreciated fact is that their posterior variance depends on the observed outputs only through the hyperparameters, rendering exploration largely insensitive to the actual measurements. We propose to inject observation-dependent feedback by warping the input space with a learned, monotone reparameterization. This mechanism allows the design policy to expand or compress regions of the input space in response to observed variability, thereby shaping the behavior of variance-based acquisition functions. We demonstrate that while such warps can be trained via marginal likelihood, a novel self-supervised objective yields substantially better performance. Our approach improves sample efficiency across a range of active learning benchmarks, particularly in regimes where non-stationarity challenges traditional methods.

[1275] Data- and Variance-dependent Regret Bounds for Online Tabular MDPs

Mingyi Li, Taira Tsuchiya, Kenji Yamanishi

Main category: cs.LG

TL;DR: Best-of-both-worlds algorithms for tabular MDPs with known transitions that achieve refined data-dependent regret bounds in adversarial regimes and variance-dependent bounds in stochastic regimes.

Details

Motivation: To develop algorithms that perform well in both adversarial and stochastic environments for tabular MDPs, adapting to different complexity measures and achieving near-optimal regret bounds.

Method: Two approaches: global optimization and policy optimization, both using optimistic follow-the-regularized-leader with log-barrier regularization. Global optimization achieves first-order, second-order, and path-length bounds; policy optimization achieves similar adaptivity using a new optimistic Q-function estimator.

Result: Algorithms achieve refined data-dependent regret bounds in adversarial regimes and variance-dependent bounds in stochastic regimes. Global optimization achieves near-optimal bounds, and policy optimization achieves similar adaptivity up to horizon factor.

Conclusion: The paper presents best-of-both-worlds algorithms for tabular MDPs that adapt to various complexity measures, achieving near-optimal regret bounds and establishing lower bounds that validate the upper bounds.

Abstract: This work studies online episodic tabular Markov decision processes (MDPs) with known transitions and develops best-of-both-worlds algorithms that achieve refined data-dependent regret bounds in the adversarial regime and variance-dependent regret bounds in the stochastic regime. We quantify MDP complexity using a first-order quantity and several new data-dependent measures for the adversarial regime, including a second-order quantity and a path-length measure, as well as variance-based measures for the stochastic regime. To adapt to these measures, we develop algorithms based on global optimization and policy optimization, both built on optimistic follow-the-regularized-leader with log-barrier regularization. For global optimization, our algorithms achieve first-order, second-order, and path-length regret bounds in the adversarial regime, and in the stochastic regime, they achieve a variance-aware gap-independent bound and a variance-aware gap-dependent bound that is polylogarithmic in the number of episodes. For policy optimization, our algorithms achieve the same data- and variance-dependent adaptivity, up to a factor of the episode horizon, by exploiting a new optimistic $Q$-function estimator. Finally, we establish regret lower bounds in terms of data-dependent complexity measures for the adversarial regime and a variance measure for the stochastic regime, implying that the regret upper bounds achieved by the global-optimization approach are nearly optimal.

[1276] Towards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMs

Wenbo Pan, Zhichao Liu, Xianlong Wang, Haining Yu, Xiaohua Jia

Main category: cs.LG

TL;DR: FlashTrace: Efficient multi-token attribution method for LLMs with span-wise aggregation and recursive attribution through reasoning chains

Details

Motivation: Existing token attribution methods face efficiency bottlenecks (O(M*N) operations) and faithfulness drops when dealing with modern LLMs that rely on extended reasoning chains, where intermediate reasoning tokens absorb attribution mass and prevent importance from propagating back to original inputs.

Method: Introduces FlashTrace with two key innovations: (1) span-wise aggregation to compute attribution over multi-token targets in a single pass, and (2) recursive attribution mechanism that traces importance through intermediate reasoning chains back to source inputs.

Result: Achieves over 130x speedup over existing baselines on long-context retrieval (RULER) and multi-step reasoning (MATH, MorehopQA) tasks while maintaining superior faithfulness. Recursive attribution analysis shows even a single recursive hop improves faithfulness by tracing importance through reasoning chains.

Conclusion: FlashTrace addresses critical challenges in LLM attribution by providing efficient and faithful multi-token attribution that can handle extended reasoning chains, making long-context attribution practical while maintaining interpretability.

Abstract: Token attribution methods provide intuitive explanations for language model outputs by identifying causally important input tokens. However, as modern LLMs increasingly rely on extended reasoning chains, existing schemes face two critical challenges: (1) efficiency bottleneck, where attributing a target span of M tokens within a context of length N requires O(M*N) operations, making long-context attribution prohibitively slow; and (2) faithfulness drop, where intermediate reasoning tokens absorb attribution mass, preventing importance from propagating back to the original input. To address these, we introduce FlashTrace, an efficient multi-token attribution method that employs span-wise aggregation to compute attribution over multi-token targets in a single pass, while maintaining faithfulness. Moreover, we design a recursive attribution mechanism that traces importance through intermediate reasoning chains back to source inputs. Extensive experiments on long-context retrieval (RULER) and multi-step reasoning (MATH, MorehopQA) tasks demonstrate that FlashTrace achieves over 130x speedup over existing baselines while maintaining superior faithfulness. We further analyze the dynamics of recursive attribution, showing that even a single recursive hop improves faithfulness by tracing importance through the reasoning chain.

[1277] Efficient Epistemic Uncertainty Estimation for Large Language Models via Knowledge Distillation

Seonghyeon Park, Jewon Yeom, Jaewon Sok, Jeongjae Park, Heejun Kim, Taesup Kim

Main category: cs.LG

TL;DR: A framework using small draft models to efficiently estimate token-level epistemic uncertainty in LLMs, avoiding costly deep ensembles, with theoretical grounding in bias-variance decomposition and practical strategies for accuracy.

Details

Motivation: Estimating epistemic uncertainty in LLMs is crucial for mitigating hallucinations and enabling risk-aware deployment, but traditional deep ensemble methods are computationally prohibitive at modern LLM scales.

Method: Proposes using small draft models to estimate token-level epistemic uncertainty via Jensen-Shannon divergence among drafts (variance proxy) and KL divergence between draft mixture and target (bias proxy). Introduces Online Stochastic Distillation (OSD) for efficient target aggregation and Data-Diverse Drafts (DDD) strategy to enhance draft diversity.

Result: Reduces estimation error (RMSE) by up to 37% compared to baselines on GSM8K. Achieves hallucination detection performance competitive with heavy perturbation-based methods like TokUR with negligible inference costs.

Conclusion: Provides a practical, efficient solution for uncertainty-aware LLM deployment that avoids the computational burden of traditional ensemble methods while maintaining competitive performance for hallucination detection.

Abstract: Quantifying uncertainty in Large Language Models (LLMs) is essential for mitigating hallucinations and enabling risk-aware deployment in safety-critical tasks. However, estimating Epistemic Uncertainty(EU) via Deep Ensembles is computationally prohibitive at the scale of modern models. We propose a framework that leverages the small draft models to efficiently estimate token-level EU, bypassing the need for full-scale ensembling. Theoretically grounded in a Bias-Variance Decomposition, our approach approximates EU via Jensen-Shannon divergence among drafts (variance proxy) and KL divergence between the draft mixture and the target (bias proxy). To further ensure accuracy without significant overhead, we introduce Online Stochastic Distillation (OSD) to efficiently approximate target aggregation and the Data-Diverse Drafts (DDD) strategy to enhance draft diversity for better target approximation. Extensive experiments on GSM8K demonstrate that our method reduces the estimation error (RMSE) by up to 37% compared to baselines. Crucially, our approach achieves Hallucination Detection performance competitive with heavy perturbation-based methods like TokUR while incurring negligible inference costs, offering a practical solution for uncertainty-aware LLM deployment.

[1278] Embedding Learning on Multiplex Networks for Link Prediction

Orell Trautmann, Olaf Wolkenhauer, Clémence Réda

Main category: cs.LG

TL;DR: A comprehensive review of embedding learning models for multiplex networks specifically for link prediction tasks, covering taxonomies, evaluation methodologies, and addressing fairness in evaluation procedures.

Details

Motivation: Network embedding learning has shown strong results for link prediction in complex systems, but as network complexity grows (multiple connection types in multiplex networks), embedding learning becomes increasingly challenging. There's a need to systematically review and classify existing models, address reproducibility and fairness in evaluation, and provide guidelines for proper assessment.

Method: The paper conducts a systematic review of published embedding learning models for multiplex networks. It proposes refined taxonomies to classify models based on embedding types and techniques, reviews evaluation methodologies, and introduces a novel fair testing procedure specifically for directed multiplex networks.

Result: Provides comprehensive taxonomies for classifying multiplex network embedding models, identifies issues with current evaluation practices, and proposes solutions for fair evaluation including a new testing procedure for directed multiplex networks.

Conclusion: This review represents a crucial step toward developing more performant and tractable embedding learning approaches for multiplex networks with fair evaluation for link prediction tasks. It provides guidelines for model evaluation and offers perspective on current challenges and available tools for downstream analyses.

Abstract: Over the past years, embedding learning on networks has shown tremendous results in link prediction tasks for complex systems, with a wide range of real-life applications. Learning a representation for each node in a knowledge graph allows us to capture topological and semantic information, which can be processed in downstream analyses later. In the link prediction task, high-dimensional network information is encoded into low-dimensional vectors, which are then fed to a predictor to infer new connections between nodes in the network. As the network complexity (that is, the numbers of connections and types of interactions) grows, embedding learning turns out increasingly challenging. This review covers published models on embedding learning on multiplex networks for link prediction. First, we propose refined taxonomies to classify and compare models, depending on the type of embeddings and embedding techniques. Second, we review and address the problem of reproducible and fair evaluation of embedding learning on multiplex networks for the link prediction task. Finally, we tackle evaluation on directed multiplex networks by proposing a novel and fair testing procedure. This review constitutes a crucial step towards the development of more performant and tractable embedding learning approaches for multiplex networks and their fair evaluation for the link prediction task. We also suggest guidelines on the evaluation of models, and provide an informed perspective on the challenges and tools currently available to address downstream analyses applied to multiplex networks.

[1279] Zero-Shot Off-Policy Learning

Arip Asadulaev, Maksim Bobrin, Salem Lahlou, Dmitry Dylov, Fakhri Karray, Martin Takac

Main category: cs.LG

TL;DR: Novel method connects successor measures to stationary density ratios for off-policy learning in zero-shot RL, enabling optimal policy adaptation without training

Details

Motivation: Address challenges in off-policy learning (distributional shift, value overestimation) and zero-shot RL where agents must adapt to new tasks without additional training

Method: Discover theoretical connection between successor measures and stationary density ratios, enabling inference of optimal importance sampling ratios for stationary distribution correction

Result: Benchmarked on motion tracking (SMPL Humanoid), continuous control (ExoRL), and long-horizon OGBench tasks; integrates with forward-backward representation frameworks

Conclusion: Bridges off-policy learning and zero-shot adaptation, enabling fast task adaptation without training and benefiting both research areas

Abstract: Off-policy learning methods seek to derive an optimal policy directly from a fixed dataset of prior interactions. This objective presents significant challenges, primarily due to the inherent distributional shift and value function overestimation bias. These issues become even more noticeable in zero-shot reinforcement learning, where an agent trained on reward-free data must adapt to new tasks at test time without additional training. In this work, we address the off-policy problem in a zero-shot setting by discovering a theoretical connection of successor measures to stationary density ratios. Using this insight, our algorithm can infer optimal importance sampling ratios, effectively performing a stationary distribution correction with an optimal policy for any task on the fly. We benchmark our method in motion tracking tasks on SMPL Humanoid, continuous control on ExoRL, and for the long-horizon OGBench tasks. Our technique seamlessly integrates into forward-backward representation frameworks and enables fast-adaptation to new tasks in a training-free regime. More broadly, this work bridges off-policy learning and zero-shot adaptation, offering benefits to both research areas.

[1280] Boundary-Constrained Diffusion Models for Floorplan Generation: Balancing Realism and Diversity

Leonardo Stoppani, Davide Bacciu, Shahab Mokarizadeh

Main category: cs.LG

TL;DR: Proposes Diversity Score metric and Boundary Cross-Attention module for diffusion-based floorplan generation to address diversity collapse and improve geometric consistency.

Details

Motivation: Current diffusion models for floorplan generation optimize for perceptual metrics like FID, which leads to limited design diversity. There's a need to balance realism with diversity and improve geometric consistency with building boundaries.

Method: 1) Introduces Diversity Score (DS) metric to quantify layout diversity under fixed constraints; 2) Proposes Boundary Cross-Attention (BCA) module for conditioning on building boundaries to improve geometric consistency.

Result: BCA significantly improves boundary adherence. Prolonged training causes diversity collapse undetected by FID, revealing trade-off between realism and diversity. Models show reliance on dataset priors in OOD evaluations.

Conclusion: Generative systems for architectural design need to explicitly balance fidelity, diversity, and generalization. The proposed DS metric and BCA module help address these challenges in floorplan generation.

Abstract: Diffusion models have become widely popular for automated floorplan generation, producing highly realistic layouts conditioned on user-defined constraints. However, optimizing for perceptual metrics such as the Fréchet Inception Distance (FID) causes limited design diversity. To address this, we propose the Diversity Score (DS), a metric that quantifies layout diversity under fixed constraints. Moreover, to improve geometric consistency, we introduce a Boundary Cross-Attention (BCA) module that enables conditioning on building boundaries. Our experiments show that BCA significantly improves boundary adherence, while prolonged training drives diversity collapse undiagnosed by FID, revealing a critical trade-off between realism and diversity. Out-Of-Distribution evaluations further demonstrate the models’ reliance on dataset priors, emphasizing the need for generative systems that explicitly balance fidelity, diversity, and generalization in architectural design tasks.

[1281] Bayesian Integration of Nonlinear Incomplete Clinical Data

Lucía González-Zamorano, Nuria Balbás-Esteban, Vanessa Gómez-Verdejo, Albert Belenguer-Llorens, Carlos Sevilla-Salcedo

Main category: cs.LG

TL;DR: BIONIC is a Bayesian framework for integrating heterogeneous multimodal clinical data with structured missingness, using pretrained embeddings for complex modalities and enabling robust learning in partially observed settings.

Details

Motivation: Multimodal clinical data presents challenges due to high dimensionality, heterogeneous representations, and structured missingness, making predictive modeling, data integration, and interpretability difficult.

Method: A unified probabilistic framework using Bayesian integration of nonlinear incomplete clinical data, with pretrained embeddings for medical images and clinical text, structured clinical variables in Bayesian multimodal formulation, and explicit modeling of modality-level and variable-level missingness.

Result: Demonstrated strong and consistent discriminative performance on three multimodal clinical/biomedical datasets compared to baselines, particularly under incomplete data scenarios.

Conclusion: BIONIC provides robust multimodal integration with intrinsic interpretability through latent structure, enabling population-level analysis of modality relevance and clinically meaningful insights.

Abstract: Multimodal clinical data are characterized by high dimensionality, heterogeneous representations, and structured missingness, posing significant challenges for predictive modeling, data integration, and interpretability. We propose BIONIC (Bayesian Integration of Nonlinear Incomplete Clinical data), a unified probabilistic framework that integrates heterogeneous multimodal data under missingness through a joint generative-discriminative latent architecture. BIONIC uses pretrained embeddings for complex modalities such as medical images and clinical text, while incorporating structured clinical variables directly within a Bayesian multimodal formulation. The proposed framework enables robust learning in partially observed and semi-supervised settings by explicitly modeling modality-level and variable-level missingness, as well as missing labels. We evaluate BIONIC on three multimodal clinical and biomedical datasets, demonstrating strong and consistent discriminative performance compared to representative multimodal baselines, particularly under incomplete data scenarios. Beyond predictive accuracy, BIONIC provides intrinsic interpretability through its latent structure, enabling population-level analysis of modality relevance and supporting clinically meaningful insight.

[1282] IntraSlice: Towards High-Performance Structural Pruning with Block-Intra PCA for LLMs

Meng Li, Peisong Wang, Yuantian Shao, Qinghao Hu, Hongjian Fang, Yifan Zhang, Zhihui Wei, Jian Cheng

Main category: cs.LG

TL;DR: IntraSlice: A framework for efficient LLM pruning using block-wise module-intra PCA compression that preserves performance while accelerating inference.

Details

Motivation: Large Language Models face deployment challenges due to massive size. Structured pruning accelerates models but causes performance degradation. Existing PCA-based methods only work between modules, introducing extra parameters and disrupting activation distributions due to residual connections.

Method: Proposes IntraSlice framework with block-wise module-intra PCA compression pruning. Leverages Transformer structural characteristics to design approximate PCA method with fully fusible transformation matrices (no extra parameters). Introduces PCA-based global pruning ratio estimator that considers compressed activation distributions beyond conventional module importance.

Result: Validated on Llama2, Llama3, and Phi series across various language benchmarks. Achieves superior compression performance compared to recent baselines at same compression ratio or inference speed.

Conclusion: IntraSlice provides an effective framework for LLM compression that maintains performance while enabling efficient deployment through improved PCA-based pruning techniques.

Abstract: Large Language Models (LLMs) achieve strong performance across diverse tasks but face deployment challenges due to their massive size. Structured pruning offers acceleration benefits but leads to significant performance degradation. Recent PCA-based pruning methods have alleviated this issue by retaining key activation components, but are only applied between modules in order to fuse the transformation matrix, which introduces extra parameters and severely disrupts activation distributions due to residual connections. To address these issues, we propose IntraSlice, a framework that applies block-wise module-intra PCA compression pruning. By leveraging the structural characteristics of Transformer modules, we design an approximate PCA method whose transformation matrices can be fully fused into the model without additional parameters. We also introduce a PCA-based global pruning ratio estimator that further considers the distribution of compressed activations, building on conventional module importance. We validate our method on Llama2, Llama3, and Phi series across various language benchmarks. Experimental results demonstrate that our approach achieves superior compression performance compared to recent baselines at the same compression ratio or inference speed.

[1283] FlyPrompt: Brain-Inspired Random-Expanded Routing with Temporal-Ensemble Experts for General Continual Learning

Hongwei Yan, Guanglong Sun, Kanglei Zhou, Qian Li, Liyuan Wang, Yi Zhong

Main category: cs.LG

TL;DR: FlyPrompt is a brain-inspired continual learning framework that uses sparse expansion and modular ensembles to handle general continual learning without clear task boundaries, achieving significant performance gains on image datasets.

Details

Motivation: Current continual parameter-efficient tuning methods rely on multiple training epochs and explicit task cues, limiting their effectiveness in general continual learning scenarios with single-pass, non-stationary data streams. Existing methods lack targeted design for two key challenges: allocating expert parameters to evolving data distributions and improving representational capacity under limited supervision.

Method: Inspired by the fruit fly’s hierarchical memory system, FlyPrompt decomposes GCL into expert routing and expert competence improvement. It uses a randomly expanded analytic router for instance-level expert activation and a temporal ensemble of output heads to dynamically adapt decision boundaries over time.

Result: FlyPrompt achieves up to 11.23%, 12.43%, and 7.62% performance gains over state-of-the-art baselines on CIFAR-100, ImageNet-R, and CUB-200 datasets respectively.

Conclusion: The brain-inspired FlyPrompt framework effectively addresses fundamental challenges in continual parameter-efficient tuning for general continual learning scenarios, demonstrating superior performance through its hierarchical memory-inspired architecture.

Abstract: General continual learning (GCL) challenges intelligent systems to learn from single-pass, non-stationary data streams without clear task boundaries. While recent advances in continual parameter-efficient tuning (PET) of pretrained models show promise, they typically rely on multiple training epochs and explicit task cues, limiting their effectiveness in GCL scenarios. Moreover, existing methods often lack targeted design and fail to address two fundamental challenges in continual PET: how to allocate expert parameters to evolving data distributions, and how to improve their representational capacity under limited supervision. Inspired by the fruit fly’s hierarchical memory system characterized by sparse expansion and modular ensembles, we propose FlyPrompt, a brain-inspired framework that decomposes GCL into two subproblems: expert routing and expert competence improvement. FlyPrompt introduces a randomly expanded analytic router for instance-level expert activation and a temporal ensemble of output heads to dynamically adapt decision boundaries over time. Extensive theoretical and empirical evaluations demonstrate FlyPrompt’s superior performance, achieving up to 11.23%, 12.43%, and 7.62% gains over state-of-the-art baselines on CIFAR-100, ImageNet-R, and CUB-200, respectively. Our source code is available at https://github.com/AnAppleCore/FlyGCL.

[1284] Deep Multivariate Models with Parametric Conditionals

Dmitrij Schlesinger, Boris Flach, Alexander Shekhovtsov

Main category: cs.LG

TL;DR: Proposes a deep multivariate model for heterogeneous data using conditional probability distributions for each variable group, enabling flexible downstream tasks via Markov chain kernel training.

Details

Motivation: Existing models for heterogeneous collections (images, segmentations, attributes, latent variables) are designed for specific tasks, limiting their applicability to other downstream tasks. Need a more flexible approach.

Method: Represent joint probability distribution using conditional probability distributions for each variable group conditioned on the rest. Train parametrized Markov chain kernel by maximizing data likelihood of its limiting distribution.

Result: Enables models to be used for practically any possible downstream task and supports wide range of semi-supervised learning scenarios.

Conclusion: Proposed approach provides more flexible and generalizable modeling framework for heterogeneous data collections compared to task-specific designs.

Abstract: We consider deep multivariate models for heterogeneous collections of random variables. In the context of computer vision, such collections may e.g. consist of images, segmentations, image attributes, and latent variables. When developing such models, most existing works start from an application task and design the model components and their dependencies to meet the needs of the chosen task. This has the disadvantage of limiting the applicability of the resulting model for other downstream tasks. Here, instead, we propose to represent the joint probability distribution by means of conditional probability distributions for each group of variables conditioned on the rest. Such models can then be used for practically any possible downstream task. Their learning can be approached as training a parametrised Markov chain kernel by maximising the data likelihood of its limiting distribution. This has the additional advantage of allowing a wide range of semi-supervised learning scenarios.

[1285] SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning

Zhen-Hao Xie, Jun-Tao Tang, Yu-Cheng Shi, Han-Jia Ye, De-Chuan Zhan, Da-Wei Zhou

Main category: cs.LG

TL;DR: SAME addresses router and expert drift in multimodal continual instruction tuning by stabilizing expert selection through orthogonal subspace decomposition and regulating expert updates via curvature-aware scaling.

Details

Motivation: Multimodal LLMs need continual instruction tuning for real-world deployment, but current sparse expert routing methods suffer from router drift (inconsistent expert selection over time) and expert drift (shared experts being overwritten by new tasks), which degrade performance.

Method: Proposes StAbilized Mixture-of-Experts (SAME) with three key components: 1) Stabilizes expert selection by decomposing routing dynamics into orthogonal subspaces and updating only task-relevant directions, 2) Regulates expert updates via curvature-aware scaling using historical input covariance without rehearsal, and 3) Introduces adaptive expert activation to freeze selected experts during training to reduce computation and cross-task interference.

Result: Extensive experiments demonstrate state-of-the-art performance in multimodal continual instruction tuning benchmarks.

Conclusion: SAME effectively addresses router and expert drift problems in multimodal continual instruction tuning, enabling more stable and efficient expansion of MLLM capabilities over time.

Abstract: Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually expand their capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. Recent methods leverage sparse expert routing to promote task specialization, but we find that the expert routing process suffers from drift as the data distribution evolves. For example, a grounding query that previously activated localization experts may instead be routed to irrelevant experts after learning OCR tasks. Meanwhile, the grounding-related experts can be overwritten by new tasks and lose their original functionality. Such failure reflects two problems: router drift, where expert selection becomes inconsistent over time, and expert drift, where shared experts are overwritten across tasks. Therefore, we propose StAbilized Mixture-of-Experts (SAME) for MCIT. To address router drift, SAME stabilizes expert selection by decomposing routing dynamics into orthogonal subspaces and updating only task-relevant directions. To mitigate expert drift, we regulate expert updates via curvature-aware scaling using historical input covariance in a rehearsal-free manner. SAME also introduces adaptive expert activation to freeze selected experts during training, reducing redundant computation and cross-task interference. Extensive experiments demonstrate its SOTA performance.

[1286] An Empirical Study of World Model Quantization

Zhongqian Fu, Tianyi Zhao, Kai Han, Hang Zhou, Xinghao Chen, Yunhe Wang

Main category: cs.LG

TL;DR: Systematic study of post-training quantization effects on world models using DINO-WM, revealing unique failure modes in visual planning tasks beyond standard accuracy-bitwidth tradeoffs.

Details

Motivation: World models enable efficient planning but require heavy computational resources. While quantization is essential for deployment, the effects of post-training quantization on world models remain largely unexamined, particularly for visual planning tasks.

Method: Empirical study using DINO-WM as representative case, evaluating diverse PTQ methods under weight-only and joint weight-activation settings across various bit-widths, quantization granularities, and planning horizons up to 50 iterations.

Result: Quantization effects in world models show unique patterns: group-wise weight quantization stabilizes low-bit rollouts, activation quantization granularity yields inconsistent benefits, encoder and predictor modules show asymmetric sensitivity, and aggressive quantization degrades planning-task success alignment.

Conclusion: World model quantization reveals distinct failure modes in planning tasks that differ from standard quantization effects, providing practical guidance for deploying quantized world models under computational constraints.

Abstract: World models learn an internal representation of environment dynamics, enabling agents to simulate and reason about future states within a compact latent space for tasks such as planning, prediction, and inference. However, running world models rely on hevay computational cost and memory footprint, making model quantization essential for efficient deployment. To date, the effects of post-training quantization (PTQ) on world models remain largely unexamined. In this work, we present a systematic empirical study of world model quantization using DINO-WM as a representative case, evaluating diverse PTQ methods under both weight-only and joint weight-activation settings. We conduct extensive experiments on different visual planning tasks across a wide range of bit-widths, quantization granularities, and planning horizons up to 50 iterations. Our results show that quantization effects in world models extend beyond standard accuracy and bit-width trade-offs: group-wise weight quantization can stabilize low-bit rollouts, activation quantization granularity yields inconsistent benefits, and quantization sensitivity is highly asymmetric between encoder and predictor modules. Moreover, aggressive low-bit quantization significantly degrades the alignment between the planning objective and task success, leading to failures that cannot be remedied by additional optimization. These findings reveal distinct quantization-induced failure modes in world model-based planning and provide practical guidance for deploying quantized world models under strict computational constraints. The code will be available at https://github.com/huawei-noah/noah-research/tree/master/QuantWM.

[1287] Grounding Generated Videos in Feasible Plans via World Models

Christos Ziakas, Amir Bar, Alessandra Russo

Main category: cs.LG

TL;DR: GVP-WM grounds video-generated plans into feasible action sequences using world models and latent-space trajectory optimization to ensure physical constraints and temporal consistency.

Details

Motivation: Video generative models can serve as zero-shot visual planners but often produce plans that violate temporal consistency and physical constraints, making them fail when mapped to executable actions.

Method: Proposes GVP-WM: generates video plan from initial/goal observations, then projects video guidance onto feasible latent trajectories via video-guided latent collocation. Formulates grounding as goal-conditioned latent-space trajectory optimization that jointly optimizes latent states and actions under world-model dynamics while preserving semantic alignment with video plan.

Result: GVP-WM recovers feasible long-horizon plans from zero-shot image-to-video-generated and motion-blurred videos that violate physical constraints, across navigation and manipulation simulation tasks.

Conclusion: The method successfully grounds video-generated plans into executable action sequences by leveraging world models and trajectory optimization, addressing key limitations of current video generative models for planning.

Abstract: Large-scale video generative models have shown emerging capabilities as zero-shot visual planners, yet video-generated plans often violate temporal consistency and physical constraints, leading to failures when mapped to executable actions. To address this, we propose Grounding Video Plans with World Models (GVP-WM), a planning method that grounds video-generated plans into feasible action sequences using a learned action-conditioned world model. At test-time, GVP-WM first generates a video plan from initial and goal observations, then projects the video guidance onto the manifold of dynamically feasible latent trajectories via video-guided latent collocation. In particular, we formulate grounding as a goal-conditioned latent-space trajectory optimization problem that jointly optimizes latent states and actions under world-model dynamics, while preserving semantic alignment with the video-generated plan. Empirically, GVP-WM recovers feasible long-horizon plans from zero-shot image-to-video-generated and motion-blurred videos that violate physical constraints, across navigation and manipulation simulation tasks.

[1288] Optimizing Tensor Train Decomposition in DNNs for RISC-V Architectures Using Design Space Exploration and Compiler Optimizations

Theologos Anthimopoulos, Milad Kokhazadeh, Vasilios Kelefouras, Benjamin Himpel, Georgios Keramidas

Main category: cs.LG

TL;DR: An end-to-end methodology for optimizing fully connected layers in DNNs on RISC-V processors using Tensor Train Decomposition, achieving 3-8x speedup over existing methods.

Details

Motivation: Deploying DNNs on resource-constrained RISC-V devices is challenging due to high computational and memory demands of fully connected layers. Low-rank factorization can compress these layers but involves complex trade-offs in design space exploration.

Method: Uses Tensor Train Decomposition (TTD) from TensorFlow T3F library to prune inefficient decomposition shapes and poor-performing solutions for RISC-V architectures. Applies compiler optimizations to enhance custom T3F layer performance.

Result: TT-decomposed layers run 3x faster than IREE and 8x faster than Pluto on the same compressed model, providing efficient DNN deployment on RISC-V edge devices.

Conclusion: The proposed methodology offers an efficient solution for deploying DNNs on resource-constrained RISC-V platforms by optimizing fully connected layers through systematic design space exploration and compiler optimizations.

Abstract: Deep neural networks (DNNs) have become indispensable in many real-life applications like natural language processing, and autonomous systems. However, deploying DNNs on resource-constrained devices, e.g., in RISC-V platforms, remains challenging due to the high computational and memory demands of fully connected (FC) layers, which dominate resource consumption. Low-rank factorization (LRF) offers an effective approach to compressing FC layers, but the vast design space of LRF solutions involves complex trade-offs among FLOPs, memory size, inference time, and accuracy, making the LRF process complex and time-consuming. This paper introduces an end-to-end LRF design space exploration methodology and a specialized design tool for optimizing FC layers on RISC-V processors. Using Tensor Train Decomposition (TTD) offered by TensorFlow T3F library, the proposed work prunes the LRF design space by excluding first, inefficient decomposition shapes and second, solutions with poor inference performance on RISC-V architectures. Compiler optimizations are then applied to enhance custom T3F layer performance, minimizing inference time and boosting computational efficiency. On average, our TT-decomposed layers run 3x faster than IREE and 8x faster than Pluto on the same compressed model. This work provides an efficient solution for deploying DNNs on edge and embedded devices powered by RISC-V architectures.

[1289] Self-Consolidation for Self-Evolving Agents

Hongzhuo Yu, Fei Zhu, Guo-Sen Xie, Ling Shao

Main category: cs.LG

TL;DR: A self-evolving LLM agent framework that learns from both successes and failures through contrastive reflection and parameter consolidation, enabling lifelong improvement without context window limitations.

Details

Motivation: Current LLM agents are static and rely on retrieving successful trajectories, which overlooks valuable learning from failures and faces context window limitations when accumulating textual experiences.

Method: Proposes two mechanisms: 1) contrastive reflection to summarize error-prone patterns and capture reusable insights from both successes and failures, and 2) self-consolidation that distills textual experience into compact learnable parameters to internalize historical experience.

Result: Extensive experiments demonstrate advantages in long-term agent evolution, showing improved learning from both successes and failures while overcoming context window limitations.

Conclusion: The framework enables LLM agents to evolve through lifelong interaction by learning from failures and consolidating experience into parameters, addressing key limitations of current static agent systems.

Abstract: While large language model (LLM) agents have demonstrated impressive problem-solving capabilities, they typically operate as static systems, lacking the ability to evolve through lifelong interaction. Existing attempts to bridge this gap primarily rely on retrieving successful past trajectories as demonstrations. However, this paradigm faces two critical limitations. First, by focusing solely on success, agents overlook the rich pedagogical value embedded in failed attempts, preventing them from identifying and avoiding recurrent pitfalls. Second, continually accumulating textual experiences not only increases the time consumption during retrieval but also inevitably introduces noise and exhausts the largest context window of current LLMs. To address these challenges, we propose a novel self-evolving framework for LLM agents that introduces a complementary evolution mechanism: First, a contrastive reflection strategy is introduced to explicitly summarize error-prone patterns and capture reusable insights. Second, we propose a self-consolidation mechanism that distills non-parametric textual experience into compact learnable parameters. This enables the agent to internalize extensive historical experience directly into its latent space. Extensive experiments demonstrate the advantages of our method in long-term agent evolution.

[1290] On the Limits of Layer Pruning for Generative Reasoning in LLMs

Safal Shrestha, Anubhav Shrestha, Aadim Nepal, Minwu Kim, Keith Ross

Main category: cs.LG

TL;DR: Layer pruning for LLMs retains classification performance but severely degrades generative reasoning tasks; supervised finetuning with self-generated responses helps but recovery remains limited, especially at high pruning ratios.

Details

Motivation: Existing layer pruning techniques for compressing large language models work well for classification tasks but suffer severe degradation on generative reasoning tasks, particularly those requiring multi-step reasoning like mathematical computation and code synthesis.

Method: Systematic study across multiple model families to analyze depth reduction effects, followed by evaluation of a mitigation strategy using supervised finetuning with Self-Generated Responses under post-training constraints without pretraining-scale data/compute.

Result: The approach achieves strong recovery on classification tasks (up to 90% baseline performance) and substantial gains (20-30 percentage points) on generative benchmarks compared to prior techniques, but recovery for generative reasoning remains fundamentally limited and viable mainly at lower pruning ratios.

Conclusion: Layer pruning has practical limits for generative reasoning; depth reduction can be effective under constrained post-training regimes primarily at lower pruning ratios, with classification tasks being more resilient than generative reasoning tasks.

Abstract: Recent works have shown that layer pruning can compress large language models (LLMs) while retaining strong performance on classification benchmarks with little or no finetuning. However, existing pruning techniques often suffer severe degradation on generative reasoning tasks. Through a systematic study across multiple model families, we find that tasks requiring multi-step reasoning are particularly sensitive to depth reduction. Beyond surface-level text degeneration, we observe degradation of critical algorithmic capabilities, including arithmetic computation for mathematical reasoning and balanced parenthesis generation for code synthesis. Under realistic post-training constraints, without access to pretraining-scale data or compute, we evaluate a simple mitigation strategy based on supervised finetuning with Self-Generated Responses. This approach achieves strong recovery on classification tasks, retaining up to 90% of baseline performance, and yields substantial gains of up to 20–30 percentage points on generative benchmarks compared to prior post-pruning techniques. Crucially, despite these gains, recovery for generative reasoning remains fundamentally limited relative to classification tasks and is viable primarily at lower pruning ratios. Overall, we characterize the practical limits of layer pruning for generative reasoning and provide guidance on when depth reduction can be applied effectively under constrained post-training regimes.

[1291] Preserve-Then-Quantize: Balancing Rank Budgets for Quantization Error Reconstruction in LLMs

Yoonjun Cho, Dongjae Jeon, Soeun Kim, Moongyu Jeon, Albert No

Main category: cs.LG

TL;DR: SRR improves PTQ by preserving top-k singular subspace of weights before quantization, using remaining rank for error reconstruction, and enabling stable QPEFT.

Details

Motivation: Existing quantization error reconstruction methods use full rank budget for error correction, which is suboptimal when weights have intrinsic low-rank structure and quantization corrupts dominant directions.

Method: Proposes Structured Residual Reconstruction (SRR) framework that: 1) preserves top-k singular subspace of activation-scaled weights before quantization, 2) quantizes only the residual, 3) uses remaining rank (r-k) for error reconstruction, with theory-guided criterion for selecting k.

Result: Consistent perplexity reductions across diverse models and quantization settings in PTQ, and 5.9 percentage-point average gain on GLUE under 2-bit QPEFT.

Conclusion: SRR provides optimal rank allocation for quantization error reconstruction, improves PTQ performance, and enables stable quantized parameter-efficient fine-tuning through gradient scaling along preserved directions.

Abstract: Quantization Error Reconstruction (QER) reduces accuracy loss in Post-Training Quantization (PTQ) by approximating weights as $\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$, using a rank-$r$ correction to reconstruct quantization error. Prior methods devote the full rank budget to error reconstruction, which is suboptimal when $\mathbf{W}$ has intrinsic low-rank structure and quantization corrupts dominant directions. We propose Structured Residual Reconstruction (SRR), a rank-allocation framework that preserves the top-$k$ singular subspace of the activation-scaled weight before quantization, quantizes only the residual, and uses the remaining rank $r-k$ for error reconstruction. We derive a theory-guided criterion for selecting $k$ by balancing quantization-exposed energy and unrecoverable error under rank constraints. We further show that resulting $\mathbf{Q} + \mathbf{L}\mathbf{R}$ parameterization naturally supports Quantized Parameter-Efficient Fine-Tuning (QPEFT), and stabilizes fine-tuning via gradient scaling along preserved directions. Experiments demonstrate consistent perplexity reductions across diverse models and quantization settings in PTQ, along with a 5.9 percentage-point average gain on GLUE under 2-bit QPEFT.

[1292] FORLER: Federated Offline Reinforcement Learning with Q-Ensemble and Actor Rectification

Nan Qiao, Sheng Yue

Main category: cs.LG

TL;DR: FORLER: Federated Offline Reinforcement Learning with Ensemble Aggregation and Actor Rectification to address policy pollution in heterogeneous data environments

Details

Motivation: Offline federated reinforcement learning (FRL) faces challenges with low-quality, heterogeneous data across devices, leading to policy pollution where one device's suboptimal policy can degrade the aggregated model. Real environment interaction is risky/costly, motivating offline approaches.

Method: Combines Q-ensemble aggregation on server (robust merging of device Q-functions) with actor rectification on devices (enriches policy gradients via zeroth-order search for high-Q actions plus custom regularizer). Uses δ-periodic strategy to reduce local computation.

Result: Extensive experiments show FORLER consistently outperforms strong baselines under varying data quality and heterogeneity. Provides theoretical safe policy improvement performance guarantees.

Conclusion: FORLER effectively addresses policy pollution in offline FRL, enabling robust learning from heterogeneous datasets while maintaining privacy and reducing computational burden on resource-constrained devices.

Abstract: In Internet-of-Things systems, federated learning has advanced online reinforcement learning (RL) by enabling parallel policy training without sharing raw data. However, interacting with real environments online can be risky and costly, motivating offline federated RL (FRL), where local devices learn from fixed datasets. Despite its promise, offline FRL may break down under low-quality, heterogeneous data. Offline RL tends to get stuck in local optima, and in FRL, one device’s suboptimal policy can degrade the aggregated model, i.e., policy pollution. We present FORLER, combining Q-ensemble aggregation on the server with actor rectification on devices. The server robustly merges device Q-functions to curb policy pollution and shift heavy computation off resource-constrained hardware without compromising privacy. Locally, actor rectification enriches policy gradients via a zeroth-order search for high-Q actions plus a bespoke regularizer that nudges the policy toward them. A $δ$-periodic strategy further reduces local computation. We theoretically provide safe policy improvement performance guarantees. Extensive experiments show FORLER consistently outperforms strong baselines under varying data quality and heterogeneity.

[1293] Segment to Focus: Guiding Latent Action Models in the Presence of Distractors

Hamza Adnan, Matthew T. Jackson, Alexey Zakharov

Main category: cs.LG

TL;DR: MaskLAM improves Latent Action Models by using segmentation masks to filter action-correlated background noise, achieving better reinforcement learning performance from unlabeled videos.

Details

Motivation: Latent Action Models (LAMs) learn action representations from raw observations but struggle with action-correlated background noise that causes spurious correlations and suboptimal latent spaces.

Method: MaskLAM incorporates visual agent segmentation masks from pretrained foundation models to weight the LAM reconstruction loss, prioritizing salient information over background elements without architectural changes.

Result: On MuJoCo tasks with action-correlated background noise, MaskLAM yields up to 4x increase in accrued rewards and 3x improvement in latent action quality compared to baselines.

Conclusion: Using segmentation masks to filter background noise significantly improves LAM performance, enabling better reinforcement learning from unlabeled videos by focusing on action-relevant features.

Abstract: Latent Action Models (LAMs) learn to extract action-relevant representations solely from raw observations, enabling reinforcement learning from unlabelled videos and significantly scaling available training data. However, LAMs face a critical challenge in disentangling action-relevant features from action-correlated noise (e.g., background motion). Failing to filter these distractors causes LAMs to capture spurious correlations and build sub-optimal latent action spaces. In this paper, we introduce MaskLAM – a lightweight modification to LAM training to mitigate this issue by incorporating visual agent segmentation. MaskLAM utilises segmentation masks from pretrained foundation models to weight the LAM reconstruction loss, thereby prioritising salient information over background elements while requiring no architectural modifications. We demonstrate the effectiveness of our method on continuous-control MuJoCo tasks, modified with action-correlated background noise. Our approach yields up to a 4x increase in accrued rewards compared to standard baselines and a 3x improvement in the latent action quality, as evidenced by linear probe evaluation.

[1294] Logic-Guided Vector Fields for Constrained Generative Modeling

Ali Baheri

Main category: cs.LG

TL;DR: LGVF is a neuro-symbolic framework that injects symbolic logic constraints into flow matching generative models through training-time logic loss and inference-time gradient steering, reducing constraint violations by 59-82% compared to standard flow matching.

Details

Motivation: Current generative models lack mechanisms to enforce declarative constraints at generation time, limiting their ability to incorporate symbolic knowledge. The authors aim to combine the expressive structure of symbolic logic with neural learning flexibility.

Method: LGVF uses two complementary mechanisms: (1) training-time logic loss that penalizes constraint violations along continuous flow trajectories with emphasis near target distribution, and (2) inference-time adjustment that steers sampling using constraint gradients as logic-informed correction to learned dynamics.

Result: LGVF reduces constraint violations by 59-82% compared to standard flow matching across three case studies (linear, nonlinear, and multi-region feasibility constraints). In linear and ring settings, it improves distributional fidelity (MMD), while in multi-obstacle setting shows a satisfaction-fidelity trade-off with improved feasibility but increased MMD.

Conclusion: LGVF successfully integrates symbolic constraints into generative models, enabling constraint-aware generation with emergent behaviors like obstacle avoidance without explicit path planning, demonstrating effective neuro-symbolic combination.

Abstract: Neuro-symbolic systems aim to combine the expressive structure of symbolic logic with the flexibility of neural learning; yet, generative models typically lack mechanisms to enforce declarative constraints at generation time. We propose Logic-Guided Vector Fields (LGVF), a neuro-symbolic framework that injects symbolic knowledge, specified as differentiable relaxations of logical constraints, into flow matching generative models. LGVF couples two complementary mechanisms: (1) a training-time logic loss that penalizes constraint violations along continuous flow trajectories, with weights that emphasize correctness near the target distribution; and (2) an inference-time adjustment that steers sampling using constraint gradients, acting as a lightweight, logic-informed correction to the learned dynamics. We evaluate LGVF on three constrained generation case studies spanning linear, nonlinear, and multi-region feasibility constraints. Across all settings, LGVF reduces constraint violations by 59-82% compared to standard flow matching and achieves the lowest violation rates in each case. In the linear and ring settings, LGVF also improves distributional fidelity as measured by MMD, while in the multi-obstacle setting, we observe a satisfaction-fidelity trade-off, with improved feasibility but increased MMD. Beyond quantitative gains, LGVF yields constraint-aware vector fields exhibiting emergent obstacle-avoidance behavior, routing samples around forbidden regions without explicit path planning.

[1295] FiLoRA: Focus-and-Ignore LoRA for Controllable Feature Reliance

Hyunsuk Chung, Caren Han, Yerin Choi, Seungyeon Ji, Jinwoo Kim, Eun-Jung Holden, Kyungreem Han

Main category: cs.LG

TL;DR: FiLoRA is a parameter-efficient adaptation framework that uses instruction-conditioned gating to explicitly control which internal feature groups multimodal foundation models rely on for predictions, without changing the task semantics.

Details

Motivation: Current multimodal foundation models lack understanding of how their predictions depend on specific internal feature groups, and existing methods for analyzing shortcut/spurious behavior offer limited control over feature reliance without altering task semantics.

Method: FiLoRA decomposes adaptation into feature group-aligned LoRA modules with instruction-conditioned gating, allowing natural language instructions to act as computation-level control signals that selectively amplify or suppress core/spurious feature groups.

Result: Across text-image and audio-visual benchmarks, FiLoRA induces consistent and causal shifts in internal computation, improves robustness under spurious feature interventions, and reveals mechanisms to regulate feature reliance beyond correlation-driven learning.

Conclusion: FiLoRA provides a principled framework for explicit control over internal feature reliance in multimodal models while keeping predictive objectives fixed, enabling better understanding and regulation of model behavior.

Abstract: Multimodal foundation models integrate heterogeneous signals across modalities, yet it remains poorly understood how their predictions depend on specific internal feature groups and whether such reliance can be deliberately controlled. Existing studies of shortcut and spurious behavior largely rely on post hoc analyses or feature removal, offering limited insight into whether reliance can be modulated without altering task semantics. We introduce FiLoRA (Focus-and-Ignore LoRA), an instruction-conditioned, parameter-efficient adaptation framework that enables explicit control over internal feature reliance while keeping the predictive objective fixed. FiLoRA decomposes adaptation into feature group-aligned LoRA modules and applies instruction-conditioned gating, allowing natural language instructions to act as computation-level control signals rather than task redefinitions. Across text–image and audio–visual benchmarks, we show that instruction-conditioned gating induces consistent and causal shifts in internal computation, selectively amplifying or suppressing core and spurious feature groups without modifying the label space or training objective. Further analyses demonstrate that FiLoRA yields improved robustness under spurious feature interventions, revealing a principled mechanism to regulate reliance beyond correlation-driven learning.

[1296] SNAP: A Self-Consistent Agreement Principle with Application to Robust Computation

Xiaoyi Jiang, Andreas Nienkötter

Main category: cs.LG

TL;DR: SNAP is a self-supervised framework for robust computation based on mutual agreement between data points, using agreement weights to suppress outliers without supervision.

Details

Motivation: The paper addresses the need for robust computational methods that can handle outliers in data without requiring supervision or prior knowledge about the data distribution.

Method: SNAP uses a Self-coNsistent Agreement Principle based on an Agreement-Reliability Hypothesis to assign weights that quantify mutual agreement between data points, emphasizing trustworthy items and downweighting outliers through exponential suppression of outlier weights.

Result: SNAP demonstrates superior performance over the iterative Weiszfeld algorithm and multivariate median of means variants in vector averaging and subspace estimation tasks, showing effective outlier suppression even in high-dimensional settings.

Conclusion: SNAP provides a flexible, easy-to-use, broadly applicable approach to robust computation that doesn’t require supervision or prior knowledge about data distributions.

Abstract: We introduce SNAP (Self-coNsistent Agreement Principle), a self-supervised framework for robust computation based on mutual agreement. Based on an Agreement-Reliability Hypothesis SNAP assigns weights that quantify agreement, emphasizing trustworthy items and downweighting outliers without supervision or prior knowledge. A key result is the Exponential Suppression of Outlier Weights, ensuring that outliers contribute negligibly to computations, even in high-dimensional settings. We study properties of SNAP weighting scheme and show its practical benefits on vector averaging and subspace estimation. Particularly, we demonstrate that non-iterative SNAP outperforms the iterative Weiszfeld algorithm and two variants of multivariate median of means. SNAP thus provides a flexible, easy-to-use, broadly applicable approach to robust computation.

[1297] Robust Domain Generalization under Divergent Marginal and Conditional Distributions

Jewon Yeom, Kyubyung Chae, Hyunggyu Lim, Yoonna Oh, Dongyoon Yang, Taesup Kim

Main category: cs.LG

TL;DR: A unified framework for domain generalization addressing both marginal and conditional distribution shifts through novel risk bound decomposition and meta-learning.

Details

Motivation: Most domain generalization methods only address conditional distribution shifts (P(X|Y)) while assuming stable label distributions (P(Y)), but real-world scenarios often involve compound shifts where both marginal and conditional distributions vary simultaneously.

Method: Proposes a unified framework with novel risk bound decomposition that explicitly separates marginal and conditional distribution components, then uses meta-learning to minimize and validate this bound across seen domains for generalization to unseen ones.

Result: Achieves state-of-the-art performance on conventional DG benchmarks and challenging multi-domain long-tailed recognition settings where both marginal and conditional shifts are pronounced.

Conclusion: The framework effectively handles compound distribution shifts in domain generalization by jointly addressing both marginal and conditional distribution variations through theoretical risk bound decomposition and practical meta-learning optimization.

Abstract: Domain generalization (DG) aims to learn predictive models that can generalize to unseen domains. Most existing DG approaches focus on learning domain-invariant representations under the assumption of conditional distribution shift (i.e., primarily addressing changes in $P(X\mid Y)$ while assuming $P(Y)$ remains stable). However, real-world scenarios with multiple domains often involve compound distribution shifts where both the marginal label distribution $P(Y)$ and the conditional distribution $P(X\mid Y)$ vary simultaneously. To address this, we propose a unified framework for robust domain generalization under divergent marginal and conditional distributions. We derive a novel risk bound for unseen domains by explicitly decomposing the joint distribution into marginal and conditional components and characterizing risk gaps arising from both sources of divergence. To operationalize this bound, we design a meta-learning procedure that minimizes and validates the proposed risk bound across seen domains, ensuring strong generalization to unseen ones. Empirical evaluations demonstrate that our method achieves state-of-the-art performance not only on conventional DG benchmarks but also in challenging multi-domain long-tailed recognition settings where both marginal and conditional shifts are pronounced.

[1298] DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers

Ionut-Vlad Modoranu, Philip Zmushko, Erik Schultheis, Mher Safaryan, Dan Alistarh

Main category: cs.LG

TL;DR: DASH is a faster implementation of Distributed Shampoo optimizer using 3D tensor stacking for GPU efficiency and novel Newton-DB iteration with Chebyshev approximations for matrix root computation.

Details

Motivation: Shampoo is an effective second-order optimizer that produces models with lower activation outliers and better compressibility, but its computational cost is prohibitive due to expensive internal operations, limiting practical adoption.

Method: Proposes DASH with two main techniques: 1) stacking preconditioner blocks into 3D tensors to improve GPU utilization, and 2) introducing Newton-DB iteration and Chebyshev polynomial approximations for faster computation of inverse matrix roots required by Shampoo.

Result: Achieves up to 4.83× faster optimizer steps compared to well-optimized Distributed Shampoo, with Newton-DB attaining the lowest validation perplexity per iteration among all tested methods.

Conclusion: DASH significantly accelerates Shampoo optimization while maintaining convergence quality, making second-order optimization more practical for large-scale training.

Abstract: Shampoo is one of the leading approximate second-order optimizers: a variant of it has won the MLCommons AlgoPerf competition, and it has been shown to produce models with lower activation outliers that are easier to compress. Yet, applying Shampoo currently comes at the cost of significant computational slowdown, due to its expensive internal operations. In this paper, we take a significant step to address this shortcoming by proposing \method (for \textbf{D}istributed \textbf{A}ccelerated \textbf{SH}ampoo), a faster implementation of Distributed Shampoo based on two main new techniques: First, we show that preconditioner blocks can be stacked into 3D tensors to significantly improve GPU utilization; second, we introduce the Newton-DB iteration and the Chebyshev polynomial approximations as novel and faster approaches for computing the inverse matrix roots required by Shampoo. Along with these algorithmic contributions, we provide a first in-depth analysis of how matrix scaling critically affects Shampoo convergence. On the practical side, our GPU-aware implementation achieves up to $4.83\times$ faster optimizer steps compared to the well-optimized Distributed Shampoo, while Newton-DB attains the lowest validation perplexity per iteration among all tested methods. Our code is available at https://github.com/IST-DASLab/DASH.

[1299] Probabilistic Performance Guarantees for Multi-Task Reinforcement Learning

Yannik Schnitzer, Mathias Jackermeier, Alessandro Abate, David Parker

Main category: cs.LG

TL;DR: A method for computing high-confidence performance guarantees for multi-task RL policies on unseen tasks, combining per-task confidence bounds with task-level generalization.

Details

Motivation: Current multi-task RL approaches lack formal performance guarantees, which are crucial for safety-critical deployments. The paper aims to provide provable guarantees for policy performance on unseen tasks.

Method: Introduces a generalization bound that composes (1) per-task lower confidence bounds from finite rollouts with (2) task-level generalization from finite sampled tasks, yielding high-confidence guarantees for new tasks from the same distribution.

Result: The guarantees are shown to be theoretically sound and informative at realistic sample sizes across state-of-the-art multi-task RL methods.

Conclusion: Provides a framework for computing formal performance guarantees in multi-task RL, addressing the safety-critical deployment gap in existing approaches.

Abstract: Multi-task reinforcement learning trains generalist policies that can execute multiple tasks. While recent years have seen significant progress, existing approaches rarely provide formal performance guarantees, which are indispensable when deploying policies in safety-critical settings. We present an approach for computing high-confidence guarantees on the performance of a multi-task policy on tasks not seen during training. Concretely, we introduce a new generalisation bound that composes (i) per-task lower confidence bounds from finitely many rollouts with (ii) task-level generalisation from finitely many sampled tasks, yielding a high-confidence guarantee for new tasks drawn from the same arbitrary and unknown distribution. Across state-of-the-art multi-task RL methods, we show that the guarantees are theoretically sound and informative at realistic sample sizes.

[1300] On Stability and Robustness of Diffusion Posterior Sampling for Bayesian Inverse Problems

Yiming Yang, Xiaoyuan Cheng, Yi He, Kaiyu Li, Wenxuan Yuan, Zhuo Sun

Main category: cs.LG

TL;DR: The paper proposes a robust diffusion posterior sampling method to address the lack of robustness in diffusion-based Bayesian inverse problem solvers when likelihood specifications are mismatched.

Details

Motivation: Diffusion models have become powerful priors for Bayesian inverse problems, but existing diffusion-based solvers rely on presumed likelihoods for observations. The link between likelihood and recovery quality is unclear, and these methods lack robustness when the presumed likelihood mismatches the true data generation process.

Method: The authors propose “robust diffusion posterior sampling” - a simple yet effective solution that is provably robust and compatible with existing gradient-based posterior samplers. They characterize posterior approximation error and prove stability of diffusion-based solvers, then address robustness issues.

Result: Empirical results on scientific inverse problems and natural image tasks validate the effectiveness and robustness of the proposed method, showing consistent performance improvements under challenging likelihood misspecifications.

Conclusion: The paper bridges the gap between likelihood and recovery quality in diffusion-based Bayesian inverse problem solvers, provides theoretical stability analysis, and offers a robust solution that maintains performance even with likelihood mismatches.

Abstract: Diffusion models have recently emerged as powerful learned priors for Bayesian inverse problems (BIPs). Diffusion-based solvers rely on a presumed likelihood for the observations in BIPs to guide the generation process. However, the link between likelihood and recovery quality for BIPs is unclear in previous works. We bridge this gap by characterizing the posterior approximation error and proving the \emph{stability} of the diffusion-based solvers. Meanwhile, an immediate result of our findings on stability demonstrates the lack of robustness in diffusion-based solvers, which remains unexplored. This can degrade performance when the presumed likelihood mismatches the unknown true data generation processes. To address this issue, we propose a simple yet effective solution, \emph{robust diffusion posterior sampling}, which is provably \emph{robust} and compatible with existing gradient-based posterior samplers. Empirical results on scientific inverse problems and natural image tasks validate the effectiveness and robustness of our method, showing consistent performance improvements under challenging likelihood misspecifications.

[1301] Learning to Route and Schedule LLMs from User Retrials via Contextual Queueing Bandits

Seoungbin Bae, Junyoung Son, Dabeen Lee

Main category: cs.LG

TL;DR: A joint routing and scheduling algorithm for LLM services that uses implicit feedback from user retrial behaviors to optimize query-LLM matching and prioritization while maintaining queue stability.

Details

Motivation: Current LLM service routing algorithms overlook two key challenges: (1) unsatisfied users retrying queries increases server backlog, and (2) explicit feedback requests degrade user experience. The paper aims to develop a more efficient system using implicit feedback from retrial behaviors.

Method: Proposes the Contextual Queueing Bandits with Multinomial Logit Feedback (CQB-MNL) framework to model query retrials and context-based learning of user preferences over LLMs. Develops Anytime CQB (ACQB) algorithm combining Thompson sampling with forced exploration at a decaying rate for efficient learning while maintaining queue stability.

Result: ACQB achieves cumulative regret of Õ(√t) for routing and queue length regret of Õ(t^{-1/4}) for any large t. Experiments on SPROUT, EmbedLLM, and RouterBench datasets show consistent outperformance over baselines, with query embeddings refined via contrastive learning and disjoint parameter model for LLM-specific learning.

Conclusion: The proposed framework effectively addresses LLM service challenges by leveraging implicit feedback from retrial behaviors, achieving both efficient learning and queue stability while improving user experience by avoiding explicit feedback requests.

Abstract: Explosive demands for LLMs often cause user queries to accumulate in server queues, requiring efficient routing (query-LLM matching) and scheduling (query prioritization) mechanisms. Several online algorithms are being deployed, but they overlook the following two key challenges inherent to conversational LLM services: (1) unsatisfied users may retry queries, increasing the server backlog, and (2) requests for explicit" feedback, such as ratings, degrade user experiences. In this paper, we develop a joint routing and scheduling algorithm that leverages implicit" feedback inferred from user retrial behaviors. The key idea is to propose and study the framework of contextual queueing bandits with multinomial logit feedback (CQB-MNL). CQB-MNL models query retrials, as well as context-based learning for user preferences over LLMs. Our algorithm, anytime CQB (ACQB), achieves efficient learning while maintaining queue stability by combining Thompson sampling with forced exploration at a decaying rate. We show that ACQB simultaneously achieves a cumulative regret of $\widetilde{\mathcal{O}}(\sqrt{t})$ for routing and a queue length regret of $\widetilde{\mathcal{O}}(t^{-1/4})$ for any large $t$. For experiments, we refine query embeddings via contrastive learning while adopting a disjoint parameter model to learn LLM-specific parameters. Experiments on SPROUT, EmbedLLM, and RouterBench datasets confirm that both algorithms consistently outperform baselines.

[1302] Two-Stage Grid Optimization for Group-wise Quantization of LLMs

Junhan Kim, Gukryeol Lee, Seungwoo Son, Jeewook Kim, Yongkweon Jeon

Main category: cs.LG

TL;DR: A two-stage optimization framework for group-wise quantization in LLMs that minimizes layer-wise reconstruction loss by incorporating input statistics and inter-group correlations, improving accuracy with minimal overhead.

Details

Motivation: Existing group-wise quantization methods like GPTQ neglect input statistics and inter-group correlations when determining group scales, leading to suboptimal accuracy due to mismatch with the goal of minimizing layer-wise reconstruction loss.

Method: Two-stage optimization: 1) Initialize each group scale to minimize group-wise reconstruction loss before GPTQ (incorporates input statistics), 2) Freeze integer weights from GPTQ and refine group scales using coordinate descent with closed-form update rule to minimize layer-wise reconstruction loss, preventing error accumulation from preceding layers.

Result: Experimental results show the method consistently enhances group-wise quantization, achieving higher accuracy with negligible computational overhead.

Conclusion: The proposed two-stage optimization framework effectively addresses limitations of existing group-wise quantization methods by explicitly minimizing layer-wise reconstruction loss through input-aware initialization and efficient refinement.

Abstract: Group-wise quantization is an effective strategy for mitigating accuracy degradation in low-bit quantization of large language models (LLMs). Among existing methods, GPTQ has been widely adopted due to its efficiency; however, it neglects input statistics and inter-group correlations when determining group scales, leading to a mismatch with its goal of minimizing layer-wise reconstruction loss. In this work, we propose a two-stage optimization framework for group scales that explicitly minimizes the layer-wise reconstruction loss. In the first stage, performed prior to GPTQ, we initialize each group scale to minimize the group-wise reconstruction loss, thereby incorporating input statistics. In the second stage, we freeze the integer weights obtained via GPTQ and refine the group scales to minimize the layer-wise reconstruction loss. To this end, we employ the coordinate descent algorithm and derive a closed-form update rule, which enables efficient refinement without costly numerical optimization. Notably, our derivation incorporates the quantization errors from preceding layers to prevent error accumulation. Experimental results demonstrate that our method consistently enhances group-wise quantization, achieving higher accuracy with negligible overhead.

[1303] BAPS: A Fine-Grained Low-Precision Scheme for Softmax in Attention via Block-Aware Precision reScaling

Zisheng Ye, Xiaoyu He, Maoyuan Song, Guoliang Qiu, Chao Liao, Chen Wu, Yonggang Sun, Zhichun Li, Xiaoru Xie, Yuanyong Luo, Hu Liu, Pinyan Lu, Heng Liao

Main category: cs.LG

TL;DR: A novel low-precision softmax workflow using HiF8 format and block-aware precision rescaling to address Transformer inference bottlenecks, reducing data movement and EXP2 unit area while maintaining accuracy.

Details

Motivation: Softmax has become the critical bottleneck in Transformer inference as matrix multiplication acceleration plateaus, due to limited data bandwidth between compute cores and high area cost of high-precision exponentiation units.

Method: Introduces a low-precision workflow using HiF8 (8-bit floating-point format) with block-aware precision rescaling for softmax, enabling matrix multiplication outputs constrained to 8-bit and computing exponentiations in low (8-bit) precision.

Result: Extensive evaluation on language models and multi-modal models confirms the method’s validity, halving required data movement bandwidth and substantially reducing EXP2 unit area without significant accuracy loss.

Conclusion: The work paves the way for doubling end-to-end inference throughput without increasing chip area and offers a concrete co-design path for future low-precision hardware and software by alleviating the vector computation bottleneck.

Abstract: As the performance gains from accelerating quantized matrix multiplication plateau, the softmax operation becomes the critical bottleneck in Transformer inference. This bottleneck stems from two hardware limitations: (1) limited data bandwidth between matrix and vector compute cores, and (2) the significant area cost of high-precision (FP32/16) exponentiation units (EXP2). To address these issues, we introduce a novel low-precision workflow that employs a specific 8-bit floating-point format (HiF8) and block-aware precision rescaling for softmax. Crucially, our algorithmic innovations make low-precision softmax feasible without the significant model accuracy loss that hampers direct low-precision approaches. Specifically, our design (i) halves the required data movement bandwidth by enabling matrix multiplication outputs constrained to 8-bit, and (ii) substantially reduces the EXP2 unit area by computing exponentiations in low (8-bit) precision. Extensive evaluation on language models and multi-modal models confirms the validity of our method. By alleviating the vector computation bottleneck, our work paves the way for doubling end-to-end inference throughput without increasing chip area, and offers a concrete co-design path for future low-precision hardware and software.

[1304] Scalable Spatio-Temporal SE(3) Diffusion for Long-Horizon Protein Dynamics

Nima Shoghi, Yuxuan Liu, Yuning Shen, Rob Brekelmans, Pan Li, Quanquan Gu

Main category: cs.LG

TL;DR: STAR-MD is a scalable SE(3)-equivariant diffusion model that generates physically plausible protein trajectories over microsecond timescales using joint spatio-temporal attention to overcome limitations of existing methods in long-horizon generation.

Details

Motivation: Molecular dynamics simulations are computationally expensive for biologically relevant timescales, and existing generative models struggle with long-horizon generation due to architectural constraints, error accumulation, and inadequate modeling of spatio-temporal dynamics.

Method: STAR-MD uses a causal diffusion transformer with joint spatio-temporal attention that efficiently captures complex space-time dependencies while avoiding memory bottlenecks. It’s a scalable SE(3)-equivariant diffusion model designed for protein trajectory generation.

Result: On the ATLAS benchmark, STAR-MD achieves state-of-the-art performance across all metrics, substantially improving conformational coverage, structural validity, and dynamic fidelity. It successfully extrapolates to generate stable microsecond-scale trajectories where baseline methods fail catastrophically.

Conclusion: STAR-MD’s joint spatio-temporal modeling enables robust dynamics simulation at biologically relevant timescales, addressing severe limitations in current models for long-horizon generation and paving the way for accelerated exploration of protein function.

Abstract: Molecular dynamics (MD) simulations remain the gold standard for studying protein dynamics, but their computational cost limits access to biologically relevant timescales. Recent generative models have shown promise in accelerating simulations, yet they struggle with long-horizon generation due to architectural constraints, error accumulation, and inadequate modeling of spatio-temporal dynamics. We present STAR-MD (Spatio-Temporal Autoregressive Rollout for Molecular Dynamics), a scalable SE(3)-equivariant diffusion model that generates physically plausible protein trajectories over microsecond timescales. Our key innovation is a causal diffusion transformer with joint spatio-temporal attention that efficiently captures complex space-time dependencies while avoiding the memory bottlenecks of existing methods. On the standard ATLAS benchmark, STAR-MD achieves state-of-the-art performance across all metrics–substantially improving conformational coverage, structural validity, and dynamic fidelity compared to previous methods. STAR-MD successfully extrapolates to generate stable microsecond-scale trajectories where baseline methods fail catastrophically, maintaining high structural quality throughout the extended rollout. Our comprehensive evaluation reveals severe limitations in current models for long-horizon generation, while demonstrating that STAR-MD’s joint spatio-temporal modeling enables robust dynamics simulation at biologically relevant timescales, paving the way for accelerated exploration of protein function.

[1305] Calibrating Adaptive Smoothing Methods for Freeway Traffic Reconstruction

Junyi Ji, Derek Gloudemans, Gergely Zachár, Matthew Nice, William Barbour, Daniel B. Work

Main category: cs.LG

TL;DR: Python implementation of Adaptive Smoothing Method (ASM) for traffic state reconstruction with end-to-end calibration using real-world data and PyTorch integration.

Details

Motivation: To provide a reproducible Python implementation of ASM for traffic state reconstruction with calibration using real-world ground truth data, addressing reproducibility challenges in traffic model calibration.

Method: Developed ASM implementation in PyTorch with end-to-end calibration formulated as parameterized kernel optimization problem, using data from sparse radar sensor network on full-state observation testbed.

Result: Evaluated results in terms of speed distribution, spatio-temporal error distribution, and spatial error to provide benchmark metrics; demonstrated usability across multiple freeways.

Conclusion: Provides reproducible benchmark for traffic reconstruction problems, discusses calibration challenges and ASM limitations, enabling integration with deep learning methods.

Abstract: The adaptive smoothing method (ASM) is a widely used approach for traffic state reconstruction. This article presents a Python implementation of ASM, featuring end-to-end calibration using real-world ground truth data. The calibration is formulated as a parameterized kernel optimization problem. The model is calibrated using data from a full-state observation testbed, with input from a sparse radar sensor network. The implementation is developed in PyTorch, enabling integration with various deep learning methods. We evaluate the results in terms of speed distribution, spatio-temporal error distribution, and spatial error to provide benchmark metrics for the traffic reconstruction problem. We further demonstrate the usability of the calibrated method across multiple freeways. Finally, we discuss the challenges of reproducibility in general traffic model calibration and the limitations of ASM. This article is reproducible and can serve as a benchmark for various freeway operation tasks.

[1306] DCoPilot: Generative AI-Empowered Policy Adaptation for Dynamic Data Center Operations

Minghao Li, Ruihang Wang, Rui Tan, Yonggang Wen

Main category: cs.LG

TL;DR: DCoPilot: A hybrid framework using LLMs and hypernetworks for generative control policies in dynamic data centers, enabling automatic adaptation to changing workloads and SLAs.

Details

Motivation: Data centers with AI workloads have rapidly changing dynamics and service-level agreements, making manual design of reinforcement learning agents insufficient. There's a need for automatic policy generation that can adapt to evolving specifications without the lag of manual redesign.

Method: DCoPilot combines two generative paradigms: 1) LLMs for symbolic generation of structured reward functions, and 2) hypernetworks for parametric generation of policy weights. It operates through three phases: simulation scale-up for stress testing, meta policy distillation for training the hypernetwork, and online adaptation for zero-shot policy generation.

Result: Evaluated across five control task families spanning diverse DC components, DCoPilot achieves near-zero constraint violations and outperforms all baselines across specification variations. Ablation studies confirm the effectiveness of LLM-based reward generation for stable hypernetwork convergence.

Conclusion: DCoPilot successfully bridges the specification-to-policy gap in dynamic data center operations by combining symbolic and parametric generation, enabling timely adaptation to changing workloads and SLAs without manual intervention.

Abstract: Modern data centers (DCs) hosting artificial intelligence (AI)-dedicated devices operate at high power densities with rapidly varying workloads, making minute-level adaptation essential for safe and energy-efficient operation. However, manually designing piecewise deep reinforcement learning (DRL) agents cannot keep pace with frequent dynamics shifts and service-level agreement (SLA) changes of an evolving DC. This specification-to-policy lag causes a lack of timely, effective control policies, which may lead to service outages. To bridge the gap, we present DCoPilot, a hybrid framework for generative control policies in dynamic DC operation. DCoPilot synergizes two distinct generative paradigms, i.e., a large language model (LLM) that performs symbolic generation of structured reward forms, and a hypernetwork that conducts parametric generation of policy weights. DCoPilot operates through three coordinated phases: (i) simulation scale-up, which stress-tests reward candidates across diverse simulation-ready (SimReady) scenes; (ii) meta policy distillation, where a hypernetwork is trained to output policy weights conditioned on SLA and scene embeddings; and (iii) online adaptation, enabling zero-shot policy generation in response to updated specifications. Evaluated across five control task families spanning diverse DC components, DCoPilot achieves near-zero constraint violations and outperforms all baselines across specification variations. Ablation studies validate the effectiveness of LLM-based unified reward generation in enabling stable hypernetwork convergence.

[1307] AICD Bench: A Challenging Benchmark for AI-Generated Code Detection

Daniil Orel, Dilshod Azizov, Indraneil Paul, Yuxia Wang, Iryna Gurevych, Preslav Nakov

Main category: cs.LG

TL;DR: AICD Bench is a comprehensive benchmark for AI-generated code detection with 2M examples, 77 models across 11 families, and 9 programming languages, introducing realistic detection tasks beyond binary classification.

Details

Motivation: Existing AI-generated code detection datasets are narrow, typically limited to binary human-machine classification under in-distribution settings, creating a gap for realistic detection scenarios needed for authorship, accountability, and security concerns.

Method: Created AICD Bench with 2M examples spanning 77 models across 11 families and 9 programming languages. Introduced three realistic detection tasks: robust binary classification under distribution shifts, model family attribution, and fine-grained human-machine classification across human, machine, hybrid, and adversarial code.

Result: Extensive evaluation shows detection performance remains far below practical usability, particularly under distribution shift and for hybrid or adversarial code, highlighting the challenge of current approaches.

Conclusion: AICD Bench serves as a unified, challenging evaluation suite to drive development of robust approaches for AI-generated code detection, addressing critical gaps in existing benchmarks.

Abstract: Large language models (LLMs) are increasingly capable of generating functional source code, raising concerns about authorship, accountability, and security. While detecting AI-generated code is critical, existing datasets and benchmarks are narrow, typically limited to binary human-machine classification under in-distribution settings. To bridge this gap, we introduce $\emph{AICD Bench}$, the most comprehensive benchmark for AI-generated code detection. It spans $\emph{2M examples}$, $\emph{77 models}$ across $\emph{11 families}$, and $\emph{9 programming languages}$, including recent reasoning models. Beyond scale, AICD Bench introduces three realistic detection tasks: ($\emph{i}$)~~$\emph{Robust Binary Classification}$ under distribution shifts in language and domain, ($\emph{ii}$)~~$\emph{Model Family Attribution}$, grouping generators by architectural lineage, and ($\emph{iii}$)~$\emph{Fine-Grained Human-Machine Classification}$ across human, machine, hybrid, and adversarial code. Extensive evaluation on neural and classical detectors shows that performance remains far below practical usability, particularly under distribution shift and for hybrid or adversarial code. We release AICD Bench as a $\emph{unified, challenging evaluation suite}$ to drive the next generation of robust approaches for AI-generated code detection. The data and the code are available at https://huggingface.co/AICD-bench}.

[1308] Learning Half-Spaces from Perturbed Contrastive Examples

Aryan Alavi Razavi Ravari, Farnam Mansouri, Yuxin Chen, Valentio Iverson, Adish Singla, Sandra Zilles

Main category: cs.LG

TL;DR: The paper studies contrastive learning with noisy contrastive examples, where the quality of contrastive examples depends on distance to decision boundary via a noise function f.

Details

Motivation: Previous work assumed idealized contrastive examples at minimum distance, but real-world scenarios involve noisy or imperfect contrastive examples. The paper aims to understand how noise in contrastive examples affects learning efficiency.

Method: Introduces a noise model parameterized by function f, where contrastive example quality depends on distance to decision boundary. Analyzes active and passive learning in two settings: fixed maximum perturbation and stochastic perturbation. Focuses on one-dimensional thresholds and half-spaces under uniform distribution.

Result: Characterizes sample complexity dependence on noise function f. Shows that under certain conditions on f, contrastive examples can speed up learning in terms of asymptotic query complexity and expected query complexity.

Conclusion: Noisy contrastive examples can still provide learning benefits when properly modeled, with quality depending on distance to decision boundary. The analysis provides theoretical understanding of contrastive learning with imperfect examples.

Abstract: We study learning under a two-step contrastive example oracle, as introduced by Mansouri et. al. (2025), where each queried (or sampled) labeled example is paired with an additional contrastive example of opposite label. While Mansouri et al. assume an idealized setting, where the contrastive example is at minimum distance of the originally queried/sampled point, we introduce and analyze a mechanism, parameterized by a non-decreasing noise function $f$, under which this ideal contrastive example is perturbed. The amount of perturbation is controlled by $f(d)$, where $d$ is the distance of the queried/sampled point to the decision boundary. Intuitively, this results in higher-quality contrastive examples for points closer to the decision boundary. We study this model in two settings: (i) when the maximum perturbation magnitude is fixed, and (ii) when it is stochastic. For one-dimensional thresholds and for half-spaces under the uniform distribution on a bounded domain, we characterize active and passive contrastive sample complexity in dependence on the function $f$. We show that, under certain conditions on $f$, the presence of contrastive examples speeds up learning in terms of asymptotic query complexity and asymptotic expected query complexity.

Sunho Kim, Susik Yoon

Main category: cs.LG

TL;DR: BTTF framework improves long-term time series forecasting by using look-ahead augmentation and self-corrective refinement through ensemble of second-stage models with initial predictions.

Details

Motivation: Address the trade-off in long-term time series forecasting between parallel efficiency (direct multi-step methods) and temporal consistency (iterative multi-step methods), seeking to bridge this gap without complex architectures.

Method: Proposes Back to the Future (BTTF) framework that refines base models by ensembling second-stage models augmented with their initial predictions, using look-ahead augmentation and self-corrective refinement.

Result: Achieves accuracy gains up to 58%, improves long-horizon accuracy, mitigates instability of linear forecasting models, and shows stable improvements even with suboptimal first-stage models.

Conclusion: Leveraging model-generated forecasts as augmentation is a simple yet powerful way to enhance long-term prediction without requiring complex architectures.

Abstract: Long-term time series forecasting (LTSF) remains challenging due to the trade-off between parallel efficiency and sequential modeling of temporal coherence. Direct multi-step forecasting (DMS) methods enable fast, parallel prediction of all future horizons but often lose temporal consistency across steps, while iterative multi-step forecasting (IMS) preserves temporal dependencies at the cost of error accumulation and slow inference. To bridge this gap, we propose Back to the Future (BTTF), a simple yet effective framework that enhances forecasting stability through look-ahead augmentation and self-corrective refinement. Rather than relying on complex model architectures, BTTF revisits the fundamental forecasting process and refines a base model by ensembling the second-stage models augmented with their initial predictions. Despite its simplicity, our approach consistently improves long-horizon accuracy and mitigates the instability of linear forecasting models, achieving accuracy gains of up to 58% and demonstrating stable improvements even when the first-stage model is trained under suboptimal conditions. These results suggest that leveraging model-generated forecasts as augmentation can be a simple yet powerful way to enhance long-term prediction, even without complex architectures.

[1310] Active learning from positive and unlabeled examples

Farnam Mansouri, Sandra Zilles, Shai Ben-David

Main category: cs.LG

TL;DR: First theoretical analysis of label complexity in active PU learning where only positive labels are revealed probabilistically

Details

Motivation: Motivated by applications like advertising and anomaly detection where only positive examples can be reliably identified, and even then only probabilistically due to factors like user engagement or detection limitations

Method: Theoretical analysis of active PU learning with adaptive querying where labels are revealed only when instances are positive AND an independent coin flip succeeds, otherwise no information is provided

Result: Provides first theoretical bounds on label complexity for this active PU learning setting, establishing fundamental limits on learning efficiency

Conclusion: Establishes theoretical foundations for active PU learning with probabilistic positive label revelation, providing insights for practical applications where only positive feedback is available

Abstract: Learning from positive and unlabeled data (PU learning) is a weakly supervised variant of binary classification in which the learner receives labels only for (some) positively labeled instances, while all other examples remain unlabeled. Motivated by applications such as advertising and anomaly detection, we study an active PU learning setting where the learner can adaptively query instances from an unlabeled pool, but a queried label is revealed only when the instance is positive and an independent coin flip succeeds; otherwise the learner receives no information. In this paper, we provide the first theoretical analysis of the label complexity of active PU learning.

[1311] ECHO: Entropy-Confidence Hybrid Optimization for Test-Time Reinforcement Learning

Chu Zhao, Enneng Yang, Yuting Liu, Jianzhe Zhao, Guibing Guo

Main category: cs.LG

TL;DR: ECHO improves test-time RL for reasoning tasks by addressing rollout collapse and early pseudo-label bias through adaptive branching control and robust policy updates.

Details

Motivation: Existing test-time RL with tree-structured rollouts suffers from rollout collapse (where branching concentrates on high-entropy trajectories) and early pseudo-label bias causing premature policy sharpening and suppressed exploration.

Method: Proposes Entropy Confidence Hybrid Group Relative Policy Optimization (ECHO) with: 1) adaptive branching control using local entropy and group-level confidence, 2) online confidence-based pruning to terminate low-confidence branches, 3) confidence adaptive clipping, and 4) entropy-confidence hybrid advantage shaping for robust policy updates.

Result: ECHO achieves consistent gains on multiple mathematical and visual reasoning benchmarks and generalizes more effectively under limited rollout budgets compared to prior methods.

Conclusion: ECHO effectively addresses rollout collapse and early pseudo-label bias in test-time RL, improving sampling efficiency and exploration for reasoning tasks.

Abstract: Test-time reinforcement learning generates multiple candidate answers via repeated rollouts and performs online updates using pseudo-labels constructed by majority voting. To reduce overhead and improve exploration, prior work introduces tree structured rollouts, which share reasoning prefixes and branch at key nodes to improve sampling efficiency. However, this paradigm still faces two challenges: (1) high entropy branching can trigger rollout collapse, where the branching budget concentrates on a few trajectories with consecutive high-entropy segments, rapidly reducing the number of effective branches; (2) early pseudo-labels are noisy and biased, which can induce self-reinforcing overfitting, causing the policy to sharpen prematurely and suppress exploration. To address these issues, we propose Entropy Confidence Hybrid Group Relative Policy Optimization (ECHO). During rollout, ECHO jointly leverages local entropy and group level confidence to adaptively control branch width, and further introduces online confidence-based pruning to terminate persistently low confidence branches, avoiding high entropy traps and mitigating collapse. During policy updates, ECHO employs confidence adaptive clipping and an entropy confidence hybrid advantage shaping approach to enhance training robustness and mitigate early stage bias. Experiments demonstrate that ECHO achieves consistent gains on multiple mathematical and visual reasoning benchmarks, and generalizes more effectively under a limited rollout budget.

[1312] Efficient Swap Regret Minimization in Combinatorial Bandits

Andreas Kontogiannis, Vasilis Pollatos, Panayotis Mertikopoulos, Ioannis Panageas

Main category: cs.LG

TL;DR: First efficient no-swap regret algorithm for combinatorial bandits with polylogarithmic dependence on exponentially large action space N

Details

Motivation: Existing combinatorial bandit algorithms focus on external regret minimization, but achieving no-swap regret with polylogarithmic dependence on N (exponentially large action space) has remained an open challenge

Method: Introduces a novel no-swap-regret learning algorithm that achieves sublinear swap regret with polylogarithmic dependence on N, with efficient per-iteration implementation across various applications

Result: Proposed algorithm achieves regret that scales polylogarithmically in N and is tight for combinatorial bandits class, with efficient implementation demonstrated across well-studied applications

Conclusion: Resolves the long-standing challenge of achieving no-swap regret with polylogarithmic dependence on exponentially large action spaces in combinatorial bandits

Abstract: This paper addresses the problem of designing efficient no-swap regret algorithms for combinatorial bandits, where the number of actions $N$ is exponentially large in the dimensionality of the problem. In this setting, designing efficient no-swap regret translates to sublinear – in horizon $T$ – swap regret with polylogarithmic dependence on $N$. In contrast to the weaker notion of external regret minimization - a problem which is fairly well understood in the literature - achieving no-swap regret with a polylogarithmic dependence on $N$ has remained elusive in combinatorial bandits. Our paper resolves this challenge, by introducing a no-swap-regret learning algorithm with regret that scales polylogarithmically in $N$ and is tight for the class of combinatorial bandits. To ground our results, we also demonstrate how to implement the proposed algorithm efficiently – that is, with a per-iteration complexity that also scales polylogarithmically in $N$ – across a wide range of well-studied applications.

[1313] SurvKAN: A Fully Parametric Survival Model Based on Kolmogorov-Arnold Networks

Marina Mastroleo, Alberto Archetti, Federico Mastroleo, Matteo Matteucci

Main category: cs.LG

TL;DR: SurvKAN: A fully parametric, time-continuous survival model using Kolmogorov-Arnold Networks (KANs) that eliminates proportional hazards constraints while maintaining interpretability for clinical applications.

Details

Motivation: Classical survival models like Cox have restrictive assumptions (linear relationships, proportional hazards) that fail to capture real-world clinical dynamics. Deep learning approaches improve expressivity but sacrifice interpretability, limiting clinical adoption where trust and transparency are crucial.

Method: Introduces SurvKAN, a fully parametric survival model based on KAN architectures that treats time as an explicit input to predict the log-hazard function directly. Uses learnable univariate functions for interpretability and enables end-to-end training on the full survival likelihood.

Result: Extensive experiments on standard survival benchmarks show SurvKAN achieves competitive or superior performance compared to classical and state-of-the-art baselines across concordance and calibration metrics. Interpretability analyses reveal clinically meaningful patterns aligned with medical domain knowledge.

Conclusion: SurvKAN addresses the trade-off between expressivity and interpretability in survival analysis by eliminating proportional hazards constraints while maintaining clinical interpretability through KAN architectures.

Abstract: Accurate prediction of time-to-event outcomes is critical for clinical decision-making, treatment planning, and resource allocation in modern healthcare. While classical survival models such as Cox remain widely adopted in standard practice, they rely on restrictive assumptions, including linear covariate relationships and proportional hazards over time, that often fail to capture real-world clinical dynamics. Recent deep learning approaches like DeepSurv and DeepHit offer improved expressivity but sacrifice interpretability, limiting clinical adoption where trust and transparency are paramount. Hybrid models incorporating Kolmogorov-Arnold Networks (KANs), such as CoxKAN, have begun to address this trade-off but remain constrained by the semi-parametric Cox framework. In this work we introduce SurvKAN, a fully parametric, time-continuous survival model based on KAN architectures that eliminates the proportional hazards constraint. SurvKAN treats time as an explicit input to a KAN that directly predicts the log-hazard function, enabling end-to-end training on the full survival likelihood. Our architecture preserves interpretability through learnable univariate functions that indicate how individual features influence risk over time. Extensive experiments on standard survival benchmarks demonstrate that SurvKAN achieves competitive or superior performance compared to classical and state-of-the-art baselines across concordance and calibration metrics. Additionally, interpretability analyses reveal clinically meaningful patterns that align with medical domain knowledge.

[1314] The Maximum von Neumann Entropy Principle: Theory and Applications in Machine Learning

Youqi Wu, Farzan Farnia

Main category: cs.LG

TL;DR: The paper extends the maximum entropy principle to von Neumann entropy (VNE) for kernel matrices, providing game-theoretic justification and applying it to kernel selection and completion problems.

Details

Motivation: While VNE has been adopted in machine learning as a spectral diversity measure for kernel matrices, there's no principled analogue of the classical maximum entropy framework with decision/game theoretic interpretation for VNE in data-driven contexts.

Method: Extends the minimax formulation of maximum entropy principle to von Neumann entropy, providing game-theoretic justification for VNE maximization over density matrices and trace-normalized PSD operators.

Result: Develops a robust interpretation of maximum VNE solutions under partial information and demonstrates applications to kernel selection from multiple normalized embeddings and kernel matrix completion.

Conclusion: The proposed framework offers a unifying information-theoretic foundation for VNE-based methods in kernel learning, connecting quantum information theory with machine learning applications.

Abstract: Von Neumann entropy (VNE) is a fundamental quantity in quantum information theory and has recently been adopted in machine learning as a spectral measure of diversity for kernel matrices and kernel covariance operators. While maximizing VNE under constraints is well known in quantum settings, a principled analogue of the classical maximum entropy framework, particularly its decision theoretic and game theoretic interpretation, has not been explicitly developed for VNE in data driven contexts. In this paper, we extend the minimax formulation of the maximum entropy principle due to Grünwald and Dawid to the setting of von Neumann entropy, providing a game-theoretic justification for VNE maximization over density matrices and trace-normalized positive semidefinite operators. This perspective yields a robust interpretation of maximum VNE solutions under partial information and clarifies their role as least committed inferences in spectral domains. We then illustrate how the resulting Maximum VNE principle applies to modern machine learning problems by considering two representative applications, selecting a kernel representation from multiple normalized embeddings via kernel-based VNE maximization, and completing kernel matrices from partially observed entries. These examples demonstrate how the proposed framework offers a unifying information-theoretic foundation for VNE-based methods in kernel learning.

[1315] Efficient Neural Controlled Differential Equations via Attentive Kernel Smoothing

Egor Serov, Ilya Kuleshov, Alexey Zaytsev

Main category: cs.LG

TL;DR: Proposes smoothed path construction for Neural CDEs using kernel/GP smoothing instead of splines, with multi-view attention to recover lost details, achieving better accuracy with fewer function evaluations.

Details

Motivation: Neural CDEs are powerful for sequence modeling but suffer from inefficiency due to rough driving paths from standard splines, which force adaptive solvers to take small steps and increase computational cost.

Method: Replace exact interpolation with kernel and Gaussian Process smoothing for explicit regularity control. Add attention-based Multi-View CDE (MV-CDE) and convolutional extension (MVC-CDE) with learnable queries to recover details lost during smoothing, distributing representational capacity across multiple trajectories.

Result: MVC-CDE with GP achieves state-of-the-art accuracy while significantly reducing Number of Function Evaluations (NFE) and total inference time compared to spline-based baselines.

Conclusion: The proposed smoothed path construction with multi-view attention improves Neural CDE efficiency and accuracy by controlling trajectory regularity and recovering lost details through distributed representation.

Abstract: Neural Controlled Differential Equations (Neural CDEs) provide a powerful continuous-time framework for sequence modeling, yet the roughness of the driving control path often restricts their efficiency. Standard splines introduce high-frequency variations that force adaptive solvers to take excessively small steps, driving up the Number of Function Evaluations (NFE). We propose a novel approach to Neural CDE path construction that replaces exact interpolation with Kernel and Gaussian Process (GP) smoothing, enabling explicit control over trajectory regularity. To recover details lost during smoothing, we propose an attention-based Multi-View CDE (MV-CDE) and its convolutional extension (MVC-CDE), which employ learnable queries to inform path reconstruction. This framework allows the model to distribute representational capacity across multiple trajectories, each capturing distinct temporal patterns. Empirical results demonstrate that our method, MVC-CDE with GP, achieves state-of-the-art accuracy while significantly reducing NFEs and total inference time compared to spline-based baselines.

[1316] State Rank Dynamics in Linear Attention LLMs

Ao Sun, Hongtao Zhang, Heng Zhou, Yixuan Ma, Yiran Qin, Tongrui Su, Yan Liu, Zhanyu Ma, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He

Main category: cs.LG

TL;DR: Analysis of internal state dynamics in Linear Attention LLMs reveals State Rank Stratification - a bifurcation where some heads maintain near-zero rank while others grow to an upper bound, with low-rank heads being crucial for reasoning and high-rank heads being redundant.

Details

Motivation: Linear Attention LLMs compress context into fixed-size state matrices for constant-time inference, but the internal dynamics of these compressed states remain poorly understood. The paper aims to study the runtime state dynamics of state-of-the-art Linear Attention models to uncover fundamental patterns and functional implications.

Method: Comprehensive study of runtime state dynamics in Linear Attention models, analyzing spectral properties of attention heads across diverse inference contexts. The research identifies State Rank Stratification phenomenon and uses diagnostic probes to understand functional divergence between low-rank and high-rank heads.

Result: Discovery of State Rank Stratification: linear attention heads bifurcate into two groups - one maintaining near-zero effective rank, the other growing rapidly to an upper bound. This pattern remains consistent across contexts, indicating it’s an intrinsic structural property from pre-training. Low-rank heads are essential for reasoning while high-rank heads show significant redundancy.

Conclusion: The State Rank Stratification phenomenon reveals fundamental architectural properties of Linear Attention LLMs. The functional divergence between head types enables practical applications like Joint Rank-Norm Pruning, which reduces KV-cache overhead by 38.9% while maintaining accuracy, demonstrating the utility of understanding internal state dynamics.

Abstract: Linear Attention Large Language Models (LLMs) offer a compelling recurrent formulation that compresses context into a fixed-size state matrix, enabling constant-time inference. However, the internal dynamics of this compressed state remain largely opaque. In this work, we present a comprehensive study on the runtime state dynamics of state-of-the-art Linear Attention models. We uncover a fundamental phenomenon termed State Rank Stratification, characterized by a distinct spectral bifurcation among linear attention heads: while one group maintains an effective rank oscillating near zero, the other exhibits rapid growth that converges to an upper bound. Extensive experiments across diverse inference contexts reveal that these dynamics remain strikingly consistent, indicating that the identity of a head,whether low-rank or high-rank,is an intrinsic structural property acquired during pre-training, rather than a transient state dependent on the input data. Furthermore, our diagnostic probes reveal a surprising functional divergence: low-rank heads are indispensable for model reasoning, whereas high-rank heads exhibit significant redundancy. Leveraging this insight, we propose Joint Rank-Norm Pruning, a zero-shot strategy that achieves a 38.9% reduction in KV-cache overhead while largely maintaining model accuracy.

[1317] Generating Causal Temporal Interaction Graphs for Counterfactual Validation of Temporal Link Prediction

Aniq Ur Rahman, Justin P. Coon

Main category: cs.LG

TL;DR: A framework for counterfactual validation of temporal link prediction models using causal temporal interaction graphs with known ground-truth causal structure.

Details

Motivation: Current temporal link prediction models are evaluated based on predictive accuracy, but this doesn't assess whether they capture the underlying causal mechanisms governing temporal interactions. There's a need for causality-aware benchmarking.

Method: Proposes a framework for counterfactual validation using causal temporal interaction graphs (CTIGs). Introduces a structural equation model for continuous-time event sequences with excitatory/inhibitory effects, extends to temporal interaction graphs, proposes a cross-model predictive error distance metric, and instantiates evaluation under controlled causal shifts and timestamp shuffling.

Result: Validates the hypothesis that predictors trained on one causal model degrade when evaluated on sufficiently distant models. Provides a foundation for causality-aware benchmarking of temporal link prediction models.

Conclusion: The framework enables counterfactual validation of temporal link prediction models, moving beyond predictive accuracy to assess causal understanding, which is crucial for reliable deployment in real-world applications.

Abstract: Temporal link prediction (TLP) models are commonly evaluated based on predictive accuracy, yet such evaluations do not assess whether these models capture the causal mechanisms that govern temporal interactions. In this work, we propose a framework for counterfactual validation of TLP models by generating causal temporal interaction graphs (CTIGs) with known ground-truth causal structure. We first introduce a structural equation model for continuous-time event sequences that supports both excitatory and inhibitory effects, and then extend this mechanism to temporal interaction graphs. To compare causal models, we propose a distance metric based on cross-model predictive error, and empirically validate the hypothesis that predictors trained on one causal model degrade when evaluated on sufficiently distant models. Finally, we instantiate counterfactual evaluation under (i) controlled causal shifts between generating models and (ii) timestamp shuffling as a stochastic distortion with measurable causal distance. Our framework provides a foundation for causality-aware benchmarking.

[1318] Hierarchical Adaptive Eviction for KV Cache Management in Multimodal Language Models

Xindian Ma, Yidi Lu, Peng Zhang, Jing Zhang

Main category: cs.LG

TL;DR: HAE is a KV cache eviction framework for multimodal LLMs that optimizes text-visual token interaction through dual-attention pruning and dynamic decoding strategies, reducing memory by 41% with minimal accuracy loss.

Details

Motivation: Existing KV cache eviction strategies fail to address heterogeneous attention distributions between visual and text tokens in multimodal LLMs, leading to suboptimal efficiency or degraded performance despite the quadratic memory/computational costs of Transformer architectures.

Method: Hierarchical Adaptive Eviction (HAE) implements Dual-Attention Pruning during pre-filling (leveraging visual token sparsity and attention variance) and a Dynamic Decoding Eviction Strategy (inspired by OS Recycle Bins) during decoding, with index broadcasting to reduce computational overhead.

Result: HAE reduces KV-Cache memory by 41% with only 0.3% accuracy drop in image understanding tasks, and accelerates story generation inference by 1.5x while maintaining output quality on Phi3.5-Vision-Instruct model.

Conclusion: HAE effectively addresses the efficiency bottleneck in multimodal LLMs by optimizing text-visual token interactions, providing a practical solution for KV cache management that balances performance and computational efficiency.

Abstract: The integration of visual information into Large Language Models (LLMs) has enabled Multimodal LLMs (MLLMs), but the quadratic memory and computational costs of Transformer architectures remain a bottleneck. Existing KV cache eviction strategies fail to address the heterogeneous attention distributions between visual and text tokens, leading to suboptimal efficiency or degraded performance. In this paper, we propose Hierarchical Adaptive Eviction (HAE), a KV cache eviction framework that optimizes text-visual token interaction in MLLMs by implementing Dual-Attention Pruning during pre-filling (leveraging visual token sparsity and attention variance) and a Dynamic Decoding Eviction Strategy (inspired by OS Recycle Bins) during decoding. HAE minimizes KV cache usage across layers, reduces computational overhead via index broadcasting, and theoretically ensures superior information integrity and lower error bounds compared to greedy strategies, enhancing efficiency in both comprehension and generation tasks. Empirically, HAE reduces KV-Cache memory by 41% with minimal accuracy loss (0.3% drop) in image understanding tasks and accelerates story generation inference by 1.5x while maintaining output quality on Phi3.5-Vision-Instruct model.

[1319] Interpretable Tabular Foundation Models via In-Context Kernel Regression

Ratmir Miftachov, Bruno Charron, Simon Valentin

Main category: cs.LG

TL;DR: KernelICL enhances tabular foundation models with interpretability by replacing final prediction layers with explicit kernel functions, enabling transparent weighted-average predictions while maintaining performance.

Details

Motivation: Tabular foundation models achieve state-of-the-art performance but remain fundamentally opaque. The authors aim to enhance these models with quantifiable sample-based interpretability while maintaining performance.

Method: Replace the final prediction layer of tabular foundation models with explicit kernel functions (Gaussian, dot-product, kNN) so predictions become transparent weighted averages of training labels. Introduce a two-dimensional taxonomy unifying kernel methods, neighbor-based approaches, and attention mechanisms under a single framework.

Result: On 55 TALENT benchmark datasets, KernelICL achieves performance on par with existing tabular foundation models while providing inspectable predictions through quantifiable weight distributions over training samples.

Conclusion: Explicit kernel constraints on the final layer enable inspectable predictions in tabular foundation models without sacrificing performance, providing a framework for interpretable in-context learning.

Abstract: Tabular foundation models like TabPFN and TabICL achieve state-of-the-art performance through in-context learning, yet their architectures remain fundamentally opaque. We introduce KernelICL, a framework to enhance tabular foundation models with quantifiable sample-based interpretability. Building on the insight that in-context learning is akin to kernel regression, we make this mechanism explicit by replacing the final prediction layer with kernel functions (Gaussian, dot-product, kNN) so that every prediction is a transparent weighted average of training labels. We introduce a two-dimensional taxonomy that formally unifies standard kernel methods, modern neighbor-based approaches, and attention mechanisms under a single framework, and quantify inspectability via the perplexity of the weight distribution over training samples. On 55 TALENT benchmark datasets, KernelICL achieves performance on par with existing tabular foundation models, demonstrating that explicit kernel constraints on the final layer enable inspectable predictions without sacrificing performance.

[1320] Cardinality-Preserving Structured Sparse Graph Transformers for Molecular Property Prediction

Abhijit Gupta

Main category: cs.LG

TL;DR: CardinalGraphFormer is a graph transformer for molecular property prediction that incorporates structural biases and cardinality-preserving aggregation, achieving state-of-the-art results across 11 molecular benchmarks through self-supervised pretraining.

Details

Motivation: Drug discovery requires efficient molecular property prediction with limited labeled data due to the vast chemical space (~10^60 molecules) vs. only thousands of approved drugs. Self-supervised pretraining on large unlabeled molecular corpora is essential for data-efficient molecular representation learning.

Method: Proposes CardinalGraphFormer, a graph transformer with Graphormer-inspired structural biases (shortest-path distance, centrality, direct-bond edge bias) within structured sparse attention limited to shortest-path distance ≤ 3. Adds cardinality-preserving unnormalized aggregation channel over same support set. Uses contrastive graph-level alignment with masked attribute reconstruction for pretraining.

Result: Improves mean performance across all 11 evaluated tasks and achieves statistically significant gains on 10 of 11 public benchmarks spanning MoleculeNet, OGB, and TDC ADMET tasks compared to strong reproduced baselines under fully matched evaluation protocol.

Conclusion: CardinalGraphFormer demonstrates effective incorporation of structural biases and cardinality-preserving aggregation for molecular representation learning, achieving superior performance in molecular property prediction tasks through self-supervised pretraining.

Abstract: Drug discovery motivates efficient molecular property prediction under limited labeled data. Chemical space is vast, often estimated at approximately 10^60 drug-like molecules, while only thousands of drugs have been approved. As a result, self-supervised pretraining on large unlabeled molecular corpora has become essential for data-efficient molecular representation learning. We introduce CardinalGraphFormer, a graph transformer that incorporates Graphormer-inspired structural biases, including shortest-path distance and centrality, as well as direct-bond edge bias, within a structured sparse attention regime limited to shortest-path distance <= 3. The model further augments this design with a cardinality-preserving unnormalized aggregation channel over the same support set. Pretraining combines contrastive graph-level alignment with masked attribute reconstruction. Under a fully matched evaluation protocol, CardinalGraphFormer improves mean performance across all 11 evaluated tasks and achieves statistically significant gains on 10 of 11 public benchmarks spanning MoleculeNet, OGB, and TDC ADMET tasks when compared to strong reproduced baselines.

[1321] Co-RedTeam: Orchestrated Security Discovery and Exploitation with LLM Agents

Pengfei He, Ash Fox, Lesly Miculicich, Stefan Friedli, Daniel Fabian, Burak Gokturk, Jiliang Tang, Chen-Yu Lee, Tomas Pfister, Long T. Le

Main category: cs.LG

TL;DR: Co-RedTeam is a multi-agent framework for automated vulnerability discovery and exploitation that integrates security knowledge, code analysis, execution feedback, and memory to improve cybersecurity red-teaming.

Details

Motivation: Current LLM approaches for cybersecurity tasks struggle with automatic vulnerability discovery and exploitation due to limited interaction, weak execution grounding, and lack of experience reuse, necessitating a more robust framework.

Method: A security-aware multi-agent framework that decomposes vulnerability analysis into coordinated discovery and exploitation stages, with agents that plan, execute, validate, and refine actions based on real execution feedback while learning from prior trajectories through long-term memory.

Result: Co-RedTeam consistently outperforms strong baselines across diverse backbone models, achieving over 60% success rate in vulnerability exploitation and over 10% absolute improvement in vulnerability detection on challenging security benchmarks.

Conclusion: The framework demonstrates the critical role of execution feedback, structured interaction, and memory for building robust and generalizable cybersecurity agents, advancing automated red-teaming capabilities.

Abstract: Large language models (LLMs) have shown promise in assisting cybersecurity tasks, yet existing approaches struggle with automatic vulnerability discovery and exploitation due to limited interaction, weak execution grounding, and a lack of experience reuse. We propose Co-RedTeam, a security-aware multi-agent framework designed to mirror real-world red-teaming workflows by integrating security-domain knowledge, code-aware analysis, execution-grounded iterative reasoning, and long-term memory. Co-RedTeam decomposes vulnerability analysis into coordinated discovery and exploitation stages, enabling agents to plan, execute, validate, and refine actions based on real execution feedback while learning from prior trajectories. Extensive evaluations on challenging security benchmarks demonstrate that Co-RedTeam consistently outperforms strong baselines across diverse backbone models, achieving over 60% success rate in vulnerability exploitation and over 10% absolute improvement in vulnerability detection. Ablation and iteration studies further confirm the critical role of execution feedback, structured interaction, and memory for building robust and generalizable cybersecurity agents.

[1322] Generating Physically Sound Designs from Text and a Set of Physical Constraints

Gregory Barber, Todd C. Henry, Mulugeta A. Haile

Main category: cs.LG

TL;DR: TIDES is a text-informed design approach that generates physically sound designs by jointly optimizing structural topology and visual properties using text prompts and differentiable physics simulation.

Details

Motivation: The paper aims to bridge the gap between textual design descriptions and physically sound engineering designs by creating a system that can generate designs that both satisfy engineering requirements and align with visual/textual specifications.

Method: TIDES uses a pre-trained text-image model to measure visual alignment with text prompts and a differentiable physics simulator to measure physical performance. It jointly optimizes structural (topology) and visual properties through this dual evaluation framework.

Result: The system successfully generates designs that satisfy engineering requirements (compliance and density) while incorporating features specified by text prompts. It was evaluated on structural optimization problems under different conditions and validated experimentally with 3D printed beams.

Conclusion: TIDES demonstrates the feasibility of text-informed design generation that balances both physical performance and visual/textual alignment, opening possibilities for AI-assisted design systems that understand both engineering constraints and human design intent.

Abstract: We present TIDES, a text informed design approach for generating physically sound designs based on a textual description and a set of physical constraints. TIDES jointly optimizes structural (topology) and visual properties. A pre-trained text-image model is used to measure the design’s visual alignment with a text prompt and a differentiable physics simulator is used to measure its physical performance. We evaluate TIDES on a series of structural optimization problems operating under different load and support conditions, at different resolutions, and experimentally in the lab by performing the 3-point bending test on 2D beam designs that are extruded and 3D printed. We find that it can jointly optimize the two objectives and return designs that satisfy engineering design requirements (compliance and density) while utilizing features specified by text.

[1323] Generalized Optimal Classification Trees: A Mixed-Integer Programming Approach

Jiancheng Tu, Wenqi Fan, Zhibin Wu

Main category: cs.LG

TL;DR: A mixed-integer programming framework for learning optimal classification trees that can optimize nonlinear performance metrics like F1-score, addressing class imbalance with acceleration techniques for scalability.

Details

Motivation: Decision trees are important for interpretable machine learning, but global optimization has been challenging. Recent advances in discrete optimization enable practical algorithms for optimal classification trees. The paper aims to address class imbalance by optimizing nonlinear metrics like F1-score, which traditional methods struggle with.

Method: Proposes a mixed-integer programming (MIP) framework for learning optimal classification trees under nonlinear performance metrics. Develops problem-specific acceleration techniques including: 1) tailored branch-and-cut algorithm, 2) instance-reduction scheme, and 3) warm-start strategies to improve scalability.

Result: Evaluated on 50 benchmark datasets. The framework efficiently optimizes nonlinear metrics while achieving strong predictive performance and reduced solution times compared with existing methods.

Conclusion: The MIP-based framework successfully addresses the challenge of optimizing nonlinear performance metrics for decision trees, offering improved handling of class imbalance while maintaining interpretability and achieving computational efficiency through specialized acceleration techniques.

Abstract: Global optimization of decision trees is a long-standing challenge in combinatorial optimization, yet such models play an important role in interpretable machine learning. Although the problem has been investigated for several decades, only recent advances in discrete optimization have enabled practical algorithms for solving optimal classification tree problems on real-world datasets. Mixed-integer programming (MIP) offers a high degree of modeling flexibility, and we therefore propose a MIP-based framework for learning optimal classification trees under nonlinear performance metrics, such as the F1-score, that explicitly addresses class imbalance. To improve scalability, we develop problem-specific acceleration techniques, including a tailored branch-and-cut algorithm, an instance-reduction scheme, and warm-start strategies. We evaluate the proposed approach on 50 benchmark datasets. The computational results show that the framework can efficiently optimize nonlinear metrics while achieving strong predictive performance and reduced solution times compared with existing methods.

[1324] Spectral Superposition: A Theory of Feature Geometry

Georgi Ivanov, Narmeen Oozeer, Shivam Raval, Tasana Pejovic, Shriyash Upadhyay, Amir Abdullah

Main category: cs.LG

TL;DR: Theoretical framework using spectral analysis of weight matrices to study geometric structure of features in neural networks, particularly in superposition regimes.

Details

Motivation: Current methods for analyzing neural network features discard geometric structure when decomposing activations into sparse linear features. There's a need to understand how features interact globally in representational space, especially in superposition where features share dimensions.

Method: Develops spectral theory using the frame operator F = WW⊤ to analyze weight matrices. Studies eigenvalues, eigenspaces, and spectral measures to understand how features allocate norm across eigenspaces. Applies operator theory to interpretability.

Result: In toy models of superposition, proves that capacity saturation forces spectral localization: features collapse onto single eigenspaces, organize into tight frames, and admit discrete classification via association schemes (classifying geometries like simplices, polygons, antiprisms).

Conclusion: Spectral methods capture global geometry of feature interactions beyond pairwise analysis. The framework enables diagnosis of feature localization in arbitrary weight matrices and points toward applying operator theory to neural network interpretability.

Abstract: Neural networks represent more features than they have dimensions via superposition, forcing features to share representational space. Current methods decompose activations into sparse linear features but discard geometric structure. We develop a theory for studying the geometric structre of features by analyzing the spectra (eigenvalues, eigenspaces, etc.) of weight derived matrices. In particular, we introduce the frame operator $F = WW^\top$, which gives us a spectral measure that describes how each feature allocates norm across eigenspaces. While previous tools could describe the pairwise interactions between features, spectral methods capture the global geometry (``how do all features interact?’’). In toy models of superposition, we use this theory to prove that capacity saturation forces spectral localization: features collapse onto single eigenspaces, organize into tight frames, and admit discrete classification via association schemes, classifying all geometries from prior work (simplices, polygons, antiprisms). The spectral measure formalism applies to arbitrary weight matrices, enabling diagnosis of feature localization beyond toy settings. These results point toward a broader program: applying operator theory to interpretability.

[1325] STILL: Selecting Tokens for Intra-Layer Hybrid Attention to Linearize LLMs

Weikang Meng, Liangyu Huo, Yadan Luo, Jiawen Guan, Jingyi Zhang, Yingjian Li, Zheng Zhang

Main category: cs.LG

TL;DR: STILL is an intra-layer hybrid linearization framework for efficiently linearizing pretrained LLMs, combining sparse softmax attention for salient tokens with linear attention for remaining context, while preserving pretrained representations through norm-preserved feature maps.

Details

Motivation: Existing linearization methods have limitations: 1) token routing based on sliding-window partitions leads to position-based selection rather than token-specific global importance, 2) linear attention suffers from distribution shift due to learnable feature maps that distort pretrained feature magnitudes.

Method: STILL introduces Self-Saliency Score with local-global consistency for accurate token selection, retains salient tokens for sparse softmax attention while summarizing remaining context via linear attention, uses Norm-Preserved Feature Map (NP-Map) to decouple feature direction from magnitude and reinject pretrained norms, and employs unified training-inference architecture with chunk-wise parallelization and delayed selection.

Result: STILL matches or surpasses original pretrained models on commonsense and general reasoning tasks, and achieves up to 86.2% relative improvement over prior linearized attention methods on long-context benchmarks.

Conclusion: STILL effectively addresses limitations of existing linearization methods by combining accurate token selection with representation preservation, achieving both computational efficiency and performance maintenance.

Abstract: Linearizing pretrained large language models (LLMs) primarily relies on intra-layer hybrid attention mechanisms to alleviate the quadratic complexity of standard softmax attention. Existing methods perform token routing based on sliding-window partitions, resulting in position-based selection and fails to capture token-specific global importance. Meanwhile, linear attention further suffers from distribution shift caused by learnable feature maps that distort pretrained feature magnitudes. Motivated by these limitations, we propose STILL, an intra-layer hybrid linearization framework for efficiently linearizing LLMs. STILL introduces a Self-Saliency Score with strong local-global consistency, enabling accurate token selection using sliding-window computation, and retains salient tokens for sparse softmax attention while summarizing the remaining context via linear attention. To preserve pretrained representations, we design a Norm-Preserved Feature Map (NP-Map) that decouples feature direction from magnitude and reinjects pretrained norms. We further adopt a unified training-inference architecture with chunk-wise parallelization and delayed selection to improve hardware efficiency. Experiments show that STILL matches or surpasses the original pretrained model on commonsense and general reasoning tasks, and achieves up to a 86.2% relative improvement over prior linearized attention methods on long-context benchmarks.

[1326] SEDformer: Event-Synchronous Spiking Transformers for Irregular Telemetry Time Series Forecasting

Ziyu Zhou, Yuchen Fang, Weilin Ruan, Shiyu Wang, James Kwok, Yuxuan Liang

Main category: cs.LG

TL;DR: SEDformer: A spiking transformer model for irregular multivariate time series forecasting that leverages the Sparsity-Event Duality property using event-driven spiking neural networks for efficient and accurate telemetry forecasting.

Details

Motivation: Existing Graph- and Transformer-based forecasters fail to properly handle the Sparsity-Event Duality (SED) property of irregular multivariate time series (IMTS) from telemetry streams, where long sparse periods are punctuated by dense event bursts. Current methods violate sparsity through padding and disrupt event semantics through relational recasting.

Method: SEDformer uses Spiking Neural Networks that naturally align with SED through sparse binary spikes and event-driven updates. It includes: (1) SED-based Spike Encoder with Event-Aligned LIF neurons, (2) Event-Preserving Temporal Downsampling to compress gaps while keeping salient events, and (3) SED-based Spike Transformer blocks with membrane-based linear attention.

Result: Experiments on public telemetry IMTS datasets show SEDformer achieves state-of-the-art forecasting accuracy while significantly reducing energy and memory usage compared to existing methods.

Conclusion: SEDformer provides a natural and efficient modeling paradigm for IMTS that faithfully respects the Sparsity-Event Duality property, offering both accuracy and computational efficiency advantages for telemetry forecasting.

Abstract: Telemetry streams from large-scale Internet-connected systems (e.g., IoT deployments and online platforms) naturally form an irregular multivariate time series (IMTS) whose accurate forecasting is operationally vital. A closer examination reveals a defining Sparsity-Event Duality (SED) property of IMTS, i.e., long stretches with sparse or no observations are punctuated by short, dense bursts where most semantic events (observations) occur. However, existing Graph- and Transformer-based forecasters ignore SED: pre-alignment to uniform grids with heavy padding violates sparsity by inflating sequences and forcing computation at non-informative steps, while relational recasting weakens event semantics by disrupting local temporal continuity. These limitations motivate a more faithful and natural modeling paradigm for IMTS that aligns with its SED property. We find that Spiking Neural Networks meet this requirement, as they communicate via sparse binary spikes and update in an event-driven manner, aligning naturally with the SED nature of IMTS. Therefore, we present SEDformer, an SED-enhanced Spiking Transformer for telemetry IMTS forecasting that couples: (1) a SED-based Spike Encoder converts raw observations into event synchronous spikes using an Event-Aligned LIF neuron, (2) an Event-Preserving Temporal Downsampling module compresses long gaps while retaining salient firings and (3) a stack of SED-based Spike Transformer blocks enable intra-series dependency modeling with a membrane-based linear attention driven by EA-LIF spiking features. Experiments on public telemetry IMTS datasets show that SEDformer attains state-of-the-art forecasting accuracy while reducing energy and memory usage, providing a natural and efficient path for modeling IMTS.

[1327] ECHO-2: A Large Scale Distributed Rollout Framework for Cost-efficient Reinforcement Learning

Jie Xiao, Meng Chen, Qingnan Ren, Song Jingwei, Jiaqi Huang, Yangshen Deng, Chris Tong, Wanyi Chen, Suli Wang, Ziqian Bi, Shuo Lu, Yiqun Duan, Lynn Ai, Eric Yang, Bill Shi

Main category: cs.LG

TL;DR: ECHO-2 is a distributed reinforcement learning framework for post-training LLMs that enables efficient wide-area coordination between rollout generation, reward evaluation, and centralized learning while managing policy staleness.

Details

Motivation: Current RL post-training for LLMs involves repeated interactions between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution across cost-efficient inference resources introduces challenges in wide-area coordination and policy dissemination latency.

Method: ECHO-2 combines centralized learning with distributed rollouts, treating bounded policy staleness as a user-controlled parameter to overlap rollout generation, dissemination, and training. It uses an overlap-based capacity model for provisioning, peer-assisted pipelined broadcast for dissemination bottlenecks, and cost-aware activation of heterogeneous workers.

Result: Experiments on GRPO post-training of 4B and 8B models under real wide-area bandwidth regimes show ECHO-2 significantly improves cost efficiency while preserving RL reward comparable to strong baselines.

Conclusion: ECHO-2 provides an effective distributed RL framework for post-training LLMs that addresses wide-area coordination challenges, improves cost efficiency, and maintains training quality through controlled policy staleness and efficient resource utilization.

Abstract: Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers opportunities to leverage more cost-efficient inference resources, but introduces challenges in wide-area coordination and policy dissemination. We present ECHO-2, a distributed RL framework for post-training with remote inference workers and non-negligible dissemination latency. ECHO-2 combines centralized learning with distributed rollouts and treats bounded policy staleness as a user-controlled parameter, enabling rollout generation, dissemination, and training to overlap. We introduce an overlap-based capacity model that relates training time, dissemination latency, and rollout throughput, yielding a practical provisioning rule for sustaining learner utilization. To mitigate dissemination bottlenecks and lower cost, ECHO-2 employs peer-assisted pipelined broadcast and cost-aware activation of heterogeneous workers. Experiments on GRPO post-training of 4B and 8B models under real wide-area bandwidth regimes show that ECHO-2 significantly improves cost efficiency while preserving RL reward comparable to strong baselines.

[1328] Geometry- and Relation-Aware Diffusion for EEG Super-Resolution

Laura Yao, Gengwei Zhang, Moajjem Chowdhury, Yunmei Liu, Tianlong Chen

Main category: cs.LG

TL;DR: TopoDiff: A geometry- and relation-aware diffusion model for EEG spatial super-resolution that incorporates topology-aware image embeddings and dynamic channel-relation graphs to improve spatial generation performance.

Details

Motivation: Current EEG spatial super-resolution methods lack awareness of physiological spatial structure, constraining their spatial generation performance. The paper aims to address this by incorporating how human experts interpret spatial EEG patterns.

Method: TopoDiff uses topology-aware image embeddings derived from EEG topographic representations to provide global geometric context, combined with a dynamic channel-relation graph that encodes inter-electrode relationships and evolves with temporal dynamics.

Result: The method achieves substantial gains in generation fidelity across multiple EEG datasets (SEED/SEED-IV for emotion recognition, PhysioNet motor imagery, TUSZ for seizure detection) and leads to notable improvements in downstream EEG task performance.

Conclusion: TopoDiff provides a spatially grounded EEG spatial super-resolution framework with consistent performance improvements by incorporating physiological spatial structure awareness through topology embeddings and dynamic relation graphs.

Abstract: Recent electroencephalography (EEG) spatial super-resolution (SR) methods, while showing improved quality by either directly predicting missing signals from visible channels or adapting latent diffusion-based generative modeling to temporal data, often lack awareness of physiological spatial structure, thereby constraining spatial generation performance. To address this issue, we introduce TopoDiff, a geometry- and relation-aware diffusion model for EEG spatial super-resolution. Inspired by how human experts interpret spatial EEG patterns, TopoDiff incorporates topology-aware image embeddings derived from EEG topographic representations to provide global geometric context for spatial generation, together with a dynamic channel-relation graph that encodes inter-electrode relationships and evolves with temporal dynamics. This design yields a spatially grounded EEG spatial super-resolution framework with consistent performance improvements. Across multiple EEG datasets spanning diverse applications, including SEED/SEED-IV for emotion recognition, PhysioNet motor imagery (MI/MM), and TUSZ for seizure detection, our method achieves substantial gains in generation fidelity and leads to notable improvements in downstream EEG task performance.

[1329] Fat-Cat: Document-Driven Metacognitive Multi-Agent System for Complex Reasoning

Tong Yang, Yemin Wang, Chaoning Zhang, Aming Wu

Main category: cs.LG

TL;DR: Fat-Cat is a document-driven agent architecture that improves LLM-based agent performance by using Markdown documents for state representation instead of rigid JSON, reducing syntactic overhead and enhancing semantic reasoning.

Details

Motivation: Existing LLM-based agent frameworks use rigid, syntax-heavy state representations like nested JSON, which forces models to devote substantial attention to syntactic processing rather than semantic reasoning, limiting agent effectiveness despite model capacity.

Method: Three key components: (1) Semantic File System representing agent state as Markdown documents aligned with pre-training corpora, (2) Textual Strategy Evolution accumulating task-solving knowledge without parameter updates, and (3) Closed-Loop Watcher monitoring reasoning trajectories to reduce hallucinations.

Result: Fat-Cat consistently improves agent performance on reasoning, retrieval, and coding benchmarks. It enables the Kimi-k2 model to outperform proprietary GPT-4o baseline on HotPotQA. Replacing document-based state with JSON leads to performance drop, validating document-driven state modeling.

Conclusion: Document-driven state representation (Markdown) is superior to rigid syntax (JSON) for LLM-based agents, improving signal-to-noise ratio in state management and enabling better semantic reasoning and performance.

Abstract: The effectiveness of LLM-based agents is often limited not by model capacity alone, but by how efficiently contextual information is utilized at runtime. Existing agent frameworks rely on rigid, syntax-heavy state representations such as nested JSON, which require models to devote a substantial portion of their limited attention to syntactic processing rather than semantic reasoning. In this paper, we propose Fat-Cat, a document-driven agent architecture that improves the signal-to-noise ratio of state management. By integrating three key components: (1) a Semantic File System that represents agent state as Markdown documents aligned with common pre-training corpora, (2) a Textual Strategy Evolution module that accumulates task-solving knowledge without parameter updates, and (3) a Closed-Loop Watcher that monitors reasoning trajectories to reduce hallucinations. Extensive reasoning, retrieval, and coding benchmarks, Fat-Cat consistently improves agent performance. It enables the Kimi-k2 model to outperform the proprietary GPT-4o baseline on HotPotQA. Replacing the document-based state with JSON leads to performance drop, while empirically validating the critical necessity of document-driven state modeling over rigid syntax. The code is available at https://github.com/answeryt/Fat-Cat.

[1330] Unsupervised Physics-Informed Operator Learning through Multi-Stage Curriculum Training

Paolo Marcandelli, Natansh Mathur, Stefano Markidis, Martina Siena, Stefano Mariani

Main category: cs.LG

TL;DR: PhIS-FNO: A multi-stage physics-informed training strategy with spline Fourier neural operator that achieves supervised-level accuracy using only boundary data through progressive optimization stages.

Details

Motivation: Current neural operators require supervised data, while physics-informed neural networks suffer from unstable convergence and limited generalization. Need a method that combines the advantages of both approaches.

Method: Multi-stage physics-informed training with progressive enforcement of boundary conditions then interior residuals, optimizer re-initialization at each stage, and Physics-Informed Spline Fourier Neural Operator (PhIS-FNO) combining Fourier layers with Hermite spline kernels.

Result: PhIS-FNO achieves accuracy comparable to supervised learning across canonical benchmarks using only boundary information, establishing staged spline-based optimization as robust paradigm.

Conclusion: The proposed multi-stage physics-informed training strategy with PhIS-FNO successfully addresses convergence stability issues and enables accurate operator learning with minimal supervision.

Abstract: Solving partial differential equations remains a central challenge in scientific machine learning. Neural operators offer a promising route by learning mappings between function spaces and enabling resolution-independent inference, yet they typically require supervised data. Physics-informed neural networks address this limitation through unsupervised training with physical constraints but often suffer from unstable convergence and limited generalization capability. To overcome these issues, we introduce a multi-stage physics-informed training strategy that achieves convergence by progressively enforcing boundary conditions in the loss landscape and subsequently incorporating interior residuals. At each stage the optimizer is re-initialized, acting as a continuation mechanism that restores stability and prevents gradient stagnation. We further propose the Physics-Informed Spline Fourier Neural Operator (PhIS-FNO), combining Fourier layers with Hermite spline kernels for smooth residual evaluation. Across canonical benchmarks, PhIS-FNO attains a level of accuracy comparable to that of supervised learning, using labeled information only along a narrow boundary region, establishing staged, spline-based optimization as a robust paradigm for physics-informed operator learning.

[1331] Scientific Theory of a Black-Box: A Life Cycle-Scale XAI Framework Based on Constructive Empiricism

Sebastian Müller, Vanessa Toborek, Eike Stadtländer, Tamás Horváth, Brendan Balcerak Jackson, Christian Bauckhage

Main category: cs.LG

TL;DR: Introduces Scientific Theory of a Black Box (SToBB) - a persistent, auditable framework for consolidating explanatory information about black-box models throughout their lifecycle, with empirical adequacy, adaptability, and auditability requirements.

Details

Motivation: Current XAI algorithms provide isolated explanations but lack a principled way to consolidate explanatory information into a persistent, auditable artifact that accompanies black-box models throughout their lifecycle.

Method: Proposes SToBB framework grounded in Constructive Empiricism with three obligations: empirical adequacy, adaptability via update commitments, and auditability. Introduces Constructive Box Theoriser (CoBoT) algorithm for online construction/maintenance of rule-based surrogates as observations accumulate.

Result: Provides a proof-of-concept instantiation of SToBB for a neural-network classifier on tabular data, demonstrating the framework’s feasibility for creating life cycle-scale, inspectable reference points for consistent analysis.

Conclusion: SToBBs position themselves as life cycle-scale, inspectable reference points that support consistent, reusable analyses and systematic external scrutiny of black-box models.

Abstract: Explainable AI (XAI) offers a growing number of algorithms that aim to answer specific questions about black-box models. What is missing is a principled way to consolidate explanatory information about a fixed black-box model into a persistent, auditable artefact, that accompanies the black-box throughout its life cycle. We address this gap by introducing the notion of a scientific theory of a black (SToBB). Grounded in Constructive Empiricism, a SToBB fulfils three obligations: (i) empirical adequacy with respect to all available observations of black-box behaviour, (ii) adaptability via explicit update commitments that restore adequacy when new observations arrive, and (iii) auditability through transparent documentation of assumptions, construction choices, and update behaviour. We operationalise these obligations as a general framework that specifies an extensible observation base, a traceable hypothesis class, algorithmic components for construction and revision, and documentation sufficient for third-party assessment. Explanations for concrete stakeholder needs are then obtained by querying the maintained record through interfaces, rather than by producing isolated method outputs. As a proof of concept, we instantiate a complete SToBB for a neural-network classifier on a tabular task and introduce the Constructive Box Theoriser (CoBoT) algorithm, an online procedure that constructs and maintains an empirically adequate rule-based surrogate as observations accumulate. Together, these contributions position SToBBs as a life cycle-scale, inspectable point of reference that supports consistent, reusable analyses and systematic external scrutiny.

[1332] Prediction-Powered Risk Monitoring of Deployed Models for Detecting Harmful Distribution Shifts

Guangyi Zhang, Yunlong Cai, Guanding Yu, Osvaldo Simeone

Main category: cs.LG

TL;DR: PPRM is a semi-supervised risk monitoring method that uses prediction-powered inference to detect harmful model performance shifts in dynamic environments with limited labeled data.

Details

Motivation: Monitoring model performance in dynamic environments is challenging due to limited labeled data availability, requiring methods that can detect harmful performance shifts with statistical guarantees while minimizing labeling costs.

Method: PPRM combines synthetic labels from model predictions with a small set of true labels to construct anytime-valid lower bounds on running risk. It detects harmful shifts by comparing these bounds with an upper bound on nominal risk, using threshold-based detection with finite-sample guarantees.

Result: The method demonstrates effectiveness across image classification, large language model monitoring, and telecommunications tasks, showing it can reliably detect performance shifts with assumption-free guarantees on false alarm probability.

Conclusion: PPRM provides a practical solution for model risk monitoring in dynamic environments with limited labeled data, offering statistical guarantees while reducing labeling requirements through semi-supervised prediction-powered inference.

Abstract: We study the problem of monitoring model performance in dynamic environments where labeled data are limited. To this end, we propose prediction-powered risk monitoring (PPRM), a semi-supervised risk-monitoring approach based on prediction-powered inference (PPI). PPRM constructs anytime-valid lower bounds on the running risk by combining synthetic labels with a small set of true labels. Harmful shifts are detected via a threshold-based comparison with an upper bound on the nominal risk, satisfying assumption-free finite-sample guarantees in the probability of false alarm. We demonstrate the effectiveness of PPRM through extensive experiments on image classification, large language model (LLM), and telecommunications monitoring tasks.

[1333] Backpropagation as Physical Relaxation: Exact Gradients in Finite Time

Antonino Emanuele Scurria

Main category: cs.LG

TL;DR: Backpropagation emerges exactly as finite-time relaxation of a physical dynamical system, with exact gradient computation in analog substrates.

Details

Motivation: To establish backpropagation as a physical dynamical process rather than just symbolic computation, enabling exact gradient computation in analog/neuromorphic hardware where continuous dynamics are native.

Method: Formulate feedforward inference as continuous-time process, apply Lagrangian theory of non-conservative systems, derive global energy functional on doubled state space (activations + sensitivities), use saddle-point dynamics, and prove Euler discretization recovers standard backprop exactly in 2L steps.

Result: Proves that unit-step Euler discretization recovers standard backpropagation exactly in 2L steps for L-layer network, with no approximations, unlike prior energy-based methods requiring symmetric weights or asymptotic convergence.

Conclusion: Backpropagation is the digitally optimized shadow of continuous physical relaxation, providing rigorous foundation for exact gradient computation in analog/neuromorphic substrates.

Abstract: Backpropagation, the foundational algorithm for training neural networks, is typically understood as a symbolic computation that recursively applies the chain rule. We show it emerges exactly as the finite-time relaxation of a physical dynamical system. By formulating feedforward inference as a continuous-time process and applying Lagrangian theory of non-conservative systems to handle asymmetric interactions, we derive a global energy functional on a doubled state space encoding both activations and sensitivities. The saddle-point dynamics of this energy perform inference and credit assignment simultaneously through local interactions. We term this framework ‘‘Dyadic Backpropagation’’. Crucially, we prove that unit-step Euler discretization, the natural timescale of layer transitions, recovers standard backpropagation exactly in precisely 2L steps for an L-layer network, with no approximations. Unlike prior energy-based methods requiring symmetric weights, asymptotic convergence, or vanishing perturbations, our framework guarantees exact gradients in finite time. This establishes backpropagation as the digitally optimized shadow of a continuous physical relaxation, providing a rigorous foundation for exact gradient computation in analog and neuromorphic substrates where continuous dynamics are native.

[1334] Interpretability in Deep Time Series Models Demands Semantic Alignment

Giovanni De Felice, Riccardo D’Elia, Alberto Termine, Pietro Barbiero, Giuseppe Marra, Silvia Santini

Main category: cs.LG

TL;DR: The paper proposes a new interpretability framework for deep time series models focused on semantic alignment - making predictions in terms of meaningful variables to users with temporal consistency constraints.

Details

Motivation: Current interpretability approaches for deep time series models focus on explaining internal computations without ensuring alignment with human reasoning about the studied phenomenon. The authors argue interpretability should pursue semantic alignment with user-understandable variables and constraints.

Method: The paper formalizes semantic alignment requirements for time series models, introduces the novel constraint that alignment must be preserved under temporal evolution, outlines a blueprint for semantically aligned models, identifies trust-supporting properties, and discusses design implications.

Result: The paper provides a theoretical framework and design principles for semantically aligned deep time series models, introducing the key requirement of temporal consistency in interpretability that has no analog in static settings.

Conclusion: Interpretability in deep time series models should focus on semantic alignment with human reasoning, requiring predictions expressed in meaningful variables with temporal consistency, leading to more trustworthy and usable models.

Abstract: Deep time series models continue to improve predictive performance, yet their deployment remains limited by their black-box nature. In response, existing interpretability approaches in the field keep focusing on explaining the internal model computations, without addressing whether they align or not with how a human would reason about the studied phenomenon. Instead, we state interpretability in deep time series models should pursue semantic alignment: predictions should be expressed in terms of variables that are meaningful to the end user, mediated by spatial and temporal mechanisms that admit user-dependent constraints. In this paper, we formalize this requirement and require that, once established, semantic alignment must be preserved under temporal evolution: a constraint with no analog in static settings. Provided with this definition, we outline a blueprint for semantically aligned deep time series models, identify properties that support trust, and discuss implications for model design.

[1335] An Optimization Method for Autoregressive Time Series Forecasting

Zheng Li, Jerry Cheng, Huanying Gu

Main category: cs.LG

TL;DR: A novel training method for time-series forecasting that enforces autoregressive prediction error monotonicity and enables flexible long-term forecasting through short-term predictions concatenation.

Details

Motivation: Current transformer-based time-series forecasting models achieve long-term forecasting mainly by scaling up model size rather than genuine autoregressive rollout, and traditional training ignores temporal causality principles.

Method: Proposes a training method that enforces two key properties: 1) AR prediction errors should increase with forecasting horizon (violations penalized), and 2) enables models to concatenate short-term AR predictions for flexible long-term forecasts.

Result: Establishes new state-of-the-art across multiple benchmarks with >10% MSE reduction compared to iTransformer and other baselines, enables short-horizon models to perform reliable long-term predictions at horizons over 7.5 times longer.

Conclusion: The proposed training method effectively addresses temporal causality issues in time-series forecasting and enables more efficient long-term predictions without excessive model scaling.

Abstract: Current time-series forecasting models are primarily based on transformer-style neural networks. These models achieve long-term forecasting mainly by scaling up the model size rather than through genuinely autoregressive (AR) rollout. From the perspective of large language model training, the traditional training process for time-series forecasting models ignores temporal causality. In this paper, we propose a novel training method for time-series forecasting that enforces two key properties: (1) AR prediction errors should increase with the forecasting horizon. Any violation of this principle is considered random guessing and is explicitly penalized in the loss function, and (2) the method enables models to concatenate short-term AR predictions for forming flexible long-term forecasts. Empirical results demonstrate that our method establishes a new state-of-the-art across multiple benchmarks, achieving an MSE reduction of more than 10% compared to iTransformer and other recent strong baselines. Furthermore, it enables short-horizon forecasting models to perform reliable long-term predictions at horizons over 7.5 times longer. Code is available at https://github.com/LizhengMathAi/AROpt

[1336] Variational Entropic Optimal Transport

Roman Dyachenko, Nikita Gushchin, Kirill Sokolov, Petr Mokrov, Evgeny Burnaev, Alexander Korotin

Main category: cs.LG

TL;DR: VarEOT introduces a variational reformulation of entropic optimal transport that avoids MCMC simulations during training by using an auxiliary normalizer, enabling efficient differentiable optimization for domain translation tasks.

Details

Motivation: Existing EOT methods for domain translation face computational inefficiency due to intractable log-partition terms, requiring either restrictive transport families (Gaussian-mixtures) or simulation-based training procedures.

Method: Proposes Variational Entropic Optimal Transport (VarEOT) with an exact variational reformulation of log-partition as tractable minimization over auxiliary positive normalizer, enabling differentiable learning with stochastic gradients without MCMC simulations.

Result: Experiments on synthetic data and unpaired image-to-image translation show competitive or improved translation quality compared to existing methods using the same weak dual EOT objective.

Conclusion: VarEOT provides an efficient optimization principle for EOT-based domain translation with theoretical guarantees and practical benefits over existing approaches.

Abstract: Entropic optimal transport (EOT) in continuous spaces with quadratic cost is a classical tool for solving the domain translation problem. In practice, recent approaches optimize a weak dual EOT objective depending on a single potential, but doing so is computationally not efficient due to the intractable log-partition term. Existing methods typically resolve this obstacle in one of two ways: by significantly restricting the transport family to obtain closed-form normalization (via Gaussian-mixture parameterizations), or by using general neural parameterizations that require simulation-based training procedures. We propose Variational Entropic Optimal Transport (VarEOT), based on an exact variational reformulation of the log-partition $\log \mathbb{E}[\exp(\cdot)]$ as a tractable minimization over an auxiliary positive normalizer. This yields a differentiable learning objective optimized with stochastic gradients and avoids the necessity of MCMC simulations during the training. We provide theoretical guarantees, including finite-sample generalization bounds and approximation results under universal function approximation. Experiments on synthetic data and unpaired image-to-image translation demonstrate competitive or improved translation quality, while comparisons within the solvers that use the same weak dual EOT objective support the benefit of the proposed optimization principle.

[1337] Decoupling Generalizability and Membership Privacy Risks in Neural Networks

Xingli Fang, Jung-Eun Kim

Main category: cs.LG

TL;DR: PPTP identifies separate regions for generalization and privacy risks in neural networks, enabling targeted privacy protection with minimal utility loss.

Details

Motivation: There's a trade-off between privacy preservation and model utility in deep learning, with current approaches sacrificing too much generalization for privacy. The authors aim to decouple these two aspects to maximize privacy gain while minimizing utility loss.

Method: The authors identify that generalization and privacy risks exist in different regions of neural network architectures. They propose Privacy-Preserving Training Principle (PPTP) which protects vulnerable model components from privacy risks while minimizing impact on generalizability.

Result: Extensive evaluations show PPTP significantly better maintains model generalizability while enhancing privacy preservation compared to existing approaches.

Conclusion: By identifying separate regions for generalization and privacy risks, PPTP enables more effective privacy protection with minimal utility degradation, addressing a key trade-off in privacy-preserving deep learning.

Abstract: A deep learning model usually has to sacrifice some utilities when it acquires some other abilities or characteristics. Privacy preservation has such trade-off relationships with utilities. The loss disparity between various defense approaches implies the potential to decouple generalizability and privacy risks to maximize privacy gain. In this paper, we identify that the model’s generalization and privacy risks exist in different regions in deep neural network architectures. Based on the observations that we investigate, we propose Privacy-Preserving Training Principle (PPTP) to protect model components from privacy risks while minimizing the loss in generalizability. Through extensive evaluations, our approach shows significantly better maintenance in model generalizability while enhancing privacy preservation.

[1338] Alignment-Aware Model Adaptation via Feedback-Guided Optimization

Gaurav Bhatt, Aditya Chinchure, Jiawei Zhou, Leonid Sigal

Main category: cs.LG

TL;DR: Alignment-aware fine-tuning framework that integrates external alignment feedback through policy-gradient regularization with adaptive gating to balance supervised and alignment-driven gradients, enabling alignment-preserving model adaptation.

Details

Motivation: Standard fine-tuning approaches optimize task objectives in isolation without accounting for critical alignment objectives like safety and hallucination avoidance, which can degrade alignment and fail to correct pre-existing misaligned behavior.

Method: Proposes an alignment-aware fine-tuning framework that integrates feedback from external alignment signals through policy-gradient-based regularization. Features an adaptive gating mechanism that dynamically balances supervised and alignment-driven gradients per sample, prioritizing uncertain/misaligned cases while allowing well-aligned examples to follow standard supervised updates. Also learns abstention behavior for fully misaligned inputs.

Result: Experiments on general and domain-specific instruction-tuning benchmarks show consistent reductions in harmful and hallucinated outputs without sacrificing downstream task performance. Additional analyses demonstrate robustness to adversarial fine-tuning, prompt-based attacks, and unsafe initializations.

Conclusion: Adaptively gated alignment optimization is an effective approach for alignment-preserving and alignment-recovering model adaptation, addressing the critical gap in standard fine-tuning methods that neglect secondary alignment objectives.

Abstract: Fine-tuning is the primary mechanism for adapting foundation models to downstream tasks; however, standard approaches largely optimize task objectives in isolation and do not account for secondary yet critical alignment objectives (e.g., safety and hallucination avoidance). As a result, downstream fine-tuning can degrade alignment and fail to correct pre-existing misaligned behavior. We propose an alignment-aware fine-tuning framework that integrates feedback from an external alignment signal through policy-gradient-based regularization. Our method introduces an adaptive gating mechanism that dynamically balances supervised and alignment-driven gradients on a per-sample basis, prioritizing uncertain or misaligned cases while allowing well-aligned examples to follow standard supervised updates. The framework further learns abstention behavior for fully misaligned inputs, incorporating conservative responses directly into the fine-tuned model. Experiments on general and domain-specific instruction-tuning benchmarks demonstrate consistent reductions in harmful and hallucinated outputs without sacrificing downstream task performance. Additional analyses show robustness to adversarial fine-tuning, prompt-based attacks, and unsafe initializations, establishing adaptively gated alignment optimization as an effective approach for alignment-preserving and alignment-recovering model adaptation.

[1339] Learning Markov Decision Processes under Fully Bandit Feedback

Zhengjia Zhuo, Anupam Gupta, Viswanath Nagarajan

Main category: cs.LG

TL;DR: First efficient bandit learning algorithm for episodic MDPs with fully bandit feedback (only aggregate reward observed) achieving near-optimal regret bounds.

Details

Motivation: Standard RL assumes full state-action observation with per-step rewards, but this is unrealistic. Recent work explores restricted feedback like trajectory feedback. This paper addresses an even more restrictive "fully bandit" setting where only aggregate reward is observed.

Method: Develops efficient bandit learning algorithm for episodic MDPs with fully bandit feedback. The approach handles the challenge of not observing visited state-action pairs by designing novel techniques to work with only aggregate reward information.

Result: Achieves $\widetilde{O}(\sqrt{T})$ regret with exponential dependence on horizon length (shown to be necessary). Also obtains improved nearly-tight regret bounds for “ordered” MDPs. Empirical evaluation shows comparable performance to state-of-art learning algorithm (UCB-VI) with full state-action feedback.

Conclusion: First efficient algorithm for episodic MDPs with fully bandit feedback, demonstrating feasibility of learning with highly restricted feedback. The exponential horizon dependence is unavoidable, and the approach shows practical promise despite theoretical limitations.

Abstract: A standard assumption in Reinforcement Learning is that the agent observes every visited state-action pair in the associated Markov Decision Process (MDP), along with the per-step rewards. Strong theoretical results are known in this setting, achieving nearly-tight $Θ(\sqrt{T})$-regret bounds. However, such detailed feedback can be unrealistic, and recent research has investigated more restricted settings such as trajectory feedback, where the agent observes all the visited state-action pairs, but only a single \emph{aggregate} reward. In this paper, we consider a far more restrictive fully bandit'' feedback model for episodic MDPs, where the agent does not even observe the visited state-action pairs -- it only learns the aggregate reward. We provide the first efficient bandit learning algorithm for episodic MDPs with $\widetilde{O}(\sqrt{T})$ regret. Our regret has an exponential dependence on the horizon length $\H$, which we show is necessary. We also obtain improved nearly-tight regret bounds for ordered’’ MDPs; these can be used to model classical stochastic optimization problems such as $k$-item prophet inequality and sequential posted pricing. Finally, we evaluate the empirical performance of our algorithm for the setting of $k$-item prophet inequalities; despite the highly restricted feedback, our algorithm’s performance is comparable to that of a state-of-art learning algorithm (UCB-VI) with detailed state-action feedback.

[1340] Unlocking the Duality between Flow and Field Matching

Daniil Shlenskii, Alexander Varlamov, Nazar Buzun, Alexander Korotin

Main category: cs.LG

TL;DR: CFM and IFM are shown to be equivalent for forward-only IFM via a bijection, but general IFM is more expressive, enabling cross-framework benefits.

Details

Motivation: To understand the relationship between two generative modeling frameworks: Conditional Flow Matching (CFM) and Interaction Field Matching (IFM), and determine if they are fundamentally different or equivalent descriptions of the same dynamics.

Method: Theoretical analysis establishing a bijection between CFM and forward-only IFM, showing their equivalence for this subclass. Demonstrates that general IFM is strictly more expressive through examples like EFM that cannot be realized in standard CFM.

Result: CFM and forward-only IFM are equivalent via a constructed bijection. General IFM is more expressive than CFM, including frameworks like EFM that CFM cannot represent. The duality enables cross-framework benefits.

Conclusion: The paper establishes a duality between CFM and IFM frameworks, showing equivalence for forward-only IFM but greater expressiveness for general IFM, enabling mutual benefits between the two approaches.

Abstract: Conditional Flow Matching (CFM) unifies conventional generative paradigms such as diffusion models and flow matching. Interaction Field Matching (IFM) is a newer framework that generalizes Electrostatic Field Matching (EFM) rooted in Poisson Flow Generative Models (PFGM). While both frameworks define generative dynamics, they start from different objects: CFM specifies a conditional probability path in data space, whereas IFM specifies a physics-inspired interaction field in an augmented data space. This raises a basic question: are CFM and IFM genuinely different, or are they two descriptions of the same underlying dynamics? We show that they coincide for a natural subclass of IFM that we call forward-only IFM. Specifically, we construct a bijection between CFM and forward-only IFM. We further show that general IFM is strictly more expressive: it includes EFM and other interaction fields that cannot be realized within the standard CFM formulation. Finally, we highlight how this duality can benefit both frameworks: it provides a probabilistic interpretation of forward-only IFM and yields novel, IFM-driven techniques for CFM.

[1341] HopFormer: Sparse Graph Transformers with Explicit Receptive Field Control

Sanggeon Yun, Raheeb Hassan, Ryozo Masukawa, Sungheon Jeong, Mohsen Imani

Main category: cs.LG

TL;DR: HopFormer is a graph Transformer that uses head-specific n-hop masked sparse attention instead of positional encodings or dense global attention, achieving competitive performance with linear computational scaling.

Details

Motivation: Current graph Transformers rely on explicit positional/structural encodings and dense global attention to incorporate graph topology, which may be unnecessary and computationally expensive.

Method: Introduces HopFormer with head-specific n-hop masked sparse attention that injects structure through attention masks without positional encodings or architectural modifications, enabling explicit control over receptive fields and linear computational scaling with mask sparsity.

Result: Achieves competitive or superior performance on node-level and graph-level benchmarks, showing dense global attention is often unnecessary - localized attention works better on small-world graphs while global attention offers diminishing returns on weaker small-world graphs.

Conclusion: Challenges prevailing assumptions in graph Transformer design and highlights sparsity-controlled attention as a principled and efficient alternative to traditional approaches.

Abstract: Graph Transformers typically rely on explicit positional or structural encodings and dense global attention to incorporate graph topology. In this work, we show that neither is essential. We introduce HopFormer, a graph Transformer that injects structure exclusively through head-specific n-hop masked sparse attention, without the use of positional encodings or architectural modifications. This design provides explicit and interpretable control over receptive fields while enabling genuinely sparse attention whose computational cost scales linearly with mask sparsity. Through extensive experiments on both node-level and graph-level benchmarks, we demonstrate that our approach achieves competitive or superior performance across diverse graph structures. Our results further reveal that dense global attention is often unnecessary: on graphs with strong small-world properties, localized attention yields more stable and consistently high performance, while on graphs with weaker small-world effects, global attention offers diminishing returns. Together, these findings challenge prevailing assumptions in graph Transformer design and highlight sparsity-controlled attention as a principled and efficient alternative.

[1342] MoLF: Mixture-of-Latent-Flow for Pan-Cancer Spatial Gene Expression Prediction from Histology

Susu Hu, Stefanie Speidel

Main category: cs.LG

TL;DR: MoLF is a generative model for pan-cancer histogenomic prediction that uses conditional Flow Matching with Mixture-of-Experts architecture to handle diverse tissue patterns across cancer types.

Details

Motivation: Current spatial transcriptomics inference methods are limited to single-tissue models, failing to leverage shared biological principles across cancer types and struggling with data-scarce scenarios. Pan-cancer training introduces heterogeneity that challenges monolithic architectures.

Method: MoLF uses conditional Flow Matching to map noise to gene latent manifold, parameterized by a Mixture-of-Experts velocity field. This dynamically routes inputs to specialized sub-networks to decouple optimization of diverse tissue patterns.

Result: MoLF establishes new state-of-the-art, consistently outperforming both specialized and foundation model baselines on pan-cancer benchmarks. It also exhibits zero-shot generalization to cross-species data.

Conclusion: MoLF effectively handles pan-cancer heterogeneity through its MoE architecture and captures fundamental, conserved histo-molecular mechanisms that generalize across species.

Abstract: Inferring spatial transcriptomics (ST) from histology enables scalable histogenomic profiling, yet current methods are largely restricted to single-tissue models. This fragmentation fails to leverage biological principles shared across cancer types and hinders application to data-scarce scenarios. While pan-cancer training offers a solution, the resulting heterogeneity challenges monolithic architectures. To bridge this gap, we introduce MoLF (Mixture-of-Latent-Flow), a generative model for pan-cancer histogenomic prediction. MoLF leverages a conditional Flow Matching objective to map noise to the gene latent manifold, parameterized by a Mixture-of-Experts (MoE) velocity field. By dynamically routing inputs to specialized sub-networks, this architecture effectively decouples the optimization of diverse tissue patterns. Our experiments demonstrate that MoLF establishes a new state-of-the-art, consistently outperforming both specialized and foundation model baselines on pan-cancer benchmarks. Furthermore, MoLF exhibits zero-shot generalization to cross-species data, suggesting it captures fundamental, conserved histo-molecular mechanisms.

[1343] Choice-Model-Assisted Q-learning for Delayed-Feedback Revenue Management

Owen Shen, Patrick Jaillet

Main category: cs.LG

TL;DR: RL for revenue management with delayed feedback using choice models to impute delayed components, with convergence guarantees and empirical evaluation showing benefits under parameter shifts but degradation under model misspecification.

Details

Motivation: Revenue management systems face delayed feedback challenges where customer cancellations and modifications occur days after booking, making real-time learning difficult. Traditional RL struggles with this delayed reward structure.

Method: Proposes choice-model-assisted RL: uses a calibrated discrete choice model as a fixed partial world model to impute the delayed component of learning targets at decision time. Combines tabular Q-learning with model-imputed targets.

Result: Theoretical convergence to O(ε/(1-γ)) neighborhood of optimal Q-function with O(t^{-1/2}) sampling term. Empirical results from hotel booking simulator (61,619 bookings, 1,088 runs) show: no difference from baseline in stationary settings; positive effects under parameter shifts (up to 12.4% gains); degradation under model misspecification (1.4-2.6% lower revenue).

Conclusion: Partial behavioral models can improve robustness under parameter shifts but introduce harmful bias under structural misspecification. The approach characterizes when such models are beneficial versus detrimental in delayed feedback RL settings.

Abstract: We study reinforcement learning for revenue management with delayed feedback, where a substantial fraction of value is determined by customer cancellations and modifications observed days after booking. We propose \emph{choice-model-assisted RL}: a calibrated discrete choice model is used as a fixed partial world model to impute the delayed component of the learning target at decision time. In the fixed-model deployment regime, we prove that tabular Q-learning with model-imputed targets converges to an $O(\varepsilon/(1-γ))$ neighborhood of the optimal Q-function, where $\varepsilon$ summarizes partial-model error, with an additional $O(t^{-1/2})$ sampling term. Experiments in a simulator calibrated from 61{,}619 hotel bookings (1{,}088 independent runs) show: (i) no statistically detectable difference from a maturity-buffer DQN baseline in stationary settings; (ii) positive effects under in-family parameter shifts, with significant gains in 5 of 10 shift scenarios after Holm–Bonferroni correction (up to 12.4%); and (iii) consistent degradation under structural misspecification, where the choice model assumptions are violated (1.4–2.6% lower revenue). These results characterize when partial behavioral models improve robustness under shift and when they introduce harmful bias.

[1344] EvalQReason: A Framework for Step-Level Reasoning Evaluation in Large Language Models

Shaima Ahmad Freja, Ferhat Ozgur Catak, Betul Yurdem, Chunming Rong

Main category: cs.LG

TL;DR: EvalQReason is a framework that quantifies LLM reasoning quality through step-level probability distribution analysis without human annotation, using two algorithms (CSD and SFC) to measure local coherence and global alignment in reasoning processes.

Details

Motivation: LLMs are increasingly deployed in critical applications requiring reliable reasoning, but existing evaluation methods focus only on final-answer correctness, providing limited insight into how reasoning unfolds across intermediate steps. There's a need for systematic evaluation of internal reasoning processes.

Method: Introduces EvalQReason framework with two complementary algorithms: Consecutive Step Divergence (CSD) measures local coherence between adjacent reasoning steps, and Step-to-Final Convergence (SFC) assesses global alignment with final answers. Each algorithm employs five statistical metrics to capture reasoning dynamics from probability distributions.

Result: Experiments with 7B-parameter models on mathematical and medical datasets show CSD-based features achieve strong predictive performance for correctness classification (F1=0.78, ROC-AUC=0.82 with classical ML; F1=0.88, ROC-AUC=0.97 with sequential neural models). CSD consistently outperforms SFC, and reasoning dynamics are domain-specific: mathematical reasoning shows clear discrimination patterns, while medical reasoning shows minimal discriminative signals.

Conclusion: EvalQReason enables scalable, process-aware evaluation of reasoning reliability, establishing probability-based divergence analysis as a principled approach for trustworthy AI deployment. The framework reveals fundamental differences in how LLMs process different reasoning types.

Abstract: Large Language Models (LLMs) are increasingly deployed in critical applications requiring reliable reasoning, yet their internal reasoning processes remain difficult to evaluate systematically. Existing methods focus on final-answer correctness, providing limited insight into how reasoning unfolds across intermediate steps. We present EvalQReason, a framework that quantifies LLM reasoning quality through step-level probability distribution analysis without requiring human annotation. The framework introduces two complementary algorithms: Consecutive Step Divergence (CSD), which measures local coherence between adjacent reasoning steps, and Step-to-Final Convergence (SFC), which assesses global alignment with final answers. Each algorithm employs five statistical metrics to capture reasoning dynamics. Experiments across mathematical and medical datasets with open-source 7B-parameter models demonstrate that CSD-based features achieve strong predictive performance for correctness classification, with classical machine learning models reaching F1=0.78 and ROC-AUC=0.82, and sequential neural models substantially improving performance (F1=0.88, ROC-AUC=0.97). CSD consistently outperforms SFC, and sequential architectures outperform classical machine learning approaches. Critically, reasoning dynamics prove domain-specific: mathematical reasoning exhibits clear divergence-based discrimination patterns between correct and incorrect solutions, while medical reasoning shows minimal discriminative signals, revealing fundamental differences in how LLMs process different reasoning types. EvalQReason enables scalable, process-aware evaluation of reasoning reliability, establishing probability-based divergence analysis as a principled approach for trustworthy AI deployment.

[1345] ReasonCACHE: Teaching LLMs To Reason Without Weight Updates

Sharut Gupta, Phillip Isola, Stefanie Jegelka, David Lopez-Paz, Kartik Ahuja, Mark Ibrahim, Mohammad Pezeshki

Main category: cs.LG

TL;DR: ReasonCACHE enables LLMs to learn reasoning skills through prefix tuning without weight updates, outperforming standard in-context learning and matching in-weight learning approaches while being more efficient.

Details

Motivation: Standard in-context learning (ICL) for reasoning tasks has limitations: attention costs grow quadratically with demonstrations, performance degrades with longer contexts, and it remains shallow learning. In-weight learning (IWL) requires parameter updates. There's a need for a middle ground that enables reasoning without context window overload or weight updates.

Method: ReasonCACHE uses prefix tuning to distill demonstrations into a fixed key-value cache. This approach injects reasoning capabilities directly into the attention mechanism without modifying model weights, bypassing the input rank constraints of low-rank weight updates.

Result: Across challenging reasoning benchmarks including GPQA-Diamond, ReasonCACHE outperforms standard ICL and matches or surpasses IWL approaches. It achieves this while being more efficient in data usage, inference cost, and trainable parameters.

Conclusion: ReasonCACHE provides a scalable middle path between in-context and in-weight learning, enabling LLMs to learn reasoning skills beyond context window limitations without parameter modifications.

Abstract: Can Large language models (LLMs) learn to reason without any weight update and only through in-context learning (ICL)? ICL is strikingly sample-efficient, often learning from only a handful of demonstrations, but complex reasoning tasks typically demand many training examples to learn from. However, naively scaling ICL by adding more demonstrations breaks down at this scale: attention costs grow quadratically, performance saturates or degrades with longer contexts, and the approach remains a shallow form of learning. Due to these limitations, practitioners predominantly rely on in-weight learning (IWL) to induce reasoning. In this work, we show that by using Prefix Tuning, LLMs can learn to reason without overloading the context window and without any weight updates. We introduce $\textbf{ReasonCACHE}$, an instantiation of this mechanism that distills demonstrations into a fixed key-value cache. Empirically, across challenging reasoning benchmarks, including GPQA-Diamond, ReasonCACHE outperforms standard ICL and matches or surpasses IWL approaches. Further, it achieves this all while being more efficient across three key axes: data, inference cost, and trainable parameters. We also theoretically prove that ReasonCACHE can be strictly more expressive than low-rank weight update since the latter ties expressivity to input rank, whereas ReasonCACHE bypasses this constraint by directly injecting key-values into the attention mechanism. Together, our findings identify ReasonCACHE as a middle path between in-context and in-weight learning, providing a scalable algorithm for learning reasoning skills beyond the context window without modifying parameters. Our project page: https://reasoncache.github.io/

[1346] Didactic to Constructive: Turning Expert Solutions into Learnable Reasoning

Ethan Mendes, Jungsoo Park, Alan Ritter

Main category: cs.LG

TL;DR: DAIL improves LLM reasoning by transforming expert human solutions into detailed reasoning traces and using contrastive learning, achieving significant gains with minimal expert data.

Details

Motivation: Current methods for improving LLM reasoning rely on either model-generated solutions (which may be incorrect) or stronger models (which may not exist for difficult problems). Expert human solutions exist but are out-of-distribution for LLMs due to implicit reasoning gaps designed for humans.

Method: Two-step approach: 1) Transform expert solutions into detailed, in-distribution reasoning traces that bridge implicit gaps, 2) Apply contrastive objective to focus learning on expert insights and methodologies rather than just imitating surface patterns.

Result: Achieves 10-25% pass@k gains on Qwen2.5-Instruct and Qwen3 models with fewer than 1000 expert solutions, improves reasoning efficiency by 2x to 4x, and enables out-of-domain generalization.

Conclusion: DAIL provides a sample-efficient method to leverage high-quality expert solutions for improving LLM reasoning capabilities, addressing distributional gaps and enabling effective learning from limited expert data.

Abstract: Improving the reasoning capabilities of large language models (LLMs) typically relies either on the model’s ability to sample a correct solution to be reinforced or on the existence of a stronger model able to solve the problem. However, many difficult problems remain intractable for even current frontier models, preventing the extraction of valid training signals. A promising alternative is to leverage high-quality expert human solutions, yet naive imitation of this data fails because it is fundamentally out of distribution: expert solutions are typically didactic, containing implicit reasoning gaps intended for human readers rather than computational models. Furthermore, high-quality expert solutions are expensive, necessitating generalizable sample-efficient training methods. We propose Distribution Aligned Imitation Learning (DAIL), a two-step method that bridges the distributional gap by first transforming expert solutions into detailed, in-distribution reasoning traces and then applying a contrastive objective to focus learning on expert insights and methodologies. We find that DAIL can leverage fewer than 1000 high-quality expert solutions to achieve 10-25% pass@k gains on Qwen2.5-Instruct and Qwen3 models, improve reasoning efficiency by 2x to 4x, and enable out-of-domain generalization.

[1347] Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection

Addison Kristanto Julistiono, Davoud Ataee Tarzanagh, Navid Azizan

Main category: cs.LG

TL;DR: Theoretical analysis of mirror descent algorithms for softmax attention mechanisms, showing convergence to generalized hard-margin SVM solutions with ℓ_p-norm objectives.

Details

Motivation: While gradient descent dynamics in attention models are well-studied, less is known about more general optimization algorithms like mirror descent. This paper aims to understand the convergence properties and implicit biases of mirror descent algorithms specifically tailored for softmax attention mechanisms.

Method: The authors investigate a family of mirror descent algorithms with potential functions chosen as the p-th power of the ℓ_p-norm, applied to softmax attention models. They analyze convergence properties theoretically and establish conditions for joint optimization of key-query matrices and decoders.

Result: Theoretical results show that mirror descent algorithms converge in direction to generalized hard-margin SVM with ℓ_p-norm objectives, with convergence rates comparable to gradient descent despite the nonlinear, nonconvex nature of the problem. Numerical experiments on real data demonstrate improved generalization over standard GD and better token selection.

Conclusion: Mirror descent algorithms offer theoretical advantages for optimizing attention mechanisms, converging to interpretable SVM-like solutions while improving practical performance in generalization and token selection tasks.

Abstract: Attention mechanisms have revolutionized several domains of artificial intelligence, such as natural language processing and computer vision, by enabling models to selectively focus on relevant parts of the input data. While recent work has characterized the optimization dynamics of gradient descent (GD) in attention-based models and the structural properties of its preferred solutions, less is known about more general optimization algorithms such as mirror descent (MD). In this paper, we investigate the convergence properties and implicit biases of a family of MD algorithms tailored for softmax attention mechanisms, with the potential function chosen as the $p$-th power of the $\ell_p$-norm. Specifically, we show that these algorithms converge in direction to a generalized hard-margin SVM with an $\ell_p$-norm objective when applied to a classification problem using a softmax attention model. Notably, our theoretical results reveal that the convergence rate is comparable to that of traditional GD in simpler models, despite the highly nonlinear and nonconvex nature of the present problem. Additionally, we delve into the joint optimization dynamics of the key-query matrix and the decoder, establishing conditions under which this complex joint optimization converges to their respective hard-margin SVM solutions. Lastly, our numerical experiments on real data demonstrate that MD algorithms improve generalization over standard GD and excel in optimal token selection.

[1348] C-kNN-LSH: A Nearest-Neighbor Algorithm for Sequential Counterfactual Inference

Jing Wang, Jie Shen, Qiaomin Xie, Jeremy C Weiss

Main category: cs.LG

TL;DR: C-kNN-LSH: A nearest-neighbor framework using locality-sensitive hashing for sequential causal inference in high-dimensional, confounded longitudinal data, applied to Long COVID recovery analysis.

Details

Motivation: Estimating causal effects from longitudinal trajectories is crucial for understanding complex disease progression like comorbidities and Long COVID recovery, but existing methods struggle with high-dimensional, confounded situations and irregular sampling in clinical data.

Method: Uses locality-sensitive hashing to efficiently identify “clinical twins” with similar covariate histories, enabling local estimation of conditional treatment effects. Integrates neighborhood estimator with doubly-robust correction to handle irregular sampling and shifting recovery profiles.

Result: Theoretical analysis shows estimator is consistent and second-order robust to nuisance error. Evaluation on real-world Long COVID cohort with 13,511 participants demonstrates superior performance in capturing recovery heterogeneity and estimating policy values compared to baselines.

Conclusion: C-kNN-LSH provides an effective framework for sequential causal inference in high-dimensional clinical settings, particularly valuable for understanding complex disease progression like Long COVID recovery.

Abstract: Estimating causal effects from longitudinal trajectories is central to understanding the progression of complex conditions and optimizing clinical decision-making, such as comorbidities and long COVID recovery. We introduce \emph{C-kNN–LSH}, a nearest-neighbor framework for sequential causal inference designed to handle such high-dimensional, confounded situations. By utilizing locality-sensitive hashing, we efficiently identify ``clinical twins’’ with similar covariate histories, enabling local estimation of conditional treatment effects across evolving disease states. To mitigate bias from irregular sampling and shifting patient recovery profiles, we integrate neighborhood estimator with a doubly-robust correction. Theoretical analysis guarantees our estimator is consistent and second-order robust to nuisance error. Evaluated on a real-world Long COVID cohort with 13,511 participants, \emph{C-kNN-LSH} demonstrates superior performance in capturing recovery heterogeneity and estimating policy values compared to existing baselines.

[1349] Poly-attention: a general scheme for higher-order self-attention

Sayak Chakrabarti, Toniann Pitassi, Josh Alman

Main category: cs.LG

TL;DR: The paper introduces poly-attention mechanisms as generalizations of self-attention that can incorporate higher-order tensor computations and arbitrary token relationships, with systematic analysis of computational complexity and representational strength.

Details

Motivation: Standard self-attention in Transformers fails at tasks requiring detection of correlated token triples or compositional operations where multiple input tokens need to be referenced. Existing higher-dimensional alternatives have superquadratic running times, creating a need for more efficient yet expressive attention mechanisms.

Method: Defines poly-attention mechanisms as a broad class generalizing self-attention to include arbitrary higher-order tensor computations and token relationship structures. Systematically studies computational complexity (exact and approximate computation) and representational strength through new algorithms and complexity-theoretic lower bounds.

Result: Develops a new attention mechanism computable exactly in quadratic time that can perform function composition for any fixed number of functions, while proving lower bounds showing faster algorithms for prior mechanisms are impossible. Establishes trade-offs between expressiveness and computational efficiency.

Conclusion: Poly-attention mechanisms provide a theoretical framework for understanding the limitations and capabilities of attention variants, with practical implications for designing more efficient yet expressive attention mechanisms for complex reasoning tasks.

Abstract: The self-attention mechanism, at the heart of the Transformer model, is able to effectively model pairwise interactions between tokens. However, numerous recent works have shown that it is unable to perform basic tasks involving detecting triples of correlated tokens, or compositional tasks where multiple input tokens need to be referenced to generate a result. Some higher-dimensional alternatives to self-attention have been proposed to address this, including higher-order attention and Strassen attention, which can perform some of these polyadic tasks in exchange for slower, superquadratic running times. In this work, we define a vast class of generalizations of self-attention, which we call poly-attention mechanisms. Our mechanisms can incorporate arbitrary higher-order (tensor) computations as well as arbitrary relationship structures between the input tokens, and they include the aforementioned alternatives as special cases. We then systematically study their computational complexity and representational strength, including giving new algorithms and matching complexity-theoretic lower bounds on the time complexity of computing the attention matrix exactly as well as approximately, and tightly determining which polyadic tasks they can each perform. Our results give interesting trade-offs between different desiderata for these mechanisms, including a tight relationship between how expressive a mechanism is, and how large the coefficients in the model may be so that the mechanism can be approximated in almost-linear time. Notably, we give a new attention mechanism which can be computed exactly in quadratic time, and which can perform function composition for any fixed number of functions. Prior mechanisms, even for just composing two functions, could only be computed in superquadratic time, and our new lower bounds show that faster algorithms for them are not possible.

[1350] SVIP: Towards Verifiable Inference of Open-source Large Language Models

Yifan Sun, Yuhang Li, Yue Zhang, Yuchen Jin, Huan Zhang

Main category: cs.LG

TL;DR: SVIP is a secret-based verifiable protocol for decentralized LLM inference that detects when providers substitute requested models with smaller ones by analyzing hidden representations.

Details

Motivation: As decentralized computing becomes popular for cost-effective LLM deployment, providers may stealthily substitute requested LLMs with smaller, cheaper models without user consent, compromising service quality while benefiting from cost savings.

Method: SVIP requires providers to return both generated text and processed hidden representations. A proxy task is trained on these representations to create a unique model identifier. A secret mechanism is integrated for enhanced security, enabling users to verify provider honesty through computationally efficient verification.

Result: Extensive experiments show SVIP achieves false negative rates below 5% and false positive rates below 3%, with verification requiring less than 0.01 seconds per prompt query. The protocol is accurate, generalizable, computationally efficient, and resistant to various attacks.

Conclusion: SVIP provides an effective, practical solution for verifiable LLM inference in decentralized settings, addressing the trust gap between users and computing providers without relying on strong cryptographic assumptions or being computationally expensive.

Abstract: The ever-increasing size of open-source Large Language Models (LLMs) renders local deployment impractical for individual users. Decentralized computing has emerged as a cost-effective solution, allowing individuals and small companies to perform LLM inference for users using surplus computational power. However, a computing provider may stealthily substitute the requested LLM with a smaller, less capable model without consent from users, thereby benefiting from cost savings. We introduce SVIP, a secret-based verifiable LLM inference protocol. Unlike existing solutions based on cryptographic or game-theoretic techniques, our method is computationally effective and does not rest on strong assumptions. Our protocol requires the computing provider to return both the generated text and processed hidden representations from LLMs. We then train a proxy task on these representations, effectively transforming them into a unique model identifier. With our protocol, users can reliably verify whether the computing provider is acting honestly. A carefully integrated secret mechanism further strengthens its security. We thoroughly analyze our protocol under multiple strong and adaptive adversarial scenarios. Our extensive experiments demonstrate that SVIP is accurate, generalizable, computationally efficient, and resistant to various attacks. Notably, SVIP achieves false negative rates below 5% and false positive rates below 3%, while requiring less than 0.01 seconds per prompt query for verification.

[1351] Self-Supervised Learning from Structural Invariance

Yipeng Zhang, Hafez Ghaemi, Jungyoon Lee, Shahab Bakhtiari, Eilif B. Muller, Laurent Charlin

Main category: cs.LG

TL;DR: AdaSSL introduces a latent variable approach to handle one-to-many mappings in self-supervised learning, addressing uncertainty when data pairs have multiple valid targets.

Details

Motivation: Existing SSL methods struggle with one-to-many mapping problems where each datum may map to multiple valid targets, particularly in scenarios like video frames where generative processes create natural variations.

Method: Introduces a latent variable to account for conditional uncertainty and derives a variational lower bound on mutual information between paired embeddings, resulting in a simple regularization term for standard SSL objectives.

Result: AdaSSL shows versatility across contrastive and distillation-based SSL objectives, demonstrating effectiveness in causal representation learning, fine-grained image understanding, and world modeling on videos.

Conclusion: The latent variable approach effectively addresses one-to-many mapping uncertainty in SSL, providing a flexible framework applicable to various SSL objectives and domains.

Abstract: Joint-embedding self-supervised learning (SSL), the key paradigm for unsupervised representation learning from visual data, learns from invariances between semantically-related data pairs. We study the one-to-many mapping problem in SSL, where each datum may be mapped to multiple valid targets. This arises when data pairs come from naturally occurring generative processes, e.g., successive video frames. We show that existing methods struggle to flexibly capture this conditional uncertainty. As a remedy, we introduce a latent variable to account for this uncertainty and derive a variational lower bound on the mutual information between paired embeddings. Our derivation yields a simple regularization term for standard SSL objectives. The resulting method, which we call AdaSSL, applies to both contrastive and distillation-based SSL objectives, and we empirically show its versatility in causal representation learning, fine-grained image understanding, and world modeling on videos.

[1352] Active Causal Experimentalist (ACE): Learning Intervention Strategies via Direct Preference Optimization

Patrick Cooper, Alvaro Velasquez

Main category: cs.LG

TL;DR: ACE learns adaptive experimental design policies for causal discovery using preference-based reinforcement learning rather than traditional isolated decision approaches.

Details

Motivation: Traditional causal discovery methods treat experimental decisions in isolation and cannot learn adaptive strategies from experience, while sequential decision-making in experimental design requires policies that can learn from accumulated knowledge.

Method: ACE (Active Causal Experimentalist) learns experimental design as a sequential policy using Direct Preference Optimization (DPO). Instead of learning from non-stationary reward magnitudes, it learns from pairwise intervention comparisons which remain meaningful throughout the learning process.

Result: ACE achieves 70-71% improvement over baselines at equal intervention budgets across synthetic benchmarks, physics simulations, and economic data. The learned policy autonomously discovers theoretically-grounded strategies like concentrated interventions on parent variables for collider mechanisms.

Conclusion: Preference-based learning can recover principled experimental strategies for causal discovery, complementing theoretical approaches with learned domain adaptation through sequential policy learning.

Abstract: Discovering causal relationships requires controlled experiments, but experimentalists face a sequential decision problem: each intervention reveals information that should inform what to try next. Traditional approaches such as random sampling, greedy information maximization, and round-robin coverage treat each decision in isolation, unable to learn adaptive strategies from experience. We propose Active Causal Experimentalist (ACE), which learns experimental design as a sequential policy. Our key insight is that while absolute information gains diminish as knowledge accumulates (making value-based RL unstable), relative comparisons between candidate interventions remain meaningful throughout. ACE exploits this via Direct Preference Optimization, learning from pairwise intervention comparisons rather than non-stationary reward magnitudes. Across synthetic benchmarks, physics simulations, and economic data, ACE achieves 70-71% improvement over baselines at equal intervention budgets (p < 0.001, Cohen’s d ~ 2). Notably, the learned policy autonomously discovers that collider mechanisms require concentrated interventions on parent variables, a theoretically-grounded strategy that emerges purely from experience. This suggests preference-based learning can recover principled experimental strategies, complementing theory with learned domain adaptation.

[1353] RaZeR: Pushing the Limits of NVFP4 Quantization with Redundant Zero Remapping

Yuzong Chen, Xilai Dai, Jake Hyun, Chi-Chih Chang, Wonsuk Jang, Yuheng Wu, Thierry Tambe, Jae-sun Seo, Mohamed S. Abdelfattah

Main category: cs.LG

TL;DR: RaZeR improves NVFP4 quantization by eliminating redundant bits in FP4 encoding and FP8 scaling factors, using them to create additional quantization values for better LLM accuracy at same memory footprint.

Details

Motivation: The paper identifies two types of redundancy in NVFP4 format: (1) unused quantization value in FP4's sign-magnitude representation (positive/negative zeros), and (2) unused sign bit in FP8 block scaling factor which is always positive. Additionally, LLM weights are found to be tolerant to lower-precision scaling factors.

Method: Proposes Redundant Zero Remapping (RaZeR) which leverages redundant bits from block scaling factor to adaptively remap the redundant FP4 zero to additional quantization values. Also designs efficient GPU kernels for RaZeR-quantized LLM inference and proposes novel hardware for native support.

Result: Extensive experiments show RaZeR reduces average perplexity loss by 34.6% under weight-only quantization and 31.2% under weight-activation quantization relative to native NVFP4. Demonstrates superior performance for 4-bit LLM quantization.

Conclusion: RaZeR pushes the limits of NVFP4 format for more accurate LLM quantization under the same memory footprint by eliminating redundancies and repurposing unused bits for additional quantization precision.

Abstract: The recently introduced NVFP4 format demonstrates remarkable performance and memory benefits for quantized large language model (LLM) inference. However, we observe two types of redundancy in NVFP4 encoding: (1) The FP4 element format naturally exposes an unused quantization value due to its sign-magnitude representation that contains both positive and negative zeros. (2) The FP8 block scaling factor has an unused sign bit because it is always positive. Additionally, we find that LLM weights are more tolerant to a lower-precision block scaling factor. Based on these observations, we propose Redundant Zero Remapping (RaZeR), an enhanced numerical format that pushes the limits of NVFP4 for more accurate LLM quantization under the same memory footprint. RaZeR leverages the redundant bits of the block scaling factor to adaptively remap the redundant FP4 zero to additional quantization values with improved accuracy. To demonstrate the practicality of RaZeR, we design efficient GPU kernels for RaZeR-quantized LLM inference and propose novel hardware to natively support this. Extensive experiments validate RaZeR’s superior performance for 4-bit LLM quantization. For example, relative to native NVFP4, RaZeR reduces the average perplexity loss by 34.6% and 31.2% under weight-only and weight-activation quantization, respectively. Code is available at: https://github.com/yc2367/NVFP4-RaZeR.

[1354] SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization

Maksim Afanasyev, Illarion Iov

Main category: cs.LG

TL;DR: SLIME is a new reference-free alignment method that prevents unlearning and formatting collapse in LLM preference optimization by decoupling preference learning from generation quality with anchoring, stabilizing penalties, and dual-margin constraints.

Details

Motivation: Current direct preference optimization methods for LLM alignment suffer from objective mismatch - optimizing relative margins between chosen/rejected responses doesn't preserve absolute likelihood of chosen responses, leading to "unlearning" (degrading high-quality outputs) and "formatting collapse" (over-penalizing rejected sequences).

Method: SLIME uses a three-pronged objective: 1) anchoring term to maximize likelihood of preferred responses, 2) stabilizing penalty preventing rejected token probabilities from collapsing to zero, 3) dual-margin mechanism combining hard and soft constraints for precise boundary shaping.

Result: SLIME achieves superior performance compared to state-of-the-art baselines while maintaining higher generation stability.

Conclusion: SLIME provides an effective reference-free alignment approach that addresses key limitations of existing direct preference optimization methods by decoupling preference learning from generation quality.

Abstract: Direct preference optimization methods have emerged as a computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) for aligning Large Language Models (LLMs). Latest approaches have streamlined the alignment process by deriving implicit reward functions, yet they often suffer from a critical objective mismatch: optimizing the relative margin between chosen and rejected responses does not guarantee the preservation of the chosen response’s absolute likelihood. This can lead to unlearning'', where the model degrades the probability of high-quality outputs to satisfy margin constraints, and formatting collapse’’ caused by the over-penalization of rejected sequences. In this work, we introduce SLIME (Stabilized Likelihood Implicit Margin Enforcement), a reference-free alignment objective designed to decouple preference learning from generation quality. SLIME incorporates a three-pronged objective: (1) an anchoring term to maximize the likelihood of preferred responses; (2) a stabilizing penalty that prevents the probabilities of rejected tokens from collapsing to zero; and (3) a dual-margin mechanism that combines hard and soft constraints for precise boundary shaping. Our results demonstrate that SLIME achieves superior performance compared to state-of-the-art baselines while maintaining higher generation stability.

[1355] Transformers learn factored representations

Adam Shai, Loren Amdahl-Culleton, Casper L. Christensen, Henry R. Bigelow, Fernando E. Rosas, Alexander B. Boyd, Eric A. Alt, Kyle J. Ray, Paul M. Riechers

Main category: cs.LG

TL;DR: Transformers learn to represent the world as factored orthogonal subspaces rather than product spaces, preferring dimensional efficiency over predictive accuracy when factors are conditionally independent.

Details

Motivation: To understand how transformers internally represent complex world structures, specifically whether they use high-dimensional product spaces or low-dimensional factored representations in orthogonal subspaces, and to explain why transformers decompose the world into interpretable parts.

Method: Formalized two representational hypotheses (product space vs. factored orthogonal subspaces), derived geometric predictions for each, and tested on transformers trained on synthetic processes with known latent structure under varying conditional independence conditions.

Result: Transformers learn factored representations when factors are conditionally independent, and maintain this preference even when noise or hidden dependencies undermine conditional independence, showing an inductive bias toward factoring at the cost of predictive fidelity.

Conclusion: Transformers have an inductive bias to decompose the world into orthogonal factored representations, explaining why they learn interpretable low-dimensional structure even when trained on complex data.

Abstract: Transformers pretrained via next token prediction learn to factor their world into parts, representing these factors in orthogonal subspaces of the residual stream. We formalize two representational hypotheses: (1) a representation in the product space of all factors, whose dimension grows exponentially with the number of parts, or (2) a factored representation in orthogonal subspaces, whose dimension grows linearly. The factored representation is lossless when factors are conditionally independent, but sacrifices predictive fidelity otherwise, creating a tradeoff between dimensional efficiency and accuracy. We derive precise predictions about the geometric structure of activations for each, including the number of subspaces, their dimensionality, and the arrangement of context embeddings within them. We test between these hypotheses on transformers trained on synthetic processes with known latent structure. Models learn factored representations when factors are conditionally independent, and continue to favor them early in training even when noise or hidden dependencies undermine conditional independence, reflecting an inductive bias toward factoring at the cost of fidelity. This provides a principled explanation for why transformers decompose the world into parts, and suggests that interpretable low dimensional structure may persist even in models trained on complex data.

[1356] Sparse Autoencoder Features for Classifications and Transferability

Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S. Bitterman

Main category: cs.LG

TL;DR: SAEs extract interpretable features from LLMs for safety tasks, achieving strong performance with cross-model transfer and generalization to multimodal tasks.

Details

Motivation: To develop transparent and controllable AI systems by systematically analyzing Sparse Autoencoders for interpretable feature extraction from LLMs in safety-critical classification tasks.

Method: Systematic analysis framework evaluating: (1) model-layer selection and scaling properties, (2) SAE architectural configurations (width and pooling strategies), (3) effect of binarizing continuous SAE activations.

Result: SAE-derived features achieve macro F1 > 0.8, outperforming hidden-state and BoW baselines, demonstrate cross-model transfer (Gemma 2 2B to 9B-IT), and generalize to cross-lingual toxicity detection and visual classification tasks.

Conclusion: SAEs establish new best practices for interpretability, enable scalable transparent LLM deployment, with pooling strategies and binarization offering efficient alternatives to traditional feature selection while maintaining/improving performance.

Abstract: Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze SAE for interpretable feature extraction from LLMs in safety-critical classification tasks. Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations. SAE-derived features achieve macro F1 > 0.8, outperforming hidden-state and BoW baselines while demonstrating cross-model transfer from Gemma 2 2B to 9B-IT models. These features generalize in a zero-shot manner to cross-lingual toxicity detection and visual classification tasks. Our analysis highlights the significant impact of pooling strategies and binarization thresholds, showing that binarization offers an efficient alternative to traditional feature selection while maintaining or improving performance. These findings establish new best practices for SAE-based interpretability and enable scalable, transparent deployment of LLMs in real-world applications. Full repo: https://github.com/shan23chen/MOSAIC.

[1357] An Empirical Study on Noisy Data and LLM Pretraining Loss Divergence

Qizhen Zhang, Ankush Garg, Jakob Foerster, Niladri Chatterji, Kshitiz Malik, Mike Lewis

Main category: cs.LG

TL;DR: Systematic study shows noisy data causes LLM pretraining divergence, with probability depending on noise type, amount, and model scale, distinct from high learning rate failures.

Details

Motivation: Web-scale pretraining datasets contain inevitable noise, which practitioners speculate causes instabilities and loss divergence in LLM pretraining, but this phenomenon remains poorly understood and requires systematic investigation.

Method: Inject controlled synthetic uniformly random noise into otherwise clean datasets, analyze training dynamics across model sizes (480M to 5.2B parameters), compare noise-induced divergences with high learning rate failures.

Result: Noisy data indeed induces training loss divergence; divergence probability strongly depends on noise type, amount, and model scale; noise-induced divergences show distinct activation patterns from high learning rate failures; diagnostics differentiate these failure modes.

Conclusion: Provides first large-scale controlled characterization of how noisy data affects loss divergence in LLM pretraining, offering insights into training stability and failure modes.

Abstract: Large-scale pretraining datasets drive the success of large language models (LLMs). However, these web-scale corpora inevitably contain large amounts of noisy data due to unregulated web content or randomness inherent in data. Although LLM pretrainers often speculate that such noise contributes to instabilities in large-scale LLM pretraining and, in the worst cases, loss divergence, this phenomenon remains poorly understood.In this work, we present a systematic empirical study of whether noisy data causes LLM pretraining divergences and how it does so. By injecting controlled synthetic uniformly random noise into otherwise clean datasets, we analyze training dynamics across model sizes ranging from 480M to 5.2B parameters. We show that noisy data indeed induces training loss divergence, and that the probability of divergence depends strongly on the noise type, amount of noise, and model scale. We further find that noise-induced divergences exhibit activation patterns distinct from those caused by high learning rates, and we provide diagnostics that differentiate these two failure modes. Together, these results provide a large-scale, controlled characterization of how noisy data affects loss divergence in LLM pretraining.

[1358] Efficient Reinforcement Finetuning via Adaptive Curriculum Learning

Taiwei Shi, Yiyang Wu, Linxin Song, Tianyi Zhou, Jieyu Zhao

Main category: cs.LG

TL;DR: AdaRFT introduces adaptive curriculum learning to reinforcement finetuning for LLMs, dynamically adjusting problem difficulty based on reward signals to improve training efficiency and mathematical reasoning performance.

Details

Motivation: Reinforcement finetuning (RFT) enhances LLM mathematical reasoning but is sample- and compute-inefficient, requiring extensive training. The authors aim to improve both efficiency and final accuracy of RFT through adaptive curriculum learning.

Method: AdaRFT dynamically adjusts training problem difficulty based on the model’s recent reward signals, ensuring tasks are challenging but solvable. This adaptive sampling strategy maintains optimal difficulty range and only requires lightweight extension to standard RFT algorithms like PPO, without modifying reward functions or model architecture.

Result: Experiments on competition-level math datasets show AdaRFT significantly improves both training efficiency and reasoning performance. It reduces training time by up to 2x and improves accuracy by considerable margins across multiple data distributions and model sizes.

Conclusion: AdaRFT offers a more scalable and effective RFT framework that enhances mathematical reasoning capabilities of LLMs while dramatically improving training efficiency through adaptive curriculum learning.

Abstract: Reinforcement finetuning (RFT) has shown great potential for enhancing the mathematical reasoning capabilities of large language models (LLMs), but it is often sample- and compute-inefficient, requiring extensive training. In this work, we introduce AdaRFT (Adaptive Curriculum Reinforcement Finetuning), a method that significantly improves both the efficiency and final accuracy of RFT through adaptive curriculum learning. AdaRFT dynamically adjusts the difficulty of training problems based on the model’s recent reward signals, ensuring that the model consistently trains on tasks that are challenging but solvable. This adaptive sampling strategy accelerates learning by maintaining an optimal difficulty range, avoiding wasted computation on problems that are too easy or too hard. AdaRFT requires only a lightweight extension to standard RFT algorithms like Proximal Policy Optimization (PPO), without modifying the reward function or model architecture. Experiments on competition-level math datasets demonstrate that AdaRFT significantly improves both training efficiency and reasoning performance. We evaluate AdaRFT across multiple data distributions and model sizes, showing that it reduces training time by up to 2x and improves accuracy by a considerable margin, offering a more scalable and effective RFT framework.

[1359] Active Transfer Bagging: A New Approach for Accelerated Active Learning Acquisition of Data by Combined Transfer Learning and Bagging Based Models

Vivienne Pelletier, Daniel J. Rivera, Obinna Nwokonkwo, Steven A. Wilson, Christopher L. Muhich

Main category: cs.LG

TL;DR: ATBagging is a method for selecting initial seed data for active learning by leveraging related datasets, using Bayesian bagged ensembles to estimate informativeness and determinantal point processes for diversity.

Details

Motivation: Active learning reduces labeling costs but early performance depends heavily on the initial random seed set. Many applications have related datasets available that could be used to construct better seed sets.

Method: ATBagging uses Bayesian interpretation of bagged ensemble models to estimate informativeness by comparing in-bag and out-of-bag predictive distributions. It imposes diversity via determinantal point processes with Random Fourier Features and quality-diversity factorization incorporating informativeness scores.

Result: ATBagging improves or ties early active learning performance across seed sizes (10-100) on four real-world datasets (QM9, ERA5, Forbes 2000, Beijing PM2.5), increasing area under learning curves in most cases, with strongest benefits in low-data regimes.

Conclusion: ATBagging provides a low-cost, high-reward method for initiating active learning-based data collection by intelligently selecting seed data from related datasets.

Abstract: Modern machine learning has achieved remarkable success on many problems, but this success often depends on the existence of large, labeled datasets. While active learning can dramatically reduce labeling cost when annotations are expensive, early performance is frequently dominated by the initial seed set, typically chosen at random. In many applications, however, related or approximate datasets are readily available and can be leveraged to construct a better seed set. We introduce a new method for selecting the seed data set for active learning, Active-Transfer Bagging (ATBagging). ATBagging estimates the informativeness of candidate data point from a Bayesian interpretation of bagged ensemble models by comparing in-bag and out-of-bag predictive distributions from the labeled dataset, yielding an information-gain proxy. To avoid redundant selections, we impose feature-space diversity by sampling a determinantal point process (DPP) whose kernel uses Random Fourier Features and a quality-diversity factorization that incorporates the informativeness scores. This same blended method is used for selection of new data points to collect during the active learning phase. We evaluate ATBagging on four real-world datasets covering both target-transfer and feature-shift scenarios (QM9, ERA5, Forbes 2000, and Beijing PM2.5). Across seed sizes nseed = 10-100, ATBagging improves or ties early active learning and increases area under the learning-curve relative to alternative seed subset selection methodologies in almost all cases, with strongest benefits in low-data regimes. Thus, ATBagging provides a low-cost, high reward means to initiating active learning-based data collection.

[1360] Trust Region Continual Learning as an Implicit Meta-Learner

Zekun Wang, Anant Gupta, Christopher J. MacLellan

Main category: cs.LG

TL;DR: Trust region continual learning combines generative replay with Fisher-metric constraints to enable rapid re-convergence to prior task optima without explicit meta-learning optimization.

Details

Motivation: Standard continual learning methods face tradeoffs: regularization-based approaches overconstrain updates when task optima weakly overlap, while replay-based methods suffer from drift due to imperfect replay. The paper seeks a hybrid approach that combines the benefits of both.

Method: Proposes trust region continual learning that combines generative replay with a Fisher-metric trust region constraint. This yields a MAML-style interpretation with single implicit inner step: replay provides old-task gradient signals while Fisher-weighted penalty offers efficient offline curvature shaping.

Result: On task-incremental diffusion image generation and continual diffusion-policy control, trust region continual learning achieves best final performance and retention, and consistently recovers early-task performance faster than EWC, replay, and continual meta-learning baselines.

Conclusion: The approach demonstrates emergent meta-learning properties in continual learning, where the model becomes an initialization that rapidly re-converges to prior task optima after task transitions without explicit bilevel optimization.

Abstract: Continual learning aims to acquire tasks sequentially without catastrophic forgetting, yet standard strategies face a core tradeoff: regularization-based methods (e.g., EWC) can overconstrain updates when task optima are weakly overlapping, while replay-based methods can retain performance but drift due to imperfect replay. We study a hybrid perspective: \emph{trust region continual learning} that combines generative replay with a Fisher-metric trust region constraint. We show that, under local approximations, the resulting update admits a MAML-style interpretation with a single implicit inner step: replay supplies an old-task gradient signal (query-like), while the Fisher-weighted penalty provides an efficient offline curvature shaping (support-like). This yields an emergent meta-learning property in continual learning: the model becomes an initialization that rapidly \emph{re-converges} to prior task optima after each task transition, without explicitly optimizing a bilevel objective. Empirically, on task-incremental diffusion image generation and continual diffusion-policy control, trust region continual learning achieves the best final performance and retention, and consistently recovers early-task performance faster than EWC, replay, and continual meta-learning baselines.

[1361] Repurposing Protein Language Models for Latent Flow-Based Fitness Optimization

Amaru Caceres Arroyo, Lea Bogensperger, Ahmed Allam, Michael Krauthammer, Konrad Schindler, Dominik Narnhofer

Main category: cs.LG

TL;DR: CHASE is a protein fitness optimization framework that uses pretrained protein language models to generate high-fitness variants through compressed embeddings and conditional flow-matching with classifier-free guidance.

Details

Motivation: Protein fitness optimization faces challenges with vast combinatorial landscapes where high-fitness variants are extremely sparse, and current methods either underperform or require computationally expensive gradient-based sampling.

Method: CHASE repurposes evolutionary knowledge from pretrained protein language models by compressing their embeddings into a compact latent space, then trains a conditional flow-matching model with classifier-free guidance to enable direct generation of high-fitness variants without predictor-based guidance during ODE sampling steps.

Result: CHASE achieves state-of-the-art performance on AAV and GFP protein design benchmarks, and bootstrapping with synthetic data further enhances performance in data-constrained settings.

Conclusion: CHASE provides an effective framework for protein fitness optimization that leverages pretrained language models and flow-matching techniques to generate high-fitness variants efficiently without expensive gradient-based sampling.

Abstract: Protein fitness optimization is challenged by a vast combinatorial landscape where high-fitness variants are extremely sparse. Many current methods either underperform or require computationally expensive gradient-based sampling. We present CHASE, a framework that repurposes the evolutionary knowledge of pretrained protein language models by compressing their embeddings into a compact latent space. By training a conditional flow-matching model with classifier-free guidance, we enable the direct generation of high-fitness variants without predictor-based guidance during the ODE sampling steps. CHASE achieves state-of-the-art performance on AAV and GFP protein design benchmarks. Finally, we show that bootstrapping with synthetic data can further enhance performance in data-constrained settings.

[1362] Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

Ruizhe Shi, Minhak Song, Runlong Zhou, Zihan Zhang, Maryam Fazel, Simon S. Du

Main category: cs.LG

TL;DR: Theoretical analysis of performance gap between RLHF and DPO under representation gap, showing RLHF can outperform DPO in sparse reward settings with statistical advantages.

Details

Motivation: To provide a fine-grained theoretical understanding of the performance differences between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO), particularly focusing on how representation gaps affect their relative effectiveness.

Method: Theoretical decomposition of performance gap into explicit representation gap (under exact optimization) and implicit representation gap (under finite samples). Analysis includes characterization of how relative capacities of reward and policy model classes influence final policy quality, and construction of concrete examples with sparse ground-truth rewards.

Result: Shows that RLHF, DPO, or online DPO can outperform each other depending on model mis-specifications. Online DPO outperforms both RLHF and standard DPO when reward and policy model classes are isomorphic and both mis-specified. In sparse reward settings, RLHF requires significantly fewer samples than DPO to recover effective reward models.

Conclusion: Provides comprehensive understanding of RLHF vs DPO performance gaps under various settings, offering practical insights into when each method is preferred based on representation gaps and statistical efficiency considerations.

Abstract: We present a fine-grained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap. Our study decomposes this gap into two sources: an explicit representation gap under exact optimization and an implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is implicitly sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model, highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.

[1363] Embedding Perturbation may Better Reflect the Uncertainty in LLM Reasoning

Qihao Wen, Jiahao Wang, Yang Nan, Pengfei He, Ravi Tandon, Han Xu

Main category: cs.LG

TL;DR: Perturbation-based uncertainty quantification method for LLMs that identifies incorrect reasoning steps by measuring token sensitivity to embedding perturbations

Details

Motivation: LLMs can produce unreliable outputs, requiring uncertainty quantification. For reasoning tasks, uncertainty should be estimated not just for final answers but also for intermediate steps to enable targeted interventions.

Method: Proposes perturbation-based uncertainty metrics that measure how sensitive tokens are to perturbations on preceding token embeddings. Incorrect reasoning steps tend to contain highly sensitive tokens, which can be identified using this sensitivity score.

Result: Perturbation-based metrics achieve stronger uncertainty quantification performance compared to baseline methods like token probability and token entropy. They also offer better simplicity and efficiency than approaches requiring multiple sampling.

Conclusion: Token sensitivity to embedding perturbations serves as an effective indicator of uncertainty in LLM reasoning steps, enabling more fine-grained uncertainty quantification for intermediate reasoning processes.

Abstract: Large language Models (LLMs) have achieved significant breakthroughs across diverse domains; however, they can still produce unreliable or misleading outputs. For responsible LLM application, Uncertainty Quantification (UQ) techniques are used to estimate a model’s uncertainty about its outputs, indicating the likelihood that those outputs may be problematic. For LLM reasoning tasks, it is essential to estimate the uncertainty not only for the final answer, but also for the intermediate steps of the reasoning, as this can enable more fine-grained and targeted interventions. In this study, we explore what UQ metrics better reflect the LLM’s ``intermediate uncertainty’‘during reasoning. Our study reveals that an LLMs’ incorrect reasoning steps tend to contain tokens which are highly sensitive to the perturbations on the preceding token embeddings. In this way, incorrect (uncertain) intermediate steps can be readily identified using this sensitivity score as guidance in practice. In our experiments, we show such perturbation-based metric achieves stronger uncertainty quantification performance compared with baseline methods such as token (generation) probability and token entropy. Besides, different from approaches that rely on multiple sampling, the perturbation-based metrics offer better simplicity and efficiency.

[1364] Maximizing Reliability with Bayesian Optimization

Jack M. Buckingham, Ivo Couckuyt, Juergen Branke

Main category: cs.LG

TL;DR: Bayesian optimization methods for reliability maximization with extremely rare failure probabilities using importance sampling

Details

Motivation: Optimizing reliability in manufacturing where failures are extremely rare (10^-6 to 10^-8) requires efficient methods for expensive black-box optimization problems

Method: Two Bayesian optimization methods based on Thompson sampling and knowledge gradient, incorporating importance sampling to target extremely small failure probabilities

Result: Proposed methods outperform existing methods in both extreme and non-extreme failure probability regimes

Conclusion: The importance sampling-enhanced BO methods provide effective solutions for reliability optimization with rare failures

Abstract: Bayesian optimization (BO) is a popular, sample-efficient technique for expensive, black-box optimization. One such problem arising in manufacturing is that of maximizing the reliability, or equivalently minimizing the probability of a failure, of a design which is subject to random perturbations - a problem that can involve extremely rare failures ($P_\mathrm{fail} = 10^{-6}-10^{-8}$). In this work, we propose two BO methods based on Thompson sampling and knowledge gradient, the latter approximating the one-step Bayes-optimal policy for minimizing the logarithm of the failure probability. Both methods incorporate importance sampling to target extremely small failure probabilities. Empirical results show the proposed methods outperform existing methods in both extreme and non-extreme regimes.

[1365] Certain Head, Uncertain Tail: Expert-Sample for Test-Time Scaling in Fine-Grained MoE

Yuanteng Chen, Peisong Wang, Nanxin Zeng, Yuantian Shao, Gang Li, Jing Liu, Jian Cheng

Main category: cs.LG

TL;DR: Expert-Sample: A training-free method for fine-grained MoE models that preserves high-confidence expert selections while injecting controlled stochasticity into uncertain tail experts to improve reasoning diversity and pass@n performance without destabilizing outputs.

Details

Motivation: Test-time scaling via multiple candidate solutions improves LLM performance, but token-level sampling requires temperature tuning that trades off diversity against stability. Fine-grained MoE models with hundreds of experts per layer offer an unexplored alternative through their rich routing space, where router scores show a pattern of high-confidence experts followed by uncertain tail candidates.

Method: Expert-Sample analyzes fine-grained MoE routing patterns and finds that while single-run greedy accuracy remains stable with fewer experts, multi-sample pass@n degrades significantly. The method preserves high-confidence expert selections while injecting controlled stochasticity into the uncertain tail of low-confidence experts, enabling diverse generation without destabilizing outputs.

Result: Evaluated on multiple fine-grained MoE models across math, knowledge reasoning, and code tasks, Expert-Sample consistently improves pass@n and verification-based accuracy. On Qwen3-30B-A3B-Instruct with GPQA-Diamond and 32 parallel samples, pass@32 rises from 85.4% to 91.9%, and accuracy improves from 59.1% to 62.6% with Best-of-N verification.

Conclusion: Expert-Sample effectively leverages the routing patterns in fine-grained MoE models to improve reasoning diversity and performance without additional training, demonstrating that the certain head governs core reasoning capability while the uncertain tail correlates with reasoning diversity.

Abstract: Test-time scaling improves LLM performance by generating multiple candidate solutions, yet token-level sampling requires temperature tuning that trades off diversity against stability. Fine-grained MoE, featuring hundreds of well-trained experts per layer and multi-expert activation per token, offers an unexplored alternative through its rich routing space. We empirically characterize fine-grained MoE routing and uncover an informative pattern: router scores exhibit a certain head of high-confidence experts followed by an uncertain tail of low-confidence candidates. While single-run greedy accuracy remains stable when fewer experts are activated, multi-sample pass@n degrades significantly-suggesting that the certain head governs core reasoning capability while the uncertain tail correlates with reasoning diversity. Motivated by these findings, we propose Expert-Sample, a training-free method that preserves high-confidence selections while injecting controlled stochasticity into the uncertain tail, enabling diverse generation without destabilizing outputs. Evaluated on multiple fine-grained MoE models across math, knowledge reasoning, and code tasks, Expert-Sample consistently improves pass@n and verification-based accuracy. On Qwen3-30B-A3B-Instruct evaluated on GPQA-Diamond with 32 parallel samples, pass@32 rises from 85.4% to 91.9%, and accuracy improves from 59.1% to 62.6% with Best-of-N verification.

[1366] R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

Zhuokun Chen, Zeren Chen, Jiahao He, Lu Sheng, Mingkui Tan, Jianfei Cai, Bohan Zhuang

Main category: cs.LG

TL;DR: R-Stitch is a training-free hybrid decoding framework that uses token-level entropy to route computation between small and large language models, reducing inference cost while maintaining accuracy.

Details

Motivation: Chain-of-thought reasoning improves LLM problem-solving but incurs high inference costs due to long autoregressive trajectories. Existing acceleration methods have limitations: speculative decoding provides limited gains when model agreement is low and rigidly enforces token-level consistency, failing to leverage that smaller models can sometimes produce more concise reasoning traces.

Method: R-Stitch uses token-level entropy as an uncertainty proxy to delegate computation between a small language model (SLM) and a large language model (LLM). High-entropy tokens (more uncertain) are delegated to the LLM, while low-entropy tokens are handled by the SLM. R-Stitch⁺ extends this with an adaptive routing policy that dynamically adjusts token budgets beyond fixed thresholds.

Result: Achieves substantial acceleration with peak speedups of 3.00× on DeepSeek-R1-Distill-Qwen-7B, 3.85× on 14B, and 4.10× on QWQ-32B while maintaining accuracy comparable to full LLM decoding. Enables adaptive efficiency-accuracy trade-offs without retraining.

Conclusion: R-Stitch effectively reduces both per-token decoding complexity and the number of generated tokens through entropy-guided routing, achieving significant inference acceleration with minimal accuracy loss for chain-of-thought reasoning tasks.

Abstract: Chain-of-thought (CoT) enhances the problem-solving ability of large language models (LLMs) but incurs substantial inference cost due to long autoregressive trajectories. Existing acceleration strategies either shorten traces via early stopping or compression, or adopt speculative decoding with a smaller model. However, speculative decoding provides limited gains when model agreement is low and rigidly enforces token-level consistency, overlooking the observation that some smaller models, when correct, produce significantly more concise reasoning traces that could reduce inference length. We introduce R-Stitch, a training-free hybrid decoding framework that leverages token-level entropy as an uncertainty proxy to delegate computation between a small language model (SLM) and an LLM. Our analysis shows that high-entropy tokens are more likely to induce errors, motivating an entropy-guided routing strategy that lets the SLM efficiently handle low-entropy tokens while delegating uncertain ones to the LLM, thereby avoiding full rollbacks and preserving answer quality. We further extend this design with R-Stitch$^{+}$, which learns an adaptive routing policy to adjust the token budget dynamically beyond fixed thresholds. By jointly reducing per-token decoding complexity and the number of generated tokens, our method achieves substantial acceleration with negligible accuracy loss. Concretely, it attains peak speedups of 3.00$\times$ on DeepSeek-R1-Distill-Qwen-7B, 3.85$\times$ on 14B, and 4.10$\times$ on QWQ-32B while maintaining accuracy comparable to full LLM decoding. Moreover, it naturally enables adaptive efficiency–accuracy trade-offs that can be tailored to diverse computational budgets without retraining.

[1367] Finite-Sample Wasserstein Error Bounds and Concentration Inequalities for Nonlinear Stochastic Approximation

Seo Taek Kong, R. Srikant

Main category: cs.LG

TL;DR: Non-asymptotic error bounds for nonlinear stochastic approximation algorithms in Wasserstein-p distance, with explicit finite-sample guarantees for last iterate and Polyak-Ruppert average convergence rates.

Details

Motivation: To provide explicit finite-sample guarantees for stochastic approximation algorithms, bridging the gap between finite-sample analyses and asymptotic theory, and improving upon moment bounds and Markov's inequality for high-probability concentration.

Method: Develops a coupling argument comparing discrete-time process to limiting Ornstein-Uhlenbeck process, applies to algorithms with general noise conditions (martingale differences, functions of ergodic Markov chains), and analyzes Polyak-Ruppert average convergence directly.

Result: Shows normalized last iterates converge to Gaussian distribution in p-Wasserstein distance at rate O(γ_n^{1/6}) where γ_n is step size; Polyak-Ruppert average converges at rate O(n^{-1/6}); demonstrates applications to linear stochastic approximation and stochastic gradient descent.

Conclusion: Provides rigorous non-asymptotic distributional guarantees for stochastic approximation algorithms, enabling explicit quantification of transition from heavy-tailed to Gaussian behavior and establishing convergence rates to central limit theorem.

Abstract: This paper derives non-asymptotic error bounds for nonlinear stochastic approximation algorithms in the Wasserstein-$p$ distance. To obtain explicit finite-sample guarantees for the last iterate, we develop a coupling argument that compares the discrete-time process to a limiting Ornstein-Uhlenbeck process. Our analysis applies to algorithms driven by general noise conditions, including martingale differences and functions of ergodic Markov chains. Complementing this result, we handle the convergence rate of the Polyak-Ruppert average through a direct analysis that applies under the same general setting. Assuming the driving noise satisfies a non-asymptotic central limit theorem, we show that the normalized last iterates converge to a Gaussian distribution in the $p$-Wasserstein distance at a rate of order $γ_n^{1/6}$, where $γ_n$ is the step size. Similarly, the Polyak-Ruppert average is shown to converge in the Wasserstein distance at a rate of order $n^{-1/6}$. These distributional guarantees imply high-probability concentration inequalities that improve upon those derived from moment bounds and Markov’s inequality. We demonstrate the utility of this approach by considering two applications: (1) linear stochastic approximation, where we explicitly quantify the transition from heavy-tailed to Gaussian behavior of the iterates, thereby bridging the gap between recent finite-sample analyses and asymptotic theory and (2) stochastic gradient descent, where we establish rate of convergence to the central limit theorem.

[1368] MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

Guangchen Lan, Sipeng Zhang, Tianle Wang, Yuwei Zhang, Daoan Zhang, Xinpeng Wei, Xiaoman Pan, Hongming Zhang, Dong-Jun Han, Christopher G. Brinton

Main category: cs.LG

TL;DR: MaPPO is a preference optimization method that incorporates prior reward knowledge into LLM alignment, generalizing DPO variants while maintaining computational efficiency.

Details

Motivation: Current preference optimization methods like DPO treat preference learning as maximum likelihood estimation, which oversimplifies response classification and doesn't incorporate prior knowledge about rewards.

Method: MaPPO extends DPO by formulating preference learning as Maximum a Posteriori estimation, explicitly integrating prior reward estimates into the optimization objective without adding hyperparameters.

Result: Extensive evaluations on MT-Bench, AlpacaEval 2.0, and Arena-Hard show consistent improvements in alignment performance across different model sizes without sacrificing efficiency.

Conclusion: MaPPO provides a principled framework for incorporating prior knowledge into preference optimization, generalizing existing DPO variants and improving alignment performance.

Abstract: As the era of large language models (LLMs) unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a methodology for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. Building on the paradigm employed by Direct Preference Optimization (DPO) and its variants of treating preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO integrates prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. Additionally, MaPPO introduces no additional hyperparameters, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin for DPO variants, including widely used SimPO, IPO and CPO, and produce consistent improvements. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks (MT-Bench, AlpacaEval 2.0, and Arena-Hard) demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.

[1369] Conflict-Aware Client Selection for Multi-Server Federated Learning

Mingwei Hong, Zheng Lin, Zehang Lin, Lin Li, Miao Yang, Xia Du, Zihan Fang, Zhaolu Kang, Dianxin Luan, Shunzhi Zhu

Main category: cs.LG

TL;DR: A decentralized reinforcement learning framework with conflict risk prediction (RL-CRP) for optimizing client selection in multi-server federated learning systems to reduce bandwidth conflicts and improve training efficiency.

Details

Motivation: Traditional single-server FL suffers from high communication latency, while multi-server FL faces resource contention and bandwidth conflicts due to overlapping client coverage and uncoordinated client selection, leading to training failures.

Method: Proposes RL-CRP: decentralized reinforcement learning with conflict risk prediction. Each server uses a categorical hidden Markov model to estimate client selection conflict likelihood from sparse historical data, and incorporates a fairness-aware reward mechanism to promote long-term client participation.

Result: Extensive experiments show RL-CRP effectively reduces inter-server conflicts and significantly improves training efficiency in terms of convergence speed and communication cost.

Conclusion: The proposed RL-CRP framework addresses key limitations in multi-server FL systems by optimizing client selection to minimize training latency and resource contention while maintaining fairness.

Abstract: Federated learning (FL) has emerged as a promising distributed machine learning (ML) that enables collaborative model training across clients without exposing raw data, thereby preserving user privacy and reducing communication costs. Despite these benefits, traditional single-server FL suffers from high communication latency due to the aggregation of models from a large number of clients. While multi-server FL distributes workloads across edge servers, overlapping client coverage and uncoordinated selection often lead to resource contention, causing bandwidth conflicts and training failures. To address these limitations, we propose a decentralized reinforcement learning with conflict risk prediction, named RL CRP, to optimize client selection in multi-server FL systems. Specifically, each server estimates the likelihood of client selection conflicts using a categorical hidden Markov model based on its sparse historical client selection sequence. Then, a fairness-aware reward mechanism is incorporated to promote long-term client participation for minimizing training latency and resource contention. Extensive experiments demonstrate that the proposed RL-CRP framework effectively reduces inter-server conflicts and significantly improves training efficiency in terms of convergence speed and communication cost.

[1370] Expanding the Capabilities of Reinforcement Learning via Text Feedback

Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J. Andrew Bagnell, Aarti Singh, Andrea Zanette

Main category: cs.LG

TL;DR: RL from Text Feedback (RLTF) uses textual feedback as intermediate supervision between sparse rewards and expensive demonstrations for LLM post-training, improving single-turn performance through multi-turn feedback internalization.

Details

Motivation: Current RL for LLMs relies on uninformative binary rewards, while distillation requires costly demonstrations. Text feedback offers richer supervision than rewards and is cheaper than demonstrations, representing a natural human interaction mode already abundant in real-world settings.

Method: Proposes RL from Text Feedback (RLTF) framework with two methods: Self Distillation (RLTF-SD) trains single-turn policy to match its own feedback-conditioned second-turn generations; Feedback Modeling (RLTF-FM) predicts feedback as auxiliary objective. Both leverage text feedback during training but not inference.

Result: Both RLTF methods consistently outperform strong baselines across reasoning puzzles, competition math, and creative writing tasks, demonstrating effectiveness of text feedback as rich supervision.

Conclusion: Text feedback provides valuable intermediate supervision between binary rewards and demonstrations, enabling more effective RL for LLMs with practical applications in real-world settings where textual critique is abundant.

Abstract: The success of RL for LLM post-training stems from an unreasonably uninformative source: a single bit of information per rollout as binary reward or preference label. At the other extreme, distillation offers dense supervision but requires demonstrations, which are costly and difficult to scale. We study text feedback as an intermediate signal: richer than scalar rewards, yet cheaper than complete demonstrations. Textual feedback is a natural mode of human interaction and is already abundant in many real-world settings, where users, annotators, and automated judges routinely critique LLM outputs. Towards leveraging text feedback at scale, we formalize a multi-turn RL setup, RL from Text Feedback (RLTF), where text feedback is available during training but not at inference. Therefore, models must learn to internalize the feedback in order to improve their test-time single-turn performance. To do this, we propose two methods: Self Distillation (RLTF-SD), which trains the single-turn policy to match its own feedback-conditioned second-turn generations; and Feedback Modeling (RLTF-FM), which predicts the feedback as an auxiliary objective. We provide theoretical analysis on both methods, and empirically evaluate on reasoning puzzles, competition math, and creative writing tasks. Our results show that both methods consistently outperform strong baselines across benchmarks, highlighting the potential of RL with an additional source of rich supervision at scale.

[1371] MEG-XL: Data-Efficient Brain-to-Text via Long-Context Pre-Training

Dulhan Jayalath, Oiwi Parker Jones

Main category: cs.LG

TL;DR: MEG-XL is a brain-to-text interface model pre-trained with 2.5 minutes of MEG context per sample (5-300x longer than prior work) that achieves state-of-the-art word decoding from brain data with dramatically less training data.

Details

Motivation: Clinical brain-to-text interfaces need to work for paralyzed patients who cannot provide extensive training recordings. Current methods pre-train with only a few seconds of context, but natural speech unfolds over minutes, so longer neural context is needed for better statistical priors and data-efficient generalization.

Method: Proposed MEG-XL model pre-trained with 2.5 minutes of MEG context per sample (equivalent to 191k tokens), which is 5-300x longer than prior work. The model is then fine-tuned for word decoding from brain data.

Result: MEG-XL matches supervised performance with a fraction of the data (e.g., 1 hour vs 50 hours) and outperforms brain foundation models. Models pre-trained with longer contexts learn representations that transfer better to word decoding.

Conclusion: Long-context pre-training helps exploit extended neural context that other methods unnecessarily discard, enabling more data-efficient brain-to-text interfaces for clinical applications.

Abstract: Clinical brain-to-text interfaces are designed for paralysed patients who cannot provide extensive training recordings. Pre-training improves data-efficient generalisation by learning statistical priors across subjects, but these priors critically depend on context. While natural speech might unfold gradually over minutes, most methods pre-train with only a few seconds of context. Thus, we propose MEG-XL, a model pre-trained with 2.5 minutes of MEG context per sample, 5-300x longer than prior work, and equivalent to 191k tokens, capturing extended neural context. Fine-tuning on the task of word decoding from brain data, MEG-XL matches supervised performance with a fraction of the data (e.g. 1hr vs 50hrs) and outperforms brain foundation models. We find that models pre-trained with longer contexts learn representations that transfer better to word decoding. Our results indicate that long-context pre-training helps exploit extended neural context that other methods unnecessarily discard. Code, model weights, and instructions are available at https://github.com/neural-processing-lab/MEG-XL .

[1372] SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty in TinyML

Ismail Lamaakal, Chaymae Yahyati, Khalid El Makkaoui, Ibrahim Ouahbi, Yassine Maleh

Main category: cs.LG

TL;DR: SNAP-UQ: A lightweight, single-pass uncertainty estimation method for neural networks that predicts next activations to compute uncertainty scores without temporal buffers or auxiliary exits.

Details

Motivation: Existing uncertainty estimation methods for neural networks often require multiple forward passes (like MC dropout or deep ensembles) or auxiliary exits, which increase computational cost, memory footprint, and latency - making them impractical for resource-constrained TinyML applications.

Method: SNAP-UQ uses depth-wise next-activation prediction: it taps a small set of backbone layers and uses tiny int8 heads to predict the mean and scale of the next activation from a low-rank projection of the previous activation. The standardized prediction error forms a depth-wise surprisal signal that is aggregated and mapped through a lightweight monotone calibrator into an uncertainty score.

Result: SNAP-UQ reduces flash memory usage by ~40-60% and latency by ~25-35% compared to early-exit and deep-ensemble baselines. It improves accuracy-drop event detection on corrupted streams by multiple AUPRC points and maintains strong failure detection (AUROC ≈ 0.9) in a single forward pass, while adding only tens of kilobytes to deployment footprint.

Conclusion: SNAP-UQ offers a novel, resource-efficient approach to uncertainty estimation that grounds uncertainty in layer-to-layer dynamics rather than solely in output confidence, making it suitable for robust TinyML monitoring with minimal computational overhead.

Abstract: This paper proposes a novel and practical method, SNAP-UQ, for single-pass, label-free uncertainty estimation based on depth-wise next-activation prediction. SNAP-UQ taps a small set of backbone layers and uses tiny int8 heads to predict the mean and scale of the next activation from a low-rank projection of the previous one; the resulting standardized prediction error forms a depth-wise surprisal signal that is aggregated and mapped through a lightweight monotone calibrator into an actionable uncertainty score. The design introduces no temporal buffers or auxiliary exits and preserves state-free inference, while increasing deployment footprint by only a few tens of kilobytes. Across vision and audio backbones, SNAP-UQ reduces flash and latency relative to early-exit and deep-ensemble baselines (typically $\sim$40–60% smaller and $\sim$25–35% faster), with several competing methods at similar accuracy often exceeding MCU memory limits. On corrupted streams, it improves accuracy-drop event detection by multiple AUPRC points and maintains strong failure detection (AUROC $\approx 0.9$) in a single forward pass. By grounding uncertainty in layer-to-layer dynamics rather than solely in output confidence, SNAP-UQ offers a novel, resource-efficient basis for robust TinyML monitoring.

[1373] Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning

Junxuan Wang, Xuyang Ge, Wentao Shu, Zhengfu He, Xipeng Qiu

Main category: cs.LG

TL;DR: Attention outputs in transformers are surprisingly low-dimensional (~60% effective rank), while MLPs and residual streams remain high-dimensional (~90%). This low-rank structure causes dead features in sparse dictionary learning, which can be fixed by subspace-constrained training.

Details

Motivation: The paper aims to understand the geometric properties of transformer attention mechanisms and address the prevalent dead feature problem in sparse dictionary learning for large language models.

Method: Analyzed effective dimensionality of attention outputs vs. MLP outputs across diverse models and datasets. Identified attention output projection matrix as key factor. Proposed subspace-constrained training for sparse autoencoders that initializes features into the active subspace of activations.

Result: Consistent finding that attention outputs have ~60% effective dimensionality while MLP outputs have ~90%. Subspace-constrained training reduced dead features from 87% to below 1% in Attention Output SAEs with 1M features.

Conclusion: Attention mechanisms operate in surprisingly low-dimensional subspaces, which explains dead feature problems in sparse dictionary learning. The proposed subspace-constrained training provides practical improvements for sparse dictionary learning in LLMs.

Abstract: Transformer architectures, and their attention mechanisms in particular, form the foundation of modern large language models. While transformer models are widely believed to operate in high-dimensional hidden spaces, we show that attention outputs are in fact confined to a surprisingly low-dimensional subspace, with an effective dimensionality of only about $60%$ of the full space. In contrast, MLP outputs and residual streams remain much closer to full-rank, exhibiting effective ranks around $90%$. This striking dimensional discrepancy is consistently observed across diverse model families and datasets, and is strongly shaped by the attention output projection matrix. Critically, we find this low-rank structure as a key factor of the prevalent dead feature problem in sparse dictionary learning, where it creates a mismatch between randomly initialized features and the intrinsic geometry of the activation space. Building on this insight, we propose a subspace-constrained training method for sparse autoencoders (SAEs), initializing feature directions into the active subspace of activations. Our approach reduces dead features from 87% to below 1% in Attention Output SAEs with 1M features, and can further extend to other sparse dictionary learning methods. Our findings provide both new insights into the geometry of attention and practical tools for improving sparse dictionary learning in large language models.

[1374] Uncertainty-Aware Knowledge Tracing Models

Joshua Mitton, Prarthana Bhattacharyya, Ralph Abboud, Simon Woodhead

Main category: cs.LG

TL;DR: Knowledge Tracing models can be enhanced with uncertainty estimation to detect when they make incorrect predictions, particularly when students choose distractors, which could improve educational platforms in resource-limited settings.

Details

Motivation: Current Knowledge Tracing (KT) models focus on predictive accuracy but often make incorrect predictions when students choose distractors, causing student errors to go undetected. There's a need to enhance KT models with additional capabilities beyond just accuracy.

Method: The authors present an approach to add uncertainty estimation capabilities to KT models. They demonstrate that larger predictive uncertainty aligns with model incorrect predictions, suggesting uncertainty can be used to detect when models are likely wrong.

Result: The research shows that uncertainty in KT models is informative and that this signal would be pedagogically useful for educational learning platforms, especially in limited resource settings where understanding student ability is crucial.

Conclusion: Adding uncertainty estimation to KT models provides valuable pedagogical signals beyond just predictive accuracy, enabling better detection of student errors and more effective educational interventions in resource-constrained environments.

Abstract: The main focus of research on Knowledge Tracing (KT) models is on model developments with the aim of improving predictive accuracy. Most of these models make the most incorrect predictions when students choose a distractor, leading to student errors going undetected. We present an approach to add new capabilities to KT models by capturing predictive uncertainty and demonstrate that a larger predictive uncertainty aligns with model incorrect predictions. We show that uncertainty in KT models is informative and that this signal would be pedagogically useful for application in an educational learning platform that can be used in a limited resource setting where understanding student ability is necessary.

[1375] Anchored Supervised Fine-Tuning

He Zhu, Junyou Su, Peng Lai, Ren Ma, Wenjia Zhang, Linyi Yang, Guanhua Chen

Main category: cs.LG

TL;DR: ASFT (Anchored Supervised Fine-Tuning) improves post-training of LLMs by adding KL regularization to DFT’s reweighting scheme, achieving better stability and performance than SFT and DFT across reasoning tasks.

Details

Motivation: Address the trade-off between SFT (efficient but memorizes) and RL (generalizes better but computationally expensive), and fix DFT's instability issues while maintaining its theoretical advantages.

Method: Analyze DFT through reward-weighted regression framework, identify its lack of distributional anchoring causing drift, then propose ASFT which adds lightweight KL regularization to DFT’s reweighting scheme to preserve tight RL bounds while ensuring stability.

Result: ASFT consistently outperforms both SFT and DFT across mathematical reasoning, medical knowledge grounding, and code generation tasks, achieving substantial improvements with minimal computational overhead.

Conclusion: Principled theoretical analysis through RWR framework leads to both stronger guarantees and practical gains in post-training methods, with ASFT providing a stable and effective middle ground between SFT and RL.

Abstract: Post-training of large language models involves a fundamental trade-off between supervised fine-tuning (SFT), which efficiently mimics demonstrations but tends to memorize, and reinforcement learning (RL), which achieves better generalization at higher computational cost. Dynamic Fine-Tuning (DFT) recently emerged as a promising middle ground, reweighting SFT objectives with token probabilities and achieving improvements in certain reasoning domains, though it exhibits instability in other tasks. We provide a analysis of DFT through the reward-weighted regression (RWR) framework, revealing that it corresponds to a specific auxiliary distribution choice that yields provably tighter RL bounds than standard SFT. However, our analysis also uncovers a critical limitation: this construction lacks distributional anchoring, leading to progressive drift that undermines training stability. To address this, we propose Anchored Supervised Fine-Tuning (ASFT), which augments DFT’s reweighting with lightweight KL regularization to preserve tightness while ensuring stability. Empirically, ASFT consistently outperforms both SFT and DFT across mathematical reasoning, medical knowledge grounding, and code generation, achieving substantial improvements with minimal computational overhead. Our RWR framework provides a systematic lens for understanding post-training methods and demonstrates that principled theoretical analysis leads to both stronger guarantees and practical gains. The code is available at https://github.com/zhuchichi56/ASFT.

[1376] How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models

Parth Asawa, Alan Zhu, Abby O’Neill, Matei Zaharia, Alexandros G. Dimakis, Joseph E. Gonzalez

Main category: cs.LG

TL;DR: Advisor Models train small open-weight models to generate natural language advice that improves black-box frontier language models’ performance on specific tasks.

Details

Motivation: Frontier language models are deployed as black-box services where model weights cannot be modified, limiting customization to prompting. There's a need for methods to enhance these black-box models' capabilities without access to their internal parameters.

Method: Train small open-weight Advisor Models to generate dynamic, per-instance natural language advice that can be provided as prompts to black-box frontier models. These advisors are trained to optimize performance on specific tasks and can transfer improvements across different models.

Result: Advisor Models improve GPT-5’s performance on RuleArena (Taxes) by 71%, reduce Gemini 3 Pro’s steps in SWE agent tasks by 24.6%, and outperform static prompt optimizers in personalizing GPT-5 to user preferences (85-100% vs. 40-60%). Advisors are transferable and robust with no degradation on other benchmarks.

Conclusion: Advisor Models provide a practical and cost-effective method for parametric optimization of black-box frontier models through natural language advice generation, enabling customization and performance enhancement without model weight access.

Abstract: Frontier language models are deployed as black-box services, where model weights cannot be modified and customization is limited to prompting. We introduce Advisor Models, a method to train small open-weight models to generate dynamic, per-instance natural language advice that improves the capabilities of black-box frontier models. Advisor Models improve GPT-5’s performance on RuleArena (Taxes) by 71%, reduce Gemini 3 Pro’s steps taken in SWE agent tasks by 24.6%, and outperform static prompt optimizers in personalizing GPT-5 to user preferences (85-100% vs. 40-60%). We also find that advisors are transferable: an advisor trained with a low-cost student model still transfers improvements to a frontier model. Moreover, Advisor Models are robust: we observe no degradation on other benchmarks than the pipeline is trained on. Our method shows how to perform parametric optimization for black-box frontier models in a practical and cost-effective way.

[1377] Simple Policy Gradients for Reasoning with Diffusion Language Models

Anthony Zhan

Main category: cs.LG

TL;DR: Proposes AGRPO, a policy gradient algorithm for post-training diffusion LLMs that optimizes individual denoising steps rather than entire sequences, achieving significant gains on math/reasoning tasks and enabling faster sampling.

Details

Motivation: Diffusion LLMs lack tractable sequence-level likelihoods, preventing them from benefiting from modern LLM post-training techniques like reinforcement learning, which limits their real-world applicability. Existing dLLM post-training methods rely on heuristic approximations or lower bounds.

Method: Amortized Group Relative Policy Optimization (AGRPO), a policy gradient algorithm that leverages the multi-step Markovian nature of dLLM generation. Instead of optimizing entire sequences, AGRPO optimizes individual denoising steps, making RL training tractable for diffusion models.

Result: Achieved +9.9% absolute gain on GSM8K, +4.6% on MATH-500, +59.4% on Countdown, and +69.7% on Sudoku over base LLaDA model. Outperformed comparable dLLM RL methods like diffu-GRPO. Models trained with AGRPO can sample 4x faster with minimal performance sacrifices.

Conclusion: AGRPO enables effective reinforcement learning for diffusion LLMs by exploiting their Markovian structure, significantly improving performance on reasoning tasks while maintaining sampling efficiency.

Abstract: Diffusion large language models (dLLMs), which offer a promising alternative to traditional autoregressive LLMs, have recently shown strong results in pretraining. However, due to their lack of tractable sequence-level likelihoods, they have yet to benefit from modern LLM post-training techniques such as reinforcement learning (RL), limiting their real-world applicability. Existing attempts at dLLM post-training rely on heuristic approximations or lower bounds of the true likelihood. In this work, we propose Amortized Group Relative Policy Optimization (AGRPO), a policy gradient algorithm that leverages the multi-step Markovian nature of dLLM generation, optimizing individual denoising steps rather than entire sequences. We demonstrate AGRPO’s effectiveness on different math and reasoning tasks, achieving +9.9% absolute gain on GSM8K, +4.6% on MATH-500, +59.4% on Countdown, and +69.7% on Sudoku over the base LLaDA model, improving upon comparable dLLM RL methods such as diffu-GRPO. Furthermore, we analyze how post-training gains persist across different inference configurations, revealing that models trained with AGRPO can sample 4x faster with minimal performance sacrifices.

[1378] Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning

Heming Zou, Yixiu Mao, Yun Qu, Qi Wang, Xiangyang Ji

Main category: cs.LG

TL;DR: UDS is an efficient online batch selection framework for supervised fine-tuning of LLMs that balances data utility and diversity without external resources, reducing training time while maintaining performance.

Details

Motivation: Full-dataset SFT is computationally expensive and can lead to overfitting/bias. Existing online batch selection methods often focus only on utility, require external resources, and add training overhead.

Method: UDS uses nuclear norm of logits matrix to capture utility and intra-sample diversity, estimates inter-sample diversity via low-dimensional embedding comparisons with a lightweight memory buffer, eliminating need for external resources or extra backpropagation.

Result: UDS outperforms state-of-the-art online batch selection methods across multiple benchmarks under varying data budgets and significantly reduces training time compared to full-dataset fine-tuning.

Conclusion: UDS provides an efficient, resource-light framework for data curation in SFT that balances utility and diversity, enabling faster training without performance degradation.

Abstract: Supervised fine-tuning (SFT) is a commonly used technique to adapt large language models (LLMs) to downstream tasks. In practice, SFT on a full dataset is computationally expensive and sometimes suffers from overfitting or bias amplification. This facilitates the rise of data curation in SFT, which prioritizes the most valuable data to optimze. This work studies the online batch selection family that dynamically scores and filters samples during the training process. However, existing popular methods often (i) rely merely on the utility of data to select a subset while neglecting other crucial factors like diversity, (ii) rely on external resources such as reference models or validation sets, and (iii) incur extra training time over full-dataset training. To address these limitations, this work develops UDS (Utility-Diversity Sampling), a framework for efficient online batch selection in SFT. UDS leverages the nuclear norm of the logits matrix to capture both data utility and intra-sample diversity, while estimating inter-sample diversity through efficient low-dimensional embedding comparisons with a lightweight memory buffer of historical samples. Such a design eliminates the need for external resources and unnecessary backpropagation, securing computational efficiency. Experiments on multiple benchmarks demonstrate that UDS consistently outperforms state-of-the-art online batch selection methods under varying data budgets, and significantly reduces training time compared to full-dataset fine-tuning. Code is available at https://github.com/gfyddha/UDS.

[1379] Geometric-disentangelment Unlearning

Duo Zhou, Yuji Zhang, Tianxin Wei, Ruizhong Qiu, Ke Yang, Xiao Lin, Cheng Qian, Jingrui He, Hanghang Tong, Chengxiang Zhai, Heng Ji, Huan Zhang

Main category: cs.LG

TL;DR: Geometric-disentanglement Unlearning (GU) is a theoretically grounded projection method that reduces the trade-off between forgetting unwanted content and preserving retaining knowledge in LLM unlearning by ensuring update directions are orthogonal to retain gradients.

Details

Motivation: Existing LLM unlearning methods often cause collateral degradation of retaining knowledge when removing forget sets, creating a persistent forget-retain trade-off. Current approaches are heuristic or rely on offline feature constructions that don't capture update-time interactions between forget and retain knowledge.

Method: The authors formalize “no side effects” as local retain invariance under small parameter updates and prove an equivalence: retain loss is locally invariant if and only if update direction is orthogonal to the subspace spanned by retain gradients. Based on this insight, they propose Geometric-disentanglement Unlearning (GU), a lightweight projection method that can be plug-and-play to existing gradient-based unlearning methods to mitigate forget-retain side effects.

Result: Experiments on TOFU, MUSE, and WMDP-cyber benchmarks show GU strengthens forgetting while reducing retain drift. When added to SimNPO, it achieves up to 62% improved forgetting Extraction Strength (ES) and 31% higher retain ES.

Conclusion: GU provides a theoretically grounded solution to reduce the forget-retain trade-off in LLM unlearning through geometric disentanglement of update directions, offering practical improvements when combined with existing gradient-based methods.

Abstract: Large language models (LLMs) can internalize private or harmful content, motivating unlearning that removes a forget set while preserving retaining knowledge. However, forgetting updates often cause collateral degradation on retaining knowledge, creating a persistent trade-off. Existing LLM unlearning methods are often heuristic, and other theoretical approaches rely on offline feature constructions that do not capture update-time forget-retain interaction in LLMs. To address this limitation, we aim to develop an LLM unlearning method that reduces the forget-retain trade-off with theoretical guarantees. We take a first-principles view by formalizing “no side effects” as local retain invariance under small parameter updates, and prove an equivalence under optimizer-induced geometry: the retain loss is locally invariant if and only if the update direction is orthogonal to the subspace spanned by retain gradients. Based on the insight, we propose Geometric-disentanglement Unlearning (GU), a lightweight and theoretically grounded projection that can be plug-and-play to existing gradient-based unlearning methods to mitigate forget-retain side effects. Experiments on TOFU, MUSE, and WMDP-cyber show that GU strengthens forgetting while reducing retain drift. When added to SimNPO, it achieves up to 62% improved forgetting Extraction Strength (ES) and 31% higher retain ES. We open-sourced our code in https://github.com/Lemutisme/Geometric-Unlearning.

[1380] Language as a Wave Phenomenon: Semantic Phase Locking and Interference in Neural Networks

Alper Yıldırım, İbrahim Yücedağ

Main category: cs.LG

TL;DR: PRISM introduces a complex-valued Transformer architecture that uses phase information for semantic processing, replacing attention with gated harmonic convolutions and enforcing unit-norm constraints to enable subtractive interference in the frequency domain.

Details

Motivation: Standard Transformers conflate semantic importance with activation magnitude, obscuring the geometric structure of latent representations. The authors aim to disentangle these factors by isolating the computational role of phase in neural representations.

Method: PRISM uses a complex-valued architecture with strict unit-norm constraints (|z| = 1) and replaces attention mechanisms with gated harmonic convolutions. This forces the model to use subtractive interference in the frequency domain for noise suppression instead of magnitude-based gating. The approach creates a hybrid architecture that fuses phase-based routing with standard attention.

Result: The hybrid architecture achieves superior parameter efficiency and representation quality compared to unconstrained baselines. The authors identify geometric phase clustering where tokens self-organize to resolve semantic ambiguities, establishing an O(N log N) reasoning framework based on spectral interference.

Conclusion: PRISM demonstrates that subtractive logic in the frequency domain is a sufficient primitive for deep reasoning, providing an algorithmic existence proof for phase-based computational approaches that can complement traditional attention mechanisms.

Abstract: In standard Transformer architectures, semantic importance is often conflated with activation magnitude, obscuring the geometric structure of latent representations. To disentangle these factors, we introduce PRISM, a complex-valued architecture designed to isolate the computational role of phase. By enforcing a strict unit-norm constraint (|z| = 1) and replacing attention with gated harmonic convolutions, the model is compelled to utilize subtractive interference in the frequency domain to suppress noise, rather than relying on magnitude-based gating. We utilize this constrained regime to demonstrate that a hybrid architecture - fusing phase-based routing with standard attention - achieves superior parameter efficiency and representation quality compared to unconstrained baselines. Mechanistically, we identify geometric phase clustering, where tokens naturally self-organize to resolve semantic ambiguities. This establishes an O(N log N) reasoning framework based on spectral interference, providing an algorithmic existence proof that subtractive logic is a sufficient primitive for deep reasoning.

[1381] Agentic Policy Optimization via Instruction-Policy Co-Evolution

Han Zhou, Xingchen Wan, Ivan Vulić, Anna Korhonen

Main category: cs.LG

TL;DR: INSPO introduces an instruction-policy co-evolution framework that dynamically optimizes LLM instructions during reinforcement learning, outperforming static instruction approaches on multi-turn reasoning tasks.

Details

Motivation: Current RLVR approaches rely on static, manually designed instructions that may be suboptimal for the base model and don't adapt as the agent's policy improves through environment interaction.

Method: INSPO maintains a dynamic population of instruction candidates sampled with questions, uses RL reward signals to attribute performance to instructions, prunes low performers, and generates new instructions through on-policy reflection using an LLM-based optimizer analyzing past experience.

Result: INSPO substantially outperforms strong baselines relying on static instructions on multi-turn retrieval and reasoning tasks, discovering innovative instructions that guide agents toward more strategic reasoning paths with only marginal computational overhead increase.

Conclusion: Instruction-policy co-evolution is effective for improving LLM reasoning agents, with INSPO demonstrating that dynamic instruction optimization integrated into RL loops leads to significant performance gains over static approaches.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capability of large language models (LLMs), enabling autonomous agents that can conduct effective multi-turn and tool-integrated reasoning. While instructions serve as the primary protocol for defining agents, RLVR typically relies on static and manually designed instructions. However, those instructions may be suboptimal for the base model, and the optimal instruction may change as the agent’s policy improves and explores the interaction with the environment. To bridge the gap, we introduce INSPO, a novel Instruction-Policy co-evolution framework that integrates instruction optimization as a dynamic component of the reinforcement learning (RL) loop. INSPO maintains a dynamic population of instruction candidates that are sampled with questions, where reward signals in RL loops are automatically attributed to each instruction, and low performers are periodically pruned. New instructions are generated and verified through an on-policy reflection mechanism, where an LLM-based optimizer analyzes past experience from a replay buffer and evolves more effective strategies given the current policy. We conduct extensive experiments on multi-turn retrieval and reasoning tasks, demonstrating that INSPO substantially outperforms strong baselines relying on static instructions. INSPO discovers innovative instructions that guide the agent toward more strategic reasoning paths, achieving substantial performance gains with only a marginal increase in computational overhead.

[1382] When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models

Tong Xie, Andrew Bai, Yuanhao Ban, Yunqi Hong, Haoyu Li, Cho-jui Hsieh

Main category: cs.LG

TL;DR: NormBT: A normalized Bradley-Terry loss for reward modeling that addresses spurious learning signals from representation distance in LLM alignment.

Details

Motivation: The standard Bradley-Terry loss for reward modeling suffers from spurious learning signals where gradient norms scale with representation distance between response pairs, causing vanishing updates for close pairs and disproportionately strong updates for distant pairs, which misaligns learning.

Method: Proposes NormBT, an adaptive pairwise normalization scheme that rescales updates to balance representation-driven effects and focuses learning signals on prediction error. It’s a lightweight, drop-in modification to BT loss with negligible overhead.

Result: NormBT improves reward model performance consistently across various LLM backbones and datasets, with notable gains of over 5% on the Reasoning category of RewardBench, which contains numerous fine-grained pairs.

Conclusion: The representation distance bias in standard BT loss significantly impacts reward modeling, and NormBT effectively addresses this issue to improve alignment learning, particularly for fine-grained distinctions.

Abstract: Reward models are central to Large Language Model (LLM) alignment within the framework of RLHF. The standard objective used in reward modeling is the Bradley-Terry (BT) loss, which learns from pairwise data consisting of chosen and rejected responses. In this work, we analyze the per-sample gradient of BT-loss and show spurious learning signals due to representation distance. In particular, BT gradient norm scales with two distinct components: (1) prediction error, reflected by the difference in predicted rewards between chosen and rejected responses, and critically, (2) representation distance between the pair measured in the output space of the final layer. While the first term captures the intended training signal, the second term can significantly impact the update magnitude and misalign learning. Specifically, pairs with small representation distance often receive vanishingly weak updates, even when misranked, while pairs with large distance receive disproportionately strong updates. This leads to gradients from large-distance pairs to overshadow those from small-distance pairs, where fine-grained distinctions are especially important. To overcome this limitation, we propose NormBT, an adaptive pair-wise normalization scheme that rescales updates to balance representation-driven effects and focuses learning signals on prediction error. NormBT is a lightweight, drop-in modification to BT loss with negligible overhead. Across various LLM backbones and datasets, NormBT improves reward model performance consistently, with notable gains of over 5% on the Reasoning category of RewardBench, which contains numerous fine-grained pairs.

[1383] On the Effectiveness of Membership Inference in Targeted Data Extraction from Large Language Models

Ali Al Sahili, Ali Chehab, Razane Tajeddine

Main category: cs.LG

TL;DR: Systematic benchmarking of Membership Inference Attack techniques integrated into LLM training data extraction pipelines to evaluate their practical utility in real-world privacy attacks.

Details

Motivation: LLMs are prone to memorizing training data, creating serious privacy risks through training data extraction and Membership Inference Attacks (MIAs). These threats are interconnected - adversaries can extract data by generating large text volumes and then use MIAs to verify if specific data was in the training set.

Method: Integrate multiple MIA techniques into the data extraction pipeline to systematically benchmark their effectiveness. Compare their performance in this integrated setting against results from conventional MIA benchmarks.

Result: The study provides comparative performance analysis of different MIA techniques when integrated into real-world data extraction scenarios, showing how their effectiveness differs from conventional benchmark evaluations.

Conclusion: The research enables evaluation of MIA techniques’ practical utility in real-world extraction scenarios, helping understand which methods are most effective for privacy attacks on LLMs.

Abstract: Large Language Models (LLMs) are prone to memorizing training data, which poses serious privacy risks. Two of the most prominent concerns are training data extraction and Membership Inference Attacks (MIAs). Prior research has shown that these threats are interconnected: adversaries can extract training data from an LLM by querying the model to generate a large volume of text and subsequently applying MIAs to verify whether a particular data point was included in the training set. In this study, we integrate multiple MIA techniques into the data extraction pipeline to systematically benchmark their effectiveness. We then compare their performance in this integrated setting against results from conventional MIA benchmarks, allowing us to evaluate their practical utility in real-world extraction scenarios.

[1384] Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu

Main category: cs.LG

TL;DR: BuPO decomposes LLM-based RL policies into internal layer policies via Transformer’s residual stream, revealing distinct entropy patterns and proposing bottom-up optimization for improved reasoning.

Details

Motivation: Existing RL approaches treat LLMs as unified policies, overlooking internal mechanisms. The paper aims to understand and leverage the internal reasoning structure of LLMs for better policy optimization.

Method: Decompose LLM-based policy into Internal Layer Policies and Internal Modular Policies via Transformer’s residual stream. Analyze entropy patterns across layers, then propose Bottom-up Policy Optimization (BuPO) that optimizes internal layers in early stages to reconstruct reasoning foundation.

Result: Reveals distinct entropy patterns: policies evolve from high-entropy exploration in early layers to deterministic refinement in top layers. Qwen shows progressive human-like reasoning vs Llama’s abrupt final-layer convergence. BuPO demonstrates effectiveness on complex reasoning benchmarks.

Conclusion: Internal layer analysis provides insights into LLM reasoning mechanisms. Bottom-up optimization approach effectively improves reasoning performance by reconstructing the foundation from lower layers upward.

Abstract: Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a unified policy, overlooking their internal mechanisms. In this paper, we decompose the LLM-based policy into Internal Layer Policies and Internal Modular Policies via Transformer’s residual stream. Our entropy analysis on internal policy reveals distinct patterns: (1) universally, policies evolve from high-entropy exploration in early layers to deterministic refinement in top layers; and (2) Qwen exhibits a progressive, human-like reasoning structure, contrasting with the abrupt final-layer convergence in Llama. Furthermore, we discover that optimizing internal layers induces feature refinement, forcing lower layers to capture high-level reasoning representations early. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that reconstructs the LLM’s reasoning foundation from the bottom up by optimizing internal layers in early stages. Extensive experiments on complex reasoning benchmarks demonstrate the effectiveness of BuPO. Our code is available at https://github.com/Trae1ounG/BuPO.

[1385] Learning to Reason in LLMs by Expectation Maximization

Junghyun Lee, Branislav Kveton, Anup Rao, Subhojyoti Mukherjee, Ryan A. Rossi, Sunav Choudhary, Alexa Siu

Main category: cs.LG

TL;DR: The paper proposes a reward-based filtered expectation-maximization (FEM) framework for learning to reason in LLMs, comparing three sampling schemes for generating rationales that justify correct answers.

Details

Motivation: Current LLMs solve reasoning problems by generating rationales before answers, but there's a need for better methods to learn reasoning capabilities through structured optimization approaches that connect EM algorithms with modern reward-based optimization.

Method: Formalizes reasoning as a latent variable model and derives FEM objective. Compares three sampling schemes: rejection sampling with budget, self-taught reasoner (STaR), and prompt posterior sampling (PPS) which conditions on correct answers during rationale generation.

Result: Experiments on LLM-as-a-judge calibration and summarization from feedback tasks show PPS outperforms other sampling schemes, demonstrating that sampling scheme design significantly impacts reasoning performance.

Conclusion: The FEM framework effectively connects EM and reward-based optimization for reasoning, with PPS emerging as the most effective sampling approach by leveraging conditioning on correct answers to guide rationale generation.

Abstract: Large language models (LLMs) solve reasoning problems by first generating a rationale and then answering. We formalize reasoning as a latent variable model and derive a reward-based filtered expectation-maximization (FEM) objective for learning to reason. This view connects EM and modern reward-based optimization, and shows that the main challenge lies in designing a sampling distribution of rationales that justify correct answers. We instantiate and compare three sampling schemes: rejection sampling with a budget, self-taught reasoner (STaR), and prompt posterior sampling (PPS), which only keeps the rationalization stage of STaR that conditions on the correct answer in the prompt. We experiment with LLM-as-a-judge calibration and summarization from feedback tasks, where conditioning on the correct answer provides a strong guidance for generating rationales. Our experiments show the efficacy of PPS over other sampling schemes, and that the sampling scheme can have a significant impact on performance.

[1386] Unifying Learning Dynamics and Generalization in Transformers Scaling Law

Chiwun Yang

Main category: cs.LG

TL;DR: Theoretical analysis of scaling laws for transformer-based language models using ODE approximations and kernel methods, establishing phase transitions in generalization error decay with computational resources.

Details

Motivation: While scaling laws are empirically validated in LLM development, their theoretical foundations remain poorly understood. The paper aims to provide rigorous theoretical analysis of learning dynamics in transformer models under realistic conditions.

Method: Formalizes transformer learning dynamics as ODE system, approximates to kernel behaviors, analyzes SGD training for multi-layer transformers on sequence-to-sequence data with arbitrary distributions. Uses theoretical analysis to characterize convergence and establish upper bounds on excess risk.

Result: Identifies distinct phase transition: initial optimization phase with exponential decay of excess risk relative to computational cost, followed by statistical phase with power-law decay of Θ(C^{-1/6}). Derives isolated scaling laws for model size, training time, and dataset size.

Conclusion: Provides rigorous theoretical framework for understanding scaling laws in transformers, explaining how computational resources affect generalization error through distinct phases and establishing theoretical bounds that align with empirical observations.

Abstract: The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing computational resources. Yet, while empirically validated, its theoretical underpinnings remain poorly understood. This work formalizes the learning dynamics of transformer-based language models as an ordinary differential equation (ODE) system, then approximates this process to kernel behaviors. Departing from prior toy-model analyses, we rigorously analyze stochastic gradient descent (SGD) training for multi-layer transformers on sequence-to-sequence data with arbitrary data distribution, closely mirroring real-world conditions. Our analysis characterizes the convergence of generalization error to the irreducible risk as computational resources scale with data, especially during the optimization process. We establish a theoretical upper bound on excess risk characterized by a distinct phase transition. In the initial optimization phase, the excess risk decays exponentially relative to the computational cost ${\sf C}$. However, once a specific resource allocation threshold is crossed, the system enters a statistical phase, where the generalization error follows a power-law decay of $Θ(\mathsf{C}^{-1/6})$. Beyond this unified framework, our theory derives isolated scaling laws for model size, training time, and dataset size, elucidating how each variable independently governs the upper bounds of generalization.

[1387] Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification

Yiju Guo, Tianyi Hu, Zexu Sun, Yankai Lin

Main category: cs.LG

TL;DR: LENS framework improves RLVR by identifying and removing interference tokens from prompts to enhance exploration efficiency and training stability.

Details

Motivation: RLVR suffers from inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. Many failures stem from interference tokens in prompts rather than task difficulty.

Method: Proposes Less Noise Sampling Framework (LENS): 1) Identifies and removes interference tokens from prompts, 2) Uses successful rollouts from purified prompts to supervise policy optimization on original noisy prompts, enabling models to learn to ignore interference.

Result: LENS significantly outperforms GRPO with 3.88% average performance gain and over 1.6× speedup in convergence, demonstrating improved rollout efficiency.

Conclusion: Pruning interference tokens is critical for improving rollout efficiency in RLVR, offering a new perspective for reinforcement learning research with verifiable rewards.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6$\times$ speedup. Our work highlights the critical role of pruning interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.

[1388] Shaping capabilities with token-level data filtering

Neil Rathi, Alec Radford

Main category: cs.LG

TL;DR: Token filtering during pretraining effectively removes undesired medical capabilities from language models, becoming more effective with scale and robust to noisy labels.

Details

Motivation: Current post hoc approaches to reducing undesired capabilities in language models can be bypassed by adversaries, so the paper explores shaping capabilities during pretraining itself as a more robust alternative.

Method: Proposes filtering pretraining data at the token level rather than document level, using sparse autoencoders to label tokens and distilling cheap, high-quality classifiers. Tests this approach on removing medical capabilities across models spanning two orders of magnitude.

Result: Token filtering is highly effective, robust, and inexpensive at scale. For largest models, token filtering leads to 7000x compute slowdown on the forget domain while maintaining alignment capabilities. Filtering gets more effective with scale and can be robust to noisy labels with sufficient pretraining compute.

Conclusion: Filtering pretraining data at the token level is an effective, scalable approach to shaping language model capabilities during training rather than after, offering robustness against adversarial bypass.

Abstract: Current approaches to reducing undesired capabilities in language models are largely post hoc, and can thus be easily bypassed by adversaries. A natural alternative is to shape capabilities during pretraining itself. On the proxy task of removing medical capabilities, we show that the simple intervention of filtering pretraining data is highly effective, robust, and inexpensive at scale. Inspired by work on data attribution, we show that filtering tokens is more effective than filtering documents, achieving the same hit to undesired capabilities at a lower cost to benign ones. Training models spanning two orders of magnitude, we then demonstrate that filtering gets more effective with scale: for our largest models, token filtering leads to a 7000x compute slowdown on the forget domain. We also show that models trained with token filtering can still be aligned on the forget domain. Along the way, we introduce a methodology for labeling tokens with sparse autoencoders and distilling cheap, high-quality classifiers. We also demonstrate that filtering can be robust to noisy labels with sufficient pretraining compute.

[1389] Real-Time Vibration-Based Bearing Fault Diagnosis Under Time-Varying Speed Conditions

Tuomas Jalonen, Mohammad Al-Sa’d, Serkan Kiranyaz, Moncef Gabbouj

Main category: cs.LG

TL;DR: A real-time CNN for bearing fault diagnosis with noise robustness and speed variation handling, plus a Fisher-based spectral separability analysis method to explain model effectiveness.

Details

Motivation: Existing bearing fault detection techniques lack adaptability to real-world dynamic conditions with noise and varying speeds, limiting practical applications.

Method: Proposes an efficient real-time CNN for diagnosing multiple bearing faults under various noise levels and time-varying rotational speeds, plus a novel Fisher-based spectral separability analysis (SSA) method to explain CNN effectiveness.

Result: Achieves 15.8% accuracy gains over state-of-the-art, robust to noise across various SNR levels, runs 5x faster than acquisition time, and SSA provides insights into model performance.

Conclusion: The proposed CNN with SSA analysis effectively addresses real-world bearing fault diagnosis challenges with superior accuracy, noise robustness, and real-time performance.

Abstract: Detection of rolling-element bearing faults is crucial for implementing proactive maintenance strategies and for minimizing the economic and operational consequences of unexpected failures. However, many existing techniques are developed and tested under strictly controlled conditions, limiting their adaptability to the diverse and dynamic settings encountered in practical applications. This paper presents an efficient real-time convolutional neural network (CNN) for diagnosing multiple bearing faults under various noise levels and time-varying rotational speeds. Additionally, we propose a novel Fisher-based spectral separability analysis (SSA) method to elucidate the effectiveness of the designed CNN model. We conducted experiments on both healthy bearings and bearings afflicted with inner race, outer race, and roller ball faults. The experimental results show the superiority of our model over the current state-of-the-art approach in three folds: it achieves substantial accuracy gains of up to 15.8%, it is robust to noise with high performance across various signal-to-noise ratios, and it runs in real-time with processing durations five times less than acquisition. Additionally, by using the proposed SSA technique, we offer insights into the model’s performance and underscore its effectiveness in tackling real-world challenges.

[1390] Trajectory Data Management and Mining: A Survey from Deep Learning to the LLM Era

Wei Chen, Yuanshao Zhu, Yanchuan Chang, Kang Luo, Haomin Wen, Lei Li, Yanwei Yu, Qingsong Wen, Chao Chen, Kai Zheng, Yunjun Gao, Yu Zheng, Xiaofang Zhou, Yuxuan Liang

Main category: cs.LG

TL;DR: Comprehensive review of trajectory computing evolution from deep learning to large language models, covering data management, mining applications, and future directions in mobility analysis.

Details

Motivation: Traditional trajectory computing methods struggle with complex calculations, limited scalability, and real-world adaptability. The paper aims to review the field's evolution from deep learning to large language models to address these limitations and shape next-generation trajectory computing.

Method: Systematic literature review approach: defines trajectory data, overviews deep learning models, explores applications in trajectory management (pre-processing, storage, analysis, visualization) and mining (forecasting, recommendation, classification, time estimation, anomaly detection, generation), and discusses large model advancements.

Result: Provides comprehensive taxonomy of trajectory computing developments, identifies emerging research directions with large models, summarizes application scenarios, public datasets, toolkits, and outlines current challenges with future directions.

Conclusion: Large language models and foundation models promise to reshape next-generation trajectory computing by addressing traditional limitations, though challenges remain in adapting these models to trajectory-specific tasks and real-world complexities.

Abstract: Trajectory computing is a pivotal domain encompassing trajectory data management and mining, garnering widespread attention due to its crucial role in various practical applications such as location services, urban traffic, and public safety. Traditional methods, focusing on simplistic spatio-temporal features, face challenges of complex calculations, limited scalability, and inadequate adaptability to real-world complexities. In this paper, we present a comprehensive review of the development and recent advances in trajectory computing, from deep learning to the more recent large language models. We first define trajectory data and provide a brief overview of widely-used deep learning models. Systematically, we explore deep learning applications in trajectory management (pre-processing, storage, analysis, and visualization) and mining (trajectory-related forecasting, trajectory-related recommendation, trajectory classification, travel time estimation, anomaly detection, and mobility generation). Furthermore, we discuss emerging research directions and recent advancements in large models (represented by foundation models and large language models) for trajectory computing, which promise to reshape the next generation of trajectory computing. Additionally, we summarize application scenarios, public datasets, and toolkits. Finally, we outline current challenges in trajectory computing research and propose future directions. Relevant papers and open-source resources have been collated and are continuously updated at: https://github.com/yoshall/Awesome-Trajectory-Computing.

[1391] Instance Temperature Knowledge Distillation

Zhengbo Zhang, Yuxi Zhou, Jia Gong, Jun Liu, Zhigang Tu

Main category: cs.LG

TL;DR: RLKD: A reinforcement learning-based knowledge distillation method that formulates temperature adjustment as a sequential decision-making task with novel state representation and delayed reward handling.

Details

Motivation: Existing knowledge distillation methods adjust temperature dynamically but only consider immediate benefits, ignoring future returns. The authors aim to address this by treating temperature adjustment as a sequential decision-making problem.

Method: Formulate temperature adjustment as RL task, design novel state representation, handle delayed rewards via instance reward calibration, and devise efficient exploration strategy for learning temperature adjustment policies.

Result: Validated on image classification and object detection tasks, showing effectiveness as a plug-and-play technique that can be inserted into various KD methods.

Conclusion: RLKD successfully addresses the limitation of existing KD methods by considering future returns in temperature adjustment through reinforcement learning, improving knowledge transfer efficiency.

Abstract: Knowledge distillation (KD) enhances the performance of a student network by allowing it to learn the knowledge transferred from a teacher network incrementally. Existing methods dynamically adjust the temperature to enable the student network to adapt to the varying learning difficulties at different learning stages of KD. KD is a continuous process, but when adjusting the temperature, these methods consider only the immediate benefits of the operation in the current learning phase and fail to take into account its future returns. To address this issue, we formulate the adjustment of temperature as a sequential decision-making task and propose a method based on reinforcement learning, termed RLKD. Importantly, we design a novel state representation to enable the agent to make more informed action (i.e. instance temperature adjustment). To handle the problem of delayed rewards in our method due to the KD setting, we explore an instance reward calibration approach. In addition,we devise an efficient exploration strategy that enables the agent to learn valuable instance temperature adjustment policy more efficiently. Our framework can serve as a plug-and-play technique to be inserted into various KD methods easily, and we validate its effectiveness on both image classification and object detection tasks. Our project is at https://itkd123.github.io/ITKD.github.io/.

[1392] Invariant Representation Guided Multimodal Sentiment Decoding with Sequential Variation Regularization

Guoyang Xu, Zhenxi Song, Junqi Xue, Yuxin Liu, Zirui Wang, Zhiguo Zhang

Main category: cs.LG

TL;DR: A dual enhancement strategy for multimodal sentiment analysis that improves temporal and modality consistency through invariant fusion and sequential variation regularization.

Details

Motivation: Addressing the challenge of inconsistent sentiment representation across modalities due to rapid emotional fluctuations over time, which leads to instability and compromised prediction performance in multimodal sentiment analysis.

Method: Proposes a robust sentiment representation dual enhancement strategy with two key components: 1) Modality invariant fusion mechanism to capture stable cross-modal representations, and 2) Sequential variation regularization term (total variation regularization degenerated into 1D linear differences) to regulate learning trajectory during backward propagation.

Result: Extensive experiments on three standard public datasets validate the effectiveness of the proposed approach in improving multimodal sentiment analysis performance.

Conclusion: The dual enhancement strategy successfully addresses temporal and modality consistency challenges in multimodal sentiment analysis, leading to more robust and stable sentiment representations.

Abstract: Achieving consistent sentiment representation across diverse modalities remains a key challenge in multimodal sentiment analysis. However, rapid emotional fluctuations over time often introduce instability, leading to compromised prediction performance. To address this challenge, we propose a robust sentiment representation dual enhancement strategy that simultaneously enhances the temporal and modality dimensions, guided by targeted mechanisms in both forward and backward propagation. Specifically, in the modality dimension, we introduce a modality invariant fusion mechanism that fosters stable cross-modal representations, which aim to capture the common and stable representations shared across different modalities. In the temporal dimension, we impose a specialized sequential variation regularization term that regulates the model’s learning trajectory during backward propagation, which is essentially total variation regularization degenerated into one-dimensional linear differences. Extensive experiments on three standard public datasets validate the effectiveness of our proposed approach.

[1393] FPBoost: Fully Parametric Gradient Boosting for Survival Analysis

Alberto Archetti, Eugenio Lomurno, Diego Piccinotti, Matteo Matteucci

Main category: cs.LG

TL;DR: FPBoost is a survival analysis model combining weighted parametric hazard functions with gradient boosting, offering flexible event-time modeling without restrictive assumptions while maintaining interpretability.

Details

Motivation: Current machine learning methods for survival analysis often rely on restrictive assumptions (proportional hazard, time discretization, accelerated failure time) that limit their flexibility. There's a need for models that can approximate any hazard function while maintaining interpretability and working well with limited data.

Method: FPBoost combines a weighted sum of fully parametric hazard functions with gradient boosting. It uses decision trees to estimate distribution parameters by maximizing the full survival likelihood. The approach allows for universal approximation of hazard functions while leveraging well-established parametric distributions for interpretability.

Result: FPBoost demonstrates strong performance across multiple benchmark datasets, showing robustness and versatility in survival estimation. The model achieves good concordance and calibration metrics, proving effective as a flexible survival analysis tool.

Conclusion: FPBoost provides a novel approach to survival analysis that offers full event-time modeling flexibility without restrictive assumptions, while maintaining interpretability through parametric distributions. It serves as a robust and versatile tool for survival estimation across various domains.

Abstract: Survival analysis is a statistical framework for modeling time-to-event data. It plays a pivotal role in medicine, reliability engineering, and social science research, where understanding event dynamics even with few data samples is critical. Recent advancements in machine learning, particularly those employing neural networks and decision trees, have introduced sophisticated algorithms for survival modeling. However, many of these methods rely on restrictive assumptions about the underlying event-time distribution, such as proportional hazard, time discretization, or accelerated failure time. In this study, we propose FPBoost, a survival model that combines a weighted sum of fully parametric hazard functions with gradient boosting. Distribution parameters are estimated with decision trees trained by maximizing the full survival likelihood. We show how FPBoost is a universal approximator of hazard functions, offering full event-time modeling flexibility while maintaining interpretability through the use of well-established parametric distributions. We evaluate concordance and calibration of FPBoost across multiple benchmark datasets, showcasing its robustness and versatility as a new tool for survival estimation.

[1394] Exploiting Latent Linearity in LLMs Improves Explainable Molecular Representation Learning

Zhuoran Li, Xu Sun, Wanyu Lin, Jiannong Cao

Main category: cs.LG

TL;DR: MoleX framework decomposes molecular embeddings from LLMs into concept-aligned space for explainable molecular representation learning, revealing linear mappings to chemical concepts that improve downstream performance and efficiency.

Details

Motivation: To analyze LLMs' latent representations in molecular domains for better explainability and downstream performance, as LLMs have shown utility in drug discovery and materials design but lack interpretability.

Method: Proposes MoleX framework that decomposes molecular embeddings within LLM representations into a concept-aligned space, showing these high-dimensional embeddings admit linear mapping onto chemically consistent concepts.

Result: The uncovered linearity aligns with established chemical principles, indicating mechanistically explainable latent structure. MoleX outperforms existing approaches in accuracy, explainability, and efficiency, achieving CPU inference 300 times faster with 100,000 fewer parameters.

Conclusion: MoleX provides an effective framework for explainable molecular representation learning that reveals interpretable latent structures in LLMs, improving both predictive performance and chemical interpretability while being highly efficient.

Abstract: Large language models (LLMs) have demonstrated broad utility across molecular domains, spanning drug discovery and materials design. Analyzing LLMs’ latent representations is crucial for elucidating their underlying mechanisms, improving explainability, and ultimately advancing downstream performance. We propose MoleX, a simple yet effective framework that decomposes molecular embeddings within LLM representations into a concept-aligned space for explainable molecular representation learning. We further show that these high-dimensional embeddings admit a linear mapping onto chemically consistent concepts. Our analysis suggests that the uncovered linearity aligns with established chemical principles, indicating a mechanistically explainable latent structure in LLM representations for scientific applications. When applied to downstream tasks, this latent linearity improves both predictive and explanatory performance. Extensive experiments demonstrate that MoleX outperforms existing approaches in accuracy, explainability, and efficiency, achieving CPU inference on large-scale datasets 300 times faster with 100,000 fewer parameters than LLMs.

[1395] Conformal mapping based Physics-informed neural networks for designing neutral inclusions

Daehee Cho, Hyeonmin Yun, Jaeyong Lee, Mikyoung Lim

Main category: cs.LG

TL;DR: CoCo-PINNs combine conformal mapping with PINNs to solve neutral inclusion problems with imperfect boundary conditions, improving credibility, consistency, and stability over traditional PINNs.

Details

Motivation: Traditional Physics-Informed Neural Networks (PINNs) struggle with inverse neutral inclusion problems involving arbitrary shapes and imperfect boundary conditions, requiring a more robust approach.

Method: Developed Conformal Mapping Coordinates Physics-Informed Neural Networks (CoCo-PINNs) that integrate geometric function theory with PINNs to model interface functions through neural network training for neutral inclusion effects.

Result: CoCo-PINNs effectively solve forward-inverse problems and enhance PINN performance in terms of credibility, consistency, and stability for neutral inclusion problems.

Conclusion: The integration of conformal mapping with PINNs provides an effective framework for solving complex neutral inclusion problems with imperfect boundary conditions and arbitrary shapes.

Abstract: We address the neutral inclusion problem with imperfect boundary conditions, focusing on designing interface functions for inclusions of arbitrary shapes. Traditional Physics-Informed Neural Networks (PINNs) struggle with this inverse problem, leading to the development of Conformal Mapping Coordinates Physics-Informed Neural Networks (CoCo-PINNs), which integrate geometric function theory with PINNs. CoCo-PINNs effectively solve forward-inverse problems by modeling the interface function through neural network training, which yields a neutral inclusion effect. This approach enhances the performance of PINNs in terms of credibility, consistency, and stability.

[1396] PIQL: Projective Implicit Q-Learning with Support Constraint for Offline Reinforcement Learning

Xinchen Han, Hossam Afifi, Michel Marot

Main category: cs.LG

TL;DR: PIQL improves IQL by replacing fixed expectile parameter with projection-based parameter and using support constraint instead of density constraint for better offline RL performance.

Details

Motivation: IQL has limitations: fixed expectile hyperparameter and density-based policy improvement method hinder adaptability and performance in offline RL.

Method: Projective IQL (PIQL) uses projection-based parameter instead of fixed expectile, extends to multi-step value estimation, and adopts support constraint rather than density constraint for policy improvement.

Result: Achieves state-of-the-art performance on D4RL and NeoRL2 benchmarks with robust gains across diverse domains.

Conclusion: PIQL maintains IQL’s theoretical framework while improving adaptability and performance through projective parameterization and support constraints.

Abstract: Offline Reinforcement Learning (RL) faces a fundamental challenge of extrapolation errors caused by out-of-distribution (OOD) actions. Implicit Q-Learning (IQL) employs expectile regression to achieve in-sample learning. Nevertheless, IQL relies on a fixed expectile hyperparameter and a density-based policy improvement method, both of which impede its adaptability and performance. In this paper, we propose Projective IQL (PIQL), a projective variant of IQL enhanced with a support constraint. In the policy evaluation stage, PIQL substitutes the fixed expectile hyperparameter with a projection-based parameter and extends the one-step value estimation to a multi-step formulation. In the policy improvement stage, PIQL adopts a support constraint instead of a density constraint, ensuring closer alignment with the policy evaluation. Theoretically, we demonstrate that PIQL maintains the expectile regression and in-sample learning framework, guarantees monotonic policy improvement, and introduces a progressively more rigorous criterion for advantageous actions. Experiments on D4RL and NeoRL2 benchmarks demonstrate robust gains across diverse domains, achieving state-of-the-art performance overall.

[1397] Decoding Generalization from Memorization in Deep Neural Networks

Simran Ketha, Venkatakrishnan Ramaswamy

Main category: cs.LG

TL;DR: Deep networks can memorize shuffled labels while losing generalization to true labels, but this paper shows they retain latent generalization ability in their representations that can be decoded.

Details

Motivation: To understand why deep networks that memorize shuffled labels lose generalization to true labels, and to determine whether this is due to irreversible representation reorganization or latent generalization ability that's not being properly read out.

Method: Empirical investigation showing that models trained on shuffled labels retain information for improved generalization in their representations. The authors develop a technique to decode this latent generalization ability from the trained model’s internal representations.

Result: Evidence supports that networks retain significant latent generalization ability despite memorizing shuffled labels. The decoding technique successfully extracts this ability, demonstrating improved generalization to true labels from the same representations.

Conclusion: Poor generalization in memorization scenarios is not due to irreversible representation damage, but rather networks “choosing” suboptimal readouts. Latent generalization ability exists and can be decoded from trained models.

Abstract: Overparameterized deep networks that generalize well have been key to the dramatic success of deep learning in recent years. The reasons for their remarkable ability to generalize are not well understood yet. When class labels in the training set are shuffled to varying degrees, it is known that deep networks can still reach perfect training accuracy at the detriment of generalization to true labels – a phenomenon that has been called memorization. It has, however, been unclear why the poor generalization to true labels that accompanies such memorization, comes about. One possibility is that during training, all layers of the network irretrievably re-organize their representations in a manner that makes generalization to true labels difficult. The other possibility is that one or more layers of the trained network retain significantly more latent ability to generalize to true labels, but the network somehow “chooses” to readout in a manner that is detrimental to generalization to true labels. Here, we provide evidence for the latter possibility by demonstrating, empirically, that such models possess information in their representations for substantially-improved generalization to true labels. Furthermore, such abilities can be easily decoded from the internals of the trained model, and we build a technique to do so. We demonstrate results on multiple models trained with standard datasets. Our code is available at: https://github.com/simranketha/MASC_DNN.

[1398] CFT-RAG: An Entity Tree Based Retrieval Augmented Generation Algorithm With Cuckoo Filter

Zihang Li, Yangdong Ruan, Wenjun Liu, Zhengyang Wang, Tong Yang

Main category: cs.LG

TL;DR: Proposes Tree-RAG acceleration using improved Cuckoo Filter for faster entity localization in hierarchical knowledge retrieval while maintaining generation quality.

Details

Motivation: Tree-RAG improves generation quality by retrieving from hierarchical knowledge structures, but faces computational efficiency bottlenecks in entity localization during retrieval.

Method: Uses improved Cuckoo Filter as efficient data structure to optimize entity localization in Tree-RAG, supporting rapid membership queries and dynamic updates.

Result: Method is significantly faster than naive Tree-RAG (hundreds of times faster with large tree counts) while maintaining high generative quality.

Conclusion: Improved Cuckoo Filter effectively accelerates Tree-RAG retrieval, addressing computational bottlenecks in hierarchical knowledge retrieval for RAG systems.

Abstract: Although retrieval-augmented generation(RAG) significantly improves generation quality by retrieving external knowledge bases and integrating generated content, it faces computational efficiency bottlenecks, particularly in knowledge retrieval tasks involving hierarchical structures for Tree-RAG. This paper proposes a Tree-RAG acceleration method based on the improved Cuckoo Filter, which optimizes entity localization during the retrieval process to achieve significant performance improvements. Tree-RAG effectively organizes entities through the introduction of a hierarchical tree structure, while the Cuckoo Filter serves as an efficient data structure that supports rapid membership queries and dynamic updates. The experiment results demonstrate that our method is much faster than naive Tree-RAG while maintaining high levels of generative quality. When the number of trees is large, our method is hundreds of times faster than naive Tree-RAG. Our work is available at https://github.com/TUPYP7180/CFT-RAG-2025.

[1399] Achieving Time Series Reasoning Requires Rethinking Model Design, Tasks Formulation, and Evaluation

Yaxuan Kong, Yiyuan Yang, Shiyu Wang, Chenghao Liu, Yuxuan Liang, Ming Jin, Stefan Zohren, Dan Pei, Yan Liu, Qingsong Wen

Main category: cs.LG

TL;DR: Survey paper analyzing the rapid growth of multimodal LLMs for time series understanding, identifying gaps in current approaches and calling for unified frameworks for robust time series reasoning.

Details

Motivation: Time series understanding is crucial for real-world applications, and while multimodal LLMs have shown promise for enhancing time series analysis with contextual information, current methods struggle in real-world settings despite explosive growth in research publications.

Method: The authors conduct a comprehensive analysis of 20 influential works from 2025, examining model design, task formulation, and evaluation approaches. They identify critical gaps and define time series reasoning as a distinct research area.

Result: Identified three major gaps: 1) methods adapt NLP techniques without proper attention to core time series properties, 2) tasks remain limited to traditional prediction/classification, and 3) evaluations focus on benchmarks rather than robustness, interpretability, or decision relevance.

Conclusion: Achieving true time series reasoning requires rethinking model design, task formulation, and evaluation together. The paper calls for unified frameworks that address robustness, interpretability, and decision relevance for real-world applications.

Abstract: Understanding time series data is fundamental to many real-world applications. Recent work explores multimodal large language models (MLLMs) to enhance time series understanding with contextual information beyond numerical signals. This area has grown from 7 papers in 2023 to over 580 in 2025, yet existing methods struggle in real-world settings. We analyze 20 influential works from 2025 across model design, task formulation, and evaluation, and identify critical gaps: methods adapt NLP techniques with limited attention to core time series properties; tasks remain restricted to traditional prediction and classification; and evaluations emphasize benchmarks over robustness, interpretability, or decision relevance. We argue that achieving time series reasoning requires rethinking model design, task formulation, and evaluation together. We define time series reasoning, outline challenges and future directions, and call on researchers to develop unified frameworks for robust, interpretable, and decision-relevant reasoning in real-world applications. The material is available at https://github.com/Eleanorkong/Awesome-Time-Series-Reasoning.

[1400] LEAD: An EEG Foundation Model for Alzheimer’s Disease Detection

Yihe Wang, Nan Huang, Nadia Mammone, Marco Cecchi, Xiang Zhang

Main category: cs.LG

TL;DR: LEAD is a large-scale foundation model for EEG-based Alzheimer’s disease detection using a gated temporal-spatial Transformer with subject-regularized training and medical contrastive learning.

Details

Motivation: To address challenges in EEG-based AD detection: lack of large-scale datasets, limited generalizability across subjects, and difficulty handling heterogeneous EEG data with varying lengths, channels, and sampling rates.

Method: Proposes LEAD foundation model with: 1) gated temporal-spatial Transformer that adapts to arbitrary EEG configurations, 2) subject-regularized training for enhanced subject-level features, 3) medical contrastive pre-training on 13 datasets (4 AD + 9 non-AD neurological disorders), and fine-tuning on 5 AD datasets.

Result: Achieves best average ranking across all 20 evaluations on 5 downstream AD datasets, substantially outperforming existing approaches including SOTA EEG foundation models.

Conclusion: LEAD demonstrates effectiveness and practical potential for real-world EEG-based AD detection, leveraging the world’s largest EEG-AD corpus (2,238 subjects).

Abstract: Electroencephalography (EEG) provides a non-invasive, highly accessible, and cost-effective approach for detecting Alzheimer’s disease (AD). However, existing methods, whether based on handcrafted feature engineering or standard deep learning, face three major challenges: 1) the lack of large-scale EEG-based AD datasets for robust representation learning; 2) limited generalizability across subjects; and 3) difficulty in adapting to highly heterogeneous data. To address these challenges, we curate the world’s largest EEG-AD corpus to date, comprising 2,238 subjects. Leveraging this unique resource, we propose LEAD, the first large-scale foundation model for EEG-based AD detection. Specifically, we design a gated temporal-spatial Transformer that can adapt to EEG recordings with arbitrary lengths, channel configurations, and sampling rates. In addition, we introduce a subject-regularized training strategy to enhance subject-level feature learning. We further employ medical contrastive learning for pre-training on 13 datasets, including 4 AD datasets and 9 non-AD neurological disorder datasets, and fine-tune/test the model on the other 5 AD datasets. LEAD achieves the best average ranking across all 20 evaluations on 5 downstream datasets, substantially outperforming existing approaches, including state-of-the-art (SOTA) EEG foundation models. These results strongly demonstrate the effectiveness and practical potential of the proposed method for real-world EEG-based AD detection. Source code: https://github.com/DL4mHealth/LEAD

[1401] On the Importance of Pretraining Data Alignment for Atomic Property Prediction

Yasir Ghunaim, Hasan Abed Al Kader Hammoud, Bernard Ghanem

Main category: cs.LG

TL;DR: Small, task-aligned pretraining datasets can outperform massive mixed datasets for atomic property prediction when selected using Chemical Similarity Index (CSI) metric.

Details

Motivation: Challenge the current paradigm that links progress in atomic property prediction to growing dataset sizes and computational resources, showing that quality of data alignment matters more than quantity.

Method: Introduce Chemical Similarity Index (CSI), a metric inspired by Fréchet Inception Distance, to quantify alignment between pretraining datasets and downstream tasks. Select most aligned datasets with minimal CSI distance for pretraining.

Result: Models pretrained on smaller, focused datasets achieve better downstream performance than those pretrained on massive mixed datasets like JMP, even when mixed datasets include the aligned data. Adding poorly aligned data can degrade performance.

Conclusion: Quality often outperforms quantity in pretraining for atomic property prediction; careful dataset selection using alignment metrics like CSI is more important than simply scaling dataset size.

Abstract: This paper challenges the recent paradigm in atomic property prediction that links progress to growing dataset sizes and computational resources. We show that pretraining on a carefully selected task-aligned dataset can match or even surpass large-scale joint pretraining while using only 1/24th of the pretraining budget. We introduce the Chemical Similarity Index (CSI), a simple metric for molecular graphs inspired by the Fréchet Inception Distance in computer vision, which quantifies the alignment between upstream pretraining datasets and downstream tasks. By selecting the most aligned dataset with minimal CSI distance, we show that models pretrained on a smaller, focused dataset consistently achieve better performance on downstream tasks than those pretrained on massive, mixed datasets such as JMP. This holds even when the mixed dataset includes the upstream dataset most aligned with the downstream task. Counterintuitively, we also find that indiscriminately adding more data can degrade model performance when the additional data is poorly aligned with the target task. Our findings highlight that quality often outperforms quantity in pretraining for atomic property prediction.

[1402] Entropy-Lens: Uncovering Decision Strategies in LLMs

Riccardo Ali, Francesco Caso, Christopher Irwin, Pietro Liò

Main category: cs.LG

TL;DR: Entropy-Lens analyzes token-space dynamics in LLMs using entropy of logit-lens predictions to create entropy profiles that reveal expansion/pruning strategies in residual streams.

Details

Motivation: Most interpretability research focuses on internal latent representations while token-space dynamics remain underexplored due to high dimensionality and categorical nature of token distributions, requiring new analysis methods.

Method: Introduces Entropy-Lens that uses entropy of logit-lens predictions to create a per-layer scalar metric called entropy profiles, which distill token-space dynamics into low-dimensional signals across different model sizes and families.

Result: Entropy profiles reveal: (i) token prediction dynamics driven by expansion/pruning strategies, (ii) family-specific dynamics invariant under depth rescaling, (iii) characteristics of task type/output format, (iv) unequal impact on downstream performance with expansion strategy being more critical.

Conclusion: The method enhances understanding of residual stream dynamics, enabling granular assessment of information processing across model depth through token-space analysis.

Abstract: In large language models (LLMs), each block operates on the residual stream to map input token sequences to output token distributions. However, most of the interpretability literature focuses on internal latent representations, leaving token-space dynamics underexplored. The high dimensionality and categoricity of token distributions hinder their analysis, as standard statistical descriptors are not suitable. We show that the entropy of logit-lens predictions overcomes these issues. In doing so, it provides a per-layer scalar, permutation-invariant metric. We introduce Entropy-Lens to distill the token-space dynamics of the residual stream into a low-dimensional signal. We call this signal the entropy profile. We apply our method to a variety of model sizes and families, showing that (i) entropy profiles uncover token prediction dynamics driven by expansion and pruning strategies; (ii) these dynamics are family-specific and invariant under depth rescaling; (iii) they are characteristic of task type and output format; (iv) these strategies have unequal impact on downstream performance, with the expansion strategy usually being more critical. Ultimately, our findings further enhance our understanding of the residual stream, enabling a granular assessment of how information is processed across model depth.

[1403] Mixtera: A Data Plane for Foundation Model Training

Maximilian Böther, Xiaozhe Yao, Tolga Kerimoglu, Dan Graur, Viktor Gsteiger, Ana Klimovic

Main category: cs.LG

TL;DR: Mixtera is a declarative data plane for foundation model training that enables precise control over data mixtures and training order, supporting dynamic adjustments and scaling to large GPU clusters.

Details

Motivation: As training datasets grow to trillions of tokens from diverse sources, manually managing data mixtures and training order becomes impractical, yet these factors significantly impact model accuracy. Current approaches lack systematic control over data composition during training.

Method: Mixtera is a centralized, read-only layer deployed on top of existing training data collections that allows declarative specification of which data samples to use, in what proportion, and in what order. It operates independently of filesystem structure, supports mixtures across arbitrary properties, and enables dynamic adjustments based on model feedback.

Result: Mixtera does not bottleneck training and scales to 256 GH200 superchips. The system successfully implements the Adaptive Data Optimization (ADO) algorithm and explores data mixtures for vision-language models, demonstrating practical utility for modern foundation model training.

Conclusion: Mixtera provides a systematic solution for declarative data management in foundation model training, enabling researchers to precisely control data mixtures and training order at scale, which is crucial for optimizing model performance.

Abstract: State-of-the-art large language and vision models are trained over trillions of tokens that are aggregated from a large variety of sources. As training data collections grow, manually managing the samples becomes time-consuming, tedious, and prone to errors. Yet recent research shows that the data mixture and the order in which samples are visited during training can significantly influence model accuracy. We build and present Mixtera, a data plane for foundation model training that enables users to declaratively express which data samples should be used in which proportion and in which order during training. Mixtera is a centralized, read-only layer that is deployed on top of existing training data collections and can be declaratively queried. It operates independently of the filesystem structure and supports mixtures across arbitrary properties (e.g., language, source dataset) as well as dynamic adjustment of the mixture based on model feedback. We experimentally evaluate Mixtera and show that our implementation does not bottleneck training and scales to 256 GH200 superchips. We demonstrate how Mixtera supports recent advancements in mixing strategies by implementing the proposed Adaptive Data Optimization (ADO) algorithm in the system and evaluating its performance impact. We also explore the role of mixtures for vision-language models.

[1404] Causally Reliable Concept Bottleneck Models

Giovanni De Felice, Arianna Casanova Flores, Francesco De Santis, Silvia Santini, Johannes Schneider, Pietro Barbiero, Alberto Termine

Main category: cs.LG

TL;DR: Causally reliable Concept Bottleneck Models (C²BMs) enhance interpretable AI by structuring concept bottlenecks according to real-world causal mechanisms, improving causal reasoning and out-of-distribution generalization.

Details

Motivation: Current concept-based models, while interpretable, fail to account for true causal mechanisms underlying data phenomena, limiting their ability to support causal reasoning, out-of-distribution generalization, and fairness constraints.

Method: Propose C²BMs that enforce reasoning through a bottleneck of concepts structured according to real-world causal mechanisms. Introduce a pipeline to automatically learn this causal structure from observational data and unstructured background knowledge (e.g., scientific literature).

Result: C²BMs are more interpretable, causally reliable, and improve responsiveness to interventions compared to standard opaque and concept-based models, while maintaining accuracy.

Conclusion: Causally structured concept bottlenecks provide a promising approach for building more reliable and interpretable AI systems that better capture real-world causal mechanisms.

Abstract: Concept-based models are an emerging paradigm in deep learning that constrains the inference process to operate through human-interpretable variables, facilitating explainability and human interaction. However, these architectures, on par with popular opaque neural models, fail to account for the true causal mechanisms underlying the target phenomena represented in the data. This hampers their ability to support causal reasoning tasks, limits out-of-distribution generalization, and hinders the implementation of fairness constraints. To overcome these issues, we propose Causally reliable Concept Bottleneck Models (C$^2$BMs), a class of concept-based architectures that enforce reasoning through a bottleneck of concepts structured according to a model of the real-world causal mechanisms. We also introduce a pipeline to automatically learn this structure from observational data and unstructured background knowledge (e.g., scientific literature). Experimental evidence suggests that C$^2$BMs are more interpretable, causally reliable, and improve responsiveness to interventions w.r.t. standard opaque and concept-based models, while maintaining their accuracy.

[1405] Large-Scale Auto-bidding with Nash Equilibrium Constraints

Zhiyu Mou, Miao Xu, Rongquan Bai, Zhuoran Yang, Chuan Yu, Jian Xu, Bo Zheng

Main category: cs.LG

TL;DR: NCB is a scalable auto-bidding framework that incorporates Nash equilibrium constraints to address strategic interdependence among advertisers in online advertising platforms.

Details

Motivation: Current industrial auto-bidding systems use single-agent methods that ignore strategic interdependence among advertisers' bids, leading to unstable or suboptimal outcomes. While game-theoretic approaches exist, they lack scalability or principled equilibrium-selection aligned with platform-wide objectives.

Method: Introduces Nash Equilibrium-Constrained Bidding (NCB), which recasts auto-bidding as a platform-wide optimization problem subject to Nash equilibrium constraints. Develops a penalty-based primal-dual gradient method with convergence guarantees and an efficient algorithm for industrial deployment.

Result: Extensive experiments validate the effectiveness of the approach, showing it can handle fine-grained strategic interdependencies while ensuring both agent-level stability and ecosystem-level optimality.

Conclusion: NCB bridges the gap between scalable single-agent methods and game-theoretic approaches by providing a principled, scalable framework that accounts for strategic interdependence in auto-bidding systems.

Abstract: Auto-bidding has become a cornerstone of modern online advertising platforms, enabling many advertisers to automate bidding at scale and optimize campaign performance. However, prevailing industrial systems rely on single-agent auto-bidding methods that are scalable but overlook the strategic interdependence among advertisers’ bids, leading to unstable or suboptimal outcomes. While recent works recognize the game-theoretic nature of auto-bidding, existing approaches remain either computationally intractable at scale or lack a principled equilibrium-selection that aligns with platform-wide objectives. In this paper, we bridge this gap by introducing Nash Equilibrium-Constrained Bidding (NCB), a principled and scalable auto-bidding framework that recasts auto-bidding as a platform-wide optimization problem subject to Nash equilibrium constraints. This approach accounts for fine-grained strategic interdependencies among advertisers, ensuring both agent-level stability and ecosystem-level optimality. Notably, we develop a theoretically sound penalty-based primal-dual gradient method with rigorous convergence guarantees, supported by an efficient algorithm suitable for industrial deployment. Extensive experiments validate the effectiveness of our approach.

[1406] 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities

Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzciński, Benjamin Eysenbach

Main category: cs.LG

TL;DR: Deep self-supervised RL with 1024-layer networks significantly outperforms shallow architectures in unsupervised goal-conditioned tasks, achieving 2-50x performance improvements on locomotion and manipulation tasks.

Details

Motivation: While self-supervised learning has driven breakthroughs in language and vision, comparable progress in reinforcement learning has remained elusive. The paper aims to unlock substantial improvements in RL scalability by exploring deep architectures, as most RL papers rely on shallow networks (2-5 layers).

Method: The approach uses self-supervised contrastive RL in an unsupervised goal-conditioned setting where no demonstrations or rewards are provided. The key innovation is scaling network depth up to 1024 layers, significantly beyond typical RL architectures. Agents must explore from scratch and learn to maximize the likelihood of reaching commanded goals.

Result: Increasing depth to 1024 layers boosts performance by 2-50x compared to the baseline self-supervised contrastive RL algorithm, outperforming other goal-conditioned baselines on simulated locomotion and manipulation tasks. Deep networks not only increase success rates but also qualitatively change learned behaviors.

Conclusion: Network depth is a critical factor for scaling self-supervised RL, with deep architectures (up to 1024 layers) unlocking substantial performance improvements in unsupervised goal-conditioned learning tasks.

Abstract: Scaling up self-supervised learning has driven breakthroughs in language and vision, yet comparable progress has remained elusive in reinforcement learning (RL). In this paper, we study building blocks for self-supervised RL that unlock substantial improvements in scalability, with network depth serving as a critical factor. Whereas most RL papers in recent years have relied on shallow architectures (around 2 - 5 layers), we demonstrate that increasing the depth up to 1024 layers can significantly boost performance. Our experiments are conducted in an unsupervised goal-conditioned setting, where no demonstrations or rewards are provided, so an agent must explore (from scratch) and learn how to maximize the likelihood of reaching commanded goals. Evaluated on simulated locomotion and manipulation tasks, our approach increases performance on the self-supervised contrastive RL algorithm by $2\times$ - $50\times$, outperforming other goal-conditioned baselines. Increasing the model depth not only increases success rates but also qualitatively changes the behaviors learned. The project webpage and code can be found here: https://wang-kevin3290.github.io/scaling-crl/.

[1407] DiTOX: Fault Detection and Localization in the ONNX Optimizer

Nikolaos Louloudakis, Ajitha Rajan

Main category: cs.LG

TL;DR: DiTOX is an automated framework for assessing correctness of the ONNX Optimizer using differential testing, finding significant correctness issues in popular AI model optimization passes.

Details

Motivation: The ONNX Optimizer is widely used for graph-level model optimizations but its ability to preserve model correctness has never been systematically evaluated, despite being used by default in many applications.

Method: DiTOX uses differential testing, fault localization, and evaluation techniques. It applies optimization passes to a corpus of ONNX models, executes both original and optimized versions on user-defined inputs, detects discrepancies, and isolates responsible optimization passes through iterative analysis.

Result: Evaluation on 130 models from ONNX Model Hub showed 9.2% crashed the optimizer or produced invalid models. Output discrepancies occurred in 30% of classification models and 16.6% of object detection/segmentation models. DiTOX uncovered 15 issues (14 previously unknown) affecting 9 of 47 optimization passes.

Conclusion: DiTOX provides a simple and effective approach for validating AI model optimizers, demonstrating significant correctness issues in the widely-used ONNX Optimizer, and is readily extensible beyond ONNX.

Abstract: The ONNX Optimizer, part of the official ONNX repository and widely adopted for graph-level model optimizations, is used by default to optimize ONNX models. Despite its popularity, its ability to preserve model correctness has not been systematically evaluated. We present DiTOX, an automated framework for comprehensively assessing the correctness of the ONNX Optimizer using differential testing, fault localization, and evaluation techniques that generalize to other compiler optimizers. DiTOX applies optimization passes to a corpus of ONNX models, executes both original and optimized versions on user-defined inputs, and detects discrepancies in behavior or optimizer failures. When divergences are observed, DiTOX isolates the responsible optimization pass through iterative, fine-grained analysis. We evaluated DiTOX on 130 models from the ONNX Model Hub spanning vision and language tasks. We found that 9.2% of model instances crashed the optimizer or produced invalid models under default settings. Moreover, output discrepancies occurred in 30% of classification models and 16.6% of object detection and segmentation models, while text-based models were largely robust. Overall, DiTOX uncovered 15 issues – 14 previously unknown – affecting 9 of the 47 optimization passes as well as the optimizer infrastructure. All issues were reported to the ONNX Optimizer developers. Our results demonstrate that DiTOX provides a simple and effective approach for validating AI model optimizers and is readily extensible beyond ONNX.

[1408] UniSymNet: A Unified Symbolic Network Guided by Transformer

Xinxin Li, Juan Zhang, Da Li, Xingyu Liu, Jin Xu, Junping Yin

Main category: cs.LG

TL;DR: UniSymNet: A unified symbolic network approach that transforms binary nonlinear operators into nested unary operators, uses Transformer pre-training for structural guidance, and achieves competitive performance on symbolic regression benchmarks.

Details

Motivation: Traditional symbolic regression algorithms struggle with increasing tree complexity, while existing symbolic networks face challenges with binary nonlinear operators and overfitting from fixed architectures. There's a need for a more flexible and effective approach to symbolic discovery.

Method: Proposes Unified Symbolic Network (UniSymNet) that unifies binary nonlinear operators into nested unary operators, pre-trains a Transformer with novel label encoding for structural selection guidance, and uses objective-specific optimization strategies for parameter learning.

Result: UniSymNet demonstrates high fitting accuracy, excellent symbolic solution rate, and relatively low expression complexity, achieving competitive performance on both low-dimensional Standard Benchmarks and high-dimensional SRBench.

Conclusion: The unified symbolic network approach effectively addresses limitations of existing symbolic regression methods, providing a promising framework for symbolic discovery with better generalization and complexity control.

Abstract: Symbolic Regression (SR) is a powerful technique for automatically discovering mathematical expressions from input data. Mainstream SR algorithms search for the optimal symbolic tree in a vast function space, but the increasing complexity of the tree structure limits their performance. Inspired by neural networks, symbolic networks have emerged as a promising new paradigm. However, most existing symbolic networks still face certain challenges: binary nonlinear operators ${\times, ÷}$ cannot be naturally extended to multivariate operators, and training with fixed architecture often leads to higher complexity and overfitting. In this work, we propose a Unified Symbolic Network that unifies nonlinear binary operators into nested unary operators and define the conditions under which UniSymNet can reduce complexity. Moreover, we pre-train a Transformer model with a novel label encoding method to guide structural selection, and adopt objective-specific optimization strategies to learn the parameters of the symbolic network. UniSymNet shows high fitting accuracy, excellent symbolic solution rate, and relatively low expression complexity, achieving competitive performance on low-dimensional Standard Benchmarks and high-dimensional SRBench.

[1409] Sparse Latent Factor Forecaster (SLFF) with Iterative Inference for Transparent Multi-Horizon Commodity Futures Prediction

Abhijit Gupta

Main category: cs.LG

TL;DR: SLFF is a structured prediction model combining sparse coding and amortized inference for multi-horizon commodity futures forecasting with interpretable latent factors.

Details

Motivation: Commodity futures are highly volatile and forecasting across multiple horizons with interpretable drivers remains challenging. Existing methods lack both predictive accuracy and interpretability of driving factors.

Method: Proposes Sparse Latent Factor Forecaster with Iterative Inference (SLFF), which combines sparse coding, unrolled optimization, and amortized inference. The model explicitly optimizes sparse latent codes to explain multi-horizon futures trajectories and trains an encoder validated against optimization-based solutions. Includes information set aware pipeline with vintage macro releases, lag aware fills, and leakage checks.

Result: On Copper and WTI futures (2005-2023), SLFF achieves competitive RMSE and MAE, improves directional skill beyond persistence, and yields factors that are stable across seeds and linked to measurable fundamentals.

Conclusion: SLFF provides an effective approach for commodity futures forecasting that balances predictive performance with interpretability through sparse latent factors, with reproducible code and diagnostics released.

Abstract: Commodity futures are volatile. Forecasting across horizons with interpretable drivers remains challenging. We propose the Sparse Latent Factor Forecaster with Iterative Inference (SLFF), a structured prediction latent variable model that combines sparse coding, unrolled optimization, and amortized inference. SLFF explicitly optimizes a sparse latent code to explain multi-horizon futures trajectories and trains an encoder whose outputs are validated against the optimization-based solution before deployment. The method is paired with an information set aware pipeline (vintage macro releases, lag aware fills, leakage checks) and evaluated under rolling origin folds against representative statistical and neural baselines. We provide quantitative criteria for factor labeling and directional diagnostics that account for no change regimes. On Copper and WTI futures (2005-2023), SLFF achieves competitive RMSE and MAE, improves directional skill beyond persistence, and yields factors that are stable across seeds and linked to measurable fundamentals. Code, diagnostics, and information set specifications are released for reproducibility.

[1410] True Zero-Shot Inference of Dynamical Systems Preserving Long-Term Statistics

Christoph Jürgen Hemmer, Daniel Durstewitz

Main category: cs.LG

TL;DR: DynaMix is a novel mixture-of-experts architecture for dynamical system reconstruction that enables zero-shot generalization to out-of-domain systems, outperforming existing time series foundation models with far fewer parameters.

Details

Motivation: Current dynamical system reconstruction approaches require retraining for each new system, lacking the zero-shot and in-context inference capabilities seen in large language models. The authors aim to create a DSR model that can generalize to novel systems without retraining.

Method: DynaMix uses a multivariate ALRNN-based mixture-of-experts architecture pre-trained for dynamical system reconstruction. It’s designed to infer surrogate models from observed data and forecast long-term evolution of novel systems without any retraining.

Result: DynaMix outperforms time series foundation models like Chronos in long-term statistics and often short-term forecasts, even on real-world data not in its training corpus. It achieves this with only 0.1% of parameters and orders of magnitude faster inference times.

Conclusion: Models built on dynamical system principles have significant potential for advancing time series prediction. DynaMix demonstrates that DSR models can achieve zero-shot generalization capabilities similar to LLMs, outperforming existing TS models while being more efficient.

Abstract: Complex, temporally evolving phenomena, from climate to brain activity, are governed by dynamical systems (DS). DS reconstruction (DSR) seeks to infer generative surrogate models of these from observed data, reproducing their long-term behavior. Existing DSR approaches require purpose-training for any new system observed, lacking the zero-shot and in-context inference capabilities known from LLMs. Here we introduce DynaMix, a novel multivariate ALRNN-based mixture-of-experts architecture pre-trained for DSR, the first DSR model able to generalize zero-shot to out-of-domain DS. Just from a provided context signal, without any re-training, DynaMix faithfully forecasts the long-term evolution of novel DS where existing time series (TS) foundation models, like Chronos, fail – at a fraction of the number of parameters (0.1%) and orders of magnitude faster inference times. DynaMix outperforms TS foundation models in terms of long-term statistics, and often also short-term forecasts, even on real-world time series, like traffic or weather data, typically used for training and evaluating TS models, but not at all part of DynaMix’ training corpus. We illustrate some of the failure modes of TS models for DSR problems, and conclude that models built on DS principles may bear a huge potential also for advancing the TS prediction field.

[1411] HyBattNet: Hybrid Framework for Predicting the Remaining Useful Life of Lithium-Ion Batteries

Khoa Tran, Tri Le, Bao Huynh, Hung-Cuong Trinh, Vy-Rin Nguyen, T. Nguyen-Thoi, Vin Nguyen-Thai

Main category: cs.LG

TL;DR: A hybrid deep learning approach for lithium-ion battery RUL prediction using novel signal preprocessing and CNN-A-LSTM-ODE-LSTM architecture with transfer learning capabilities.

Details

Motivation: Accurate RUL prediction is crucial for timely maintenance of lithium-ion batteries in electric applications, requiring robust methods that can handle limited target data through transfer learning.

Method: Proposes a novel signal preprocessing pipeline with derived capacity features, denoising, and delta-based enhancement, followed by a hybrid deep learning model combining 1D CNN, Attentional LSTM (A-LSTM), and ODE-LSTM blocks to capture both continuous and discrete temporal dynamics.

Result: Outperforms baseline deep learning and machine learning techniques on two LFP/graphite battery datasets, achieving RMSE of 101.59, with robust performance even when fine-tuned on limited target data.

Conclusion: The proposed method demonstrates strong potential for real-world RUL prediction applications, particularly with its transfer learning capabilities for scenarios with limited target data.

Abstract: Accurate prediction of the Remaining Useful Life (RUL) is essential for enabling timely maintenance of lithium-ion batteries, impacting the operational efficiency of electric applications that rely on them. This paper proposes a RUL prediction approach that leverages data from recent charge-discharge cycles to estimate the number of remaining usable cycles. The approach introduces both a novel signal preprocessing pipeline and a deep learning prediction model. In the signal preprocessing pipeline, a derived capacity feature is computed using interpolated current and capacity signals. Alongside original capacity, voltage and current, these features are denoised and enhanced using statistical metrics and a delta-based method to capture differences between the current and previous cycles. In the prediction model, the processed features are then fed into a hybrid deep learning architecture composed of 1D Convolutional Neural Networks (CNN), Attentional Long Short-Term Memory (A-LSTM), and Ordinary Differential Equation-based LSTM (ODE-LSTM) blocks. The ODE-LSTM architecture employs ordinary differential equations to integrate continuous dynamics into sequence-to-sequence modeling, thereby combining continuous and discrete temporal representations, while the A-LSTM incorporates an attention mechanism to capture local temporal dependencies. The model is further evaluated using transfer learning across different learning strategies and target data partitioning scenarios. Results indicate that the model maintains robust performance, even when fine-tuned on limited target data. Experimental results on two publicly available LFP/graphite lithium-ion battery datasets demonstrate that the proposed method outperforms a baseline deep learning approach and machine learning techniques, achieving an RMSE of 101.59, highlighting its potential for real-world RUL prediction applications.

[1412] Small Models, Smarter Learning: The Power of Joint Task Training

Csaba Both, Benjamin Hoover, Hendrik Strobelt, Dmitry Krotov, Daniel Karl I. Weidele, Mauro Martino, Nima Dehmamy

Main category: cs.LG

TL;DR: Joint training of compatible tasks can dramatically reduce model size requirements for learning, but only when tasks share computational primitives and structure.

Details

Motivation: To systematically understand when multi-task learning reduces the minimum model size required to learn tasks, using controlled testbeds to study task compatibility effects.

Method: Used nested arithmetic (ListOps) and permutation groups as controlled testbeds, analyzed learning transitions (minimum model size for task learning), conducted PCA of learned embeddings, and performed transfer experiments.

Result: Certain task pairings reduced model size requirements by 2-7 times; successful joint training induced structured number representations; pretraining on easy tasks enabled learning addition at 7 times smaller sizes.

Conclusion: Task compatibility (shared computational primitives and structure) determines whether joint training reduces capacity requirements, not mere task diversity.

Abstract: Multi-task learning improves generalization, but when does it reduce the model capacity required to learn? We provide a systematic study of how joint training affects the learning transition, the minimum model size at which a task can be learned, using nested arithmetic (ListOps) and permutation groups as controlled testbeds. Certain task pairings dramatically reduce model size requirements: combining easy operations (MAX, MIN, PROD) with hard ones (modular addition, permutation products) enables learning with 2-7 times fewer parameters. Crucially, we also identify when synergies fail: pairing structurally similar hard tasks (e.g., ADD with alternating-sign NADD) provides no benefit, nor does pairing tasks lacking shared computational primitives. PCA of learned embeddings reveals that successful joint training induces structured number representations (ordering, parity, modular structure) absent in single-task models. Transfer experiments confirm these representations are causal: models pretrained on easy tasks learn addition at 7 times smaller sizes. Our results establish that task compatibility, not mere diversity, determines whether joint training reduces capacity requirements, providing quantitative guidance for curriculum design.

[1413] ePC: Fast and Deep Predictive Coding for Digital Hardware

Cédric Goemaere, Gaspard Oliviers, Rafal Bogacz, Thomas Demeester

Main category: cs.LG

TL;DR: The paper introduces error-based Predictive Coding (ePC), a novel reparameterization of PC that overcomes signal decay issues in digital simulation, enabling efficient training of deeper neural networks while matching backpropagation performance.

Details

Motivation: Predictive Coding offers a brain-inspired alternative to backpropagation but suffers from hardware-algorithm mismatch in digital simulation, with exponential signal decay that prevents scaling to deeper architectures.

Method: The authors reformulate PC by introducing error-based PC (ePC), which reparameterizes the canonical state-based PC to eliminate signal decay while maintaining exact gradient computation, enabling orders of magnitude faster training.

Result: ePC matches backpropagation’s performance across multiple architectures and datasets, even for deeper models where traditional PC struggles, and runs orders of magnitude faster than state-based PC.

Conclusion: The work provides both practical improvements for scaling PC-based learning on digital hardware and theoretical insights into PC dynamics, establishing a foundation for deeper architecture training with brain-inspired algorithms.

Abstract: Predictive Coding (PC) offers a brain-inspired alternative to backpropagation for neural network training, described as a physical system minimizing its internal energy. However, in practice, PC is predominantly digitally simulated, requiring excessive amounts of compute while struggling to scale to deeper architectures. This paper reformulates PC to overcome this hardware-algorithm mismatch. First, we uncover how the canonical state-based formulation of PC (sPC) is, by design, deeply inefficient in digital simulation, inevitably resulting in exponential signal decay that stalls the entire minimization process. Then, to overcome this fundamental limitation, we introduce error-based PC (ePC), a novel reparameterization of PC which does not suffer from signal decay. Though no longer biologically plausible, ePC numerically computes exact PC weights gradients and runs orders of magnitude faster than sPC. Experiments across multiple architectures and datasets demonstrate that ePC matches backpropagation’s performance even for deeper models where sPC struggles. Besides practical improvements, our work provides theoretical insight into PC dynamics and establishes a foundation for scaling PC-based learning to deeper architectures on digital hardware and beyond.

[1414] Experience-based Knowledge Correction for Robust Planning in Minecraft

Seungjoon Lee, Suhwan Kim, Minhyeon Oh, Youngsik Yoon, Jungseul Ok

Main category: cs.LG

TL;DR: XENON is an LLM-based agent that algorithmically corrects flawed knowledge priors through experience, using adaptive dependency graphs and failure-aware action memory to improve long-horizon planning in Minecraft.

Details

Motivation: LLMs often have flawed priors about goal/item dependencies and feasible actions in embodied environments like Minecraft, and they fail to correct these through prompting alone, even with feedback. This limits their effectiveness in long-horizon planning tasks.

Method: XENON integrates two mechanisms: 1) Adaptive Dependency Graph that corrects item dependencies using past successes, and 2) Failure-aware Action Memory that corrects action knowledge using past failures. These allow the agent to algorithmically revise knowledge from experience with sparse binary feedback.

Result: Experiments across multiple Minecraft benchmarks show XENON outperforms prior agents in both knowledge learning and long-horizon planning. Remarkably, with only a 7B open-weight LLM, it surpasses agents that rely on much larger proprietary models.

Conclusion: XENON demonstrates that algorithmic knowledge correction from experience enables robustness to flawed priors and sparse feedback, allowing effective long-horizon planning with smaller LLMs in complex embodied environments.

Abstract: Large Language Model (LLM)-based planning has advanced embodied agents in long-horizon environments such as Minecraft, where acquiring latent knowledge of goal (or item) dependencies and feasible actions is critical. However, LLMs often begin with flawed priors and fail to correct them through prompting, even with feedback. We present XENON (eXpErience-based kNOwledge correctioN), an agent that algorithmically revises knowledge from experience, enabling robustness to flawed priors and sparse binary feedback. XENON integrates two mechanisms: Adaptive Dependency Graph, which corrects item dependencies using past successes, and Failure-aware Action Memory, which corrects action knowledge using past failures. Together, these components allow XENON to acquire complex dependencies despite limited guidance. Experiments across multiple Minecraft benchmarks show that XENON outperforms prior agents in both knowledge learning and long-horizon planning. Remarkably, with only a 7B open-weight LLM, XENON surpasses agents that rely on much larger proprietary models. Code available at https://sjlee-me.github.io/XENON

[1415] KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache

Fei Li, Song Liu, Weiguo Wu, Shiqiang Nie, Jinyu Wang

Main category: cs.LG

TL;DR: KVmix: A mixed-precision quantization method for KV Cache that uses gradient-based importance analysis for layer-specific bit-width allocation and dynamic long-context optimization to reduce memory usage while maintaining accuracy.

Details

Motivation: The high memory demands of KV Cache during LLM inference severely restrict deployment in resource-constrained platforms. Existing quantization methods either use static precision allocation or fail to dynamically prioritize critical KV in long-context tasks, forcing memory-accuracy-throughput tradeoffs.

Method: KVmix uses gradient-based importance analysis to evaluate how individual Key and Value projection matrices affect model loss, enabling layer-specific bit-width allocation for mixed-precision quantization. It dynamically prioritizes higher precision for important layers while aggressively quantizing less influential ones. It also introduces dynamic long-context optimization that adaptively keeps full-precision KV pairs for recent pivotal tokens and compresses older ones, with efficient low-bit quantization and CUDA kernels.

Result: On LLMs like Llama and Mistral, KVmix achieves near-lossless inference performance with extremely low quantization (Key 2.19bit, Value 2.38bit), delivering 4.9x memory compression and 5.3x speedup in inference throughput.

Conclusion: KVmix provides an effective solution for KV Cache quantization that balances accuracy and efficiency through intelligent mixed-precision allocation and dynamic long-context optimization, enabling efficient LLM deployment on resource-constrained platforms.

Abstract: The high memory demands of the Key-Value (KV) Cache during the inference of Large Language Models (LLMs) severely restrict their deployment in resource-constrained platforms. Quantization can effectively alleviate the memory pressure caused by KV Cache. However, existing methods either rely on static one-size-fits-all precision allocation or fail to dynamically prioritize critical KV in long-context tasks, forcing memory-accuracy-throughput tradeoffs. In this work, we propose a novel mixed-precision quantization method for KV Cache named KVmix. KVmix leverages gradient-based importance analysis to evaluate how individual Key and Value projection matrices affect the model loss, enabling layer-specific bit-width allocation for mix-precision quantization. It dynamically prioritizes higher precision for important layers while aggressively quantizing less influential ones, achieving a tunable balance between accuracy and efficiency. KVmix also introduces a dynamic long-context optimization strategy that adaptively keeps full-precision KV pairs for recent pivotal tokens and compresses older ones, achieving high-quality sequence generation with low memory usage. Additionally, KVmix provides efficient low-bit quantization and CUDA kernels to optimize computational overhead. On LLMs such as Llama and Mistral, KVmix achieves near-lossless inference performance with extremely low quantization configuration (Key 2.19bit Value 2.38bit), while delivering a remarkable 4.9x memory compression and a 5.3x speedup in inference throughput.

[1416] Adaptive Shielding for Safe Reinforcement Learning under Hidden-Parameter Dynamics Shifts

Minjae Kwon, Tyler Ingebrand, Ufuk Topcu, Lu Feng

Main category: cs.LG

TL;DR: Adaptive Shielding framework for safe RL in hidden-parameter MDPs that uses function encoding for dynamics inference and two-layer safety strategy with safety-regularized optimization and uncertainty-aware shielding.

Details

Motivation: Unseen shifts in environment dynamics driven by hidden parameters (friction, gravity) create safety challenges in reinforcement learning that need to be addressed while maintaining performance.

Method: Proposes Adaptive Shielding with function encoder to infer low-dimensional dynamics representation online, safety-regularized optimization to train policies away from high-cost regions, and uncertainty-aware shielding using conformal prediction to filter unsafe actions.

Result: Outperforms baselines on return-safety trade-off across Safe-Gym benchmarks with varying hidden parameters, generalizes reliably to unseen dynamics with modest execution-time overhead.

Conclusion: Adaptive Shielding provides an effective framework for safe RL in environments with hidden parameter shifts, combining proactive and reactive safety mechanisms with theoretical guarantees on prediction errors.

Abstract: Unseen shifts in environment dynamics, driven by hidden parameters such as friction or gravity, create a challenge for maintaining safety. We address this challenge by proposing Adaptive Shielding, a framework for safe reinforcement learning in constrained hidden-parameter Markov decision processes. A function encoder infers a low-dimensional representation of the underlying dynamics online from transition data, allowing the shield to adapt. To ensure safety during this process, we use a two-layer strategy. First, we introduce safety-regularized optimization that proactively trains the policy away from high-cost regions. Second, the adaptive shielding reactively uses the inferred dynamics to forecast safety risks and applies uncertainty-aware bounds using conformal prediction to filter unsafe actions. We prove that prediction errors in the shielding connect with bounds on the average cost rate. Empirically, across Safe-Gym benchmarks with varying hidden parameters, our approach outperforms baselines on the return-safety trade-off and generalizes reliably to unseen dynamics, while incurring only modest execution-time overhead. Code is available at https://github.com/safe-autonomy-lab/AdaptiveShieldingFE.

[1417] Identifiability of Deep Polynomial Neural Networks

Konstantin Usevich, Ricardo Borsoi, Clara Dérand, Marianne Clausel

Main category: cs.LG

TL;DR: Deep Polynomial Neural Networks (PNNs) have complex algebraic structure but their identifiability (interpretability) is poorly understood. This paper provides comprehensive analysis of identifiability for deep PNNs with/without bias terms, revealing interplay between activation degrees and layer widths.

Details

Motivation: PNNs have rich algebraic and geometric structure, but their identifiability - crucial for interpretability - remains poorly understood. Understanding when PNNs are identifiable (unique parameterization) is important for model interpretability and theoretical understanding.

Method: Theoretical analysis connecting deep PNNs to low-rank tensor decompositions and using Kruskal-type uniqueness theorems. Constructive proofs examining architectures with and without bias terms, focusing on interplay between activation degrees and layer widths.

Result: Shows architectures with non-increasing layer widths are generically identifiable under mild conditions. Encoder-decoder networks are identifiable when decoder widths don’t grow too rapidly compared to activation degrees. Settles open conjecture on dimension of PNN’s neurovarieties and provides new bounds on activation degrees.

Conclusion: Provides comprehensive theoretical understanding of identifiability in deep PNNs, revealing intricate relationship between architectural choices and identifiability properties, with implications for model interpretability and theoretical foundations.

Abstract: Polynomial Neural Networks (PNNs) possess a rich algebraic and geometric structure. However, their identifiability – a key property for ensuring interpretability – remains poorly understood. In this work, we present a comprehensive analysis of the identifiability of deep PNNs, including architectures with and without bias terms. Our results reveal an intricate interplay between activation degrees and layer widths in achieving identifiability. As special cases, we show that architectures with non-increasing layer widths are generically identifiable under mild conditions, while encoder-decoder networks are identifiable when the decoder widths do not grow too rapidly compared to the activation degrees. Our proofs are constructive and center on a connection between deep PNNs and low-rank tensor decompositions, and Kruskal-type uniqueness theorems. We also settle an open conjecture on the dimension of PNN’s neurovarieties, and provide new bounds on the activation degrees required for it to reach the expected dimension.

[1418] A Selective Quantization Tuner for ONNX Models

Nikolaos Louloudakis, Ajitha Rajan

Main category: cs.LG

TL;DR: SeQTO is a framework for selective quantization of ONNX models that optimizes the trade-off between accuracy and efficiency by quantizing only some layers while keeping others at full precision, with multi-objective optimization to find optimal configurations across different hardware devices.

Details

Motivation: Full quantization often causes significant accuracy degradation, and hardware accelerators may not support all quantized operations. Selective quantization can balance accuracy and efficiency, but determining the optimal configuration is challenging.

Method: SeQTO framework enables selective quantization, deployment, and execution of ONNX models on diverse CPU/GPU devices. It combines profiling with multi-objective optimization (Pareto Front) to identify optimal quantization configurations, evaluates performance metrics (accuracy, size), and provides visualization tools.

Result: Evaluation on four ONNX models across CPU/GPU devices shows SeQTO identifies high-quality selectively quantized models, achieving up to 54.14% lower accuracy loss while maintaining up to 98.18% of size reduction compared to fully quantized models.

Conclusion: SeQTO effectively addresses the selective quantization challenge by providing a systematic framework that optimizes the accuracy-efficiency trade-off across diverse hardware platforms.

Abstract: Quantization reduces the precision of deep neural networks to lower model size and computational demands, but often at the expense of accuracy. Fully quantized models can suffer significant accuracy degradation, and resource-constrained hardware accelerators may not support all quantized operations. A common workaround is selective quantization, where only some layers are quantized while others remain at full precision. However, determining the optimal balance between accuracy and efficiency is a challenging task. To this direction, we propose SeQTO, a framework that enables selective quantization, deployment, and execution of ONNX models on diverse CPU and GPU devices, combined with profiling and multi-objective optimization. SeQTO generates selectively quantized models, deploys them across hardware accelerators, evaluates performance on metrics such as accuracy and size, applies Pareto Front-based objective minimization to identify optimal candidates, and provides visualization of results. We evaluated SeQTO on four ONNX models under two quantization settings across CPU and GPU devices. Our results show that SeQTO effectively identifies high-quality selectively quantized models, achieving up to 54.14% lower accuracy loss while maintaining up to 98.18% of size reduction compared to fully quantized models.

[1419] Parameter-efficient Multi-Task and Multi-Domain Learning using Factorized Tensor Networks

Yash Garg, Nebiyou Yismaw, Rakib Hyder, Ashley Prater-Bennette, M. Salman Asif

Main category: cs.LG

TL;DR: FTN introduces factorized tensor networks for multi-task/domain learning with minimal additional parameters, achieving accuracy comparable to independent single-task networks while avoiding catastrophic forgetting.

Details

Motivation: To develop an efficient multi-task and multi-domain learning method that leverages shared information across tasks/domains while minimizing additional parameters and avoiding catastrophic forgetting.

Method: Uses factorized tensor networks (FTN) that incorporate task/domain-specific low-rank tensor factors into a shared frozen network from a source model, enabling adaptation to multiple targets with minimal parameters.

Result: FTN achieves similar accuracy as single-task/domain methods while using only a fraction of additional parameters per task, demonstrated on various datasets with both convolutional and transformer architectures.

Conclusion: FTN provides an efficient parameter-sharing approach for multi-task/domain learning that maintains accuracy while minimizing computational and storage costs.

Abstract: Multi-task and multi-domain learning methods seek to learn multiple tasks/domains, jointly or one after another, using a single unified network. The primary challenge and opportunity lie in leveraging shared information across these tasks and domains to enhance the efficiency of the unified network. The efficiency can be in terms of accuracy, storage cost, computation, or sample complexity. In this paper, we introduce a factorized tensor network (FTN) designed to achieve accuracy comparable to that of independent single-task or single-domain networks, while introducing a minimal number of additional parameters. The FTN approach entails incorporating task- or domain-specific low-rank tensor factors into a shared frozen network derived from a source model. This strategy allows for adaptation to numerous target domains and tasks without encountering catastrophic forgetting. Furthermore, FTN requires a significantly smaller number of task-specific parameters compared to existing methods. We performed experiments on widely used multi-domain and multi-task datasets. We show the experiments on convolutional-based architecture with different backbones and on transformer-based architecture. Our findings indicate that FTN attains similar accuracy as single-task or single-domain methods while using only a fraction of additional parameters per task. The code is available at https://doi.org/10.24433/CO.7519211.v2.

[1420] ECGTwin: Personalized ECG Generation Using Controllable Diffusion Model

Yongfan Lai, Bo Liu, Xinyan Guan, Qinghao Zhao, Hongyan Li, Shenda Hong

Main category: cs.LG

TL;DR: ECGTwin: A two-stage framework for personalized ECG generation using contrastive learning and diffusion models with specialized condition injection pathways.

Details

Motivation: To enable personalized ECG digital twins for individualized healthcare by addressing challenges of extracting individual features without ground truth and injecting various conditions without confusing generative models.

Method: Two-stage framework: 1) Individual Base Extractor trained via contrastive learning captures personal features from reference ECG; 2) Diffusion-based generation with AdaX Condition Injector that uses two specialized pathways to integrate individual features and target cardiac conditions.

Result: Model generates ECG signals with high fidelity and diversity while preserving individual-specific features, offering fine-grained controllability and potential to enhance ECG auto-diagnosis in downstream applications.

Conclusion: ECGTwin demonstrates the possibility of precise personalized healthcare solutions through personalized ECG generation that preserves individual characteristics while allowing condition-specific control.

Abstract: Personalized electrocardiogram (ECG) generation is to simulate a patient’s ECG digital twins tailored to specific conditions. It has the potential to transform traditional healthcare into a more accurate individualized paradigm, while preserving the key benefits of conventional population-level ECG synthesis. However, this promising task presents two fundamental challenges: extracting individual features without ground truth and injecting various types of conditions without confusing generative model. In this paper, we present ECGTwin, a two-stage framework designed to address these challenges. In the first stage, an Individual Base Extractor trained via contrastive learning robustly captures personal features from a reference ECG. In the second stage, the extracted individual features, along with a target cardiac condition, are integrated into the diffusion-based generation process through our novel AdaX Condition Injector, which injects these signals via two dedicated and specialized pathways. Both qualitative and quantitative experiments have demonstrated that our model can not only generate ECG signals of high fidelity and diversity by offering a fine-grained generation controllability, but also preserving individual-specific features. Furthermore, ECGTwin shows the potential to enhance ECG auto-diagnosis in downstream application, confirming the possibility of precise personalized healthcare solutions.

[1421] Retrospective Feature Estimation for Continual Learning

Nghia D. Nguyen, Hieu Trung Nguyen, Ang Li, Hoang Pham, Viet Anh Nguyen, Khoa D. Doan

Main category: cs.LG

TL;DR: Retrospective Feature Estimation (RFE) is a new continual learning approach that learns to reverse feature changes by aligning current DNN features backward to old task feature spaces using retrospector modules.

Details

Motivation: Current DNNs suffer from catastrophic forgetting when learning from changing data streams. Existing continual learning methods use replay, regularization, or dedicated capacity, but this paper explores a novel direction of retrospective feature estimation to mitigate forgetting.

Method: RFE learns to reverse feature changes by aligning features from the current trained DNN backward to the feature space of old tasks using a chain of small feature mapping networks called retrospector modules. This makes predictions easier in the original feature space.

Result: Empirical experiments on CL benchmarks (CIFAR10, CIFAR100, Tiny ImageNet) demonstrate RFE’s effectiveness and potential compared to existing representative CL methods.

Conclusion: Retrospective mechanisms offer a principled alternative for mitigating catastrophic forgetting in continual learning, motivating further research in this direction.

Abstract: The intrinsic capability to continuously learn a changing data stream is a desideratum of deep neural networks (DNNs). However, current DNNs suffer from catastrophic forgetting, which interferes with remembering past knowledge. To mitigate this issue, existing Continual Learning (CL) approaches often retain exemplars for replay, regularize learning, or allocate dedicated capacity for new tasks. This paper investigates an unexplored direction for CL called Retrospective Feature Estimation (RFE). RFE learns to reverse feature changes by aligning the features from the current trained DNN backward to the feature space of the old task, where performing predictions is easier. This retrospective process utilizes a chain of small feature mapping networks called retrospector modules. Empirical experiments on several CL benchmarks, including CIFAR10, CIFAR100, and Tiny ImageNet, demonstrate the effectiveness and potential of this novel CL direction compared to existing representative CL methods, motivating further research into retrospective mechanisms as a principled alternative for mitigating catastrophic forgetting in CL. Code is available at: https://github.com/mail-research/retrospective-feature-estimation.

[1422] SACO: Sequence-Aware Constrained Optimization Framework for Coupon Distribution in E-commerce

Li Kong, Bingzhe Wang, Zhou Chen, Suhan Hu, Yuchao Ma, Qi Qi, Suoyuan Song, Bicheng Jin

Main category: cs.LG

TL;DR: SACO framework for sequential coupon distribution optimization in e-commerce platforms

Details

Motivation: Existing coupon distribution strategies fail to leverage complex sequential interactions between platforms and users, leading to performance plateau despite abundant e-commerce log data

Method: Proposes Sequence-Aware Constrained Optimization (SACO) framework that integrates three key characteristics: general scenarios, sequential modeling with comprehensive historical data, and efficient iterative updates

Result: Superior performance demonstrated on real-world industrial dataset, public datasets, and synthetic datasets

Conclusion: SACO framework enables optimized online decision-making for long-term revenue boosting in various real-world marketing scenarios

Abstract: Coupon distribution is a critical marketing strategy used by online platforms to boost revenue and enhance user engagement. Regrettably, existing coupon distribution strategies fall far short of effectively leveraging the complex sequential interactions between platforms and users. This critical oversight, despite the abundance of e-commerce log data, has precipitated a performance plateau. In this paper, we focus on the scene that the platforms make sequential coupon distribution decision multiple times for various users, with each user interacting with the platform repeatedly. Based on this scenario, we propose a novel marketing framework, named \textbf{S}equence-\textbf{A}ware \textbf{C}onstrained \textbf{O}ptimization (SACO) framework, to directly devise coupon distribution policy for long-term revenue boosting. SACO framework enables optimized online decision-making in a variety of real-world marketing scenarios. It achieves this by seamlessly integrating three key characteristics, general scenarios, sequential modeling with more comprehensive historical data, and efficient iterative updates within a unified framework. Furthermore, empirical results on real-world industrial dataset, alongside public and synthetic datasets demonstrate the superiority of our framework.

[1423] FoMEMO: Towards Foundation Models for Expensive Multi-objective Optimization

Yiming Yao, Fei Liu, Liang Zhao, Xi Lin, Yilu Liu, Qingfu Zhang

Main category: cs.LG

TL;DR: FoMEMO introduces foundation models for expensive multi-objective optimization, using pre-training on synthetic data to enable in-context optimization without rebuilding models for each new problem.

Details

Motivation: Addressing the challenge of expensive multi-objective optimization where sample-efficiency is crucial due to limited evaluations. Existing methods either rebuild models from scratch for each problem or require extensive domain-specific pre-training, making them hard to generalize to emerging real-world applications.

Method: Proposes FoMEMO (Foundation Models for Expensive Multi-objective Optimization) that establishes a foundation model conditioned on domain trajectory and user preference. The model is pre-trained on hundreds of millions of synthetic data points to enable fast in-context optimization based on predicted preference-wise aggregated posteriors, without requiring model updates during optimization.

Result: Demonstrates that pre-training with diverse synthetic data leads to superior generalization and optimization performance on unknown problems compared to existing approaches that require rebuilding models or extensive real-world domain experiments.

Conclusion: FoMEMO provides a new paradigm for expensive multi-objective optimization that leverages foundation models pre-trained on synthetic data to achieve better generalization and efficiency without requiring model updates during the optimization process.

Abstract: Expensive multi-objective optimization is a prevalent and crucial concern in many real-world scenarios, where sample-efficiency is vital due to the limited evaluations to recover the true Pareto front for decision making. Existing works either involve rebuilding Gaussian process surrogates from scratch for each objective in each new problem encountered, or rely on extensive past domain experiments for pre-training deep learning models, making them hard to generalize and impractical to cope with various emerging applications in the real world. To address this issue, we propose a new paradigm named FoMEMO (Foundation Models for Expensive Multi-objective Optimization), which enables the establishment of a foundation model conditioned on any domain trajectory and user preference, and facilitates fast in-context optimization based on the predicted preference-wise aggregated posteriors. Rather than accessing extensive real-world domain experiments for training, we demonstrate that pre-training the foundation model with a diverse set of hundreds of millions of synthetic data can lead to superior generalization and optimization performance to unknown problems, without necessitating any subsequent model training or updates in the following optimization process.

[1424] Meta-Learning Reinforcement Learning for Crypto-Return Prediction

Junqiao Wang, Zhaoyang Guan, Guanyu Liu, Tianze Xia, Xianzhi Li, Shuo Yin, Xinyuan Song, Chuhan Cheng, Tianyu Shi, Alex Lee

Main category: cs.LG

TL;DR: Meta-RL-Crypto: A transformer-based architecture combining meta-learning and RL for cryptocurrency trading, using a self-improving agent with actor-judge-meta-judge roles that learns from multimodal market inputs without human supervision.

Details

Motivation: Cryptocurrency return prediction is challenging due to fast-shifting market factors (on-chain activity, news, social sentiment) and scarce/expensive labeled training data. Need for automated trading systems that can adapt to changing market conditions.

Method: Unified transformer architecture combining meta-learning and RL. Starts with instruction-tuned LLM, then uses closed-loop system with three alternating roles: actor (makes trades), judge (evaluates decisions), and meta-judge (refines evaluation criteria). Self-improving process leverages multimodal market inputs and internal preference feedback without human supervision.

Result: Experiments across diverse market regimes show good performance on real market technical indicators and outperforms other LLM-based baselines.

Conclusion: Meta-RL-Crypto demonstrates effective self-improving trading agent architecture for cryptocurrency markets, showing promise for automated trading systems that can adapt to changing conditions without human intervention.

Abstract: Predicting cryptocurrency returns is notoriously difficult: price movements are driven by a fast-shifting blend of on-chain activity, news flow, and social sentiment, while labeled training data are scarce and expensive. In this paper, we present Meta-RL-Crypto, a unified transformer-based architecture that unifies meta-learning and reinforcement learning (RL) to create a fully self-improving trading agent. Starting from a vanilla instruction-tuned LLM, the agent iteratively alternates between three roles-actor, judge, and meta-judge-in a closed-loop architecture. This learning process requires no additional human supervision. It can leverage multimodal market inputs and internal preference feedback. The agent in the system continuously refines both the trading policy and evaluation criteria. Experiments across diverse market regimes demonstrate that Meta-RL-Crypto shows good performance on the technical indicators of the real market and outperforming other LLM-based baselines.

[1425] Learning to Weight Parameters for Training Data Attribution

Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann

Main category: cs.LG

TL;DR: Proposes a method to learn parameter importance weights for gradient-based data attribution, improving accuracy across image classification, language modeling, and diffusion tasks without requiring annotated labels.

Details

Motivation: Existing gradient-based data attribution methods either treat network parameters uniformly or rely on implicit Hessian approximations, failing to capture functional heterogeneity of network parameters. There's a need for methods that explicitly model parameter importance for more accurate attribution.

Method: Proposes learning parameter importance weights directly from data without requiring annotated labels. The method explicitly models functional heterogeneity of network parameters to improve attribution accuracy.

Result: Improves attribution accuracy across diverse tasks including image classification, language modeling, and diffusion. Enables fine-grained attribution for concepts like subject and style.

Conclusion: Explicitly learning parameter importance weights from data improves gradient-based data attribution accuracy and enables more fine-grained analysis of training data influence on model outputs.

Abstract: We study gradient-based data attribution, aiming to identify which training examples most influence a given output. Existing methods for this task either treat network parameters uniformly or rely on implicit weighting derived from Hessian approximations, which do not fully model functional heterogeneity of network parameters. To address this, we propose a method to explicitly learn parameter importance weights directly from data, without requiring annotated labels. Our approach improves attribution accuracy across diverse tasks, including image classification, language modeling, and diffusion, and enables fine-grained attribution for concepts like subject and style.

[1426] StefaLand: An Efficient Geoscience Foundation Model That Improves Dynamic Land-Surface Predictions

Nicholas Kraabel, Jiangtao Liu, Yuchen Bian, Daniel Kifer, Chaopeng Shen

Main category: cs.LG

TL;DR: StefaLand is a generative spatiotemporal Earth representation learning model that learns cross-domain interactions for predicting climate-driven land-surface responses like streamflow, soil moisture, soil composition, and landslides, with strong spatial generalization capabilities.

Details

Motivation: Traditional models struggle with spatial generalization and degrade under concept drift, while vision foundation models trained on satellite imagery demand massive compute and aren't designed for dynamic land surface prediction tasks.

Method: Uses a location-aware masked autoencoder that fuses static and time-series inputs, attribute-based rather than image-based representation to reduce compute demands, and residual fine-tuning adapters for knowledge transfer across tasks.

Result: Demonstrates strong spatial generalization on five datasets across four important tasks (streamflow, soil moisture, soil composition, landslides), outperforming state-of-the-art supervised learning baselines, fine-tuned vision foundation models, and commercial embeddings.

Conclusion: StefaLand highlights the value of cross-domain interactions, can be trained on academic compute resources, and provides assistance to data-poor regions for climate-driven land-surface prediction tasks.

Abstract: Managing natural resources and mitigating risks from floods, droughts, wildfires, and landslides require models that can accurately predict climate-driven land-surface responses. Traditional models often struggle with spatial generalization because they are trained or calibrated on limited observations and can degrade under concept drift. Recently proposed vision foundation models trained on satellite imagery demand massive compute, and they are not designed for dynamic land surface prediction tasks. We introduce StefaLand, a generative spatiotemporal Earth representation learning model centered on learning cross-domain interactions to suppress overfitting. StefaLand demonstrates especially strong spatial generalization on five datasets across four important tasks: streamflow, soil moisture, soil composition and landslides, compared to previous state-of-the-art methods. The domain-inspired design choices include a location-aware masked autoencoder that fuses static and time-series inputs, an attribute-based rather than image-based representation that drastically reduces compute demands, and residual fine-tuning adapters that strengthen knowledge transfer across tasks. StefaLand can be pretrained and finetuned on commonly available academic compute resources, yet consistently outperforms state-of-the-art supervised learning baselines, fine-tuned vision foundation models and commercially available embeddings, highlighting the previously overlooked value of cross-domain interactions and providing assistance to data-poor regions of the world.

[1427] Bridging GANs and Bayesian Neural Networks via Partial Stochasticity

Maurizio Filippone, Marius P. Linhard

Main category: cs.LG

TL;DR: GANs reinterpreted as Bayesian neural networks with partial stochasticity, enabling better understanding of optimization challenges and proposing regularization strategies for improved performance.

Details

Motivation: GANs are successful but notoriously difficult to optimize. The paper aims to explain both the success and limitations of GANs by providing a Bayesian interpretation that reveals fundamental optimization challenges.

Method: Reinterpret GANs as Bayesian neural networks with partial stochasticity. Establish universal approximation conditions, rewrite adversarial optimization as likelihood proxy optimization, and propose regularization strategies including loss landscape smoothing and minimum description length solutions.

Result: The Bayesian interpretation provides theoretical understanding of GAN optimization. Proposed regularization strategies lead to performance improvements across a wide range of experiments.

Conclusion: The Bayesian framework offers deeper understanding of GANs, reveals the need for regularization, and provides practical strategies that improve performance, paving the way for more stable GAN training.

Abstract: Generative Adversarial Networks (GANs) are popular and successful generative models. Despite their success, optimization is notoriously challenging. In this work, we explain the success and limitations of GANs by casting them as Bayesian neural networks with partial stochasticity. This interpretation allows us to establish conditions of universal approximation and to rewrite the adversarial-style optimization of several variants of GANs as the optimization of a proxy for the likelihood obtained by marginalizing out the stochastic variables. Following this interpretation, the need for regularization becomes apparent, and we propose to adopt strategies to smooth the loss landscape and methods to search for solutions with minimum description length, which are associated with flat minima and good generalization. Results obtained on a wide range of experiments indicate that these strategies lead to performance improvements and pave the way to a deeper understanding of GANs.

[1428] Intra-Cluster Mixup: An Effective Data Augmentation Technique for Complementary-Label Learning

Tan-Ha Mai, Hsuan-Tien Lin

Main category: cs.LG

TL;DR: Proposes Intra-Cluster Mixup (ICM) for complementary-label learning, addressing Mixup’s ineffectiveness by synthesizing augmented data only from nearby examples to mitigate complementary-label noise.

Details

Motivation: Complementary-label learning (CLL) is a weakly-supervised learning approach where models learn from labels indicating classes instances don't belong to, which is cheaper to collect than ordinary labels. While most CLL research focuses on loss functions, data augmentation remains underexplored, and standard Mixup augmentation is ineffective in CLL due to complementary-label noise.

Method: Proposes Intra-Cluster Mixup (ICM) which only synthesizes augmented data from nearby examples within clusters, mitigating the noise effect caused by standard Mixup. ICM encourages complementary label sharing among nearby examples.

Result: ICM achieves substantial performance improvements across synthetic and real-world datasets. On MNIST and CIFAR datasets, it achieves significant accuracy increases of 30% and 10% respectively, working effectively with state-of-the-art CLL algorithms in both balanced and imbalanced settings.

Conclusion: ICM effectively addresses the data augmentation challenge in complementary-label learning by mitigating complementary-label noise, demonstrating strong performance improvements and compatibility with existing CLL algorithms.

Abstract: In this paper, we investigate the challenges of complementary-label learning (CLL), a specialized form of weakly-supervised learning (WSL) where models are trained with labels indicating classes to which instances do not belong, rather than standard ordinary labels. This alternative supervision is appealing because collecting complementary labels is generally cheaper and less labor-intensive. Although most existing research in CLL emphasizes the development of novel loss functions, the potential of data augmentation in this domain remains largely underexplored. In this work, we uncover that the widely-used Mixup data augmentation technique is ineffective when directly applied to CLL. Through in-depth analysis, we identify that the complementary-label noise generated by Mixup negatively impacts the performance of CLL models. We then propose an improved technique called Intra-Cluster Mixup (ICM), which only synthesizes augmented data from nearby examples, to mitigate the noise effect. ICM carries the benefits of encouraging complementary label sharing of nearby examples, and leads to substantial performance improvements across synthetic and real-world labeled datasets. In particular, our wide spectrum of experimental results on both balanced and imbalanced CLL settings justifies the potential of ICM in allying with state-of-the-art CLL algorithms, achieving significant accuracy increases of 30% and 10% on MNIST and CIFAR datasets, respectively.

[1429] Pulling Back the Curtain on Deep Networks

Maciej Satkiewicz, Roberto Corizzo, Marcin Pietroń

Main category: cs.LG

TL;DR: Semantic Pullbacks (SP) is a post-hoc explanation method for deep neural networks that produces perceptually aligned, class-conditional explanations by isolating the network’s effective linear action via a principled pullback formulation.

Details

Motivation: Existing post-hoc explanation methods (like gradient-based saliency maps) often produce noisy, perceptually misaligned explanations that lack interpretability. Many approaches use ad-hoc heuristics that lack theoretical justification and fail basic sanity checks.

Method: SP isolates the network’s effective linear action through a principled pullback formulation and refines it to recover coherent local structures learned by target neurons. This produces perceptually aligned explanations that highlight meaningful features.

Result: SP significantly outperforms established attribution methods on standard faithfulness benchmarks for both convolutional architectures (ResNet50, VGG) and transformer-based models (PVT). The method remains general and computationally efficient.

Conclusion: Semantic Pullbacks provide a theoretically motivated, faithful explanation method that produces perceptually aligned explanations, supports counterfactual perturbations, and can be easily integrated into existing deep learning pipelines across modalities.

Abstract: Post-hoc explainability methods typically associate each output score of a deep neural network with an input-space direction, most commonly instantiated as the gradient and visualized as a saliency map. However, these approaches often yield explanations that are noisy, lack perceptual alignment and, thus, offer limited interpretability. While many explanation methods attempt to address this issue via modified backward rules or additional heuristics, such approaches are often difficult to justify theoretically and frequently fail basic sanity checks. We introduce Semantic Pullbacks (SP), a faithful and effective post-hoc explanation method for deep neural networks. Semantic Pullbacks address the limitations above by isolating the network’s effective linear action via a principled pullback formulation and refining it to recover coherent local structures learned by the target neuron. As a result, SP produces perceptually aligned, class-conditional explanations that highlight meaningful features, support compelling counterfactual perturbations, and admit a clear theoretical motivation. Across standard faithfulness benchmarks, Semantic Pullbacks significantly outperform established attribution methods on both classical convolutional architectures (ResNet50, VGG) and transformer-based models (PVT), while remaining general and computationally efficient. Our method can be easily plugged into existing deep learning pipelines and extended to other modalities.

[1430] Frictional Q-Learning

Hyunwoo Kim, Hyo Kyung Lee

Main category: cs.LG

TL;DR: Frictional Q-Learning: An off-policy RL algorithm that addresses extrapolation errors by drawing analogy to static friction, encoding supported actions as tangent directions using contrastive VAE.

Details

Motivation: Off-policy reinforcement learning suffers from extrapolation errors when learned policies select actions weakly supported in the replay buffer, leading to unstable performance.

Method: Draws analogy to static friction in classical mechanics, representing replay buffer as smooth low-dimensional action manifold. Uses contrastive variational autoencoder to encode supported actions as tangent directions, with orthogonal complement representing normal components of extrapolation error.

Result: Empirical results on standard continuous-control benchmarks demonstrate robust, stable performance compared with existing baselines.

Conclusion: The friction-inspired approach effectively mitigates deviations toward unsupported actions in off-policy RL, providing a stable learning framework.

Abstract: Off-policy reinforcement learning suffers from extrapolation errors when a learned policy selects actions that are weakly supported in the replay buffer. In this study, we address this issue by drawing an analogy to static friction in classical mechanics. From this perspective, the replay buffer is represented as a smooth, low-dimensional action manifold, where the support directions correspond to the tangential component, while the normal component captures the dominant first-order extrapolation error. This decomposition reveals an intrinsic anisotropy in value sensitivity that naturally induces a stability condition analogous to a friction threshold. To mitigate deviations toward unsupported actions, we propose Frictional Q-Learning, an off-policy algorithm that encodes supported actions as tangent directions using a contrastive variational autoencoder. We further show that an orthonormal basis of the orthogonal complement corresponds to normal components under mild local isometry assumptions. Empirical results on standard continuous-control benchmarks demonstrate robust, stable performance compared with existing baselines.

[1431] Advancing Universal Deep Learning for Electronic-Structure Hamiltonian Prediction of Materials

Shi Yin, Zujian Dai, Xinyang Pan, Lixin He

Main category: cs.LG

TL;DR: NextHAM: A neural E(3)-symmetry transformer method for efficient and generalizable electronic-structure Hamiltonian prediction with a new large benchmark dataset.

Details

Motivation: Deep learning methods for Hamiltonian prediction offer computational efficiency over traditional DFT, but face challenges with diverse atomic types, structural patterns, and high-dimensional complexity that limit generalization performance.

Method: 1) Uses zeroth-step Hamiltonians from initial DFT charge density as informative descriptors and initial estimates; 2) Neural Transformer architecture with strict E(3)-symmetry and high non-linear expressiveness; 3) Novel training objective ensuring accuracy in both real and reciprocal space to prevent error amplification and “ghost states”.

Result: NextHAM achieves excellent accuracy and efficiency in predicting Hamiltonians and band structures on the Materials-HAM-SOC benchmark dataset of 17,000 material structures spanning 68 elements.

Conclusion: The work advances universal deep learning for Hamiltonian prediction through both methodological innovations (NextHAM architecture) and dataset curation (Materials-HAM-SOC benchmark).

Abstract: Deep learning methods for electronic-structure Hamiltonian prediction has offered significant computational efficiency advantages over traditional DFT methods, yet the diversity of atomic types, structural patterns, and the high-dimensional complexity of Hamiltonians pose substantial challenges to the generalization performance. In this work, we contribute on both the methodology and dataset sides to advance universal deep learning paradigm for Hamiltonian prediction. On the method side, we propose NextHAM, a neural E(3)-symmetry and expressive correction method for efficient and generalizable materials electronic-structure Hamiltonian prediction. First, we introduce the zeroth-step Hamiltonians, which can be efficiently constructed by the initial charge density of DFT, as informative descriptors of neural regression model in the input level and initial estimates of the target Hamiltonian in the output level, so that the regression model directly predicts the correction terms to the target ground truths, thereby significantly simplifying the input-output mapping for learning. Second, we present a neural Transformer architecture with strict E(3)-Symmetry and high non-linear expressiveness for Hamiltonian prediction. Third, we propose a novel training objective to ensure the accuracy performance of Hamiltonians in both real space and reciprocal space, preventing error amplification and the occurrence of “ghost states” caused by the large condition number of the overlap matrix. On the dataset side, we curate a high-quality broad-coverage large benchmark, namely Materials-HAM-SOC, comprising 17,000 material structures spanning 68 elements from six rows of the periodic table and explicitly incorporating SOC effects. Experimental results on Materials-HAM-SOC demonstrate that NextHAM achieves excellent accuracy and efficiency in predicting Hamiltonians and band structures.

[1432] The Function Representation of Artificial Neural Network

Zhongkui Ma

Main category: cs.LG

TL;DR: ANN structure expressed as functional form using activation integral concept, enabling mathematical solutions and placing ANN in more reasonable framework.

Details

Motivation: To provide a mathematical framework for understanding artificial neural networks by expressing their structure as functional forms, potentially eliminating fundamental questions about ANN.

Method: Uses activation integral concept derived from activation functions to represent ANN structure as simple mathematical functions, enabling analytical solutions.

Result: ANN structure can be represented by simple functions, allowing mathematical solutions and placing current ANN in more reasonable mathematical framework.

Conclusion: This functional representation approach provides mathematical foundation for ANN, potentially resolving fundamental questions about neural network behavior and structure.

Abstract: This paper expresses the structure of artificial neural network (ANN) as a functional form, using the activation integral concept derived from the activation function. In this way, the structure of ANN can be represented by a simple function, and it is possible to find the mathematical solutions of ANN. Thus, it can be recognized that the current ANN can be placed in a more reasonable framework. Perhaps all questions about ANN will be eliminated.

[1433] A Systematic Analysis of Out-of-Distribution Detection Under Representation and Training Paradigm Shifts

Claudio César Claros Olivares, Austin J. Brockmeier

Main category: cs.LG

TL;DR: Large-scale systematic comparison of OOD detection methods using AURC/AUGRC metrics, analyzing different distribution shift regimes and representation paradigms (CNNs vs ViTs), with neural collapse analysis providing insights on method selection.

Details

Motivation: To provide comprehensive empirical comparison and statistically grounded guidance for OOD detection method selection under various distribution shifts, addressing the lack of systematic evaluation across different representation paradigms and shift regimes.

Method: Systematic comparison using AURC/AUGRC metrics, exploring different distribution shift regimes stratified by CLIP embeddings, evaluating CNN and ViT representations on CIFAR-10/100, SuperCIFAR-100, and TinyImageNet with multiple-comparison-controlled rank-based pipeline (Friedman test with Conover-Holm post-hoc).

Result: Probabilistic scores dominate misclassification detection for both CNNs and ViTs; geometry-aware scores prevail on CNNs under strong shifts while GradNorm and KPCA Reconstruction Error remain competitive on ViTs; neural collapse analysis explains when prototype/boundary-based scores become optimal.

Conclusion: Learned feature space largely determines OOD efficacy, with different methods optimal for different architectures and shift regimes; neural collapse analysis provides theoretical grounding for method selection under distribution shift.

Abstract: We present the largest systematic comparison to date of out-of-distribution (OOD) detection methods using AURC and AUGRC as primary metrics. Our comparison explores different regimes of distribution shift (stratified by CLIP embeddings of the out-of-distribution image datasets) with varying numbers of classes and uses a representation-centric view of OOD detection, including neural collapse metrics, for subsequent analysis. Together the empirical results and representation analysis provides novel insights and statistically grounded guidance for method selection under distribution shift. Experiments cover two representation paradigms: CNNs trained from scratch and a fine-tuned Vision Transformer (ViT), evaluated on CIFAR-10/100, SuperCIFAR-100, and TinyImageNet. Using a multiple-comparison-controlled, rank-based pipeline (Friedman test with Conover-Holm post-hoc) and Bron-Kerbosch cliques, we find that the learned feature space largely determines OOD efficacy. For both CNNs and ViTs, probabilistic scores (e.g., MSR, GEN) dominate misclassification (ID) detection. Under stronger shifts, geometry-aware scores (e.g., NNGuide, fDBD, CTM) prevail on CNNs, whereas on ViTs GradNorm and KPCA Reconstruction Error remain consistently competitive. We further show a class-count-dependent trade-off for Monte-Carlo Dropout (MCD) and that a simple PCA projection improves several detectors. The neural-collapse-based geometric analysis explains when prototype and boundary-based scores become optimal under strong shifts.

[1434] Dual-Phase Continual Learning: Supervised Adaptation Meets Unsupervised Retention

Vaibhav Singh, Rahaf Aljundi, Eugene Belilovsky

Main category: cs.LG

TL;DR: A method for continual learning in vision-language models that uses unsupervised test-time adaptation to mitigate forgetting without storing past data

Details

Motivation: Vision-language models struggle with catastrophic forgetting when adapting to new domains, and traditional continual learning methods require storing/replaying past data which may be infeasible

Method: Teacher-student framework with gradient-based sparse parameter updates that leverages unlabeled test-time data to reinforce prior task knowledge without replay

Result: Effectively mitigates forgetting in class-incremental continual learning for VLMs, offering memory-free alternative to episodic replay with strong empirical performance

Conclusion: Unsupervised test-time adaptation can effectively address catastrophic forgetting in VLMs without requiring storage of past data, providing practical solution for continual learning

Abstract: Foundational Vision-Language Models (VLMs) excel across diverse tasks, but adapting them to new domains without forgetting prior knowledge remains a critical challenge. Continual Learning (CL) addresses this challenge by enabling models to learn sequentially from new data while mitigating the forgetting of prior information, typically under supervised settings involving label shift. Nonetheless, abrupt distribution shifts can still cause substantial forgetting, potentially nullifying the benefits of supervised updates, especially when storing or replaying past data is infeasible. In this work, we propose leveraging unlabeled testtime data in an unsupervised manner to reinforce prior task performance without requiring replay or stored examples. Unlike traditional Test Time Adaptation (TTA), which primarily focuses on domain shift or corruption, our method improves performance on earlier tasks by exploiting representative test samples encountered during deployment. We introduce a simple Teacher-Student framework with gradient-based sparse parameter updates, and show that it effectively mitigates forgetting in class-incremental CL for VLMs, offering a memory-free alternative to episodic replay with strong empirical results.

[1435] ChaosNexus: A Foundation Model for ODE-based Chaotic System Forecasting with Hierarchical Multi-scale Awareness

Chang Liu, Bohao Zhao, Jingtao Ding, Yong Li

Main category: cs.LG

TL;DR: ChaosNexus: A foundation model for chaotic system forecasting using ScaleFormer architecture with multi-scale temporal processing and Mixture-of-Experts layers, achieving superior long-term attractor statistics and competitive point-wise accuracy.

Details

Motivation: Existing foundation models for ODE-based chaotic systems fail to capture multi-scale temporal structures and distinct spectral characteristics of chaotic dynamics, limiting their forecasting capabilities.

Method: Proposes ScaleFormer architecture that processes temporal contexts across hierarchically varying patch sizes to capture long-range dependencies and preserve high-frequency fluctuations. Integrates Mixture-of-Experts layers in each block and conditions final forecasts on learned frequency fingerprints for global spectral understanding.

Result: Extensive evaluations on over 9,000 synthetic systems show superior fidelity in long-term attractor statistics while maintaining competitive point-wise accuracy. Achieves remarkable zero-shot mean error below 1°C for 5-day station-based weather forecasting in real-world applications.

Conclusion: ChaosNexus effectively addresses multi-scale temporal and spectral challenges in chaotic system forecasting through its novel ScaleFormer architecture and frequency conditioning, demonstrating strong performance in both synthetic and real-world scenarios.

Abstract: Foundation models have shown great promise in achieving zero-shot or few-shot forecasting for ODE-based chaotic systems via large-scale pretraining. However, existing architectures often fail to capture the multi-scale temporal structures and distinct spectral characteristics of chaotic dynamics. To address this, we introduce ChaosNexus, a foundation model for chaotic system forecasting underpinned by the proposed ScaleFormer architecture. By processing temporal contexts across hierarchically varying patch sizes, ChaosNexus effectively captures long-range dependencies and preserves high-frequency fluctuations. To address heterogeneity across distinct systems, we integrate Mixture-of-Experts (MoE) layers into each ScaleFormer block and explicitly condition the final forecasts on a learned frequency fingerprint, providing the model with a global spectral view of the system. Extensive evaluations on over 9,000 synthetic systems demonstrate that ChaosNexus achieves superior fidelity in long-term attractor statistics while maintaining competitive point-wise accuracy. Furthermore, in real-world applications, it achieves a remarkable zero-shot mean error below 1°C for 5-day station-based weather forecasting. Codes are available at https://github.com/TomXaxaxa/ChaosNexus.

[1436] UniGAP: A Universal and Adaptive Graph Upsampling Approach to Mitigate Over-Smoothing in Node Classification Tasks

Xiaotang Wang, Yun Zhu, Haizhou Shi, Yongchao Liu, Yongqi Zhang

Main category: cs.LG

TL;DR: UniGAP is a universal adaptive graph upsampling framework that mitigates over-smoothing in graph neural networks through condensed trajectory features, serving as a plug-in component for existing GNNs to enhance node classification performance.

Details

Motivation: Existing graph networks (MPNNs and Graph Transformers) suffer from over-smoothing of node features, limiting expressive capacity. Current upsampling techniques are often heuristic, requiring extensive manual labor and lacking universal integration strategies.

Method: UniGAP uses an adaptive graph upsampler based on condensed trajectory features as a plug-in component for existing GNNs. It’s a representation-based, fully differentiable framework that can be integrated with various graph neural architectures.

Result: UniGAP demonstrates significant improvements over heuristic data augmentation methods across various datasets and metrics. It identifies key bottlenecks where over-smoothing occurs and shows potential for combination with large language models.

Conclusion: UniGAP provides an effective universal framework for mitigating over-smoothing in graph neural networks, offering insights into graph structure evolution and enabling further exploration of graph upsampling methods.

Abstract: In the graph domain, deep graph networks based on Message Passing Neural Networks (MPNNs) or Graph Transformers often cause over-smoothing of node features, limiting their expressive capacity. Many upsampling techniques involving node and edge manipulation have been proposed to mitigate this issue. However, these methods are often heuristic, resulting in extensive manual labor and suboptimal performance and lacking a universal integration strategy. In this study, we introduce UniGAP, a universal and adaptive graph upsampling framework to mitigate over-smoothing in node classification tasks. Specifically, we design an adaptive graph upsampler based on condensed trajectory features, serving as a plug-in component for existing GNNs to mitigate the over-smoothing problem and enhance performance. Moreover, UniGAP serves as a representation-based and fully differentiable framework to inspire further exploration of graph upsampling methods. Through extensive experiments, UniGAP demonstrates significant improvements over heuristic data augmentation methods in various datasets and metrics. We analyze how graph structure evolves with UniGAP, identifying key bottlenecks where over-smoothing occurs, and providing insights into how UniGAP addresses this issue. Lastly, we show the potential of combining UniGAP with large language models (LLMs) to further improve downstream performance. Our code is available at: https://github.com/wangxiaotang0906/UniGAP

[1437] Reinforcement Learning for Durable Algorithmic Recourse

Marina Ceccon, Alessandro Fabris, Goran Radanović, Asia J. Biega, Gian Antonio Susto

Main category: cs.LG

TL;DR: A time-aware algorithmic recourse framework using reinforcement learning to generate durable recommendations that remain valid over time in competitive, resource-constrained decision systems.

Details

Motivation: Prior research on algorithmic recourse has focused on robustness to model updates but neglected temporal dynamics in competitive settings where recommendations shape future applicant pools, creating a need for durable recommendations that remain valid over time.

Method: Proposes a time-aware framework modeling how populations adapt to recommendations, with a novel RL-based recourse algorithm that captures evolving environmental dynamics to generate feasible and valid recommendations durable over a predefined time horizon T.

Result: Extensive experiments in complex simulation environments show the approach substantially outperforms existing baselines, offering superior balance between feasibility and long-term validity.

Conclusion: Temporal and behavioral dynamics are crucial for practical recourse systems, and the proposed framework provides durable recommendations that allow individuals to confidently reapply after implementing suggested changes.

Abstract: Algorithmic recourse seeks to provide individuals with actionable recommendations that increase their chances of receiving favorable outcomes from automated decision systems (e.g., loan approvals). While prior research has emphasized robustness to model updates, considerably less attention has been given to the temporal dynamics of recourse–particularly in competitive, resource-constrained settings where recommendations shape future applicant pools. In this work, we present a novel time-aware framework for algorithmic recourse, explicitly modeling how candidate populations adapt in response to recommendations. Additionally, we introduce a novel reinforcement learning (RL)-based recourse algorithm that captures the evolving dynamics of the environment to generate recommendations that are both feasible and valid. We design our recommendations to be durable, supporting validity over a predefined time horizon T. This durability allows individuals to confidently reapply after taking time to implement the suggested changes. Through extensive experiments in complex simulation environments, we show that our approach substantially outperforms existing baselines, offering a superior balance between feasibility and long-term validity. Together, these results underscore the importance of incorporating temporal and behavioral dynamics into the design of practical recourse systems.

[1438] On the Design of One-step Diffusion via Shortcutting Flow Paths

Haitao Lin, Peiyan Hu, Minsi Ren, Zhifeng Gao, Zhi-Ming Ma, Guolin ke, Tailin Wu, Stan Z. Li

Main category: cs.LG

TL;DR: A framework for designing and improving one-step diffusion models (shortcut models) that achieves state-of-the-art FID scores on ImageNet-256 without pre-training or distillation.

Details

Motivation: Recent few-step diffusion models have shown efficiency but their theoretical derivation and practical implementation are too coupled, obscuring the design space and limiting systematic improvements.

Method: Proposes a common design framework for shortcut models that provides theoretical justification and disentangles component-level choices, enabling systematic identification of improvements for one-step diffusion models.

Result: Achieves new SOTA FID50k of 2.85 on ImageNet-256x256 with one-step generation under classifier-free guidance, and 2.53 with 2x training steps, without pre-training, distillation, or curriculum learning.

Conclusion: The framework lowers barriers to component-level innovation in shortcut models and facilitates principled exploration of their design space for efficient diffusion models.

Abstract: Recent advances in few-step diffusion models have demonstrated their efficiency and effectiveness by shortcutting the probabilistic paths of diffusion models, especially in training one-step diffusion models from scratch (\emph{a.k.a.} shortcut models). However, their theoretical derivation and practical implementation are often closely coupled, which obscures the design space. To address this, we propose a common design framework for representative shortcut models. This framework provides theoretical justification for their validity and disentangles concrete component-level choices, thereby enabling systematic identification of improvements. With our proposed improvements, the resulting one-step model achieves a new state-of-the-art FID50k of 2.85 on ImageNet-256x256 under the classifier-free guidance setting with one step generation, and further reaches FID50k of 2.53 with 2x training steps. Remarkably, the model requires no pre-training, distillation, or curriculum learning. We believe our work lowers the barrier to component-level innovation in shortcut models and facilitates principled exploration of their design space.

[1439] End-to-End Conformal Calibration for Optimization Under Uncertainty

Christopher Yeh, Nicolas Christianson, Alan Wu, Adam Wierman, Yisong Yue

Main category: cs.LG

TL;DR: End-to-end framework for learning uncertainty sets in conditional robust optimization using conformal prediction and partially input-convex neural networks, with applications in energy storage and portfolio optimization.

Details

Motivation: Neural networks often lack well-calibrated uncertainty estimates needed for robust decision-making, and existing uncertainty quantification methods don't consider downstream decision-making losses, leading to suboptimal performance.

Method: Proposes an end-to-end framework that learns uncertainty sets using conformal prediction for calibration guarantees, represents convex uncertainty sets with partially input-convex neural networks, and optimizes uncertainty sets directly for downstream decision-making performance.

Result: The approach consistently outperforms two-stage estimate-then-optimize baselines in energy storage arbitrage and portfolio optimization applications.

Conclusion: Learning uncertainty sets end-to-end with decision-aware calibration improves robust optimization performance compared to traditional two-stage approaches.

Abstract: Machine learning can significantly improve performance for decision-making under uncertainty across a wide range of domains. However, ensuring robustness guarantees requires well-calibrated uncertainty estimates, which can be difficult to achieve with neural networks. Moreover, in high-dimensional settings, there may be many valid uncertainty estimates, each with its own performance profile - i.e., not all uncertainty is equally valuable for downstream decision-making. To address this problem, this paper develops an end-to-end framework to learn uncertainty sets for conditional robust optimization in a way that is informed by the downstream decision-making loss, with robustness and calibration guarantees provided by conformal prediction. In addition, we propose to represent general convex uncertainty sets with partially input-convex neural networks, which are learned as part of our framework. Our approach consistently improves upon two-stage estimate-then-optimize baselines on concrete applications in energy storage arbitrage and portfolio optimization.

[1440] Beyond Model Ranking: Predictability-Aligned Evaluation for Time Series Forecasting

Wanjin Feng, Yuan Yuan, Jingtao Ding, Yong Li

Main category: cs.LG

TL;DR: A novel predictability-aligned diagnostic framework using spectral coherence to separate model performance from data’s intrinsic unpredictability, introducing SCP score and LUR diagnostic tool.

Details

Motivation: Current time series forecasting evaluation metrics conflate model performance with data's intrinsic unpredictability, leading to unfair model comparisons and limited understanding of model behavior.

Method: Introduces Spectral Coherence Predictability (SCP) score to quantify inherent difficulty of forecasting instances, and Linear Utilization Ratio (LUR) as frequency-resolved diagnostic tool to measure how effectively models exploit linearly predictable information.

Result: Reveals “predictability drift” showing forecasting difficulty varies over time, and identifies architectural trade-off: complex models excel on low-predictability data while linear models are effective on predictable tasks.

Conclusion: Advocates for paradigm shift from simplistic aggregate scores to predictability-aware evaluation for fairer model comparisons and deeper understanding of model behavior.

Abstract: In the era of increasingly complex AI models for time series forecasting, progress is often measured by marginal improvements on benchmark leaderboards. However, this approach suffers from a fundamental flaw: standard evaluation metrics conflate a model’s performance with the data’s intrinsic unpredictability. To address this pressing challenge, we introduce a novel, predictability-aligned diagnostic framework grounded in spectral coherence. Our framework makes two primary contributions: the Spectral Coherence Predictability (SCP), a computationally efficient ($O(N\log N)$) and task-aligned score that quantifies the inherent difficulty of a given forecasting instance, and the Linear Utilization Ratio (LUR), a frequency-resolved diagnostic tool that precisely measures how effectively a model exploits the linearly predictable information within the data. We validate our framework’s effectiveness and leverage it to reveal two core insights. First, we provide the first systematic evidence of “predictability drift”, demonstrating that a task’s forecasting difficulty varies sharply over time. Second, our evaluation reveals a key architectural trade-off: complex models are superior for low-predictability data, whereas linear models are highly effective on more predictable tasks. We advocate for a paradigm shift, moving beyond simplistic aggregate scores toward a more insightful, predictability-aware evaluation that fosters fairer model comparisons and a deeper understanding of model behavior.

[1441] Individual Regret in Cooperative Stochastic Multi-Armed Bandits

Idan Barnea, Tal Lancewicki, Yishay Mansour

Main category: cs.LG

TL;DR: Cooperative multi-agent bandit algorithm with communication over arbitrary graphs achieves individual regret independent of graph diameter, with logarithmic message size and communication rounds.

Details

Motivation: To develop cooperative multi-armed bandit algorithms where multiple agents communicate over arbitrary connected graphs, addressing limitations of prior work that had regret dependent on graph diameter.

Method: Analyzes Cooperative Successive Elimination (coopSE) algorithm for stochastic MAB with multiple agents communicating over arbitrary connected graphs, with focus on message size and communication round constraints.

Result: Achieves individual regret bound O(R/m + A² + A√log T) independent of graph diameter, with logarithmic message size; with logarithmic communication rounds, gets O(R/m + A log T) regret.

Conclusion: First cooperative stochastic MAB algorithm with individual regret independent of graph diameter, showing feasibility of efficient communication-constrained multi-agent bandit learning.

Abstract: We study the regret in stochastic Multi-Armed Bandits (MAB) with multiple agents that communicate over an arbitrary connected communication graph. We analyzed a variant of Cooperative Successive Elimination algorithm, $\coopse$, and show an individual regret bound of ${O}(\mathcal{R} / m + A^2 + A \sqrt{\log T})$ and a nearly matching lower bound. Here $A$ is the number of actions, $T$ the time horizon, $m$ the number of agents, and $\mathcal{R} = \sum_{Δ_i > 0}\log(T)/Δ_i$ is the optimal single agent regret, where $Δ_i$ is the sub-optimality gap of action $i$. Our work is the first to show an individual regret bound in cooperative stochastic MAB that is independent of the graph’s diameter. When considering communication networks there are additional considerations beyond regret, such as message size and number of communication rounds. First, we show that our regret bound holds even if we restrict the messages to be of logarithmic size. Second, for logarithmic number of communication rounds, we obtain a regret bound of ${O}(\mathcal{R} / m+A \log T)$.

[1442] Dense associative memory for Gaussian distributions

Chandan Tankala, Krishnakumar Balasubramanian

Main category: cs.LG

TL;DR: Extends dense associative memories from vectors to Gaussian distributions using Wasserstein distance, enabling distributional pattern storage and retrieval with exponential capacity guarantees.

Details

Motivation: Classical dense associative memories (DAMs) are limited to vector representations, while modern machine learning often deals with distributions. The paper aims to bridge this gap by extending DAMs to handle Gaussian distributions using optimal transport theory.

Method: Proposes a Wasserstein DAM framework using Gaussian densities with 2-Wasserstein distance. Defines log-sum-exp energy over stored distributions and retrieval dynamics that aggregate optimal transport maps via Gibbs weighting. Stationary points correspond to Wasserstein barycenters.

Result: Proves exponential storage capacity and provides quantitative retrieval guarantees under Wasserstein perturbations. Validates on synthetic and real-world datasets including CelebA, CIFAR-10 (images) and text8, NLI corpus (text).

Conclusion: Successfully generalizes DAMs from vectors to distributions, bridging classical associative memories with modern generative modeling and enabling distributional storage/retrieval for memory-augmented learning.

Abstract: Dense associative memories (DAMs) store and retrieve patterns via energy-function based fixed points, but existing models are limited to vector representations. We extend DAMs to Gaussian densities equipped with the 2-Wasserstein distance. Our framework defines a log-sum-exp energy over stored distributions and a retrieval dynamics aggregating optimal transport maps in a Gibbs-weighted manner. Stationary points correspond to self-consistent Wasserstein barycenters, generalizing classical DAM fixed points. We prove exponential storage capacity and provide quantitative retrieval guarantees under Wasserstein perturbations. We validate the method on synthetic and real-world image (CelebA and CIFAR-10 datasets) and text (text8 and NLI corpus) datasets. By generalizing from vectors to distributions, our work bridges classical DAMs with modern generative modeling and paves way for distributional storage and retrieval in memory-augmented learning.

[1443] MOMA: Masked Orthogonal Matrix Alignment for Zero-Additional-Parameter Model Merging

Fanshuang Kong, Richong Zhang, Zhijie Nie, Hang Zhou, Ziqiao Wang, Qiang Sun, Chunming Hu

Main category: cs.LG

TL;DR: MOMA addresses model merging degradation by optimizing orthogonal transformations and masks to align merged encoders with classifier heads without adding parameters.

Details

Motivation: Model merging often yields suboptimal performance on classification tasks due to geometric misalignment between merged encoders and static task-specific classifier heads. Existing methods use auxiliary parameters for strict alignment, but the authors argue this is unnecessary since the misalignment is predominantly orthogonal.

Method: Proposes MOMA (Masked Orthogonal Matrix Alignment) which jointly optimizes a global multi-task vector mask and task-specific orthogonal transformations to rectify misalignment. Crucially, MOMA absorbs new parameters directly into existing model weights.

Result: Achieves performance comparable to state-of-the-art baselines with zero additional parameters and zero added inference cost.

Conclusion: MOMA effectively addresses model merging degradation by leveraging orthogonal transformation properties, eliminating the need for strict alignment and additional parameters while maintaining performance.

Abstract: Model merging offers a scalable alternative to multi-task learning but often yields suboptimal performance on classification tasks. We attribute this degradation to a geometric misalignment between the merged encoder and static task-specific classifier heads. Existing methods typically rely on auxiliary parameters to enforce strict representation alignment. We challenge this approach by revealing that the misalignment is predominantly an orthogonal transformation, rendering such strict alignment unnecessary. Leveraging this insight, we propose MOMA (Masked Orthogonal Matrix Alignment), which rectifies the misalignment by jointly optimizing a global multi-task vector mask and task-specific orthogonal transformations. Crucially, MOMA absorbs corresponding new parameters directly into the existing model weights, achieving performance comparable to state-of-the-art baselines with zero additional parameters and zero added inference cost.

[1444] Revisiting Multivariate Time Series Forecasting with Missing Values

Jie Yang, Yifan Hu, Kexin Zhang, Luyang Niu, Philip S. Yu, Kaize Ding

Main category: cs.LG

TL;DR: CRIB is a novel framework for multivariate time series forecasting with missing values that avoids imputation and directly predicts from partially observed data using Information Bottleneck principle with consistency regularization.

Details

Motivation: Traditional imputation-then-prediction frameworks for time series forecasting with missing values are problematic because there's no ground truth for missing values, making imputation error-prone and potentially degrading prediction accuracy. The authors found that unsupervised imputation can corrupt data distributions and harm prediction performance.

Method: Proposes Consistency-Regularized Information Bottleneck (CRIB) framework that directly predicts from partially observed time series without imputation. Uses Information Bottleneck principle to learn robust representations that filter out noise from missing values while preserving predictive signals. Combines unified-variate attention mechanism with consistency regularization scheme.

Result: Comprehensive experiments on four real-world datasets show CRIB achieves accurate predictions even under high missing rates, outperforming traditional imputation-based approaches.

Conclusion: CRIB represents a paradigm shift from imputation-then-prediction to direct prediction from incomplete data, offering a more robust solution for time series forecasting with missing values by avoiding imputation errors and preserving data distributions.

Abstract: Missing values are common in real-world time series, and multivariate time series forecasting with missing values (MTSF-M) has become a crucial area of research for ensuring reliable predictions. To address the challenge of missing data, current approaches have developed an imputation-then-prediction framework that uses imputation modules to fill in missing values, followed by forecasting on the imputed data. However, this framework overlooks a critical issue: there is no ground truth for the missing values, making the imputation process susceptible to errors that can degrade prediction accuracy. In this paper, we conduct a systematic empirical study and reveal that imputation without direct supervision can corrupt the underlying data distribution and actively degrade prediction accuracy. To address this, we propose a paradigm shift that moves away from imputation and directly predicts from the partially observed time series. We introduce Consistency-Regularized Information Bottleneck (CRIB), a novel framework built on the Information Bottleneck principle. CRIB combines a unified-variate attention mechanism with a consistency regularization scheme to learn robust representations that filter out noise introduced by missing values while preserving essential predictive signals. Comprehensive experiments on four real-world datasets demonstrate the effectiveness of CRIB, which predicts accurately even under high missing rates. Our code is available in https://github.com/Muyiiiii/CRIB.

[1445] Attention in Geometry: Scalable Spatial Modeling via Adaptive Density Fields and FAISS-Accelerated Kernels

Zhaowen Fan

Main category: cs.LG

TL;DR: ADF is a geometric attention framework that formulates spatial aggregation as query-conditioned attention in continuous space, bridging adaptive kernel methods and attention mechanisms, with applications in aircraft trajectory analysis.

Details

Motivation: The paper aims to develop a framework that bridges concepts from adaptive kernel methods and attention mechanisms by reinterpreting spatial influence as geometry-preserving attention grounded in physical distance, enabling better analysis of spatial data like aircraft trajectories.

Method: ADF formulates spatial aggregation as a query-conditioned, metric-induced attention operator in continuous space, using FAISS-accelerated inverted file indices for scalability by treating approximate nearest-neighbor search as an intrinsic component of the attention mechanism.

Result: Demonstrated through a case study on aircraft trajectory analysis in the Chengdu region, extracting trajectory-conditioned Zones of Influence (ZOI) to reveal recurrent airspace structures and localized deviations.

Conclusion: ADF provides a novel geometric attention framework that effectively bridges adaptive kernel methods and attention mechanisms for spatial data analysis, with practical applications in trajectory analysis and potential extensions to other spatial domains.

Abstract: This work introduces Adaptive Density Fields (ADF), a geometric attention framework that formulates spatial aggregation as a query-conditioned, metric-induced attention operator in continuous space. By reinterpreting spatial influence as geometry-preserving attention grounded in physical distance, ADF bridges concepts from adaptive kernel methods and attention mechanisms. Scalability is achieved via FAISS-accelerated inverted file indices, treating approximate nearest-neighbor search as an intrinsic component of the attention mechanism. We demonstrate the framework through a case study on aircraft trajectory analysis in the Chengdu region, extracting trajectory-conditioned Zones of Influence (ZOI) to reveal recurrent airspace structures and localized deviations.

[1446] AverageTime: Enhance Long-Term Time Series Forecasting with Simple Averaging

Gaoxiang Zhao, Chunmao Huang, Li Zhou, Xiaoqiang Wang

Main category: cs.LG

TL;DR: AverageTime is a simple, efficient time series forecasting model that extends iTransformer’s channel extraction concept by generating multiple novel sequences through averaging operations and structural mechanisms, achieving state-of-the-art performance with near-linear complexity.

Details

Motivation: The paper aims to improve multivariate long-term time series forecasting by enhancing the modeling of intra-sequence and cross-channel dependencies. While iTransformer successfully models channel-wise dependencies, the authors seek to develop a more flexible approach that can generate multiple novel sequences beyond just transforming the original input, while maintaining efficiency.

Method: AverageTime builds on iTransformer’s channel extraction concept but reframes it as a stackable and extensible architecture. The model uses simple averaging operations applied to both extracted sequences and original series, incorporates channel clustering for efficiency, and allows integration of other techniques like series decomposition. It generates multiple novel sequences through various structural mechanisms rather than being limited to transforming the original input.

Result: Experiments on real-world datasets show that AverageTime surpasses state-of-the-art models in forecasting performance while maintaining near-linear complexity. With just two straightforward averaging operations, it achieves superior results with negligible performance loss from the channel clustering technique.

Conclusion: AverageTime offers a new perspective on time series forecasting by enriching sequence information through extraction and fusion. The work demonstrates that simple, efficient approaches can outperform complex architectures in multivariate time series forecasting.

Abstract: Multivariate long-term time series forecasting aims to predict future sequences by utilizing historical observations, with a core focus on modeling intra-sequence and cross-channel dependencies. Numerous studies have developed diverse architectures to capture these patterns, achieving significant improvements in forecasting accuracy. Among them, iTransformer, a representative method for channel information extraction, leverages the Transformer architecture to model channel-wise dependencies, thereby facilitating sequence transformation for enhanced forecasting performance. Building upon iTransformer’s channel extraction concept, we propose AverageTime, a simple, efficient, and scalable forecasting model. Beyond iTransformer, AverageTime retains the original sequence information and reframes channel extraction as a stackable and extensible architecture. This allows the model to generate multiple novel sequences through various structural mechanisms, rather than being limited to transforming the original input. Moreover, the newly extracted sequences are not restricted to channel processing; other techniques such as series decomposition can also be incorporated to enhance predictive accuracy. Additionally, we introduce a channel clustering technique into AverageTime, which substantially improves training and inference efficiency with negligible performance loss. Experiments on real-world datasets demonstrate that with only two straightforward averaging operations, applied to both the extracted sequences and the original series. AverageTime surpasses state-of-the-art models in forecasting performance while maintaining near-linear complexity. This work offers a new perspective on time series forecasting: enriching sequence information through extraction and fusion. The source code is available at https://github.com/ UniqueoneZ/AverageTime.

[1447] Putnam-like dataset summary: LLMs as mathematical competition contestants

Bartosz Bieganowski, Daniel Strzelecki, Robert Skiba, Mateusz Topolewski

Main category: cs.LG

TL;DR: Analysis of LLM performance on Putnam-like mathematical competition problems shows Gemini 2.5 Pro achieves high scores but struggles with rigorous justifications and 2024 Putnam problems.

Details

Motivation: To evaluate and benchmark large language models' mathematical reasoning capabilities using challenging Putnam-like competition problems, assessing their ability to solve complex mathematical contest problems.

Method: Created a benchmark with 96 original Putnam-like problems and 576 LLM-generated solutions, then analyzed model performance, scoring distributions, and solution quality across different models.

Result: Top models like Gemini 2.5 Pro achieved high scores on the benchmark, demonstrating strong mathematical reasoning, but showed lower performance on actual 2024 Putnam problems and struggled with providing fully rigorous justifications.

Conclusion: LLMs show promising mathematical reasoning capabilities on contest problems but have limitations in rigorous justification and performance on recent Putnam competitions, revealing distinct behavioral patterns across models.

Abstract: In this paper we summarize the results of the Putnam-like benchmark published by Google DeepMind. This dataset consists of 96 original problems in the spirit of the Putnam Competition and 576 solutions generated by LLMs. We analyze the performance of models on this set of problems to verify their ability to solve problems from mathematical contests. We find that top models, particularly Gemini 2.5 Pro, achieve high scores, demonstrating strong mathematical reasoning capabilities, although their performance was lower on problems from the 2024 Putnam competition. The analysis highlights distinct behavioral patterns among models, including bimodal scoring distributions and challenges in providing fully rigorous justifications.

[1448] Calibrated Probabilistic Interpolation for GEDI Biomass

Robin Young, Srinivasan Keshav

Main category: cs.LG

TL;DR: ANPs (Attentive Neural Processes) provide calibrated uncertainty estimation for biomass mapping from sparse GEDI LiDAR data by learning spatial covariance functions and conditioning on local observations, outperforming traditional ensemble methods.

Details

Motivation: Standard machine learning methods for biomass mapping from GEDI LiDAR data treat predictions as independent and fail to produce calibrated uncertainty estimates, especially in heterogeneous landscapes where they conflate ensemble variance with aleatoric uncertainty and ignore local spatial context.

Method: Introduces Attentive Neural Processes (ANPs), a probabilistic meta-learning framework that explicitly conditions predictions on local observation sets and geospatial foundation model embeddings. Unlike static ensembles, ANPs learn a flexible spatial covariance function that adapts uncertainty estimates to landscape complexity.

Result: ANPs achieve competitive accuracy while maintaining near-ideal uncertainty calibration across five distinct biomes (Tropical Amazonian forests to Boreal/Alpine ecosystems). The method demonstrates operational utility through few-shot adaptation, recovering most performance gap in cross-region transfer with minimal local data.

Conclusion: ANPs provide a scalable, theoretically rigorous alternative to ensemble variance for continental-scale earth observation, offering calibrated uncertainty estimation that adapts to landscape heterogeneity and enables effective few-shot transfer learning.

Abstract: Reliable wall-to-wall biomass mapping from NASA’s GEDI mission requires interpolating sparse LiDAR observations across heterogeneous landscapes. While machine learning approaches like Random Forest and XGBoost are standard for this task, they treat spatial predictions of GEDI observations from multispectral or SAR remote sensing data as independent without adapting to the varying difficulty of heterogeneous landscapes. We demonstrate these approaches generally fail to produce calibrated prediction intervals. We identify that this stems from conflating ensemble variance with aleatoric uncertainty and ignoring local spatial context. To resolve this, we introduce Attentive Neural Processes (ANPs), a probabilistic meta-learning framework that explicitly conditions predictions on local observation sets and geospatial foundation model embeddings. Unlike static ensembles, ANPs learn a flexible spatial covariance function, allowing uncertainty estimates to expand in complex landscapes and contract in homogeneous areas. We validate this approach across five distinct biomes ranging from Tropical Amazonian forests to Boreal and Alpine ecosystems, demonstrating that ANPs achieve competitive accuracy while maintaining near-ideal uncertainty calibration. We demonstrate the operational utility of the method through few-shot adaptation, where the model recovers most of the performance gap in cross-region transfer using minimal local data. This work provides a scalable, theoretically rigorous alternative to ensemble variance for continental scale earth observation.

[1449] CleanSurvival: Automated data preprocessing for time-to-event models using reinforcement learning

Yousef Koka, David Selby, Gerrit Großmann, Kathan Pandya, Sebastian Vollmer

Main category: cs.LG

TL;DR: CleanSurvival is a reinforcement learning framework that automates data preprocessing optimization specifically for survival analysis models, outperforming standard approaches and random search.

Details

Motivation: Data preprocessing is critical but often neglected in machine learning, especially for specialized tasks like survival analysis which lacks automated preprocessing solutions tailored to time-to-event models.

Method: Uses Q-learning reinforcement learning to select optimal combinations of data imputation, outlier detection, and feature extraction techniques for survival models (Cox, random forest, neural network, or user-supplied). Handles both continuous and categorical variables.

Result: Experimental benchmarks on real-world datasets show superior predictive performance compared to standard approaches, finding optimal preprocessing pipelines up to 10 times faster than undirected random grid search. Simulation studies demonstrate effectiveness across different types and levels of missingness and noise.

Conclusion: CleanSurvival successfully addresses the gap in automated preprocessing for survival analysis, providing an efficient reinforcement learning-based solution that improves model performance and accelerates pipeline optimization.

Abstract: Data preprocessing is a critical yet frequently neglected aspect of machine learning, often paid little attention despite its potentially significant impact on model performance. While automated machine learning pipelines are starting to recognize and integrate data preprocessing into their solutions for classification and regression tasks, this integration is lacking for more specialized tasks like survival or time-to-event models. As a result, survival analysis not only faces the general challenges of data preprocessing but also suffers from the lack of tailored, automated solutions in this area. To address this gap, this paper presents ‘CleanSurvival’, a reinforcement-learning-based solution for optimizing preprocessing pipelines, extended specifically for survival analysis. The framework can handle continuous and categorical variables, using Q-learning to select which combination of data imputation, outlier detection and feature extraction techniques achieves optimal performance for a Cox, random forest, neural network or user-supplied time-to-event model. The package is available on GitHub: https://github.com/datasciapps/CleanSurvival Experimental benchmarks on real-world datasets show that the Q-learning-based data preprocessing results in superior predictive performance to standard approaches, finding such a model up to 10 times faster than undirected random grid search. Furthermore, a simulation study demonstrates the effectiveness in different types and levels of missingness and noise in the data.

[1450] Planning-Augmented Sampling with Early Guidance for High-Reward Discovery

Rui Zhu, Yudong Zhang, Xuan Yu, Chen Zhang, Xu Wang, Yang Wang

Main category: cs.LG

TL;DR: GFlowNets enhanced with Monte Carlo Tree Search and polynomial upper confidence bounds for faster discovery of high-reward candidates in structured generation tasks like molecular design.

Details

Motivation: Existing GFlowNet sampling strategies rely on weak guided exploration, which slows early discovery of high-reward candidates. In tasks like molecular design, rapid generation of high-quality solutions is more important than perfect distribution matching.

Method: Proposes a planning-augmented framework using Monte Carlo Tree Search with polynomial upper confidence bounds for online value estimates, combined with a controllable soft-greedy mechanism that integrates planning signals into GFlowNets forward policy.

Result: Empirical results show the method accelerates early high-reward discovery, sustains top-quality sample generation, and preserves diversity across representative tasks.

Conclusion: The planning-augmented GFlowNet framework effectively balances exploration and exploitation, enabling faster discovery of high-reward solutions while maintaining diversity in structured generation tasks.

Abstract: Generative Flow Networks (GFlowNets) enable structured generation with inherent diversity, but existing sampling strategies often rely on weak guided exploration, slowing early discovery of high-reward candidates. In tasks such as molecular design, rapid and consistent generation of high-reward solutions can outweigh faithful distribution matching. We propose a planning-augmented framework in which Monte Carlo Tree Search using polynomial upper confidence bounds provides online value estimates, and a controllable soft-greedy mechanism integrates these planning signals into the GFlowNets forward policy. This design fosters early exploration of high-reward trajectories and gradually shifts to policy-driven exploitation as experience accumulates. Empirical results show that our method accelerates early high-reward discovery, sustains top-quality sample generation, and preserves diversity across representative tasks. All implementations are available at https://github.com/ZRNB/PLUS.

[1451] Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: Theoretical and Empirical Evidence

Shaopeng Fu, Liang Ding, Jingfeng Zhang, Di Wang

Main category: cs.LG

TL;DR: Short-length adversarial training can effectively defend against long-length jailbreak attacks on LLMs, with theoretical and empirical evidence showing that training on adversarial suffixes of length Θ(√M) can defend against attacks of length Θ(M).

Details

Motivation: Adversarial training for LLM alignment against jailbreak attacks is resource-intensive when using long adversarial prompts. The paper aims to find more efficient defense strategies by investigating the relationship between adversarial suffix lengths during training and testing.

Method: Theoretical analysis of adversarial in-context learning for linear transformers on linear regression tasks, proving robust generalization bounds. Empirical adversarial training on open-source LLMs with evaluation against jailbreak attacks of varying adversarial suffix lengths.

Result: Theoretical bound shows dependence on Θ(√M_test/M_train). Empirical results confirm positive correlation between attack success rate and the ratio √(adversarial suffix length during jailbreaking)/length during AT. Short-length AT can effectively defend against long-length attacks.

Conclusion: It’s practical to defend against long-length jailbreak attacks using efficient short-length adversarial training, significantly reducing computational resources while maintaining robust defense capabilities.

Abstract: Jailbreak attacks against large language models (LLMs) aim to induce harmful behaviors in LLMs through carefully crafted adversarial prompts. To mitigate attacks, one way is to perform adversarial training (AT)-based alignment, i.e., training LLMs on some of the most adversarial prompts to help them learn how to behave safely under attacks. During AT, the length of adversarial prompts plays a critical role in the robustness of aligned LLMs. While long-length adversarial prompts during AT might lead to strong LLM robustness, their synthesis however is very resource-consuming, which may limit the application of LLM AT. This paper focuses on adversarial suffix jailbreak attacks and unveils that to defend against a jailbreak attack with an adversarial suffix of length $Θ(M)$, it is enough to align LLMs on prompts with adversarial suffixes of length $Θ(\sqrt{M})$. Theoretically, we analyze the adversarial in-context learning of linear transformers on linear regression tasks and prove a robust generalization bound for trained transformers. The bound depends on the term $Θ(\sqrt{M_{\text{test}}}/M_{\text{train}})$, where $M_{\text{train}}$ and $M_{\text{test}}$ are the numbers of adversarially perturbed in-context samples during training and testing. Empirically, we conduct AT on popular open-source LLMs and evaluate their robustness against jailbreak attacks of different adversarial suffix lengths. Results confirm a positive correlation between the attack success rate and the ratio of the square root of the adversarial suffix length during jailbreaking to the length during AT. Our findings show that it is practical to defend against “long-length” jailbreak attacks via efficient “short-length” AT. The code is available at https://github.com/fshp971/adv-icl.

[1452] The Three Regimes of Offline-to-Online Reinforcement Learning

Lu Li, Tianwei Ni, Yihao Sun, Pierre-Luc Bacon

Main category: cs.LG

TL;DR: Offline-to-online RL suffers from inconsistent empirical behavior; a stability-plasticity principle is proposed to explain this, identifying three regimes of online fine-tuning based on preserving better knowledge from pretrained policy or offline dataset.

Details

Motivation: The empirical behavior of offline-to-online reinforcement learning is highly inconsistent - design choices that work well in one setting fail completely in another. This inconsistency needs explanation and principled guidance.

Method: Proposes a stability-plasticity principle: preserve knowledge from pretrained policy or offline dataset (whichever is better) while maintaining sufficient plasticity. Identifies three regimes of online fine-tuning requiring distinct stability properties. Validates through large-scale empirical study.

Result: Empirical results strongly align with framework predictions in 45 of 63 cases, with only 3 opposite mismatches. The framework successfully explains inconsistent behavior across different settings.

Conclusion: Provides a principled framework for guiding design choices in offline-to-online RL based on relative performance of offline dataset and pretrained policy, addressing the empirical inconsistency problem.

Abstract: Offline-to-online reinforcement learning (RL) has emerged as a practical paradigm that leverages offline datasets for pretraining and online interactions for fine-tuning. However, its empirical behavior is highly inconsistent: design choices of online fine-tuning that work well in one setting can fail completely in another. We propose a stability–plasticity principle that can explain this inconsistency: we should preserve the knowledge of pretrained policy or offline dataset during online fine-tuning, whichever is better, while maintaining sufficient plasticity. This perspective identifies three regimes of online fine-tuning, each requiring distinct stability properties. We validate this framework through a large-scale empirical study, finding that the results strongly align with its predictions in 45 of 63 cases, with only 3 opposite mismatches. This work provides a principled framework for guiding design choices in offline-to-online RL based on the relative performance of the offline dataset and the pretrained policy.

[1453] Overcoming Spurious Solutions in Semi-Dual Neural Optimal Transport: A Smoothing Approach for Learning the Optimal Transport Plan

Jaemoo Choi, Jaewoong Choi, Dohyun Kwon

Main category: cs.LG

TL;DR: OTP: A novel method that learns both Optimal Transport Map and Optimal Transport Plan to solve convergence issues in Neural OT, outperforming existing methods in image-to-image translation tasks.

Details

Motivation: Semi-dual Neural OT often generates fake solutions that fail to accurately transfer one distribution to another. The authors aim to address convergence problems in learning Optimal Transport maps and handle cases where deterministic OT maps don't exist.

Method: Proposes OTP method that learns both the OT Map and the Optimal Transport Plan (optimal coupling between distributions). Identifies sufficient conditions for max-min solution recovery and provides theoretical guarantees under sharp distribution assumptions.

Result: OTP model recovers optimal transport map where existing methods fail, outperforms current OT-based models in image-to-image translation tasks, and can learn stochastic transport maps for one-to-many tasks like colorization.

Conclusion: OTP successfully addresses fake solution issues in Neural OT, provides theoretical guarantees for correct OT problem solving, and demonstrates practical superiority in image translation tasks including handling stochastic transport scenarios.

Abstract: We address the convergence problem in learning the Optimal Transport (OT) map, where the OT Map refers to a map from one distribution to another while minimizing the transport cost. Semi-dual Neural OT, a widely used approach for learning OT Maps with neural networks, often generates fake solutions that fail to transfer one distribution to another accurately. We identify a sufficient condition under which the max-min solution of Semi-dual Neural OT recovers the true OT Map. Moreover, to address cases when this sufficient condition is not satisfied, we propose a novel method, OTP, which learns both the OT Map and the Optimal Transport Plan, representing the optimal coupling between two distributions. Under sharp assumptions on the distributions, we prove that our model eliminates the fake solution issue and correctly solves the OT problem. Our experiments show that the OTP model recovers the optimal transport map where existing methods fail and outperforms current OT-based models in image-to-image translation tasks. Notably, the OTP model can learn stochastic transport maps when deterministic OT Maps do not exist, such as one-to-many tasks like colorization.

[1454] Fusing Multi- and Hyperspectral Satellite Data for Harmful Algal Bloom Monitoring with Self-Supervised and Hierarchical Deep Learning

Nicholas LaHaye, Kelly M. Luis, Michelle M. Gierach

Main category: cs.LG

TL;DR: Self-supervised framework (SIT-FUSE) for detecting harmful algal blooms using multi-sensor satellite data fusion without labeled datasets

Details

Motivation: Need for scalable monitoring of harmful algal blooms (HABs) in environments with limited ground truth observations, requiring methods that don't rely on per-instrument labeled datasets

Method: Fuses reflectance data from multiple satellite instruments (VIIRS, MODIS, OLCI, OCI) with TROPOMI solar-induced fluorescence, using self-supervised representation learning and hierarchical deep clustering to segment phytoplankton abundance and species

Result: Strong agreement with in-situ measurements of total phytoplankton, Karenia brevis, and Pseudo-nitzschia spp. from Gulf of Mexico and Southern California (2018-2025)

Conclusion: Advances scalable HAB monitoring and enables exploratory analysis via hierarchical embeddings, moving toward operationalizing self-supervised learning for global aquatic biogeochemistry

Abstract: We present a self-supervised machine learning framework for detecting and mapping the severity and speciation of harmful algal blooms (HABs) using multi-sensor satellite data. By fusing reflectance data from operational polar-orbiting satellite-based instruments (VIIRS, MODIS, OLCI, and OCI) with TROPOMI solar-induced fluorescence (SIF), our framework, called SIT-FUSE, generates HAB severity and speciation products without requiring per-instrument labeled datasets. The framework employs self-supervised representation learning and hierarchical deep clustering to segment phytoplankton cell abundance and species into interpretable classes, validated against in-situ data from the Gulf of Mexico and Southern California (2018-2025). Results show strong agreement with total phytoplankton, Karena brevis, and Pseudo-nitzschia spp. measurements. This work advances scalable HAB monitoring in environments where ground truth observations are limited, while enabling exploratory analysis via hierarchical embeddings - a critical step toward operationalizing self-supervised learning for global aquatic biogeochemistry.

[1455] Learning to Add, Multiply, and Execute Algorithmic Instructions Exactly with Neural Networks

Artur Back de Luca, George Giapitzakis, Kimon Fountoulakis

Main category: cs.LG

TL;DR: Neural networks can learn to execute binary algorithmic instructions exactly using NTK framework with logarithmic training data

Details

Motivation: Neural networks fail to generalize perfectly on discrete algorithmic operations like arithmetic, which are fundamental for algorithmic execution testing. The paper investigates whether neural networks can learn to execute binary-encoded algorithmic instructions exactly.

Method: Uses Neural Tangent Kernel (NTK) framework to study training dynamics of two-layer fully connected networks in infinite-width limit. Employs two techniques: structuring training data to isolate bit-level rules, and controlling correlations in NTK regime to align predictions with target algorithmic executions.

Result: Shows that sufficiently large ensemble of such models can be trained to execute exactly, with high probability, four fundamental tasks: binary permutations, binary addition, binary multiplication, and Subtract and Branch if Negative (SBN) instructions. Since SBN is Turing-complete, the framework extends to computable functions.

Conclusion: Neural networks can learn to execute binary algorithmic instructions exactly using logarithmic training data through careful data structuring and NTK correlation control, enabling exact algorithmic execution in neural networks.

Abstract: Neural networks are known for their ability to approximate smooth functions, yet they fail to generalize perfectly to unseen inputs when trained on discrete operations. Such operations lie at the heart of algorithmic tasks such as arithmetic, which is often used as a test bed for algorithmic execution in neural networks. In this work, we ask: can neural networks learn to execute binary-encoded algorithmic instructions exactly? We use the Neural Tangent Kernel (NTK) framework to study the training dynamics of two-layer fully connected networks in the infinite-width limit and show how a sufficiently large ensemble of such models can be trained to execute exactly, with high probability, four fundamental tasks: binary permutations, binary addition, binary multiplication, and Subtract and Branch if Negative (SBN) instructions. Since SBN is Turing-complete, our framework extends to computable functions. We show how this can be efficiently achieved using only logarithmically many training data. Our approach relies on two techniques: structuring the training data to isolate bit-level rules, and controlling correlations in the NTK regime to align model predictions with the target algorithmic executions.

[1456] MLLMEraser: Achieving Test-Time Unlearning in Multimodal Large Language Models through Activation Steering

Chenlu Ding, Jiancan Wu, Leheng Sheng, Fan Zhang, Yancheng Yuan, Xiang Wang, Xiangnan He

Main category: cs.LG

TL;DR: MLLMEraser: A training-free, input-aware framework for test-time unlearning in multimodal LLMs using activation steering to erase specific knowledge without parameter updates.

Details

Motivation: MLLMs face deployment concerns about memorized private data, outdated knowledge, and harmful content. Existing unlearning methods are computationally expensive, irreversible, and distort retained knowledge.

Method: Uses activation steering to enable dynamic knowledge erasure without parameter updates. Constructs multimodal erasure direction by contrasting adversarially perturbed knowledge-recall vs. knowledge-erasure image-text pairs. Includes input-aware steering mechanism to adaptively apply erasure direction only when needed.

Result: Outperforms state-of-the-art MLLM unlearning baselines on LLaVA-1.5 and Qwen-2.5-VL, achieving stronger forgetting with lower computational cost and minimal utility degradation.

Conclusion: MLLMEraser provides an effective, efficient training-free solution for test-time unlearning in multimodal LLMs, addressing privacy and safety concerns while preserving model utility.

Abstract: Multimodal large language models (MLLMs) have demonstrated remarkable capabilities across vision-language tasks, yet their large-scale deployment raises pressing concerns about memorized private data, outdated knowledge, and harmful content. Existing unlearning approaches for MLLMs typically adapt training-based strategies such as gradient ascent or preference optimization, but these methods are computationally expensive, irreversible, and often distort retained knowledge. In this work, we propose MLLMEraser, an input-aware, training-free framework for test-time unlearning. Our approach leverages activation steering to enable dynamic knowledge erasure without parameter updates. Specifically, we construct a multimodal erasure direction by contrasting adversarially perturbed, knowledge-recall image-text pairs with knowledge-erasure counterparts, capturing both textual and visual discrepancies. To prevent unnecessary interference, we further design an input-aware steering mechanism that adaptively determines when and how the erasure direction should be applied, preserving utility on retained knowledge while enforcing forgetting on designated content. Experiments on LLaVA-1.5 and Qwen-2.5-VL demonstrate that MLLMEraser consistently outperforms state-of-the-art MLLM unlearning baselines, achieving stronger forgetting performance with lower computational cost and minimal utility degradation.

[1457] Effective and Efficient Cross-City Traffic Knowledge Transfer: A Privacy-Preserving Perspective

Zhihao Zeng, Ziquan Fang, Yuting Huang, Lu Chen, Yunjun Gao

Main category: cs.LG

TL;DR: FedTT is a privacy-aware federated learning framework for cross-city traffic knowledge transfer that addresses data quality, distribution discrepancies, and privacy concerns through imputation, domain adaptation, and secure aggregation.

Details

Motivation: Existing federated traffic knowledge transfer approaches face critical challenges including potential privacy leakage, cross-city data distribution discrepancies, and low data quality, which hinder practical application in real-world traffic prediction scenarios.

Method: FedTT includes three key innovations: 1) traffic view imputation for missing data completion, 2) traffic domain adapter for uniform data transformation to address distribution discrepancies, and 3) traffic secret aggregation protocol for secure data aggregation to protect privacy.

Result: Extensive experiments on 4 real-world datasets demonstrate that FedTT outperforms 14 state-of-the-art baselines in traffic prediction tasks.

Conclusion: FedTT provides an effective privacy-aware and efficient federated learning solution for cross-city traffic knowledge transfer that addresses key practical challenges in traffic prediction applications.

Abstract: Traffic prediction aims to forecast future traffic conditions using historical traffic data, serving a crucial role in urban computing and transportation management. While transfer learning and federated learning have been employed to address the scarcity of traffic data by transferring traffic knowledge from data-rich to data-scarce cities without traffic data exchange, existing approaches in Federated Traffic Knowledge Transfer (FTT) still face several critical challenges such as potential privacy leakage, cross-city data distribution discrepancies, and low data quality, hindering their practical application in real-world scenarios. To this end, we present FedTT, a novel privacy-aware and efficient federated learning framework for cross-city traffic knowledge transfer. Specifically, our proposed framework includes three key innovations: (i) a traffic view imputation method for missing traffic data completion to enhance data quality, (ii) a traffic domain adapter for uniform traffic data transformation to address data distribution discrepancies, and (iii) a traffic secret aggregation protocol for secure traffic data aggregation to safeguard data privacy. Extensive experiments on 4 real-world datasets demonstrate that the proposed FedTT framework outperforms the 14 state-of-the-art baselines.

[1458] PatternKV: Flattening KV Representation Expands Quantization Headroom

Ji Zhang, Yiwei Li, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li

Main category: cs.LG

TL;DR: PatternKV: A pattern-aligned residual quantization scheme for KV cache compression in LLMs that mines representative pattern vectors online, aligns KV vectors to patterns, and quantizes only residuals to enable efficient 2-bit KV quantization with minimal accuracy loss.

Details

Motivation: KV cache in autoregressive LLMs has become the dominant memory and bandwidth bottleneck during inference, especially with long contexts and test-time scaling. While KV quantization helps reduce cache cost, accuracy drops sharply because the native KV distribution lacks flatness and maintains a wide quantization range. Prior outlier isolation methods cap error but fail to flatten the overall distribution, leaving performance fragile under low-bit settings.

Method: PatternKV is a pattern-aligned residual quantization scheme that: 1) Mines representative pattern vectors online from the KV cache, 2) Aligns each KV vector to its nearest pattern, and 3) Quantizes only the residual difference between the original vector and the pattern. This reshapes the KV distribution by flattening the quantization target and narrowing its range, based on insights that K cache maintains stable context-evolving structure while V cache carries latent semantic regularities.

Result: PatternKV delivers consistent 2-bit gains across long-context and test-time scaling settings on multiple backbones, with only 0.08% average 4-bit drop relative to FP16. It improves test-time scaling accuracy by 10% on average, raises throughput by 1.5x, and supports 1.25x larger batches.

Conclusion: PatternKV effectively addresses KV cache bottlenecks through pattern-aligned residual quantization, enabling efficient low-bit KV quantization with minimal accuracy loss and significant performance improvements for long-context and test-time scaling scenarios in LLMs.

Abstract: KV cache in autoregressive LLMs eliminates redundant recomputation but has emerged as the dominant memory and bandwidth bottleneck during inference, notably with long contexts and test-time scaling. KV quantization is a key lever for reducing cache cost, but accuracy drops sharply as the native KV distribution lacks flatness and thus maintains a wide quantization range. Prior work focuses on isolating outliers, which caps their error but fails to flatten the overall distribution, leaving performance fragile under low-bit settings. In this work, we show that the K cache maintains a stable, context-evolving structure, while the V cache carries latent semantic regularities, with both contributing to the organization of vectors into shared patterns. Building on these insights, we propose PatternKV, a pattern-aligned residual quantization scheme. It mines representative pattern vectors online, aligns each KV vector to its nearest pattern, and quantizes only the residual. This reshaping of the KV distribution flattens the quantization target and narrows its range, thereby improving the fidelity of low-bit KV quantization. Across long-context and test-time scaling settings on multiple backbones, PatternKV delivers consistent 2-bit gains, with a 0.08% average 4-bit drop relative to FP16, improves test-time scaling accuracy by 10% on average, and raises throughput by 1.5x while supporting 1.25x larger batches.

[1459] An Overview of Low-Rank Structures in the Training and Adaptation of Large Models

Laura Balzano, Tianjiao Ding, Benjamin D. Haeffele, Soo Min Kwon, Qing Qu, Peng Wang, Zhangyang Wang, Can Yaras

Main category: cs.LG

TL;DR: A comprehensive tutorial reviewing how deep networks naturally learn low-rank structures during training, with theoretical perspectives on optimization dynamics and implicit regularization, and practical applications in parameter-efficient fine-tuning and training.

Details

Motivation: The paper addresses the computational challenges of large-scale deep learning by exploring how deep networks inherently develop low-rank structures in weights and representations during training, which can be exploited for more efficient training and deployment.

Method: The paper provides a tutorial review with two complementary theoretical perspectives: 1) analyzing low-rank emergence through optimization dynamics of gradient descent throughout training, and 2) understanding it as a result of implicit regularization effects at convergence. These frameworks are then connected to practical techniques.

Result: The review establishes theoretical foundations that explain the success of practical techniques like Low-Rank Adaptation (LoRA) for fine-tuning, inspires new parameter-efficient low-rank training strategies, and explains the effectiveness of masked training approaches like dropout and masked self-supervised learning.

Conclusion: Exploiting inherent low-rank structures in deep networks provides a promising direction for addressing computational challenges in large-scale deep learning, with both theoretical understanding and practical applications for more efficient training and deployment.

Abstract: The substantial computational demands of modern large-scale deep learning present significant challenges for efficient training and deployment. Recent research has revealed a widespread phenomenon wherein deep networks inherently learn low-rank structures in their weights and representations during training. This tutorial paper provides a comprehensive review of advances in exploiting these low-rank structures, bridging mathematical foundations with practical applications. We present two complementary theoretical perspectives on the emergence of low-rankness: viewing it through the optimization dynamics of gradient descent throughout training, and understanding it as a result of implicit regularization effects at convergence. Practically, these theoretical frameworks provide a foundation for understanding the success of techniques such as Low-Rank Adaptation (LoRA) in fine-tuning, inspire new parameter-efficient low-rank training strategies, and explain the effectiveness of masked training approaches like dropout and masked self-supervised learning.

[1460] Unlocking Graph Structure Learning with Tree-Guided Large Language Models

Zhihan Zhang, Xunkai Li, Lei Zhu, Guang Zeng, Bowen Fan, Yanzhe Wen, Hongchao Qin, Rong-Hua Li, Guoren Wang

Main category: cs.LG

TL;DR: LLaTA: A novel framework that integrates large language models with graph structure learning for text-attributed graphs using tree-based optimization and training-free LLM integration

Details

Motivation: Traditional graph structure learning methods are not designed for text-attributed graphs, and there's a need to develop new paradigms that can effectively leverage LLMs while addressing optimization challenges and computational efficiency concerns.

Method: Reformulates GSL optimization as a tree optimization framework, shifting from edge prediction to language-aware tree sampling. Uses decoupled, training-free model design principles for LLM integration, focusing on efficient inference rather than fine-tuning. Proposes LLaTA which leverages tree-based LLM in-context learning.

Result: Extensive experiments on 11 datasets show LLaTA achieves state-of-the-art predictive performance across various domains, demonstrating flexibility (works with any backbone), scalability (outperforms other LLM-based GSL methods), and effectiveness.

Conclusion: LLaTA provides an effective solution for integrating LLMs with graph structure learning for text-attributed graphs, addressing both optimization and efficiency challenges through tree-based frameworks and training-free LLM integration.

Abstract: Recently, the emergence of large language models (LLMs) has motivated integrating language descriptions into graphs, forming text-attributed graphs (TAGs) that enhance model encoding capabilities from a data-centric perspective. A review of prior advancements highlights that graph structure learning (GSL) is a pivotal technique for improving data utility, making it highly relevant to efficient TAG learning. However, most GSL methods are tailored for traditional graphs without textual information, underscoring the necessity of developing a new GSL paradigm. Despite clear motivations, it remains challenging: (1) How can we define a reasonable optimization objective for GSL in the era of LLMs, considering the massive parameters in LLMs? (2) How can we design an efficient model architecture that enables seamless integration of LLMs for this optimization objective? For Question 1, we reformulate existing GSL optimization objectives as a tree optimization framework, shifting the focus from obtaining a well-trained edge predictor to a language-aware tree sampler. For Question 2, we propose decoupled and training-free model design principles for LLM integration, shifting the focus from computation-intensive fine-tuning to more efficient inference. Based on this, we propose Large Language and Tree Assistant (LLaTA), which leverages tree-based LLM in-context learning to enhance the understanding of topology and text, enabling reliable inference and generating improved graph structure. Extensive experiments on 11 datasets demonstrate that LLaTA enjoys flexibility-incorporated with any backbone; scalability-outperforms other LLM-based GSL methods; and effectiveness-achieving SOTA predictive performance across a variety of datasets from different domains.

[1461] Myopic Bayesian Decision Theory for Batch Active Learning with Partial Batch Label Sampling

Kangping Hu, Stephen Mussmann

Main category: cs.LG

TL;DR: Derives Bayesian Decision Theory for active learning, introduces ParBaLS algorithm for efficient batch sampling with EPIG, showing superior performance on neural embeddings.

Details

Motivation: Addresses the proliferation of active learning acquisition functions without clear guidance, and the computational challenges of scaling Bayesian methods to large batch sizes.

Method: Derives Bayesian Decision Theory for myopic active learning, leading to EPIG algorithm, then develops Partial Batch Label Sampling (ParBaLS) to efficiently scale EPIG to batch settings using a particular decision process formulation.

Result: ParBaLS EPIG shows superior performance for fixed budget on several datasets with Bayesian Logistic Regression on Neural Embeddings compared to other methods.

Conclusion: Provides principled Bayesian framework for active learning and practical algorithm (ParBaLS) that effectively scales EPIG to batch settings with better performance.

Abstract: Over the past couple of decades, many active learning acquisition functions have been proposed, leaving practitioners with an unclear choice of which to use. Bayesian Decision Theory (BDT) offers a universal principle to guide decision-making. In this work, we derive BDT for (Bayesian) active learning in the myopic framework, where we imagine we only have one more point to label. This derivation leads to effective algorithms such as Expected Error Reduction (EER), Expected Predictive Information Gain (EPIG), and other algorithms that appear in the literature. A key challenge of such methods is the difficult scaling to large batch sizes, leading to either computational challenges (BatchBALD) or dramatic performance drops (top-$B$ selection). Here, using a particular formulation of the decision process, we derive Partial Batch Label Sampling (ParBaLS) for the EPIG algorithm. We show experimentally for several datasets that ParBaLS EPIG gives superior performance for a fixed budget and Bayesian Logistic Regression on Neural Embeddings. Our code is available at https://github.com/ADDAPT-ML/ParBaLS.

[1462] shapr: Explaining Machine Learning Models with Conditional Shapley Values in R and Python

Martin Jullum, Lars Henry Berge Olsen, Jon Lachmann, Annabelle Redelmeier

Main category: cs.LG

TL;DR: The shapr R package and shaprpy Python library provide Shapley value-based model explanation tools with emphasis on conditional Shapley values for accurate feature dependency modeling, supporting tabular data, time series forecasts, and causal analysis.

Details

Motivation: To create comprehensive Shapley value-based explanation tools that accurately capture feature dependencies, which is crucial for correct model interpretation and typically lacking in existing software.

Method: Implements conditional Shapley value estimation with various approaches for modeling feature dependencies, specialized functionality for time series forecasts, parallelized computations, iterative estimation with convergence detection, and causal Shapley value computation when causal information is available.

Result: Developed shapr R package and shaprpy Python library that provide versatile, user-friendly tools for generating accurate Shapley value-based explanations with extensive visualization capabilities and support for various data types including time series.

Conclusion: The shapr and shaprpy packages enhance model interpretability through comprehensive Shapley value estimation with accurate feature dependency modeling, offering both simplicity for common use cases and flexibility for advanced applications.

Abstract: This paper introduces the shapr R package, a versatile tool for generating Shapley value-based prediction explanations for machine learning and statistical regression models. Moreover, the shaprpy Python library brings the core capabilities of shapr to the Python ecosystem. Shapley values originate from cooperative game theory in the 1950s, but have over the past few years become a widely used method for quantifying how a model’s features/covariates contribute to specific prediction outcomes. The shapr package emphasizes conditional Shapley value estimates, providing a comprehensive range of approaches for accurately capturing feature dependencies – a crucial aspect for correct model explanation, typically lacking in similar software. In addition to regular tabular data, the shapr R package includes specialized functionality for explaining time series forecasts. The package offers a minimal set of user functions with sensible default values for most use cases while providing extensive flexibility for advanced users to fine-tune computations. Additional features include parallelized computations, iterative estimation with convergence detection, and rich visualization tools. shapr also extends its functionality to compute causal and asymmetric Shapley values when causal information is available. Overall, the shapr and shaprpy packages aim to enhance the interpretability of predictive models within a powerful and user-friendly framework.

[1463] AnyBCQ: Hardware Efficient Flexible Binary-Coded Quantization for Multi-Precision LLMs

Gunho Park, Jeongin Bae, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee

Main category: cs.LG

TL;DR: AnyBCQ is a hardware-friendly multi-precision quantization method for LLMs that enables dynamic precision selection using binary bit-plane representations, improving efficiency while maintaining accuracy.

Details

Motivation: LLM deployment faces memory and latency bottlenecks, requiring flexible quantization techniques that can balance accuracy and efficiency based on runtime constraints. Existing multi-precision models need hardware-friendly implementations that support direct bit-plane operations for optimal efficiency.

Method: Extends Binary-Coded Quantization (BCQ) to support multi-precision inference by representing weights as binary bit-planes with scale factors. Uses progressive precision expansion that incrementally refines scaling factors while reusing binary codes, and co-designs specialized kernels for dynamic per-request precision selection.

Result: Significantly reduces accuracy drop in low-bit regimes (e.g., 2-bit), remains competitive at higher precision, and achieves up to 3.0x throughput gain over half precision and 1.2x over state-of-the-art multi-precision methods.

Conclusion: AnyBCQ provides a practical foundation for multi-precision LLM deployment by aligning algorithmic flexibility with hardware efficiency, enabling dynamic precision selection across diverse service-level objectives.

Abstract: The deployment of large language models (LLMs) is increasingly constrained by memory and latency bottlenecks, motivating the need for quantization techniques that flexibly balance accuracy and efficiency. Recent work has introduced multi-precision models, which enable inference at multiple precisions within a single model depending on runtime constraints. To support such flexibility, quantized weights are often stored as bit-planes, where hardware efficiency improves when the compute operates directly at the bit-plane level and activates only the precision required by each request. In this work, we present AnyBCQ, a hardware-friendly multi-precision extension of Binary-Coded Quantization (BCQ) that supports direct bit-plane operations. By representing weights as binary bit-planes with corresponding scale factors, AnyBCQ enables bit-plane-level computation and maps naturally to accelerator-friendly, bit-parallel arithmetic. Our progressive precision expansion mechanism incrementally refines scaling factors while reusing previously assigned binary codes, yielding monotonic improvements in accuracy as additional bits are enabled. We further co-design a specialized kernel that exploits the BCQ structure to support dynamic per-request precision selection with negligible overhead. Experiments on recent LLMs demonstrate that AnyBCQ significantly narrows the accuracy drop in the low-bit regime (e.g. 2-bit), remains competitive at higher precision, and achieves throughput gains of up to 3.0x over half precision and 1.2x over state-of-the-art multi-precision methods. By aligning algorithmic flexibility with hardware efficiency, AnyBCQ provides a practical foundation for multi-precision LLM deployment across diverse service-level objectives.

[1464] ASIL: Augmented Structural Information Learning for Deep Graph Clustering in Hyperbolic Space

Li Sun, Zhenhao Huang, Yujie Wang, Hongbo Lv, Chunyang Liu, Hao Peng, Philip S. Yu

Main category: cs.LG

TL;DR: ASIL: A differentiable structural information framework for deep graph clustering without requiring predefined cluster numbers, handling imbalanced graphs via hyperbolic partitioning trees and contrastive learning.

Details

Motivation: Existing deep graph clustering methods require predefined cluster numbers K and struggle with imbalanced graphs. Structural information theory is rarely used in deep clustering, and its classic discrete definition neglects node attributes while having prohibitive complexity.

Method: Proposes ASIL (Augmented Structural Information Learning) framework: 1) Differentiable structural information framework generalizing discrete formalism to continuous realm, 2) LSEnet hyperbolic model to learn neural partitioning tree in Lorentz model, 3) Tree contrastive learning with structural entropy bound, 4) Integration of hyperbolic partitioning tree construction and contrastive learning with linear complexity.

Result: ASIL outperforms 20 strong baselines by an average of +12.42% in NMI on Citeseer dataset. Achieves effective debiased graph clustering in linear complexity with provable improvement in graph conductance.

Conclusion: ASIL provides an efficient framework for deep graph clustering without requiring predefined cluster numbers, handling imbalanced graphs through differentiable structural information theory and hyperbolic representations.

Abstract: Graph clustering is a longstanding topic in machine learning. Recently, deep methods have achieved results but still require predefined cluster numbers K and struggle with imbalanced graphs. We study deep graph clustering without K considering realistic imbalance through structural information theory. In the literature, structural information is rarely used in deep clustering, and its classic discrete definition neglects node attributes while exhibiting prohibitive complexity. In this paper, we establish a differentiable structural information framework, generalizing the discrete formalism to the continuous realm. We design a hyperbolic model (LSEnet) to learn the neural partitioning tree in the Lorentz model. Theoretically, we demonstrate its capability in clustering without K and identifying minority clusters. Second, we refine hyperbolic representations to enhance graph semantics. Since tree contrastive learning is non-trivial and costs quadratic complexity, we advance our theory by discovering that structural entropy bounds the tree contrastive loss. Finally, we approach graph clustering through a novel augmented structural information learning (ASIL), which offers an efficient objective to integrate hyperbolic partitioning tree construction and contrastive learning. With a provable improvement in graph conductance, ASIL achieves effective debiased graph clustering in linear complexity. Extensive experiments show ASIL outperforms 20 strong baselines by an average of +12.42% in NMI on the Citeseer dataset.

[1465] Message Passing on the Edge: Towards Scalable and Expressive GNNs

Pablo Barceló, Fabian Jogl, Alexander Kozachinskiy, Matthias Lanzinger, Stefan Neumann, Cristóbal Rojas

Main category: cs.LG

TL;DR: EB-GNN is an edge-based graph neural network architecture that performs message passing on edges and triangles, offering stronger theoretical expressivity than traditional vertex-based approaches while maintaining near-linear computational efficiency.

Details

Motivation: Most GNNs propagate information through vertices, but this work explores edge-based message passing to achieve greater expressivity while maintaining computational efficiency, inspired by classic triangle-counting algorithms.

Method: Introduces EB-1WL (edge-based color-refinement test) and EB-GNN architecture that passes messages along edges and triangles, with theoretical foundations in first-order logic and homomorphism counting.

Result: EB-1WL is significantly more expressive than traditional 1WL, EB-GNN has strongest theoretical expressivity among edge-based GNNs, maintains near-linear time/memory usage, and outperforms simple MPNNs while competing with state-of-the-art GNNs at lower cost.

Conclusion: EB-GNN provides a highly efficient general-purpose GNN architecture with strong theoretical expressivity guarantees and practical computational advantages over both simple and specialized GNN approaches.

Abstract: Graph neural networks (GNNs) are widely used in graph learning and most architectures propagate information by passing messages between vertices. In this work, we shift our attention to GNNs that perform message passing on edges and introduce EB-1WL, an edge-based color-refinement test, and a corresponding architecture, EB-GNN. Our EB-GNN architecture is inspired by the classic triangle-counting algorithm of Chiba and Nishizeki and passes messages along edges and triangles. Our contributions are as follows: (1) Theoretically, we show that EB-1WL is significantly more expressive than 1WL. We provide a complete logical characterization of EB-1WL in first-order logic, along with distinguishability results via homomorphism counting. To the best of our knowledge, EB-GNN has the strongest theoretical expressivity guarantees among edge-based message-passing GNNs in the literature. (2) Unlike many GNN architectures that are more expressive than 1WL, we prove that EB-1WL and EB-GNN admit near-linear time and memory usage on practical graph learning workloads. (3) We show in experiments that EB-GNN is a highly efficient general-purpose architecture: it substantially outperforms simple MPNNs and remains competitive with task-specialized state-of-the-art GNNs at substantially lower computational cost.

[1466] TSCAN: Context-Aware Uplift Modeling via Two-Stage Training for Online Merchant Business Diagnosis

Hangtao Zhang, Zhe Li, Kairui Zhang

Main category: cs.LG

TL;DR: TSCAN: A two-stage context-aware uplift modeling approach for ITE estimation that addresses sample selection bias through CAN-U (with regularization) and CAN-D (without regularization) models with context-aware attention.

Details

Motivation: Traditional ITE estimation methods suffer from sample selection bias and use regularization techniques (IPM, re-weighting, propensity scores) that cause information loss. Existing methods also fail to fully utilize contextual features that affect treatment effects across different external contexts.

Method: Two-stage approach: 1) CAN-U model with IPM and propensity score regularization generates complete dataset with counterfactual uplift labels; 2) CAN-D model with isotonic output layer directly models uplift effects without regularization, adaptively correcting CAN-U errors while reinforcing factual samples. Context-Aware Attention Layer manages interactions between treatment, merchant, and contextual features throughout both stages.

Result: Extensive experiments on two real-world datasets validate TSCAN’s effectiveness. Deployment on one of China’s largest online food ordering platforms demonstrates practical utility and impact for real-world merchant diagnosis.

Conclusion: TSCAN successfully addresses limitations of traditional ITE estimation methods by combining two-stage training with context-aware attention, eliminating reliance on problematic regularizations while better modeling varying treatment effects across different contexts.

Abstract: A primary challenge in ITE estimation is sample selection bias. Traditional approaches utilize treatment regularization techniques such as the Integral Probability Metrics (IPM), re-weighting, and propensity score modeling to mitigate this bias. However, these regularizations may introduce undesirable information loss and limit the performance of the model. Furthermore, treatment effects vary across different external contexts, and the existing methods are insufficient in fully interacting with and utilizing these contextual features. To address these issues, we propose a Context-Aware uplift model based on the Two-Stage training approach (TSCAN), comprising CAN-U and CAN-D sub-models. In the first stage, we train an uplift model, called CAN-U, which includes the treatment regularizations of IPM and propensity score prediction, to generate a complete dataset with counterfactual uplift labels. In the second stage, we train a model named CAN-D, which utilizes an isotonic output layer to directly model uplift effects, thereby eliminating the reliance on the regularization components. CAN-D adaptively corrects the errors estimated by CAN-U through reinforcing the factual samples, while avoiding the negative impacts associated with the aforementioned regularizations. Additionally, we introduce a Context-Aware Attention Layer throughout the two-stage process to manage the interactions between treatment, merchant, and contextual features, thereby modeling the varying treatment effect in different contexts. We conduct extensive experiments on two real-world datasets to validate the effectiveness of TSCAN. Ultimately, the deployment of our model for real-world merchant diagnosis on one of China’s largest online food ordering platforms validates its practical utility and impact.

[1467] Context-Selective State Space Models: Feedback is All You Need

Riccardo Zattra, Giacomo Baggio, Umberto Casti, Augusto Ferrante, Francesco Ticozzi

Main category: cs.LG

TL;DR: COFFEE introduces a novel time-varying state space model with state feedback for context-dependent selectivity, enabling efficient long-range dependency capture with parallel implementation.

Details

Motivation: Transformers have quadratic complexity and struggle with long-range dependencies. State space models offer an alternative, but need improvements in context-dependent selectivity and efficiency.

Method: COFFEE is a time-varying SSM that incorporates state feedback to enable context-dependent selectivity while maintaining parallel implementation capability. The model regulates its dynamics based on internal state context.

Result: Achieves near-perfect accuracy on induction head task with 100x fewer parameters/training sequences than S6 (Mamba’s SSM). On MNIST, reaches 97% accuracy with only 3585 parameters, outperforming S6.

Conclusion: State feedback is a key mechanism for building scalable and efficient sequence models, enabling better long-range dependency capture with fewer parameters.

Abstract: Transformers, powered by the attention mechanism, are the backbone of most foundation models, yet they suffer from quadratic complexity and difficulties in dealing with long-range dependencies in the input sequence. Recent work has shown that state space models (SSMs) provide a promising alternative. In this paper, we introduce the COFFEE (COntext From FEEdback) model, a novel time-varying SSM that incorporates state feedback to enable context-dependent selectivity, while still allowing for parallel implementation. This idea allows the model to regulate its dynamics based on the context described by the internal state, which embodies a compact representation of the input history. State feedback allows COFFEE to improve its ability to capture long-range dependencies: on the induction head task, it achieves near-perfect accuracy with two orders of magnitude fewer parameters and training sequences compared to S6 (the SSM of Mamba). On MNIST, COFFEE largely outperforms S6 within the same architecture, reaching 97% accuracy with only 3585 parameters. These results showcase the role of state feedback as a key mechanism for building scalable and efficient sequence models.

[1468] Bubble2Heat: Optical to Thermal Inference in Pool Boiling Using Physics-encoded Generative AI

Qianxi Fu, Youngjoon Suh, Xiaojing Zhang, Sanghyeon Chang, Yoonjin Won

Main category: cs.LG

TL;DR: Deep learning framework generates high-resolution temperature fields from experimental pool boiling data using conditional GANs trained on simulation data.

Details

Motivation: Quantitative characterization of multiphase heat transfer is limited by challenges in measuring temperature fields in chaotic, rapidly evolving flow regimes. Computational methods have resolution but struggle to replicate complex experimental conditions.

Method: Conditional generative adversarial network trained on simulation data, with preprocessing pipeline aligning simulation data with experimental measurements using image processing and pretrained CNN segmentation. Data augmentation enhances physical plausibility.

Result: Framework successfully generates temperature field data at simulation resolution from segmented high-speed recordings and pointwise thermocouple readings available in canonical pool boiling experiments.

Conclusion: Deep generative models can bridge the gap between observable multiphase phenomena and underlying thermal transport, offering powerful approach to augment and interpret experimental measurements in complex two-phase systems.

Abstract: Phase change process plays a critical role in thermal management systems, yet quantitative characterization of multiphase heat transfer remains limited by the challenges of measuring temperature fields in chaotic, rapidly evolving flow regimes. While computational methods offer temperature data at a high spatiotemporal resolution in ideal cases, replicating complex experimental conditions remains prohibitively difficult. In this paper, we present a deep learning framework that can generate temperature field data at simulation resolution from segmented high-speed recordings and pointwise thermocouple readings which are typically available in a canonical pool boiling experimental configuration without requiring advanced techniques. This framework leverages a conditional generative adversarial network trained only on simulation data. To ensure direct applicability of the model to experimental data, our framework also introduces a preprocessing pipeline that aligns high resolution simulation data with experimental measurements through both conventional image processing and image segmentation with pretrained convolutional neural network. We further show that standard data augmentation strategies are effective in enhancing the physical plausibility of the inference when precise physical constraints are not applicable. Our results highlight the potential of deep generative models to bridge the gap between observable multiphase phenomena and underlying thermal transport, offering a powerful approach to augment and interpret experimental measurements in complex two-phase systems.

[1469] Model-agnostic Selective Labeling with Provable Statistical Guarantees

Huipeng Huang, Wenbo Liao, Huajun Xi, Hao Zeng, Mengchen Zhao, Hongxin Wei

Main category: cs.LG

TL;DR: Conformal Labeling is a method that uses conformal prediction to identify which AI-generated labels can be provably trusted by controlling the false discovery rate, ensuring a predefined fraction of AI-assigned labels is correct.

Details

Motivation: AI models offer cost-effective labeling but produce unreliable labels with errors. Existing selective labeling methods lack theoretical guarantees on AI label quality, leading to unacceptably high error rates in AI-labeled subsets.

Method: Construct conformal p-values for each test instance by comparing AI models’ predicted confidence to those of calibration instances mislabeled by AI models. Select test instances whose p-values are below a data-dependent threshold to certify AI predictions as trustworthy.

Result: The method achieves tight false discovery rate (FDR) control with high power across various tasks including image and text labeling, and LLM QA, providing theoretical guarantees that FDR is controlled below nominal levels.

Conclusion: Conformal Labeling provides a principled approach to identify trustworthy AI-generated labels with theoretical guarantees, addressing the reliability problem in AI-assisted data labeling.

Abstract: Obtaining high-quality labels for large datasets is expensive, requiring massive annotations from human experts. While AI models offer a cost-effective alternative by predicting labels, their label quality is compromised by the unavoidable labeling errors. Existing methods mitigate this issue through selective labeling, where AI labels a subset and human labels the remainder. However, these methods lack theoretical guarantees on the quality of AI-assigned labels, often resulting in unacceptably high labeling error within the AI-labeled subset. To address this, we introduce \textbf{Conformal Labeling}, a novel method to identify instances where AI predictions can be provably trusted. This is achieved by controlling the false discovery rate (FDR), the proportion of incorrect labels within the selected subset. In particular, we construct a conformal $p$-value for each test instance by comparing AI models’ predicted confidence to those of calibration instances mislabeled by AI models. Then, we select test instances whose $p$-values are below a data-dependent threshold, certifying AI models’ predictions as trustworthy. We provide theoretical guarantees that Conformal Labeling controls the FDR below the nominal level, ensuring that a predefined fraction of AI-assigned labels is correct on average. Extensive experiments demonstrate that our method achieves tight FDR control with high power across various tasks, including image and text labeling, and LLM QA.

[1470] WaterDrum: Watermarking for Data-centric Unlearning Metric

Xinyang Lu, Xinyuan Niu, Gregory Kang Ruey Lau, Bui Thi Cam Nhung, Rachael Hwee Ling Sim, John Russell Himawan, Fanyu Wen, Chuan-Sheng Foo, See-Kiong Ng, Bryan Kian Hsiang Low

Main category: cs.LG

TL;DR: WaterDrum is a data-centric unlearning metric for LLMs that uses text watermarking to evaluate unlearning effectiveness, addressing limitations of utility-centric metrics when forget/retain sets have semantic overlap.

Details

Motivation: Existing LLM unlearning metrics fail when forget and retain sets have semantically similar content or when retraining from scratch is impractical. Need for more accurate evaluation of unlearning effectiveness in realistic scenarios.

Method: Uses robust text watermarking to create a data-centric unlearning metric. Introduces new benchmark datasets with varying levels of data similarity for rigorous evaluation of unlearning algorithms.

Result: Developed WaterDrum metric and released benchmark datasets on HuggingFace. Provides more accurate evaluation of unlearning effectiveness compared to utility-centric metrics.

Conclusion: WaterDrum offers a practical data-centric approach to evaluate LLM unlearning, especially useful when forget/retain sets overlap semantically, addressing limitations of existing evaluation methods.

Abstract: Large language model (LLM) unlearning is critical in real-world applications where it is necessary to efficiently remove the influence of private, copyrighted, or harmful data from some users. Existing utility-centric unlearning metrics (based on model utility) may fail to accurately evaluate the extent of unlearning in realistic settings such as when the forget and retain sets have semantically similar content and/or retraining the model from scratch on the retain set is impractical. This paper presents the first data-centric unlearning metric for LLMs called WaterDrum that exploits robust text watermarking to overcome these limitations. We introduce new benchmark datasets (with different levels of data similarity) for LLM unlearning that can be used to rigorously evaluate unlearning algorithms via WaterDrum. Our code is available at https://github.com/lululu008/WaterDrum and our new benchmark datasets are released at https://huggingface.co/datasets/Glow-AI/WaterDrum-Ax.

[1471] LinearizeLLM: An Agent-Based Framework for LLM-Driven Exact Linear Reformulation of Nonlinear Optimization Problems

Paul-Niklas Ken Kandora, Simon Caspar Zeller, Aaron Jeremias Elsing, Elena Kuss, Steffen Rebennack

Main category: cs.LG

TL;DR: LinearizeLLM is an agent-based LLM framework that automatically converts nonlinear optimization problems into solver-ready linear formulations using pattern detection and specialized reformulation techniques.

Details

Motivation: Current methods for reformulating nonlinear optimization problems into linear forms are manual and require domain expertise, creating a barrier to practical applications. There's a need for automated approaches that can handle complex nonlinear patterns.

Method: Uses an agent-based LLM framework where specialized agents first detect nonlinearity patterns (e.g., bilinear products), then apply pattern-aware reformulation techniques to select the most suitable linearization method for each detected pattern.

Result: Achieved 73% mean end-to-end overall success rate across nonlinearity depths, which is 8.3x higher than a one-shot LLM baseline and 4.3x higher than Pyomo. Tested on 40 instances including 27 from ComplexOR with injected linearizable operators and 13 automatically generated instances with deeply nested nonlinearities.

Conclusion: Pattern-specialized agents can effectively automate the linearization process, supporting natural-language-based modeling of nonlinear optimization problems and reducing the need for manual expertise.

Abstract: Reformulating nonlinear optimization problems into solver-ready linear optimization problems is often necessary for practical applications, but the process is often manual and requires domain expertise. We propose LinearizeLLM, an agent-based LLM framework that produces solver-ready linear reformulations of nonlinear optimization problems. Agents first detect the nonlinearity pattern (e.g., bilinear products) and apply nonlinearity pattern-aware reformulation techniques, selecting the most suitable linearization technique. We benchmark on 40 instances: 27 derived from ComplexOR by injecting exactly-linearizable operators, and 13 automatically generated instances with deeply nested nonlinearities. LinearizeLLM achieves 73% mean end-to-end overall success (OSR) across nonlinearity depths (8.3x higher than a one-shot LLM baseline; 4.3x higher than Pyomo). The results suggest that a set of pattern-specialized agents can automate linearization, supporting natural-language-based modeling of nonlinear optimization.

[1472] Scaling Gaussian Process Regression with Full Derivative Observations

Daniel Huang

Main category: cs.LG

TL;DR: DSoftKI extends SoftKI for scalable Gaussian Processes with derivative observations using local temperature vectors to encode directional sensitivity, enabling efficient kernel approximation without kernel derivatives.

Details

Motivation: Existing Gaussian Process methods struggle with scalability when incorporating derivative observations, especially in high-dimensional settings like molecular force field prediction. There's a need for methods that can efficiently handle full derivative information while maintaining accuracy.

Method: Extends SoftKI’s softmax interpolation approach by replacing global temperature vectors with local temperature vectors at each interpolation point. This allows encoding local directional sensitivity and constructing scalable approximate kernels including first and second-order derivatives through interpolation, eliminating the need for kernel derivatives.

Result: DSoftKI demonstrates accuracy and scalability on synthetic benchmarks, toy n-body physics simulations, standard regression datasets with synthetic gradients, and high-dimensional molecular force field prediction (100-1000 dimensions). It scales to larger datasets with full derivative observations than previously possible.

Conclusion: DSoftKI provides an effective scalable Gaussian Process method for handling derivative observations, particularly valuable for high-dimensional scientific applications like molecular modeling where derivative information is crucial.

Abstract: We present a scalable Gaussian Process (GP) method called DSoftKI that can fit and predict full derivative observations. It extends SoftKI, a method that approximates a kernel via softmax interpolation, to the setting with derivatives. DSoftKI enhances SoftKI’s interpolation scheme by replacing its global temperature vector with local temperature vectors associated with each interpolation point. This modification allows the model to encode local directional sensitivity, enabling the construction of a scalable approximate kernel, including its first and second-order derivatives, through interpolation. Moreover, the interpolation scheme eliminates the need for kernel derivatives, facilitating extensions such as Deep Kernel Learning (DKL). We evaluate DSoftKI on synthetic benchmarks, a toy n-body physics simulation, standard regression datasets with synthetic gradients, and high-dimensional molecular force field prediction (100-1000 dimensions). Our results demonstrate that DSoftKI is accurate and scales to larger datasets with full derivative observations than previously possible.

[1473] Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning

Zijun Chen, Shengbo Wang, Nian Si

Main category: cs.LG

TL;DR: Distributionally robust average-reward RL algorithms with near-optimal sample complexity for stable long-term performance in practical applications.

Details

Motivation: Practical applications like robotics, operations research, and healthcare require stable long-term performance, motivating the study of distributionally robust average-reward reinforcement learning to handle uncertainty in transition dynamics.

Method: Two algorithms: 1) Reduction to DR discounted MDP, 2) Anchored DR Average-Reward MDP with anchoring state to stabilize controlled transition kernels within uncertainty sets. Both assume nominal MDP is uniformly ergodic.

Result: Achieved sample complexity of $\widetilde{O}\left(|\mathbf{S}||\mathbf{A}| t_{\mathrm{mix}}^2\varepsilon^{-2}\right)$ for estimating optimal policy and robust average reward under KL and $f_k$-divergence uncertainty sets. First finite-sample convergence guarantee for DR average-reward RL.

Conclusion: Proposed algorithms provide theoretical guarantees for distributionally robust average-reward RL with practical applications requiring stable long-term performance, validated through numerical experiments.

Abstract: Motivated by practical applications where stable long-term performance is critical-such as robotics, operations research, and healthcare-we study the problem of distributionally robust (DR) average-reward reinforcement learning. We propose two algorithms that achieve near-optimal sample complexity. The first reduces the problem to a DR discounted Markov decision process (MDP), while the second, Anchored DR Average-Reward MDP, introduces an anchoring state to stabilize the controlled transition kernels within the uncertainty set. Assuming the nominal MDP is uniformly ergodic, we prove that both algorithms attain a sample complexity of $\widetilde{O}\left(|\mathbf{S}||\mathbf{A}| t_{\mathrm{mix}}^2\varepsilon^{-2}\right)$ for estimating the optimal policy as well as the robust average reward under KL and $f_k$-divergence-based uncertainty sets, provided the uncertainty radius is sufficiently small. Here, $\varepsilon$ is the target accuracy, $|\mathbf{S}|$ and $|\mathbf{A}|$ denote the sizes of the state and action spaces, and $t_{\mathrm{mix}}$ is the mixing time of the nominal MDP. This represents the first finite-sample convergence guarantee for DR average-reward reinforcement learning. We further validate the convergence rates of our algorithms through numerical experiments.

[1474] On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating

Huy Nguyen, Thong T. Doan, Quang Pham, Nghi D. Q. Bui, Nhat Ho, Alessandro Rinaldo

Main category: cs.LG

TL;DR: Theoretical analysis of DeepSeekMoE’s shared expert strategy and normalized sigmoid gating, showing improved sample efficiency and convergence in expert estimation tasks.

Details

Motivation: Despite DeepSeekMoE's success in large language models, there's limited theoretical understanding of its two key features: shared expert strategy and normalized sigmoid gating mechanism.

Method: Comprehensive theoretical study from statistical perspective with convergence analysis of expert estimation task, plus empirical validation on synthetic and real-world datasets for vision/language modeling tasks.

Result: Theoretical analysis shows gains in sample efficiency for both features; empirical experiments verify findings; extensive analysis of router behaviors including saturation, change rate, and expert utilization.

Conclusion: Provides theoretical justification for DeepSeekMoE design choices and insights for expert/gating structure design in mixture of experts architectures.

Abstract: Mixture of experts (MoE) methods are a key component in most large language model architectures, including the recent series of DeepSeek models. Compared to other MoE implementations, DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism. Despite the prominent role of DeepSeekMoE in the success of the DeepSeek series of models, there have been only a few attempts to justify theoretically the value of the shared expert strategy, while its normalized sigmoid gating has remained unexplored. To bridge this gap, we undertake a comprehensive theoretical study of these two features of DeepSeekMoE from a statistical perspective. We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating, offering useful insights into the design of expert and gating structures. To verify empirically our theoretical findings, we carry out several experiments on both synthetic data and real-world datasets for (vision) language modeling tasks. Finally, we conduct an extensive empirical analysis of the router behaviors, ranging from router saturation, router change rate, to expert utilization.

[1475] BOLT-GAN: Bayes-Error-Motivated Objective for Stable GAN Training

Mohammadreza Tavasoli Naeini, Ali Bereyhi, Morteza Noshad, Ben Liang, Alfred O. Hero

Main category: cs.LG

TL;DR: BOLT-GAN introduces a novel GAN training framework using Bayes optimal learning threshold (BOLT) loss for stable training, improving image generation metrics across benchmarks.

Details

Motivation: The paper aims to address stability issues in GAN training by proposing a novel framework that links the training objective to a min-max Bayes error criterion, seeking more stable and effective training.

Method: The discriminator is trained using BOLT loss under a standard 1-Lipschitz constraint, which guides the generator to maximize the Bayes error of the discrimination task. The framework represents a class of metrics on probability measures controlled by a 1-Lipschitz discriminator minimizing an integral probability metric upper-bounded by Wasserstein-1 distance.

Result: Across four standard image-generation benchmarks, BOLT-GAN improves FID and precision/recall over benchmark GAN frameworks under identical architectures and training budgets.

Conclusion: The experimental findings confirm the advantage of linking GAN training objective to a min-max Bayes error criterion, providing a more stable training framework for generative models.

Abstract: We introduce BOLT-GAN, a novel framework for stable GAN training using the Bayes optimal learning threshold (BOLT). The discriminator is trained via the BOLT loss under a standard 1-Lipschitz constraint. This guides the generator to maximize the Bayes error of the discrimination task. We show that the training objective in this case represents a class of metrics on probability measures controlled by a 1-Lipschitz discriminator minimizing an integral probability metric that is upper-bounded by Wasserstein-1 distance. Across four standard image-generation benchmarks, BOLT-GAN improves FID and precision/recall over benchmark GAN frameworks under identical architectures and training budgets. Our experimental findings further confirm the advantage of linking the GAN training objective to a min-max Bayes error criterion.

[1476] Context-Free Synthetic Data Mitigates Forgetting

Parikshit Bansal, Sujay Sanghavi

Main category: cs.LG

TL;DR: Fine-tuning language models causes forgetting; using context-free generations to estimate KL divergence helps preserve original model performance without access to training data.

Details

Motivation: Fine-tuning language models often leads to catastrophic forgetting of previously learned capabilities. The paper addresses this problem in scenarios where only model weights are available, not the original training data or recipe.

Method: Proposes using context-free generation to create synthetic data for approximate unbiased estimation of KL divergence between original and fine-tuned models. Augments fine-tuning datasets with these context-free generations to mitigate forgetting. Compares effectiveness against contextual synthetic data and pretraining data subsets.

Result: Context-free generations effectively mitigate forgetting in two settings: preserving zero-shot performance of pretrained-only models (OLMo-1B) and preserving reasoning performance of thinking models (R1-Distill-Llama-8B). Contextual synthetic data and pretraining data subsets are less effective.

Conclusion: Context-free generation provides an effective method for mitigating catastrophic forgetting during fine-tuning when original training data is unavailable, outperforming other synthetic data approaches.

Abstract: Fine-tuning a language model often results in a degradation of its existing performance on other tasks, due to a shift in the model parameters; this phenomenon is often referred to as (catastrophic) forgetting. We are interested in mitigating this, in settings where we only have access to the model weights but no access to its training data/recipe. A natural approach is to penalize the KL divergence between the original model and the new one. Our main realization is that a simple process - which we term context-free generation - allows for an approximate unbiased estimation of this KL divergence. We show that augmenting a fine-tuning dataset with context-free generations mitigates forgetting, in two settings: (a) preserving the zero-shot performance of pretrained-only models, and (b) preserving the reasoning performance of thinking models. We show that contextual synthetic data, and even a portion of the pretraining data, are less effective. We also investigate the effect of choices like generation temperature, data ratios etc. We present our results for OLMo-1B for pretrained-only setting and R1-Distill-Llama-8B for the reasoning setting.

[1477] Reward-Aware Proto-Representations in Reinforcement Learning

Hon Tik Tse, Siddarth Chandrasekar, Marlos C. Machado

Main category: cs.LG

TL;DR: The paper introduces the Default Representation (DR) as a reward-aware alternative to the Successor Representation (SR) in reinforcement learning, providing theoretical foundations and empirical benefits for tasks like reward shaping, option discovery, exploration, and transfer learning.

Details

Motivation: While the Successor Representation (SR) has been useful in RL for addressing challenges like exploration and credit assignment, it is reward-agnostic. The authors propose the Default Representation (DR) as a similar representation that incorporates reward dynamics, aiming to provide reward-aware behavior and better performance in various RL settings.

Method: The authors develop theoretical foundations for DR in tabular cases by deriving dynamic programming and temporal-difference learning methods, characterizing the basis for DR’s vector space, and extending DR to function approximation through default features. Empirically, they analyze DR’s benefits in settings where SR has been applied, including reward shaping, option discovery, exploration, and transfer learning.

Result: Results show that compared to SR, DR produces qualitatively different reward-aware behavior and quantitatively better performance in several settings. The DR demonstrates advantages in reward-aware tasks due to its incorporation of reward dynamics.

Conclusion: The Default Representation provides a theoretically grounded, reward-aware alternative to the Successor Representation that offers improved performance in various reinforcement learning applications, particularly in tasks requiring reward-sensitive behavior.

Abstract: In recent years, the successor representation (SR) has attracted increasing attention in reinforcement learning (RL), and it has been used to address some of its key challenges, such as exploration, credit assignment, and generalization. The SR can be seen as representing the underlying credit assignment structure of the environment by implicitly encoding its induced transition dynamics. However, the SR is reward-agnostic. In this paper, we discuss a similar representation that also takes into account the reward dynamics of the problem. We study the default representation (DR), a recently proposed representation with limited theoretical (and empirical) analysis. Here, we lay some of the theoretical foundation underlying the DR in the tabular case by (1) deriving dynamic programming and (2) temporal-difference methods to learn the DR, (3) characterizing the basis for the vector space of the DR, and (4) formally extending the DR to the function approximation case through default features. Empirically, we analyze the benefits of the DR in many of the settings in which the SR has been applied, including (1) reward shaping, (2) option discovery, (3) exploration, and (4) transfer learning. Our results show that, compared to the SR, the DR gives rise to qualitatively different, reward-aware behaviour and quantitatively better performance in several settings.

[1478] Reaction Prediction via Interaction Modeling of Symmetric Difference Shingle Sets

Runhan Shi, Letian Chen, Gufeng Yu, Yang Yang

Main category: cs.LG

TL;DR: ReaDISH is a novel chemical reaction prediction model that addresses permutation sensitivity and inadequate substructural interaction modeling through symmetric difference shingle encoding and geometry-structure interaction attention mechanisms.

Details

Motivation: Existing machine learning models for chemical reaction prediction suffer from sensitivity to input permutations (molecule/atom orderings) and inadequate modeling of substructural interactions governing reactivity, leading to inconsistent predictions and poor generalization to real-world scenarios.

Method: Two key innovations: (1) symmetric difference shingle encoding that extends differential reaction fingerprint (DRFP) by representing shingles as continuous high-dimensional embeddings to capture structural changes while eliminating order sensitivity; (2) geometry-structure interaction attention mechanism that models intra- and inter-molecular interactions at the shingle level.

Result: Extensive experiments show ReaDISH improves reaction prediction performance across diverse benchmarks and demonstrates enhanced robustness with an average improvement of 8.76% on R² under permutation perturbations.

Conclusion: ReaDISH effectively addresses critical limitations in chemical reaction prediction by learning permutation-invariant representations while incorporating interaction-aware features, leading to more robust and accurate predictions.

Abstract: Chemical reaction prediction remains a fundamental challenge in organic chemistry, where existing machine learning models face two critical limitations: sensitivity to input permutations (molecule/atom orderings) and inadequate modeling of substructural interactions governing reactivity. These shortcomings lead to inconsistent predictions and poor generalization to real-world scenarios. To address these challenges, we propose ReaDISH, a novel reaction prediction model that learns permutation-invariant representations while incorporating interaction-aware features. It introduces two innovations: (1) symmetric difference shingle encoding, which extends the differential reaction fingerprint (DRFP) by representing shingles as continuous high-dimensional embeddings, capturing structural changes while eliminating order sensitivity; and (2) geometry-structure interaction attention, a mechanism that models intra- and inter-molecular interactions at the shingle level. Extensive experiments demonstrate that ReaDISH improves reaction prediction performance across diverse benchmarks. It shows enhanced robustness with an average improvement of 8.76% on R$^2$ under permutation perturbations.

[1479] Can Test-time Computation Mitigate Reproduction Bias in Neural Symbolic Regression?

Shun Sato, Issei Sato

Main category: cs.LG

TL;DR: Analysis of neural symbolic regression (NSR) reveals two key limitations: token-by-token generation prevents numerical consistency validation, and reproduction bias restricts search space by copying training expressions.

Details

Motivation: To understand why neural symbolic regression methods using Transformers pre-trained on synthetic data often perform poorly, especially with many input variables, and to analyze their fundamental limitations.

Method: Theoretical and empirical analysis of NSR approaches, examining token-by-token generation limitations and reproduction bias, plus investigation of test-time strategies to mitigate bias.

Result: Found that Transformers cannot compositionally generate tokens while validating numerical consistency, and that reproduction bias causes most generated expressions to be copied from training data. Test-time strategies with additional information can mitigate reproduction bias.

Conclusion: NSR has fundamental limitations in token generation and suffers from reproduction bias, but test-time strategies can help. These findings provide guidance for designing more robust symbolic regression methods.

Abstract: Mathematical expressions play a central role in scientific discovery. Symbolic regression aims to automatically discover such expressions from given numerical data. Recently, Neural symbolic regression (NSR) methods that involve Transformers pre-trained on synthetic datasets have gained attention for their fast inference, but they often perform poorly, especially with many input variables. In this study, we analyze NSR from both theoretical and empirical perspectives and show that (1) ordinary token-by-token generation is ill-suited for NSR, as Transformers cannot compositionally generate tokens while validating numerical consistency, and (2) the search space of NSR methods is greatly restricted due to reproduction bias, where the majority of generated expressions are merely copied from the training data. We further examine whether tailored test-time strategies can reduce reproduction bias and show that providing additional information at test time effectively mitigates it. These findings contribute to a deeper understanding of the limitation of NSR approaches and provide guidance for designing more robust and generalizable methods. Code is available at https://github.com/Shun-0922/Mem-Bias-NSR .

[1480] Measure gradients, not activations! Enhancing neuronal activity in deep reinforcement learning

Jiashun Liu, Zihao Wu, Johan Obando-Ceron, Pablo Samuel Castro, Aaron Courville, Ling Pan

Main category: cs.LG

TL;DR: GraMa is a gradient-based metric for measuring neuron learning capacity in deep RL agents, addressing limitations of activation-based dormant neuron detection in complex architectures.

Details

Motivation: Existing methods for detecting dormant neurons in RL agents (using activation statistics like tau-dormant neuron ratio) lose statistical power in complex architectures, and the authors argue that maintaining a neuron's learning capacity (ability to adapt via gradients) is more critical than preserving expressive ability.

Method: Shift from activation-based to gradient-based statistics, introducing GraMa (Gradient Magnitude Neural Activity Metric) - a lightweight, architecture-agnostic metric that quantifies neuron-level learning capacity by analyzing gradient magnitudes rather than activations.

Result: GraMa effectively reveals persistent neuron inactivity across diverse architectures (residual networks, diffusion models, varied activation functions). Resetting neurons guided by GraMa (ReGraMa) consistently improves learning performance across multiple deep RL algorithms and benchmarks (MuJoCo, DeepMind Control Suite).

Conclusion: Gradient-based metrics like GraMa are more effective than activation-based approaches for quantifying neuron learning capacity in complex RL architectures, and guided neuron resetting using these metrics improves agent performance.

Abstract: Deep reinforcement learning (RL) agents frequently suffer from neuronal activity loss, which impairs their ability to adapt to new data and learn continually. A common method to quantify and address this issue is the tau-dormant neuron ratio, which uses activation statistics to measure the expressive ability of neurons. While effective for simple MLP-based agents, this approach loses statistical power in more complex architectures. To address this, we argue that in advanced RL agents, maintaining a neuron’s learning capacity, its ability to adapt via gradient updates, is more critical than preserving its expressive ability. Based on this insight, we shift the statistical objective from activations to gradients, and introduce GraMa (Gradient Magnitude Neural Activity Metric), a lightweight, architecture-agnostic metric for quantifying neuron-level learning capacity. We show that GraMa effectively reveals persistent neuron inactivity across diverse architectures, including residual networks, diffusion models, and agents with varied activation functions. Moreover, resetting neurons guided by GraMa (ReGraMa) consistently improves learning performance across multiple deep RL algorithms and benchmarks, such as MuJoCo and the DeepMind Control Suite.

[1481] Global Feature Enhancing and Fusion Framework for Strain Gauge Time Series Classification

Xu Zhang, Peng Wang, Chen Wang, Zhe Xu, Xiaohua Nie, Wei Wang

Main category: cs.LG

TL;DR: A hypergraph-based framework for learning and fusing global features to improve time series classification of Strain Gauge Status data, addressing limitations of CNNs that only capture local features.

Details

Motivation: Current CNN-based time series classification methods only extract local features, which is insufficient when local subsequences between different time series are very similar (e.g., in aircraft wing strain gauge data). There's a need to capture global features for more comprehensive time series representation.

Method: Proposes a hypergraph-based global feature learning and fusion framework that: (1) constructs global features through feature engineering, and (2) learns high-order relationships between local features to capture global features. The framework learns and fuses global features for semantic consistency to enhance SGS time series representation.

Result: The method shows better generalization for unseen data in SGS recognition when validated on industrial SGS and public UCR datasets. The approach improves recognition accuracy compared to methods that only use local features.

Conclusion: Global feature learning and fusion through hypergraphs enhances time series representation and improves classification accuracy, especially for complex industrial data like Strain Gauge Status where local features alone are insufficient.

Abstract: Strain Gauge Status (SGS) time series recognition is crucial in the field of intelligent manufacturing based on the Internet of Things, as accurate identification helps timely detection of failed mechanical components, avoiding accidents. The loading and unloading sequences generated by strain gauges can be identified through time series classification (TSC) algorithms. Recently, deep learning models, e.g., convolutional neural networks (CNNs) have shown remarkable success in the TSC task, as they can extract discriminative local features from the subsequences to identify the time series. However, we observe that only the local features may not be sufficient for expressing the time series, especially when the local sub-sequences between different time series are very similar, e.g., SGS data of aircraft wings in static strength experiments. Nevertheless, CNNs suffer from the limitation in extracting global features due to the nature of convolution operations. For extracting global features to more comprehensively represent the SGS time series, we propose two insights: (i) Constructing global features through feature engineering. (ii) Learning high-order relationships between local features to capture global features. To realize and utilize them, we propose a hypergraph-based global feature learning and fusion framework, which learns and fuses global features for semantic consistency to enhance the representation of SGS time series, thereby improving recognition accuracy. Our method designs are validated on industrial SGS and public UCR datasets, showing better generalization for unseen data in SGS recognition. The code is available at the link https://github.com/Meteor-Stars/GFEF.

[1482] Are Your Generated Instances Truly Useful? GenBench-MILP: A Benchmark Suite for MILP Instance Generation

Yidong Luo, Chenguang Wang, Dong Li, Tianshu Yu

Main category: cs.LG

TL;DR: GenBench-MILP is a benchmark suite for evaluating MILP instance generators across four dimensions: mathematical validity, structural similarity, computational hardness, and downstream utility, with novel analysis of solver-internal features.

Details

Motivation: Current evaluation of machine learning-generated MILP instances relies on superficial metrics that fail to capture true computational complexity, creating a need for comprehensive, standardized evaluation.

Method: Introduces GenBench-MILP benchmark suite with four evaluation dimensions and novel analysis of solver-internal features like root node gaps, heuristic success rates, and cut plane usage to capture solver dynamics.

Result: Experiments show instances with high structural similarity can have drastically different solver interactions and difficulty levels, revealing limitations of current evaluation methods.

Conclusion: GenBench-MILP provides a multifaceted evaluation toolkit for rigorous comparison and development of high-fidelity MILP instance generators.

Abstract: The proliferation of machine learning-based methods for Mixed-Integer Linear Programming (MILP) instance generation has surged, driven by the need for diverse training datasets. However, a critical question remains: Are these generated instances truly useful and realistic? Current evaluation protocols often rely on superficial structural metrics or simple solvability checks, which frequently fail to capture the true computational complexity of real-world problems. To bridge this gap, we introduce GenBench-MILP, a comprehensive benchmark suite designed for the standardized and objective evaluation of MILP generators. Our framework assesses instance quality across four key dimensions: mathematical validity, structural similarity, computational hardness, and utility in downstream tasks. A distinctive innovation of GenBench-MILP is the analysis of solver-internal features – including root node gaps, heuristic success rates, and cut plane usage. By treating the solver’s dynamic behavior as an expert assessment, we reveal nuanced computational discrepancies that static graph features miss. Our experiments on instance generative models demonstrate that instances with high structural similarity scores can still exhibit drastically divergent solver interactions and difficulty levels. By providing this multifaceted evaluation toolkit, GenBench-MILP aims to facilitate rigorous comparisons and guide the development of high-fidelity instance generators.

[1483] EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control

Kai Yang, Xin Xu, Yangkun Chen, Weijie Liu, Jiafei Lyu, Zichuan Lin, Deheng Ye, Saiyong Yang

Main category: cs.LG

TL;DR: EntroPIC: A method using proportional-integral control to stabilize entropy in RL training of large language models by dynamically adjusting loss coefficients for positive/negative samples.

Details

Motivation: Existing RL methods struggle to maintain appropriate entropy levels during long-term LLM training, as positive and negative samples affect entropy differently across training steps, leading to unstable exploration and potential convergence to sub-optimal solutions.

Method: Proposes Entropy stabilization via Proportional-Integral Control (EntroPIC) that adaptively adjusts the influence of positive and negative samples by dynamically tuning their loss coefficients to stabilize entropy throughout training.

Result: Experimental results show EntroPIC successfully maintains desired entropy levels, enabling stable and optimal RL training for LLMs, with comprehensive theoretical analysis for both on-policy and off-policy learning settings.

Conclusion: EntroPIC effectively controls entropy in large-scale LLM training, ensuring efficient exploration and steady progress by stabilizing entropy through adaptive adjustment of sample influences.

Abstract: Long-term training of large language models (LLMs) requires maintaining stable exploration to prevent the model from collapsing into sub-optimal behaviors. Entropy is crucial in this context, as it controls exploration and helps avoid premature convergence to sub-optimal solutions. However, existing reinforcement learning methods struggle to maintain an appropriate level of entropy, as the training process involves a mix of positive and negative samples, each affecting entropy in different ways across steps. To address this, we propose Entropy stabilization via Proportional-Integral Control (EntroPIC), a novel method that adaptively adjusts the influence of positive and negative samples by dynamically tuning their loss coefficients. This approach stabilizes entropy throughout training, ensuring efficient exploration and steady progress. We provide a comprehensive theoretical analysis for both on-policy and off-policy learning settings, demonstrating that EntroPIC is effective at controlling entropy in large-scale LLM training. Experimental results show that our method successfully maintains desired entropy levels, enabling stable and optimal RL training for LLMs.

[1484] Bayes optimal learning of attention-indexed models

Fabrizio Boncoraglio, Emanuele Troiani, Vittorio Erba, Lenka Zdeborová

Main category: cs.LG

TL;DR: AIM is a theoretical framework for analyzing learning in deep attention layers using statistical mechanics and random matrix theory to derive generalization error predictions and phase transitions.

Details

Motivation: To develop a tractable theoretical framework for understanding learning in self-attention layers, which are key components of modern transformer architectures, by capturing how token-level outputs emerge from layered bilinear interactions over high-dimensional embeddings.

Method: Proposes the attention-indexed model (AIM) inspired by multi-index models, allowing full-width key and query matrices. Uses tools from statistical mechanics and random matrix theory to derive closed-form predictions for Bayes-optimal generalization error and identifies phase transitions. Also proposes a matching approximate message passing algorithm.

Result: Derives closed-form predictions for Bayes-optimal generalization error, identifies sharp phase transitions as functions of sample complexity, model width, and sequence length, and shows that gradient descent can reach optimal performance.

Conclusion: AIM provides a solvable theoretical playground for understanding learning in self-attention layers, offering insights into generalization behavior and phase transitions in transformer architectures.

Abstract: We introduce the attention-indexed model (AIM), a theoretical framework for analyzing learning in deep attention layers. Inspired by multi-index models, AIM captures how token-level outputs emerge from layered bilinear interactions over high-dimensional embeddings. Unlike prior tractable attention models, AIM allows full-width key and query matrices, aligning more closely with practical transformers. Using tools from statistical mechanics and random matrix theory, we derive closed-form predictions for Bayes-optimal generalization error and identify sharp phase transitions as a function of sample complexity, model width, and sequence length. We propose a matching approximate message passing algorithm and show that gradient descent can reach optimal performance. AIM offers a solvable playground for understanding learning in self-attention layers, that are key components of modern architectures.

[1485] Decomposed Trust: Exploring Privacy, Adversarial Robustness, Fairness, and Ethics of Low-Rank LLMs

Daniel Agyei Asante, Md Mokarram Chowdhury, Yang Li

Main category: cs.LG

TL;DR: Low-rank compression of LLMs maintains performance but has mixed trustworthiness effects: improves privacy and robustness, degrades fairness and ethical reasoning.

Details

Motivation: While low-rank compression effectively reduces LLM size for resource-constrained deployment, its impact on trustworthiness aspects (privacy, robustness, fairness, ethics) remains unexplored, creating a critical research gap.

Method: Comprehensive evaluation of multiple LLMs compressed with diverse low-rank algorithms, assessing trustworthiness across privacy, adversarial robustness, fairness, and ethical alignment. Includes analysis of model scale and fine-tuning effects, plus gradient-based attribution to identify layers contributing to robustness.

Result: Low-rank compression preserves/improves training data privacy but weakens PII protection; adversarial robustness is preserved/enhanced; ethical reasoning degrades in zero-shot but recovers with few-shot prompting; fairness declines under compression.

Conclusion: Low-rank compression has complex trustworthiness trade-offs: while beneficial for privacy and robustness, it harms fairness and ethical reasoning, requiring careful consideration in deployment strategies.

Abstract: Large language models (LLMs) have driven major advances across domains, yet their massive size hinders deployment in resource-constrained settings. Model compression addresses this challenge, with low-rank factorization emerging as a particularly effective method for reducing size, memory, and computation while maintaining accuracy. However, while these compressed models boast of benign performance and system-level advantages, their trustworthiness implications remain poorly understood. In this paper, we present the first comprehensive study of how low-rank factorization affects LLM trustworthiness across privacy, adversarial robustness, fairness, and ethical alignment. We evaluate multiple LLMs of different sizes and variants compressed with diverse low-rank algorithms, revealing key insights: (1) low-rank compression preserves or improves training data privacy but weakens PII protection during conversation; (2) adversarial robustness is generally preserved and often enhanced, even under deep compression; (3) ethical reasoning degrades in zero-shot settings but partially recovers with few-shot prompting; (4) fairness declines under compression. Beyond compression, we investigate how model scale and fine-tuning affect trustworthiness, as both are important in low-rank methods. To guide trustworthy compression strategies, we end our paper with a gradient-based attribution analysis to identify which layers in LLMs contribute most to adversarial robustness.

[1486] Converge Faster, Talk Less: Hessian-Informed Federated Zeroth-Order Optimization

Zhe Li, Bicheng Ying, Zidong Liu, Chaosheng Dong, Haibo Yang

Main category: cs.LG

TL;DR: HiSo: Hessian-informed zeroth-order federated optimization for LLM fine-tuning that accelerates convergence using diagonal Hessian approximations while maintaining scalar-only communication.

Details

Motivation: Existing zeroth-order federated learning methods overlook curvature information despite its benefits for convergence acceleration, creating a gap between theoretical worst-case bounds and practical performance.

Method: Proposes HiSo, a Hessian-informed ZO federated optimization method that leverages global diagonal Hessian approximations to accelerate convergence while preserving scalar-only communication without transmitting second-order information.

Result: HiSo achieves 1-5× speedup in communication rounds over state-of-the-art ZO-FL baselines across diverse LLM fine-tuning benchmarks, with theoretical analysis showing accelerated convergence independent of Lipschitz constant and model dimension.

Conclusion: Hessian information acts as an effective accelerator in federated ZO optimization, providing both theoretical justification and empirical evidence for faster convergence than worst-case bounds.

Abstract: Zeroth-order (ZO) optimization enables dimension-free communication in federated learning (FL), making it attractive for fine-tuning of large language models (LLMs) due to significant communication savings. However, existing ZO-FL methods largely overlook curvature information, despite its well-established benefits for convergence acceleration. To address this, we propose HiSo, a Hessian-informed ZO federated optimization method that accelerates convergence by leveraging global diagonal Hessian approximations, while strictly preserving scalar-only communication without transmitting any second-order information. Theoretically, for non-convex functions, we show that HiSo can achieve an accelerated convergence rate that is independent of the Lipschitz constant $L$ and model dimension $d$ under some Hessian approximation assumptions, offering a plausible explanation for the observed phenomenon of ZO convergence being much faster than its worst-case $\mathscr{O}(d)$-bound. Empirically, across diverse LLM fine-tuning benchmarks, HiSo delivers a 1$\sim$5$\times$ speedup in communication rounds over existing state-of-the-art ZO-FL baselines. This superior convergence not only cuts communication costs but also provides strong empirical evidence that Hessian information acts as an effective accelerator in federated ZO optimization settings. Our source code is provided at https://github.com/ZidongLiu/DeComFL.

[1487] Reusing Trajectories in Policy Gradients Enables Fast Convergence

Alessandro Montenegro, Federico Mansutti, Marco Mussi, Matteo Papini, Alberto Maria Metelli

Main category: cs.LG

TL;DR: RT-PG algorithm reuses past off-policy trajectories to accelerate policy gradient convergence, achieving best-known sample complexity of Õ(ε⁻¹) when reusing all past data.

Details

Motivation: Policy gradient methods are sample-inefficient due to reliance on fresh on-policy data, requiring O(ε⁻²) trajectories. While gradient reuse has been studied, trajectory reuse remains theoretically unexplored despite its intuitive potential for efficiency gains.

Method: Proposes RT-PG (Reusing Trajectories - Policy Gradient) algorithm that uses power mean-corrected multiple importance weighting estimator to combine on-policy and off-policy data from the most recent ω iterations.

Result: RT-PG achieves sample complexity of Õ(ε⁻²ω⁻¹), leading to Õ(ε⁻¹) when reusing all available past trajectories - the best known rate for PG methods. Empirical validation shows effectiveness against baselines.

Conclusion: Reusing past off-policy trajectories can significantly accelerate policy gradient convergence, with RT-PG providing theoretical and empirical evidence for this approach achieving state-of-the-art sample complexity.

Abstract: Policy gradient (PG) methods are a class of effective reinforcement learning algorithms, particularly when dealing with continuous control problems. They rely on fresh on-policy data, making them sample-inefficient and requiring $O(ε^{-2})$ trajectories to reach an $ε$-approximate stationary point. A common strategy to improve efficiency is to reuse information from past iterations, such as previous gradients or trajectories, leading to off-policy PG methods. While gradient reuse has received substantial attention, leading to improved rates up to $O(ε^{-3/2})$, the reuse of past trajectories, although intuitive, remains largely unexplored from a theoretical perspective. In this work, we provide the first rigorous theoretical evidence that reusing past off-policy trajectories can significantly accelerate PG convergence. We propose RT-PG (Reusing Trajectories - Policy Gradient), a novel algorithm that leverages a power mean-corrected multiple importance weighting estimator to effectively combine on-policy and off-policy data coming from the most recent $ω$ iterations. Through a novel analysis, we prove that RT-PG achieves a sample complexity of $\widetilde{O}(ε^{-2}ω^{-1})$. When reusing all available past trajectories, this leads to a rate of $\widetilde{O}(ε^{-1})$, the best known one in the literature for PG methods. We further validate our approach empirically, demonstrating its effectiveness against baselines with state-of-the-art rates.

[1488] TRACE: Grounding Time Series in Context for Multimodal Embedding and Retrieval

Jialin Chen, Ziyu Zhao, Gaukhar Nurbek, Aosong Feng, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, Rex Ying

Main category: cs.LG

TL;DR: TRACE is a multimodal retriever that aligns time-series data with textual context, enabling cross-modal retrieval between text and time-series for improved downstream tasks and foundation model development.

Details

Motivation: Time-series data is ubiquitous in domains like weather, healthcare, and energy, but existing retrieval methods lack semantic grounding, struggle with heterogeneous modality alignment, and have limited capacity for multi-channel signals. There's a growing need for effective cross-modal retrieval to support downstream tasks and time-series foundation models.

Method: TRACE grounds time-series embeddings in aligned textual context, enables fine-grained channel-level alignment, employs hard negative mining for semantically meaningful retrieval, and supports flexible cross-modal retrieval modes (Text-to-Timeseries and Timeseries-to-Text). It serves as both a retrieval engine and a standalone encoder with lightweight task-specific tuning.

Result: TRACE achieves state-of-the-art performance on downstream forecasting and classification tasks. Extensive experiments across multiple domains demonstrate its dual utility as both an effective encoder for downstream applications and a general-purpose retriever to enhance time-series models.

Conclusion: TRACE addresses the underexplored area of time-series retrieval by providing a generic multimodal retriever that effectively links linguistic descriptions with complex temporal patterns, improving predictive accuracy and interpretability while maintaining strong cross-modal alignment.

Abstract: The ubiquity of dynamic data in domains such as weather, healthcare, and energy underscores a growing need for effective interpretation and retrieval of time-series data. These data are inherently tied to domain-specific contexts, such as clinical notes or weather narratives, making cross-modal retrieval essential not only for downstream tasks but also for developing robust time-series foundation models by retrieval-augmented generation (RAG). Despite the increasing demand, time-series retrieval remains largely underexplored. Existing methods often lack semantic grounding, struggle to align heterogeneous modalities, and have limited capacity for handling multi-channel signals. To address this gap, we propose TRACE, a generic multimodal retriever that grounds time-series embeddings in aligned textual context. TRACE enables fine-grained channel-level alignment and employs hard negative mining to facilitate semantically meaningful retrieval. It supports flexible cross-modal retrieval modes, including Text-to-Timeseries and Timeseries-to-Text, effectively linking linguistic descriptions with complex temporal patterns. By retrieving semantically relevant pairs, TRACE enriches downstream models with informative context, leading to improved predictive accuracy and interpretability. Beyond a static retrieval engine, TRACE also serves as a powerful standalone encoder, with lightweight task-specific tuning that refines context-aware representations while maintaining strong cross-modal alignment. These representations achieve state-of-the-art performance on downstream forecasting and classification tasks. Extensive experiments across multiple domains highlight its dual utility, as both an effective encoder for downstream applications and a general-purpose retriever to enhance time-series models.

[1489] Dense Associative Memory with Epanechnikov Energy

Benjamin Hoover, Zhaoyang Shi, Krishnakumar Balasubramanian, Dmitry Krotov, Parikshit Ram

Main category: cs.LG

TL;DR: A novel log-sum-ReLU energy function for Dense Associative Memory networks enables exact memory retrieval with exponential capacity and introduces emergent local minima while preserving perfect pattern recovery.

Details

Motivation: To develop a more effective energy function for Dense Associative Memory networks that overcomes limitations of existing approaches like log-sum-exponential, enabling better memory capacity and introducing emergent local minima for potential generative applications.

Method: Proposes log-sum-ReLU (LSR) energy function inspired by optimal kernel density estimation and Epanechnikov kernel, which enables exact memory retrieval with exponential capacity without requiring exponential separation functions.

Result: LSR has significantly more local minima (memories) with comparable log-likelihood to LSE-based models, and analysis reveals emergent memories on image datasets show creativity and novelty.

Conclusion: LSR shows potential for both large-scale memory storage and generative tasks, with emergent local minima representing a previously unseen characteristic in Dense Associative Memory literature.

Abstract: We propose a novel energy function for Dense Associative Memory (DenseAM) networks, the log-sum-ReLU (LSR), inspired by optimal kernel density estimation. Unlike the common log-sum-exponential (LSE) function, LSR is based on the Epanechnikov kernel and enables exact memory retrieval with exponential capacity without requiring exponential separation functions. Moreover, it introduces abundant additional \emph{emergent} local minima while preserving perfect pattern recovery – a characteristic previously unseen in DenseAM literature. Empirical results show that LSR energy has significantly more local minima (memories) that have comparable log-likelihood to LSE-based models. Analysis of LSR’s emergent memories on image datasets reveals a degree of creativity and novelty, hinting at this method’s potential for both large-scale memory storage and generative tasks.

Max Zimmer, Christophe Roux, Moritz Wagner, Deborah Hendrych, Sebastian Pokutta

Main category: cs.LG

TL;DR: A novel pruning method for LLMs that uses efficient 1-swap optimization to minimize per-layer pruning error with equal sparsity per row, improving performance over existing methods.

Details

Motivation: Traditional pruning methods for LLMs are suboptimal - magnitude pruning doesn't work well on Transformers, and full retraining is too expensive. Existing approaches use approximations/heuristics for the combinatorial mask selection problem, which becomes infeasible at LLM scale.

Method: Decouple pruning rows by enforcing equal sparsity per row, derive optimal 1-swaps (exchanging one kept and one pruned weight) computable efficiently via Gram matrix. Propose simple 1-swap algorithm that warmstarts from any pruning mask, runs efficiently on GPUs at LLM scale, and is hyperparameter-free.

Result: Reduces per-layer pruning error by up to 60% over Wanda (state-of-the-art), consistently improves perplexity and zero-shot accuracy across state-of-the-art GPT architectures.

Conclusion: The mask selection problem for LLM pruning can be made drastically more tractable through row-wise equal sparsity constraints and efficient 1-swap optimization, providing significant improvements over existing methods without requiring expensive retraining.

Abstract: The resource requirements of neural networks can be significantly reduced through pruning - the removal of seemingly less important parameters. However, for LLMs, full retraining to recover pruning-induced performance degradation is often prohibitive and classical approaches such as magnitude pruning are suboptimal on Transformers. State-of-the-art methods hence solve a layer-wise mask selection problem: finding a pruning mask that minimizes per-layer pruning error on a small set of calibration data. Exactly solving this problem is computationally infeasible due to its combinatorial nature and the size of the search space, and existing approaches rely on approximations or heuristics. We demonstrate that the mask selection problem can be made drastically more tractable at LLM scale. To that end, we decouple the rows by enforcing equal sparsity levels per row. This allows us to derive optimal 1-swaps (exchanging one kept and one pruned weight) computable efficiently via the Gram matrix. We propose a simple 1-swap algorithm that warmstarts from any pruning mask, runs efficiently on GPUs at LLM scale, and is essentially hyperparameter-free. Our approach reduces per-layer pruning error by up to 60% over Wanda (Sun et al., 2024) and consistently improves perplexity and zero-shot accuracy across state-of-the-art GPT architectures.

[1491] Hybrid Meta-learners for Estimating Heterogeneous Treatment Effects

Zhongyuan Liang, Lars van der Laan, Ahmed Alaa

Main category: cs.LG

TL;DR: Proposes Hybrid Learner (H-learner) for CATE estimation that interpolates between direct and indirect regularization approaches to combine their strengths.

Details

Motivation: Existing meta-learners for CATE estimation have complementary strengths: indirect learners work well when potential outcome functions are simple, while direct learners excel when CATE is simpler than individual outcomes. Neither consistently outperforms the other across all scenarios.

Method: Introduces H-learner that learns intermediate functions whose difference approximates CATE without requiring accurate individual approximations of potential outcomes. This allows suboptimal fits to potential outcomes to improve bias-variance tradeoff for CATE estimation.

Result: H-learner consistently operates at the Pareto frontier, effectively combining strengths of both direct and indirect meta-learners on semi-synthetic and real-world benchmark datasets.

Conclusion: The hybrid regularization approach provides a flexible framework that adapts to dataset characteristics, outperforming existing methods by balancing between direct and indirect regularization strategies.

Abstract: Estimating conditional average treatment effects (CATE) from observational data involves modeling decisions that differ from supervised learning, particularly concerning how to regularize model complexity. Previous approaches can be grouped into two primary “meta-learner” paradigms that impose distinct inductive biases. Indirect meta-learners first fit and regularize separate potential outcome (PO) models and then estimate CATE by taking their difference, whereas direct meta-learners construct and directly regularize estimators for the CATE function itself. Neither approach consistently outperforms the other across all scenarios: indirect learners perform well when the PO functions are simple, while direct learners outperform when the CATE is simpler than individual PO functions. In this paper, we introduce the Hybrid Learner (H-learner), a novel regularization strategy that interpolates between the direct and indirect regularizations depending on the dataset at hand. The H-learner achieves this by learning intermediate functions whose difference closely approximates the CATE without necessarily requiring accurate individual approximations of the POs themselves. We demonstrate that intentionally allowing suboptimal fits to the POs improves the bias-variance tradeoff in estimating CATE. Experiments conducted on semi-synthetic and real-world benchmark datasets illustrate that the H-learner consistently operates at the Pareto frontier, effectively combining the strengths of both direct and indirect meta-learners.

[1492] WebSTAR: Scalable Data Synthesis for Computer Use Agents with Step-Level Filtering

Yifei He, Pranit Chawla, Yaser Souri, Subhojit Som, Xia Song

Main category: cs.LG

TL;DR: WebSTAR: A scalable data synthesis pipeline for computer use agents that transforms noisy rollouts into reliable supervision via step-level filtering and reasoning augmentation, creating 13.3K trajectories and 100K graded steps for training multimodal models that achieve state-of-the-art performance.

Details

Motivation: Computer use agents (CUAs) are difficult to train due to high GUI interaction costs and scarcity of high-quality trajectory data. Existing datasets rely on human demonstrations which limit scalability, while synthetic data from strong CUAs contains too much noise for effective imitation learning.

Method: Introduces a scalable data synthesis pipeline with step-level filtering that evaluates actions individually to retain only correct steps, complemented by reasoning augmentation for improved planning. Uses this to create WebSTAR dataset from OpenAI’s computer-use-preview model, then trains Qwen-2.5-VL-Instruct models. Also creates WebSCORE dataset of graded step-level actions and trains StepRM, a 7B multimodal process reward model distilled from o4-mini.

Result: The 7B model trained on WebSTAR surpasses state-of-the-art open-source CUA model UI-TARS-1.5-7B by more than 15% on WebVoyager with only supervised finetuning. StepRM matches o4-mini’s grading quality while being far more efficient to deploy at scale.

Conclusion: Step-level filtering is established as a key principle for scalable CUA training. The paper provides practical tools including two new datasets (WebSTAR, WebSCORE) and a lightweight process reward model (StepRM) to advance robust and efficient computer use agents.

Abstract: Computer use agents (CUAs) can operate real-world digital interfaces but remain difficult to train due to the high cost of graphical user interface (GUI) interaction and the scarcity of high-quality trajectory data. Existing datasets rely on human demonstrations, limiting scalability. A natural alternative is to synthesize data from strong CUAs, yet their rollouts are highly noisy, with incorrect or suboptimal actions consisting a large proportion of the steps, making naive imitation ineffective. To tackle this challenge, we introduce a scalable data synthesis pipeline that transforms noisy rollouts into reliable supervision without human annotation. The core idea is step-level filtering, which evaluates actions individually to retain only correct steps, complemented by reasoning augmentation for improved planning. Using this pipeline, we construct WebSTAR, a dataset of 13.3K trajectories and 100K graded, reasoning-rich steps synthesized from OpenAI’s computer-use-preview model. We train Qwen-2.5-VL-Instruct models (7B and 32B) on WebSTAR. On WebVoyager, our 7B model surpasses SoTA open-source CUA model UI-TARS-1.5-7B by more than 15% with only supervised finetuning. Building on step-level grading, we further create WebSCORE, a dataset of graded step-level actions, and train StepRM, a 7B multimodal process reward model distilled from o4-mini, which matches its grading quality while being far more efficient to deploy at scale. Our results establish step-level filtering as a key principle for scalable CUA training and construct two new datasets (WebSTAR, WebSCORE) and a lightweight process reward model (StepRM) as practical tools to advance robust and efficient CUAs.

[1493] Stable Gradients for Stable Learning at Scale in Deep Reinforcement Learning

Roger Creus Castanyer, Johan Obando-Ceron, Lu Li, Pierre-Luc Bacon, Glen Berseth, Aaron Courville, Pablo Samuel Castro

Main category: cs.LG

TL;DR: Deep RL scaling challenges stem from non-stationarity combined with gradient pathologies from suboptimal architectures; simple interventions stabilize gradient flow for robust performance at scale.

Details

Motivation: Scaling deep reinforcement learning networks often degrades performance, but the root causes remain poorly understood. Existing solutions are complex and don't highlight underlying causes.

Method: Conduct empirical analyses to identify that non-stationarity combined with gradient pathologies from suboptimal architectures cause scaling issues. Propose simple interventions to stabilize gradient flow that are compatible with established algorithms.

Result: The interventions enable robust performance across a range of network depths and widths, resulting in strong performance even at large scales, validated on various agents and environment suites.

Conclusion: The combination of non-stationarity and gradient pathologies underlies RL scaling challenges, and simple architectural interventions can stabilize gradient flow for effective large-scale performance.

Abstract: Scaling deep reinforcement learning networks is challenging and often results in degraded performance, yet the root causes of this failure mode remain poorly understood. Several recent works have proposed mechanisms to address this, but they are often complex and fail to highlight the causes underlying this difficulty. In this work, we conduct a series of empirical analyses which suggest that the combination of non-stationarity with gradient pathologies, due to suboptimal architectural choices, underlie the challenges of scale. We propose a series of direct interventions that stabilize gradient flow, enabling robust performance across a range of network depths and widths. Our interventions are simple to implement and compatible with well-established algorithms, and result in an effective mechanism that enables strong performance even at large scales. We validate our findings on a variety of agents and suites of environments.

[1494] PIS: A Generalized Physical Inversion Solver for Arbitrary Sparse Observations via Set Conditioned Flow Matching

Weijie Yang, Xun Zhang, Simin Jiang, Yubao Zhou

Main category: cs.LG

TL;DR: PIS is a framework using Set-Conditioned Flow Matching with Cosine-Annealed Sparsity Curriculum for stable physical parameter inversion from sparse, irregular sensor data with orders-of-magnitude speedup.

Details

Motivation: Traditional methods for estimating high-dimensional physical parameters from PDE-constrained, sparse measurements face accuracy and efficiency bottlenecks, especially with real-world sensor placement constraints.

Method: Physical Inversion Solver (PIS) combines Set-Conditioned Flow Matching with Cosine-Annealed Sparsity Curriculum (CASC) to enable stable inversion from arbitrary off-grid sensors with minimal guidance, using straight-path transport for efficiency.

Result: PIS achieves 88.7% error reduction under extreme sparsity (<1%), offers orders-of-magnitude speedup with instantaneous inference (50 NFEs), and provides robust uncertainty quantification for optimal sensor placement across subsurface characterization, wave-based characterization, and structural health monitoring.

Conclusion: PIS provides a unified framework for efficient and accurate physical parameter inversion from sparse measurements with practical applications in various domains requiring sensor-based monitoring.

Abstract: The estimation of high-dimensional physical parameters constrained by partial differential equations (PDEs) from limited and indirect measurements is a highly ill-posed problem. Traditional methods face significant accuracy and efficiency bottlenecks, particularly when observations are sparse, irregularly sampled, and constrained by real-world sensor placement. We propose the Physical Inversion Solver (PIS), a unified framework that couples Set-Conditioned Flow Matching with a Cosine-Annealed Sparsity Curriculum (CASC) to enable stable inversion from arbitrary, off-grid sensors even under minimal guidance. By leveraging straight-path transport, PIS achieves instantaneous inference (50 NFEs), offering orders-of-magnitude speedup over iterative baselines. Extensive experiments demonstrate that PIS reduces error by up to 88.7% under extreme sparsity (<1%) across subsurface characterization, wave-based characterization, and structural health monitoring, while providing robust uncertainty quantification for optimal sensor placement.

[1495] Cooperative Sheaf Neural Networks

André Ribeiro, Ana Luiza Tenório, Juan Belieni, Amauri H. Souza, Diego Mesquita

Main category: cs.LG

TL;DR: Sheaf diffusion on directed graphs enables cooperative message passing, overcoming limitations of existing sheaf methods that lack directionality and cooperative behavior.

Details

Motivation: Existing sheaf diffusion methods fail to achieve cooperative behavior (where nodes independently choose whether to propagate/gather information) due to lack of message directionality, limiting their flexibility in graph representation learning.

Method: Introduces cellular sheaves over directed graphs with in- and out-degree Laplacians, then proposes Cooperative Sheaf Neural Networks (CSNNs) that leverage this construction for cooperative message passing.

Result: CSNNs show overall better performance compared to prior sheaf diffusion and cooperative graph neural networks, with theoretical analysis showing they allow nodes to selectively attend to arbitrarily far nodes while ignoring others.

Conclusion: Sheaf diffusion can be extended to directed graphs to achieve cooperative behavior, addressing limitations of existing methods and potentially mitigating oversquashing in graph neural networks.

Abstract: Sheaf diffusion has recently emerged as a promising design pattern for graph representation learning due to its inherent ability to handle heterophilic data and avoid oversmoothing. Meanwhile, cooperative message passing has also been proposed as a way to enhance the flexibility of information diffusion by allowing nodes to independently choose whether to propagate/gather information from/to neighbors. A natural question ensues: is sheaf diffusion capable of exhibiting this cooperative behavior? Here, we provide a negative answer to this question. In particular, we show that existing sheaf diffusion methods fail to achieve cooperative behavior due to the lack of message directionality. To circumvent this limitation, we introduce the notion of cellular sheaves over directed graphs and characterize their in- and out-degree Laplacians. We leverage our construction to propose Cooperative Sheaf Neural Networks (CSNNs). Theoretically, we characterize the receptive field of CSNN and show it allows nodes to selectively attend (listen) to arbitrarily far nodes while ignoring all others in their path, potentially mitigating oversquashing. Our experiments show that CSNN presents overall better performance compared to prior art on sheaf diffusion as well as cooperative graph neural networks.

[1496] Performative Policy Gradient: Optimality in Performative Reinforcement Learning

Debabrota Basu, Udvas Das, Brahim Driss, Uddalak Mukherjee

Main category: cs.LG

TL;DR: PePG is the first policy gradient algorithm for performative RL that converges to performatively optimal policies, accounting for distribution shifts caused by the algorithm itself.

Details

Motivation: Standard RL methods ignore that deployed algorithms influence their environments, causing distribution shifts. While performative aspects have been studied in supervised learning, RL counterparts remain under-explored, especially for achieving performative optimality rather than just stability.

Method: Proves performative counterparts of performance difference lemma and policy gradient theorem, then introduces Performative Policy Gradient (PePG) algorithm. Analyzes convergence under softmax parametrization with and without entropy regularization.

Result: PePG converges to performatively optimal policies (policies that remain optimal under self-induced distribution shifts), outperforming existing performative RL algorithms that only achieve stability.

Conclusion: PePG is the first policy gradient algorithm designed for performative RL that achieves performative optimality, significantly advancing beyond prior stability-focused approaches.

Abstract: Post-deployment machine learning algorithms often influence the environments they act in, and thus shift the underlying dynamics that the standard reinforcement learning (RL) methods ignore. While designing optimal algorithms in this performative setting has recently been studied in supervised learning, the RL counterpart remains under-explored. In this paper, we prove the performative counterparts of the performance difference lemma and the policy gradient theorem in RL, and further introduce the Performative Policy Gradient algorithm (PePG). PePG is the first policy gradient algorithm designed to account for performativity in RL. Under softmax parametrisation, and also with and without entropy regularisation, we prove that PePG converges to performatively optimal policies, i.e. policies that remain optimal under the distribution shifts induced by themselves. Thus, PePG significantly extends the prior works in Performative RL that achieves performative stability but not optimality. Furthermore, our empirical analysis on standard performative RL environments validate that PePG outperforms the existing performative RL algorithms aiming for stability.

[1497] Discrete Diffusion Trajectory Alignment via Stepwise Decomposition

Jiaqi Han, Austin Wang, Minkai Xu, Wenda Chu, Meihua Dang, Haotian Ye, Huayu Chen, Yisong Yue, Stefano Ermon

Main category: cs.LG

TL;DR: A novel offline preference optimization method for discrete diffusion models that decomposes trajectory alignment into stepwise objectives by matching per-step posteriors, enabling efficient optimization compatible with arbitrary reward functions.

Details

Motivation: Discrete diffusion models show promise for sequence data but need improvement through alignment with rewards, similar to RL in language models. Current methods apply rewards only on final outputs and backpropagate through entire denoising, which is inefficient.

Method: Proposes an offline preference optimization framework that decomposes trajectory alignment into stepwise objectives by matching per-step posterior distributions. This enables efficient diffusion optimization compatible with arbitrary reward functions and yields equivalent optimal solution under additive factorization of trajectory reward.

Result: Experiments across DNA sequence design, protein inverse folding, and language modeling show superiority over RL-based baselines: up to 12% improvement in predicted activity on DNA sequences, and GSM8K score improvement from 78.6 to 81.2 on LLaDA-8B-Instruct.

Conclusion: The proposed stepwise alignment method efficiently optimizes discrete diffusion models for various sequence domains, outperforming RL-based approaches while being compatible with arbitrary reward functions.

Abstract: Discrete diffusion models have demonstrated great promise in modeling various sequence data, ranging from human language to biological sequences. Inspired by the success of RL in language models, there is growing interest in further improving the models by alignment with a certain reward. In this work, we propose an offline preference optimization method to approach trajectory alignment for discrete diffusion models. Instead of applying the reward on the final output and backpropagating the gradient to the entire denoising process, we decompose the problem into a set of stepwise alignment objectives by matching the per-step posterior. This framework enables efficient diffusion optimization, is compatible with arbitrary reward functions, and importantly, yields an equivalent optimal solution under additive factorization of the trajectory reward. Experiments across multiple domains including DNA sequence design, protein inverse folding, and language modeling consistently demonstrate the superiority of our approach. Notably, it achieves an up to 12% improvement over the most competitive RL-based baseline in terms of predicted activity on DNA sequence design, and further improves the GSM8K score from 78.6 to 81.2 on LLaDA-8B-Instruct for language modeling.

[1498] Resolving Extreme Data Scarcity by Explicit Physics Integration: An Application to Groundwater Heat Transport

Julia Pelzer, Corné Verburg, Alexander Heinlein, Miriam Schulte

Main category: cs.LG

TL;DR: LGCNN combines lightweight numerical modeling for global transport with convolutional neural networks for local processes to solve advection-diffusion problems in data-scarce settings, demonstrated on geothermal heat pump modeling.

Details

Motivation: Real-world flow applications in scientific/engineering domains face challenges with large spatial domains, high resolution requirements, and material heterogeneities. Classical simulations are computationally expensive, while ML-based surrogates need large datasets that are often unavailable in practice.

Method: Decomposes advection-diffusion problems into multiscale processes of locally and globally dominated components. Proposes Local-Global Convolutional Neural Network (LGCNN) that combines: 1) lightweight numerical model for global transport, and 2) two convolutional neural networks for local processes.

Result: LGCNN generalizes to arbitrarily larger domains even when trained on fewer than five simulations. Successfully transferred to real subsurface parameter maps from the Munich region, Germany for city-scale geothermal heat pump interaction modeling.

Conclusion: The proposed LGCNN approach effectively addresses data-scarce settings for complex flow problems by separating local and global components, enabling efficient surrogate modeling with minimal training data.

Abstract: Real-world flow applications in complex scientific and engineering domains, such as geosciences, challenge classical simulation methods due to large spatial domains, high spatio-temporal resolution requirements, and potentially strong material heterogeneities that lead to ill-conditioning and long runtimes. While machine learning-based surrogate models can reduce computational cost, they typically rely on large training datasets that are often unavailable in practice. To address data-scarce settings, we revisit the structure of advection-diffusion problems and decompose them into multiscale processes of locally and globally dominated components, separating spatially localized interactions and long-range effects. We propose a Local-Global Convolutional Neural Network (LGCNN) that combines a lightweight numerical model for global transport with two convolutional neural networks addressing processes of a more local nature. We demonstrate the performance of our method on city-scale geothermal heat pump interaction modeling and show that, even when trained on fewer than five simulations, LGCNN generalizes to arbitrarily larger domains, and can be successfully transferred to real subsurface parameter maps from the Munich region, Germany.

[1499] Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model

Nathan Kallus

Main category: cs.LG

TL;DR: The paper proposes a robust preference alignment method for LLMs that doesn’t assume a known link function between preferences and latent rewards, addressing misspecification bias through semiparametric modeling and convergence guarantees.

Details

Motivation: Traditional preference alignment for LLMs assumes a known link function (e.g., Bradley-Terry) between observed preferences and latent rewards. Misspecification of this link can bias inferred rewards and misalign learned policies. The authors aim to develop methods robust to unknown and unrestricted link functions.

Method: The authors study preference alignment under an unknown link function, showing that realizability of f-divergence-constrained reward maximization induces a semiparametric single-index binary choice model. They develop preference optimization algorithms robust to the unknown link, focusing on policy learning with implicit reward functions rather than estimating structural parameters. They prove convergence guarantees using generic function complexity measures.

Result: The paper develops theoretical convergence guarantees for preference optimization algorithms robust to unknown link functions. Empirical demonstrations on LLM alignment show the effectiveness of their approach. Code is publicly available.

Conclusion: The proposed method provides a robust approach to LLM preference alignment that doesn’t require assuming a specific link function, addressing potential misspecification bias while maintaining theoretical guarantees for policy learning.

Abstract: Aligning large language models (LLMs) to preference data typically assumes a known link function between observed preferences and latent rewards (e.g., a logistic Bradley-Terry link). Misspecification of this link can bias inferred rewards and misalign learned policies. We study preference alignment under an unknown and unrestricted link function. We show that realizability of $f$-divergence-constrained reward maximization in a policy class induces a semiparametric single-index binary choice model, where a scalar policy-dependent index captures all dependence on demonstrations and the remaining preference distribution is unrestricted. Rather than assuming this model has identifiable finite-dimensional structural parameters and estimating them, as in econometrics, we focus on policy learning with the reward function implicit, analyzing error to the optimal policy and allowing for unidentifiable nonparametric indices. We develop preference optimization algorithms robust to the unknown link and prove convergence guarantees in terms of generic function complexity measures. We demonstrate this empirically on LLM alignment. Code is available at https://github.com/causalml/spo/

[1500] SynCoGen: Synthesizable 3D Molecule Generation via Joint Reaction and Coordinate Modeling

Andrei Rekesh, Miruna Cretu, Dmytro Shevchuk, Vignesh Ram Somnath, Pietro Liò, Robert A. Batey, Mike Tyers, Michał Koziarski, Cheng-Hao Liu

Main category: cs.LG

TL;DR: SynCoGen is a framework for synthesizable 3D molecule generation that combines masked graph diffusion and flow matching to sample from joint distributions of molecular building blocks, chemical reactions, and atomic coordinates.

Details

Motivation: Synthesizability is a critical bottleneck in generative molecular design, especially when extending from 2D graphs to 3D geometry-based conditional generation, which remains largely unexplored.

Method: SynCoGen uses a single framework combining simultaneous masked graph diffusion and flow matching. It samples from joint distributions of molecular building blocks, chemical reactions, and atomic coordinates. The model was trained on SynSpace, a dataset family with over 1.2M synthesis-aware building block graphs and 7.5M conformers.

Result: Achieves state-of-the-art performance in unconditional small molecule graph and conformer co-generation. For protein ligand generation, delivers superior performance in molecular linker design and pharmacophore-conditioned generation across diverse targets without relying on scoring functions.

Conclusion: The multimodal non-autoregressive formulation represents a foundation for molecular design applications including analog expansion, lead optimization, and direct de novo design.

Abstract: Synthesizability remains a critical bottleneck in generative molecular design. While recent advances have addressed synthesizability in 2D graphs, extending these constraints to 3D for geometry-based conditional generation remains largely unexplored. In this work, we present SynCoGen (Synthesizable Co-Generation), a single framework that combines simultaneous masked graph diffusion and flow matching for synthesizable 3D molecule generation. SynCoGen samples from the joint distribution of molecular building blocks, chemical reactions, and atomic coordinates. To train the model, we curated SynSpace, a dataset family containing over 1.2M synthesis-aware building block graphs and 7.5M conformers. SynCoGen achieves state-of-the-art performance in unconditional small molecule graph and conformer co-generation. For protein ligand generation in drug discovery, the amortized model delivers superior performance in both molecular linker design and pharmacophore-conditioned generation across diverse targets without relying on any scoring functions. Overall, this multimodal non-autoregressive formulation represents a foundation for a range of molecular design applications, including analog expansion, lead optimization, and direct de novo design.

[1501] EvoXplain: When Machine Learning Models Agree on Predictions but Disagree on Why – Measuring Mechanistic Multiplicity Across Training Runs

Chama Bensmail

Main category: cs.LG

TL;DR: EvoXplain is a diagnostic framework that measures stability of model explanations across repeated training, revealing that high-accuracy models can have multiple distinct explanatory modes rather than a single coherent explanation.

Details

Motivation: The paper challenges the assumption that high predictive accuracy implies correct and trustworthy explanations, questioning whether models achieving similar accuracy rely on the same internal logic or different competing mechanisms.

Method: EvoXplain treats explanations as samples drawn from training and model selection pipelines without aggregating predictions or constructing ensembles, examining whether these samples form coherent explanations or separate into multiple structured explanatory modes.

Result: On Breast Cancer and COMPAS datasets using Logistic Regression and Random Forests, explanations frequently exhibit clear multimodality even for models assumed to be stable, with distinct explanation modes coexisting at near-identical hyperparameter configurations.

Conclusion: EvoXplain makes explanatory instability visible and quantifiable, reframing interpretability as a property of a model class under repeated instantiation rather than of any single trained model.

Abstract: Machine learning models are primarily judged by predictive performance, especially in applied settings. Once a model reaches high accuracy, its explanation is often assumed to be correct and trustworthy. This assumption raises an overlooked question: when two models achieve high accuracy, do they rely on the same internal logic, or do they reach the same outcome via different and potentially competing mechanisms? We introduce EvoXplain, a diagnostic framework that measures the stability of model explanations across repeated training. Rather than analysing the explanation of a single trained model, EvoXplain treats explanations as samples drawn from the training and model selection pipeline itself, without aggregating predictions or constructing ensembles. It examines whether these samples form a single coherent explanation or separate into multiple structured explanatory modes. We evaluate EvoXplain on the Breast Cancer and COMPAS datasets using Logistic Regression and Random Forests. Although all models achieve high predictive accuracy, their explanations frequently exhibit clear multimodality. Even models commonly assumed to be stable, such as Logistic Regression, can give rise to distinct explanation modes under repeated training on the same data split. Crucially, these modes can coexist at near-identical hyperparameter configurations, indicating explanation non-identifiability rather than smooth sensitivity to regularisation strength. EvoXplain does not attempt to select a correct explanation. Instead, it makes explanatory instability visible and quantifiable, revealing when single-instance or averaged explanations obscure the existence of multiple underlying mechanisms. More broadly, EvoXplain reframes interpretability as a property of a model class under repeated instantiation, rather than of any single trained model.

[1502] Is This Predictor More Informative than Another? A Decision-Theoretical Comparison

Yiding Feng, Liuhan Qian, Wei Tang

Main category: cs.LG

TL;DR: The paper introduces the “informativeness gap” framework to compare probabilistic predictors when neither is guaranteed to be well-calibrated, focusing on maximizing usefulness for downstream decision-making tasks.

Details

Motivation: In real-world applications, model providers must choose which predictive model to deploy for downstream decision-makers with diverse payoff objectives, but existing comparison methods assume calibration or don't account for varied decision tasks.

Method: Introduces the informativeness gap between predictors as maximum normalized payoff advantage across all decision-making tasks, provides dual characterization yielding a natural informativeness measure (relaxed earth mover’s distance), and shows sample-efficient estimation in prediction-only access setting.

Result: The framework generalizes existing notions (U-Calibration, Calibration Decision Loss, Blackwell informativeness), satisfies completeness and soundness desiderata, and experiments with LLM-based forecasters show it offers more decision-relevant evaluation than traditional metrics.

Conclusion: The informativeness gap provides a principled framework for comparing miscalibrated predictors and evaluating how calibration post-processing affects downstream decision usefulness, offering a more decision-relevant alternative to traditional evaluation metrics.

Abstract: In many real-world applications, a model provider provides probabilistic forecasts to downstream decision-makers who use them to make decisions under diverse payoff objectives. The provider may have access to multiple predictive models, each potentially miscalibrated, and must choose which model to deploy in order to maximize the usefulness of predictions for downstream decisions. A central challenge arises: how can the provider meaningfully compare two predictors when neither is guaranteed to be well-calibrated, and when the relevant decision tasks may differ across users and contexts? To answer this, our first contribution introduces the notion of the informativeness gap between any two predictors, defined as the maximum normalized payoff advantage one predictor offers over the other across all decision-making tasks. Our framework strictly generalizes several existing notions: it subsumes U-Calibration and Calibration Decision Loss, which compare a miscalibrated predictor to its calibrated counterpart, and it recovers Blackwell informativeness as a special case when both predictors are perfectly calibrated. Our second contribution is a dual characterization of the informativeness gap, which gives rise to a natural informativeness measure that can be viewed as a relaxed variant of the earth mover’s distance between two prediction distributions. We show that this measure satisfies natural desiderata: it is complete and sound, and it can be estimated sample-efficiently in the prediction-only access setting. We complement our theory with experiments on LLM-based forecasters in real-world prediction tasks, showing that the informativeness gap offers a more decision-relevant alternative to traditional metrics, and provides a principled lens for evaluating how ad hoc calibration post-processing affects downstream decision usefulness.

[1503] High-Dimensional Search, Low-Dimensional Solution: Decoupling Optimization from Representation

Yusuf Kalyoncuoglu, Ratmir Miftachov

Main category: cs.LG

TL;DR: Paper shows neural network representations can be compressed 16x via random projections with minimal performance loss, revealing intrinsic low-dimensional solution manifolds that enable efficient model distillation.

Details

Motivation: Current state-of-the-art models use massive parameter widths despite having low intrinsic dimensionality, creating redundancy. The authors investigate whether this redundancy serves optimization search rather than final representation quality.

Method: Decouple solution geometry using data-independent random projections to compress representations of ResNet, ViT, and BERT models. Compare with PCA and learned baselines to validate robustness of solution manifolds.

Result: Representations can be compressed up to 16x with only ~1% performance degradation. Random projections achieve parity with PCA and learned baselines, confirming solution manifolds are intrinsically robust.

Conclusion: Establishes foundation for Subspace-Native Distillation paradigm where student models target intrinsic low-dimensional manifolds directly, enabling “Train Big, Deploy Small” vision by bypassing high-dimensional optimization bottlenecks.

Abstract: State-of-the-art models rely on massive widths despite exhibiting low Intrinsic Dimension (ID). We posit that this redundancy serves the non-convex optimization search rather than the final representation. We validate this hypothesis by decoupling the solution geometry via data-independent random projections, demonstrating that ResNet, ViT, and BERT representations can be compressed by up to 16x with negligible performance degradation of around 1%. Notably, these oblivious projections achieve parity with PCA and learned baselines, confirming the solution manifold is intrinsically robust. These findings establish the foundation for Subspace-Native Distillation: a paradigm where student models target this intrinsic manifold directly, bypassing the high-dimensional optimization bottleneck to realize the vision of “Train Big, Deploy Small”

[1504] Spectral Bellman Method: Unifying Representation and Exploration in RL

Ofir Nabati, Bo Dai, Shie Mannor, Guy Tennenholtz

Main category: cs.LG

TL;DR: Spectral Bellman Method: A novel representation learning framework for RL that aligns features with Bellman dynamics through spectral analysis of value function covariance.

Details

Motivation: Existing representation learning methods in RL are often derived from model-learning perspectives, which misaligns them with the actual RL task. There's a need for representation learning that directly aligns with the fundamental structure of Bellman updates.

Method: Introduces Spectral Bellman Method based on Inherent Bellman Error (IBE) condition. Uses spectral relationship between Bellman operator transformations of value function distributions and feature covariance structure. Learns state-action features that capture Bellman-aligned covariance through simple algorithm modifications.

Result: Learned representations enable structured exploration by aligning feature covariance with Bellman dynamics, improving performance in hard-exploration and long-horizon tasks. Framework extends to multi-step Bellman operators.

Conclusion: Provides theoretically-grounded representation learning directly suited for value-based RL, offering principled path toward more powerful and structurally sound representations.

Abstract: Representation learning is critical to the empirical and theoretical success of reinforcement learning. However, many existing methods are induced from model-learning aspects, misaligning them with the RL task in hand. This work introduces the Spectral Bellman Method, a novel framework derived from the Inherent Bellman Error (IBE) condition. It aligns representation learning with the fundamental structure of Bellman updates across a \textit{space} of possible value functions, making it directly suited for value-based RL. Our key insight is a fundamental spectral relationship: under the zero-IBE condition, the transformation of a \textit{distribution} of value functions by the Bellman operator is intrinsically linked to the feature covariance structure. This connection yields a new, theoretically-grounded objective for learning state-action features that capture this Bellman-aligned covariance, requiring only a simple modification to existing algorithms. We demonstrate that our learned representations enable structured exploration by aligning feature covariance with Bellman dynamics, improving performance in hard-exploration and long-horizon tasks. Our framework naturally extends to multi-step Bellman operators, offering a principled path toward learning more powerful and structurally sound representations for value-based RL.

[1505] Exploring the Limitations of kNN Noisy Feature Detection and Recovery for Self-Driving Labs

Qiuyu Shi, Kangming Li, Yao Fehlis, Runze Zhang, Daniel Persaud, Robert Black, Jason Hattrick-Simpers

Main category: cs.LG

TL;DR: Automated workflow for detecting and correcting noisy features in materials discovery datasets, with systematic study of factors affecting detection and recovery performance.

Details

Motivation: Self-driving laboratories accelerate materials discovery but suffer from input parameter errors that corrupt features and compromise modeling. Need automated methods to detect and correct noisy features to improve data quality and experimental precision.

Method: Develops automated workflow to: 1) detect noisy features, 2) determine which sample-feature pairings can be corrected, 3) recover correct feature values. Uses systematic study examining dataset size, noise intensity, noise type, and feature value distribution effects on detectability and recoverability across DFT and SDL datasets.

Result: High-intensity noise and large training datasets improve detection and correction. Low-intensity noise reduces performance but can be compensated by larger clean datasets. Continuous/dispersed feature distributions show greater recoverability than discrete/narrow distributions. Provides benchmark for kNN imputation in materials datasets.

Conclusion: Demonstrates model-agnostic framework for rational data recovery under noise, limited data, and varying feature distributions. Aims to enhance data quality and experimental precision in automated materials discovery.

Abstract: Self-driving laboratories (SDLs) have shown promise to accelerate materials discovery by integrating machine learning with automated experimental platforms. However, errors in the capture of input parameters may corrupt the features used to model system performance, compromising current and future campaigns. This study develops an automated workflow to systematically detect noisy features, determine sample-feature pairings that can be corrected, and finally recover the correct feature values. A systematic study is then performed to examine how dataset size, noise intensity, noise type, and feature value distribution affect both the detectability and recoverability of noisy features on both Density Functional Theory (DFT) and SDL datasets. In general, high-intensity noise and large training datasets are conducive to the detection and correction of noisy features. Low-intensity noise reduces detection and recovery but can be compensated for by larger clean training data sets. Detection and correction results vary between features, with continuous and dispersed feature distributions showing greater recoverability compared to features with discrete or narrow distributions. This systematic study not only demonstrates a model agnostic framework for rational data recovery in the presence of noise, limited data, and differing feature distributions but also provides a tangible benchmark of kNN imputation in materials datasets. Ultimately, it aims to enhance data quality and experimental precision in automated materials discovery.

[1506] Sparse identification of nonlinear dynamics with library optimization mechanism: Recursive long-term prediction perspective

Ansei Yonezawa, Heisei Yonezawa, Shuichi Yahagi, Itsuro Kajiwara, Shinya Kijimoto, Hikaru Taniuchi, Kentaro Murakami

Main category: cs.LG

TL;DR: SINDy-LOM combines sparse regression with library optimization to discover dynamical system equations, using parameterized basis functions and recursive long-term prediction accuracy for better model reliability.

Details

Motivation: Traditional SINDy requires manual library design (set of candidate basis functions), which is challenging for many dynamical systems. The paper aims to automate library design and improve model reliability beyond one-step-ahead prediction.

Method: Proposes SINDy-LOM with two-layer optimization: inner layer performs sparse regression to find linear combinations of basis functions, outer layer optimizes parameterized basis functions using recursive long-term prediction accuracy as objective.

Result: SINDy-LOM produces parsimonious closed-form models with good interpretability, reduces user burden in library design, and improves reliability through recursive long-term prediction perspective compared to traditional SINDy.

Conclusion: The library optimization mechanism successfully automates library design and improves model reliability, making SINDy more practical for discovering governing equations of dynamical systems from data.

Abstract: The sparse identification of nonlinear dynamics (SINDy) approach can discover the governing equations of dynamical systems based on measurement data, where the dynamical model is identified as the sparse linear combination of the given basis functions. A major challenge in SINDy is the design of a library, which is a set of candidate basis functions, as the appropriate library is not trivial for many dynamical systems. To overcome this difficulty, this study proposes SINDy with library optimization mechanism (SINDy-LOM), which is a combination of the sparse regression technique and the novel learning strategy of the library. In the proposed approach, the basis functions are parametrized. The SINDy-LOM approach involves a two-layer optimization architecture: the inner-layer, in which the data-driven model is extracted as the sparse linear combination of the candidate basis functions, and the outer-layer, in which the basis functions are optimized from the viewpoint of the recursive long-term (RLT) prediction accuracy; thus, the library design is reformulated as the optimization of the parametrized basis functions. The dynamical model obtained by SINDy-LOM has good interpretability and usability, as this approach yields a parsimonious closed-form model. The library optimization mechanism significantly reduces user burden. The RLT perspective improves the reliability of the resulting model compared with the traditional SINDy approach that can only ensure the one-step-ahead prediction accuracy. The effectiveness of the proposed approach is verified through numerical experiments.

[1507] SCAR: State-Space Compression for Scalable AI-Based Network Management of Vehicular Services

Ioan-Sorin Comsa, Purav Shah, Karthik Vaidhyanathan, Deepak Gangadharan, Christof Imhof, Per Bergamin, Aryan Kaushik, Gabriel-Miro Muntean, Ramona Trestian

Main category: cs.LG

TL;DR: SCAR is an edge-assisted framework using ML-based compression (clustering and RBF networks) to reduce CQI state dimensionality for scalable RL-based network management in vehicular systems, improving feasibility and fairness.

Details

Motivation: Traditional network management struggles with high-volume, rapidly varying CQI data in dynamic vehicular environments, requiring scalable AI solutions for connected vehicular services.

Method: Proposes SCAR framework with ML-based compression (clustering and RBF networks) to reduce CQI state dimensionality, then uses compressed states to train RL policies for network management with fairness objectives.

Result: SCAR increases time in feasible management regions by 14%, reduces unfair service allocation by 15%, and SAST-based clustering reduces compression distortion by 10% compared to RL baselines on uncompressed states.

Conclusion: SCAR enables scalable and fair AI-assisted network management in dynamic vehicular systems through effective state-space compression and RL-based policy optimization.

Abstract: The increasing demand for connected vehicular services poses significant challenges for AI-based network and service management due to the high volume and rapid variability of network state information. Traditional management and control mechanisms struggle to scale when processing fine-grained metrics such as Channel Quality Indicators (CQIs) in dynamic vehicular environments. To address this challenge, we propose SCAR (State-Space Compression for AI-Based Network Management), an edge-assisted framework that improves scalability and fairness in vehicular services through network state abstraction. SCAR employs machine-learning (ML)-based compression techniques, including clustering and radial basis function (RBF) networks, to reduce the dimensionality of CQI-derived state information while preserving essential features relevant to management decisions. The resulting compressed states are used to train reinforcement learning (RL)-based management policies that aim to maximize network efficiency while satisfying service-level fairness objectives defined by the NGMN. Simulation results show that SCAR increases the time spent in feasible management regions by 14% and reduces unfair service allocation time by 15% compared to reinforcement learning baselines operating on uncompressed state information. Furthermore, simulated annealing with stochastic tunneling (SAST)-based clustering reduces state compression distortion by 10%, confirming the effectiveness of the proposed approach. These results demonstrate that SCAR enables scalable and fair AI-assisted network and service management in dynamic vehicular systems.

[1508] Biased Local SGD for Efficient Deep Learning on Heterogeneous Systems

Jihyun Lim, Junhyuk Jo, Chanhyeok Ko, Young Min Go, Jimin Hwa, Sunwoo Lee

Main category: cs.LG

TL;DR: Local SGD with intentional bias in data sampling and model aggregation enables efficient parallel training on heterogeneous systems (CPUs + GPUs), achieving up to 32x speedup over synchronous SGD with comparable accuracy.

Details

Motivation: Traditional parallel training methods assume homogeneous computing resources, but heterogeneous systems (mix of CPUs and GPUs) are common. Synchronous data-parallel SGD suffers from synchronization overhead under heterogeneous workloads, forcing practitioners to use only the fastest devices.

Method: Uses local SGD with intentionally introduced bias in data sampling and model aggregation to harmonize slower CPUs with faster GPUs. The method carefully controls bias to balance computational efficiency and model accuracy.

Result: Achieves up to 32x faster training than synchronous SGD with 2 CPUs and 8 GPUs on ResNet20/CIFAR-10, while maintaining nearly identical accuracy. The method demonstrates practical acceleration while achieving comparable or even higher accuracy than synchronous SGD under the same epoch budget.

Conclusion: Intentional bias in local SGD enables flexible utilization of diverse compute resources for deep learning, providing a practical solution for training on heterogeneous systems without sacrificing accuracy.

Abstract: Most parallel neural network training methods assume homogeneous computing resources. For example, synchronous data-parallel SGD suffers from significant synchronization overhead under heterogeneous workloads, often forcing practitioners to rely only on the fastest devices (e.g., GPUs). In this work, we study local SGD for efficient parallel training on heterogeneous systems. We show that intentionally introducing bias in data sampling and model aggregation can effectively harmonize slower CPUs with faster GPUs. Our extensive empirical results demonstrate that a carefully controlled bias significantly accelerates local SGD while achieving comparable or even higher accuracy than synchronous SGD under the same epoch budget. For instance, our method trains ResNet20 on CIFAR-10 with 2 CPUs and 8 GPUs up to 32x faster than synchronous SGD, with nearly identical accuracy. These results provide practical insights into how to flexibly utilize diverse compute resources for deep learning.

[1509] Graph Learning via Logic-Based Weisfeiler-Leman Variants and Tabularization

Reijo Jaakkola, Tomi Janhunen, Antti Kuusisto, Magdalena Ortiz, Matias Selin, Mantas Šimkus

Main category: cs.LG

TL;DR: Novel graph classification approach using tabularized graph data via modified Weisfeiler-Leman algorithms, achieving performance comparable to graph transformers while being 40-60x faster without GPU requirements.

Details

Motivation: To develop efficient graph classification methods that avoid the computational overhead and hyperparameter sensitivity of graph neural networks and transformers, while maintaining competitive predictive performance.

Method: Tabularizes graph data using new variants of the Weisfeiler-Leman algorithm with modified logical frameworks, then applies standard tabular data methods for classification.

Result: Achieves better predictive performance than graph neural networks and matches graph transformers on 14 benchmark datasets, while being 40-60x faster and requiring no GPU or extensive hyperparameter tuning.

Conclusion: The approach provides an efficient, hardware-agnostic alternative to deep learning methods for graph classification with competitive performance and minimal computational requirements.

Abstract: We present a novel approach for graph classification based on tabularizing graph data via new variants of the Weisfeiler-Leman algorithm and then applying methods for tabular data. We investigate a comprehensive class of versions of the Weisfeiler-Leman algorithm obtained by modifying the underlying logical framework and establish a precise theoretical characterization of their expressive power. We then test selected versions on 14 benchmark datasets that span a range of application domains. The experiments demonstrate that our approach generally achieves better predictive performance than graph neural networks and matches that of graph transformers, while being 40-60x faster and requiring neither a GPU nor extensive hyperparameter tuning.

[1510] FLARE: Fast Low-rank Attention Routing Engine

Vedant Puri, Aditya Joglekar, Sri Datta Ganesh Bandreddi, Kevin Ferguson, Yu-hsuan Chen, Yongjie Jessica Zhang, Levent Burak Kara

Main category: cs.LG

TL;DR: FLARE introduces a low-rank attention routing mechanism using latent tokens to reduce quadratic complexity, enabling efficient processing of long sequences while maintaining performance.

Details

Motivation: The quadratic complexity of self-attention in transformers limits scalability on long sequences, creating a need for efficient attention mechanisms that can handle large-scale data like million-point meshes.

Method: FLARE uses a token-mixing operator that routes information through a small set of latent tokens, creating low-rank attention via encode-decode factorization with only two standard SDPA calls. It assigns disjoint latent slices to each attention head for head-specific low-rank pathways.

Result: FLARE scales to one-million-point unstructured meshes on a single GPU, achieves SOTA accuracy on PDE surrogate benchmarks, outperforms general-purpose efficient-attention methods on Long Range Arena, and includes a large-scale additive manufacturing benchmark dataset.

Conclusion: FLARE provides an efficient attention mechanism that maintains performance while enabling transformer scalability on extremely long sequences through low-rank routing and compatibility with fused attention kernels.

Abstract: The quadratic complexity of self-attention limits the scalability of transformers on long sequences. We introduce Fast Low-rank Attention Routing Engine (FLARE), a token-mixing operator that realizes low-rank attention by routing information through a small set of latent tokens. Each layer induces an input-input token mixing matrix of rank at most $M$ via a minimal encode-decode factorization implemented using only two standard scaled dot-product attention (SDPA) calls. Because the dominant ${O}(NM)$ computation is expressed purely in terms of standard SDPA, FLARE is compatible with fused attention kernels and avoids materializing $M\times N$ projection matrices. FLARE further assigns disjoint latent slices to each attention head, yielding a mixture of head-specific low-rank pathways. Empirically, FLARE scales to one-million-point unstructured meshes on a single GPU, achieves state-of-the-art accuracy on PDE surrogate benchmarks, and outperforms general-purpose efficient-attention methods on the Long Range Arena suite. We additionally release a large-scale additive manufacturing benchmark dataset. Our code is available at https://github.com/vpuri3/FLARE.py.

[1511] Clinical Data Goes MEDS? Let’s OWL make sense of it

Alberto Marfoglia, Jong Ho Jhee, Adrien Coulet

Main category: cs.LG

TL;DR: MEDS-OWL: An OWL ontology that bridges the Medical Event Data Standard (MEDS) with Semantic Web technologies, enabling FAIR-aligned RDF representation of clinical event data.

Details

Motivation: Healthcare data lacks standardized, semantically explicit representations, limiting interoperability and reproducibility. MEDS addresses this but lacks Semantic Web integration.

Method: Developed MEDS-OWL ontology with formal concepts/relations for MEDS datasets as RDF graphs, plus meds2rdf Python library for conversion. Evaluated on synthetic clinical cohort and MIMIC-IV subset with SHACL validation.

Result: First release includes 13 classes, 10 object properties, 20 data properties, 24 axioms. Enables FAIR-aligned datasets, provenance-aware publishing, and interoperability of event-based clinical data.

Conclusion: Bridges MEDS with Semantic Web, providing reusable semantic layer for event-based clinical data and foundation for graph-based analytics.

Abstract: The application of machine learning on healthcare data is often hindered by the lack of standardized and semantically explicit representation, leading to limited interoperability and reproducibility across datasets and experiments. The Medical Event Data Standard (MEDS) addresses these issues by introducing a minimal, event-centric data model designed for reproducible machine-learning workflows from health data. However, MEDS is defined as a data-format specification and does not natively provide integration with the Semantic Web ecosystem. In this article, we introduce MEDS-OWL, a lightweight OWL ontology that provides formal concepts and relations to represent MEDS datasets as RDF graphs. Additionally, we implemented meds2rdf, a Python conversion library that transforms MEDS events into RDF graphs, ensuring conformance with the ontology. We evaluate the proposed approach on two datasets: a synthetic clinical cohort describing care pathways for ruptured intracranial aneurysms, and a real-world subset of MIMIC-IV. To assess semantic consistency, we performed a SHACL validation against the resulting knowledge graphs. The first release of MEDS-OWL comprises 13 classes, 10 object properties, 20 data properties, and 24 OWL axioms. Combined with meds2rdf, it enables data transformation into FAIR-aligned datasets, provenance-aware publishing, and interoperability of event-based clinical data. By bridging MEDS with the Semantic Web, this work contributes a reusable semantic layer for event-based clinical data and establishes a robust foundation for subsequent graph-based analytics.

[1512] Constructing 3D Rotational Invariance and Equivariance with Symmetric Tensor Networks

Meng Zhang, Chao Wang, Hao Zhang, Shaojun Dong, Lixin He

Main category: cs.LG

TL;DR: Systematic framework for constructing continuous rotationally invariant and equivariant functions using symmetric tensor networks for geometric deep learning

Details

Motivation: Symmetry-aware architectures are fundamental to geometric deep learning, but there's a need for systematic approaches to construct continuous rotationally invariant and equivariant functions that can handle various tensor representations

Method: Uses symmetric tensor networks to construct invariant maps, obtains equivariant maps via differentiation, supports both Cartesian and spherical tensors of different ranks/types, and derives general continuous equivariant maps from vector inputs to tensor outputs

Result: Provides a unified framework that clarifies how common equivariant primitives in geometric graph neural networks arise within the construction, enabling systematic generation of symmetry-aware functions

Conclusion: The tensor network approach offers a systematic method for building continuous rotationally invariant and equivariant functions in geometric deep learning, connecting to existing equivariant architectures

Abstract: Symmetry-aware architectures are central to geometric deep learning. We present a systematic approach for constructing continuous rotationally invariant and equivariant functions using symmetric tensor networks. The proposed framework supports inputs and outputs given as a tuple of Cartesian tensors of different rank as well as spherical tensors of different type. We introduce tensor network generators for invariant maps and obtain equivariant maps via differentiation. Specifically, we derive general continuous equivariant maps from vector inputs to Cartesian or spherical tensor output. Finally, we clarify how common equivariant primitives in geometric graph neural networks arise within our construction.

[1513] Longitudinal Progression Prediction of Alzheimer’s Disease with Tabular Foundation Model

Yilang Ding, Jiawen Ren, Jiaying Lu, Gloria Hyunjung Kwak, Armin Iraji, Shengpu Tang, Alex Fedorov

Main category: cs.LG

TL;DR: L2C-TabPFN method uses longitudinal-to-cross-sectional transformation with tabular foundation model to predict Alzheimer’s disease outcomes from multimodal clinical data, achieving state-of-the-art results for ventricular volume prediction.

Details

Motivation: Alzheimer's disease prediction is challenging due to multifactorial etiology and complex multimodal clinical data. Accurate forecasting of clinically relevant biomarkers is essential for monitoring disease progression.

Method: Integrates longitudinal-to-cross-sectional (L2C) transformation with pre-trained Tabular Foundation Model (TabPFN) using TADPOLE dataset. Converts sequential patient records into fixed-length feature vectors for predicting diagnosis, cognitive scores, and ventricular volume.

Result: Competitive performance on diagnostic and cognitive outcomes, with state-of-the-art results in ventricular volume prediction - a key imaging biomarker reflecting neurodegeneration in Alzheimer’s disease.

Conclusion: Highlights potential of tabular foundational models for advancing longitudinal prediction of clinically relevant imaging markers in Alzheimer’s disease.

Abstract: Alzheimer’s disease is a progressive neurodegenerative disorder that remains challenging to predict due to its multifactorial etiology and the complexity of multimodal clinical data. Accurate forecasting of clinically relevant biomarkers, including diagnostic and quantitative measures, is essential for effective monitoring of disease progression. This work introduces L2C-TabPFN, a method that integrates a longitudinal-to-cross-sectional (L2C) transformation with a pre-trained Tabular Foundation Model (TabPFN) to predict Alzheimer’s disease outcomes using the TADPOLE dataset. L2C-TabPFN converts sequential patient records into fixed-length feature vectors, enabling robust prediction of diagnosis, cognitive scores, and ventricular volume. Experimental results demonstrate that, while L2C-TabPFN achieves competitive performance on diagnostic and cognitive outcomes, it provides state-of-the-art results in ventricular volume prediction. This key imaging biomarker reflects neurodegeneration and progression in Alzheimer’s disease. These findings highlight the potential of tabular foundational models for advancing longitudinal prediction of clinically relevant imaging markers in Alzheimer’s disease.

[1514] SimMerge: Learning to Select Merge Operators from Similarity Signals

Oliver Bolton, Aakanksha, Arash Ahmadian, Sara Hooker, Marzieh Fadaee, Beyza Ermis

Main category: cs.LG

TL;DR: SimMerge: A predictive method for selecting high-performing model merges using task-agnostic similarity signals, enabling efficient merge operator, order, and model subset selection without iterative evaluation.

Details

Motivation: Model merging is powerful for LLM development but scaling is challenging due to expensive merge-and-evaluate searches required to determine optimal merge operators, model subsets, and merge orders.

Method: SimMerge uses inexpensive, task-agnostic similarity signals between models extracted from a small set of unlabeled probes. It predicts performance of candidate two-way merges using functional and structural features, enabling selection without iterative evaluation.

Result: SimMerge consistently outperforms best fixed merge operators across 7B-parameter LLMs, generalizes to multi-way merges and 111B-parameter LLMs without retraining, and supports online addition of new tasks and operators via bandit variant.

Conclusion: Learning how to merge enables scalable model composition when checkpoint catalogs are large and evaluation budgets are limited, making model merging more practical for LLM development.

Abstract: Model merging combines multiple models into a single model with aggregated capabilities, making it a powerful tool for large language model (LLM) development. However, scaling model merging is challenging: performance depends on the choice of merge operator, model subset, and merge order, often requiring expensive merge-and-evaluate searches. In this work, we introduce SimMerge, a predictive merge-selection method that identifies high-performing merges using inexpensive, task-agnostic similarity signals between models. Given a small set of unlabeled probes, SimMerge extracts functional and structural features to predict the performance of candidate two-way merges, enabling merge operator, order and model subset selection without iterative evaluation. We show that SimMerge consistently outperforms the best fixed merge operator across 7B-parameter LLMs and generalizes to multi-way merges and 111B-parameter LLMs without retraining. We further introduce a bandit variant that supports adding new tasks and operators online. Our results suggest that learning how to merge enables scalable model composition when checkpoint catalogs are large and evaluation budgets are limited.

[1515] FastTTS: Accelerating Test-Time Scaling for Edge LLM Reasoning

Hao Mark Chen, Zhiwen Mo, Guanxi Lu, Shuang Liang, Lingxiao Ma, Wayne Luk, Hongxiang Fan

Main category: cs.LG

TL;DR: FastTTS is a serving system that enables efficient Test-Time Scaling for edge LLM agents by optimizing memory usage and execution scheduling to match cloud-model accuracy with low latency on resource-constrained devices.

Details

Motivation: Edge deployment of LLM agents is needed for privacy, offline use, and responsive interaction, but memory constraints limit deployment to smaller LLMs with weaker reasoning capabilities. Test-Time Scaling can enhance reasoning but introduces heavy performance overhead on resource-constrained edge devices.

Method: FastTTS introduces three techniques: 1) Speculative Beam Extension to mitigate system stragglers from irregular reasoning paths, 2) Asymmetric Multi-Model Memory Allocation to balance memory between token generation and reasoning-step verification, and 3) Dynamic Prefix-Aware Scheduling to maximize KV-cache reuse across search paths.

Result: FastTTS achieves 2.2x higher goodput and reduces latency by 38%-68% compared to vLLM baseline, enabling edge LLMs on consumer GPUs to match cloud-model accuracy and latency.

Conclusion: FastTTS enables practical deployment of agentic AI on edge devices by making Test-Time Scaling efficient and low-latency, democratizing access to advanced reasoning capabilities.

Abstract: Recent advances in reasoning Large Language Models (LLMs) are driving the emergence of agentic AI systems. Edge deployment of LLM agents near end users is increasingly necessary to protect data privacy, enable offline use, and provide responsive interaction with local context. However, strict memory constraints on edge devices limit deployment to smaller LLMs, whose reasoning capabilities are much weaker than those of large cloud models, hindering practical deployment of edge agentic AI. Test-Time Scaling (TTS) offers a promising solution by allocating more compute during inference to enhance the reasoning capability of edge LLMs. However, current TTS methods introduce heavy hardware performance overhead on resource-constrained devices, making them impractical for real applications. To address this challenge, we present FastTTS, a serving system that enables fast and efficient TTS for memory-constrained LLM reasoning. After analyzing common patterns across various TTS methods and identifying their performance bottlenecks, we introduce three novel techniques: i) Speculative Beam Extension, which mitigates system stragglers caused by irregular reasoning paths, ii) Asymmetric Multi-Model Memory Allocation, which dynamically balances memory usage between token generation and reasoning-step verification, and iii) Dynamic Prefix-Aware Scheduling, which optimizes reasoning execution to maximize KV-cache reuse across search paths. FastTTS offers a plug-and-play third-party library on top of vLLM, enabling edge LLMs on a single consumer GPU to match cloud-model accuracy and cloud-measured latency. Comprehensive evaluation shows that FastTTS achieves an average 2.2x higher goodput and reduces latency by 38%–68% compared to the vLLM baseline; it pushes the boundaries of low-latency TTS on memory-constrained edge devices and highlights the potential for democratizing agentic AI.

[1516] A Learnable Wavelet Transformer for Long-Short Equity Trading and Risk-Adjusted Return Optimization

Shuozhe Li, Du Cheng, Leqi Liu

Main category: cs.LG

TL;DR: WaveLSFormer: A learnable wavelet-based Transformer for intraday trading that combines multi-scale decomposition with return-oriented decision learning for financial time series.

Details

Motivation: Intraday trading from financial time series is challenging due to noise, non-stationarity, and cross-sectional asset dependence. Existing methods struggle with these complexities.

Method: Uses learnable wavelet front-end for multi-scale decomposition with spectral regularizers, low-guided high-frequency injection module for multi-scale fusion, and Transformer backbone for decision learning with risk-aware portfolio optimization.

Result: Outperforms MLP, LSTM and Transformer baselines across six industry groups, achieving cumulative return of 0.607 ± 0.045 and Sharpe ratio of 2.157 ± 0.166.

Conclusion: WaveLSFormer effectively handles financial time series complexities through joint multi-scale decomposition and decision learning, improving both profitability and risk-adjusted returns.

Abstract: Learning profitable intraday trading policies from financial time series is challenging due to heavy noise, non-stationarity, and strong cross-sectional dependence among related assets. We propose \emph{WaveLSFormer}, a learnable wavelet-based long-short Transformer that jointly performs multi-scale decomposition and return-oriented decision learning. Specifically, a learnable wavelet front-end generates low-/high-frequency components via an end-to-end trained filter bank, guided by spectral regularizers that encourage stable and well-separated frequency bands. To fuse multi-scale information, we introduce a low-guided high-frequency injection (LGHI) module that refines low-frequency representations with high-frequency cues while controlling training stability. The model outputs a portfolio of long/short positions that is rescaled to satisfy a fixed risk budget and is optimized directly with a trading objective and risk-aware regularization. Extensive experiments on five years of hourly data across six industry groups, evaluated over ten random seeds, demonstrate that WaveLSFormer consistently outperforms MLP, LSTM and Transformer backbones, with and without fixed discrete wavelet front-ends. On average in all industries, WaveLSFormer achieves a cumulative overall strategy return of $0.607 \pm 0.045$ and a Sharpe ratio of $2.157 \pm 0.166$, substantially improving both profitability and risk-adjusted returns over the strongest baselines.

[1517] Finance-Grounded Optimization For Algorithmic Trading

Kasymkhan Khubiev, Mikhail Semenov, Irina Podlipnova, Dinara Khubieva

Main category: cs.LG

TL;DR: Proposes financially-grounded loss functions (Sharpe ratio, PnL, Maximum Drawdown) and turnover regularization for deep learning in finance, outperforming traditional MSE for trading tasks.

Details

Motivation: Deep learning has limitations in finance where interpretability and domain-specific metrics matter. Traditional loss functions like MSE don't align with financial performance metrics used by specialists.

Method: Introduces loss functions derived from financial metrics (Sharpe ratio, Profit-and-Loss, Maximum Drawdown) and proposes turnover regularization to constrain position turnover within limits.

Result: The proposed financially-grounded loss functions with turnover regularization outperform traditional mean squared error loss for return prediction tasks when evaluated using algorithmic trading metrics.

Conclusion: Financially-grounded metrics enhance predictive performance in trading strategies and portfolio optimization, making deep learning more suitable for financial applications.

Abstract: Deep Learning is evolving fast and integrates into various domains. Finance is a challenging field for deep learning, especially in the case of interpretable artificial intelligence (AI). Although classical approaches perform very well with natural language processing, computer vision, and forecasting, they are not perfect for the financial world, in which specialists use different metrics to evaluate model performance. We first introduce financially grounded loss functions derived from key quantitative finance metrics, including the Sharpe ratio, Profit-and-Loss (PnL), and Maximum Draw down. Additionally, we propose turnover regularization, a method that inherently constrains the turnover of generated positions within predefined limits. Our findings demonstrate that the proposed loss functions, in conjunction with turnover regularization, outperform the traditional mean squared error loss for return prediction tasks when evaluated using algorithmic trading metrics. The study shows that financially grounded metrics enhance predictive performance in trading strategies and portfolio optimization.

[1518] GeoDynamics: A Geometric State-Space Neural Network for Understanding Brain Dynamics on Riemannian Manifolds

Tingting Dan, Jiaqi Ding, Guorong Wu

Main category: cs.LG

TL;DR: GeoDynamics is a geometric state-space neural network that models brain connectivity dynamics on Riemannian manifolds, learning latent state trajectories from symmetric positive definite matrices for neuroscience and action recognition applications.

Details

Motivation: Current state-space models for brain dynamics treat neural data in Euclidean space or impose oversimplified network priors, failing to capture the true geometric nature of functional connectivity matrices which reside on Riemannian manifolds. There's a need for models that respect the intrinsic geometry of brain connectivity data.

Method: GeoDynamics embeds functional connectivity matrices (SPD matrices) into a manifold-aware recurrent framework that operates directly on the high-dimensional SPD manifold. It learns smooth, geometry-respecting transitions to track latent brain-state trajectories, capturing how coordinated networks evolve over time.

Result: The model successfully reveals task-driven state changes and identifies early markers of neurological disorders including Alzheimer’s disease, Parkinson’s disease, and autism. It also demonstrates scalability and robustness on human action recognition benchmarks (UTKinect, Florence, HDM05), showing effectiveness across diverse domains.

Conclusion: GeoDynamics provides a principled geometric framework for modeling brain dynamics that respects the intrinsic manifold structure of functional connectivity data, offering insights into both normal cognitive processes and pathological conditions while being applicable to broader spatiotemporal modeling tasks.

Abstract: State-space models (SSMs) have become a cornerstone for unraveling brain dynamics, revealing how latent neural states evolve over time and give rise to observed signals. By combining the flexibility of deep learning with the principled dynamical structure of SSMs, recent studies have achieved powerful fits to functional neuroimaging data. However, most existing approaches still view the brain as a set of loosely connected regions or impose oversimplified network priors, falling short of a truly holistic and self-organized dynamical system perspective. Brain functional connectivity (FC) at each time point naturally forms a symmetric positive definite (SPD) matrix, which resides on a curved Riemannian manifold rather than in Euclidean space. Capturing the trajectories of these SPD matrices is key to understanding how coordinated networks support cognition and behavior. To this end, we introduce GeoDynamics, a geometric state-space neural network that tracks latent brain-state trajectories directly on the high-dimensional SPD manifold. GeoDynamics embeds each connectivity matrix into a manifold-aware recurrent framework, learning smooth and geometry-respecting transitions that reveal task-driven state changes and early markers of Alzheimer’s disease, Parkinson’s disease, and autism. Beyond neuroscience, we validate GeoDynamics on human action recognition benchmarks (UTKinect, Florence, HDM05), demonstrating its scalability and robustness in modeling complex spatiotemporal dynamics across diverse domains.

[1519] Small Vectors, Big Effects: A Mechanistic Study of RL-Induced Reasoning via Steering Vectors

Viacheslav Sinii, Nikita Balagansky, Gleb Gerasimov, Daniil Laptev, Yaroslav Aksenov, Vadim Kurochkin, Alexey Gorbatovski, Boris Shaposhnikov, Daniil Gavrilov

Main category: cs.LG

TL;DR: Lightweight steering vectors trained with RL can explain most of fine-tuning performance gains in LLMs while preserving interpretability, revealing how reasoning training reshapes internal computations through specific layer mechanisms.

Details

Motivation: The paper aims to understand how reasoning training reshapes LLMs' internal computations, which remains unclear despite the effectiveness of fine-tuning for reasoning tasks.

Method: Researchers insert lightweight steering vectors into the base model’s residual stream and train them with a reinforcement-learning objective, then analyze how these vectors affect different layers’ computations while preserving interpretability.

Result: The steering vectors explain a large portion of full fine-tuning performance increase. Last-layer vectors act like token-substitution bias on first generated tokens, penultimate-layer vectors operate through MLP and unembedding to up-weight process words, and vectors transfer to other models from the same family.

Conclusion: The results deepen understanding of how trained steering vectors shape computation and should inform future work in activation engineering and the study of reasoning models.

Abstract: The mechanisms by which reasoning training reshapes LLMs’ internal computations remain unclear. We study lightweight steering vectors inserted into the base model’s residual stream and trained with a reinforcement-learning objective. These vectors explain a large portion of full fine-tuning performance increase while preserving the interpretability of small, additive interventions. We find that (i) the last-layer steering vector acts like a token-substitution bias concentrated on the first generated token, consistently boosting tokens such as “To” and “Step”; (ii) the penultimate-layer vector leaves attention patterns largely intact and instead operates through the MLP and unembedding, preferentially up-weighting process words and structure symbols; and (iii) the steering vectors transfer to other models from the same family. Taken together, these results deepen understanding of how trained steering vectors shape computation and should inform future work in activation engineering and the study of reasoning models.

[1520] Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

Dulhan Jayalath, Shashwat Goel, Thomas Foster, Parag Jain, Suchin Gururangan, Cheng Zhang, Anirudh Goyal, Alan Schelten

Main category: cs.LG

TL;DR: Compute as Teacher (CaT) framework uses inference compute as supervision for RL training by generating parallel rollouts, aggregating them into pseudo-reference answers, and deriving rewards through self-proposed rubrics, enabling learning without ground truth labels in non-verifiable domains.

Details

Motivation: The paper addresses the challenge of obtaining learning signals in post-training when there is no ground truth available, particularly in non-verifiable domains like healthcare guidance where no programmatic checker exists. Traditional RL requires human labels or verifiable rewards, which are scarce or unavailable in many real-world applications.

Method: CaT framework has two components: (1) Reference estimation that aggregates parallel rollouts into a pseudo-reference answer (using synthesis or other aggregators), and (2) Reward derivation that converts pseudo-references into RL rewards using self-proposed rubrics - binary, auditable criteria generated from the pseudo-reference and scored by an LLM judge.

Result: On HealthBench, CaT-trained models match or exceed inference-time aggregation quality while using 9x less test-time compute. CaT competes with learning from expert physician annotations, yielding up to +30% relative improvement over initial policy. The framework also works with verifiable rewards, matching best baselines on MATH-500 in test-time RL.

Conclusion: Compute as Teacher demonstrates that inference compute itself can serve as supervision for RL training without human labels, enabling learning in non-verifiable domains. The framework is versatile across both verifiable and non-verifiable domains and reduces test-time compute requirements while maintaining or improving performance.

Abstract: Where do learning signals come from when there is no ground truth in post-training? We show that inference compute itself can serve as supervision. By generating parallel rollouts and converting them into reference estimates, models can learn without human labels-critically, even in non-verifiable domains like healthcare guidance where no programmatic checker exists. We call this framework Compute as Teacher (CaT) and it turns inference-time compute from parallel rollouts into supervision for RL training. The framework has two components: (1) reference estimation which aggregates rollouts into a pseudo-reference answer, and (2) reward derivation which converts that pseudo-reference into RL rewards. For (1), we explore a simple method we call synthesis, but the framework admits any aggregator. For (2), we introduce self-proposed rubrics for non-verifiable domains. These are binary, auditable criteria generated from the pseudo-reference and scored by an LLM judge. On HealthBench, models trained with CaT match or exceed inference-time aggregation quality while using 9x less test-time compute. Here, CaT also competes with learning from expert physician annotations, yielding up to +30% relative improvement over the initial policy. The framework extends naturally to verifiable rewards, matching the best existing baselines on MATH-500 in test-time RL and demonstrating ‘drop-in’ versatility across both types of domains.

[1521] Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

Yuval Ran-Milo, Yotam Alexander, Shahar Mendel, Nadav Cohen

Main category: cs.LG

TL;DR: Transformers trained with RL on sparse rewards can spontaneously develop Chain-of-Thought reasoning, and this paper analyzes the mechanism through policy gradient dynamics on a synthetic graph traversal task.

Details

Motivation: To understand the mechanism by which sparse rewards drive policy gradient to discover systematic reasoning steps (Chain-of-Thought) in Transformers, which remains poorly understood despite empirical observations.

Method: Analyze policy gradient dynamics of single-layer Transformers on a synthetic graph traversal task requiring intermediate reasoning. Use theoretical analysis to prove convergence to structured algorithms and identify distributional requirements. Validate with experiments on synthetic data and real-world language models on mathematical reasoning tasks.

Result: Proved that policy gradient drives Transformers to converge to structured, interpretable algorithms for iterative graph traversal despite training only on final-answer correctness. Identified critical role of “simple examples” (instances requiring fewer reasoning steps) in the training distribution for enabling generalizable reasoning strategies.

Conclusion: The emergence of Chain-of-Thought reasoning in RL-trained Transformers depends crucially on training distributions containing sufficient simple examples, which enable learning of generalizable reasoning strategies that extrapolate to more complex problems.

Abstract: Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive policy gradient to discover such systematic reasoning remains poorly understood. We address this by analyzing the policy gradient dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, policy gradient drives the Transformer to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of “simple examples”: instances requiring fewer reasoning steps. When the training distribution places sufficient mass on these simpler examples, the Transformer learns a generalizable traversal strategy that extrapolates to longer chains; when this mass vanishes, policy gradient learning becomes infeasible. We corroborate our theoretical results through experiments on synthetic data and with real-world language models on mathematical reasoning tasks, validating that our theoretical findings carry over to practical settings.

[1522] DAG: A Dual Correlation Network for Time Series Forecasting with Exogenous Variables

Xiangfei Qiu, Yuhan Zhu, Zhengyu Li, Xingjian Wu, Bin Yang, Jilin Hu

Main category: cs.LG

TL;DR: DAG: Dual correlation network for time series forecasting with exogenous variables using temporal and channel correlation modules

Details

Motivation: Existing time series forecasting methods with exogenous variables fail to leverage future exogenous variables and don't fully account for correlations between endogenous and exogenous variables, limiting predictive accuracy.

Method: Proposes DAG with Temporal Correlation Module and Channel Correlation Module, each containing correlation discovery and correlation injection submodules to capture relationships between historical/future exogenous variables and historical endogenous variables.

Result: Not specified in abstract, but method claims to better leverage exogenous variables, especially future ones, for improved forecasting accuracy.

Conclusion: DAG addresses limitations of existing TSF-X methods by effectively utilizing both historical and future exogenous variables through dual correlation modeling.

Abstract: Time series forecasting is essential in various domains. Compared to relying solely on endogenous variables (i.e., target variables), considering exogenous variables (i.e., covariates) provides additional predictive information and often leads to more accurate predictions. However, existing methods for time series forecasting with exogenous variables (TSF-X) have the following shortcomings: 1) they do not leverage future exogenous variables, 2) they fail to fully account for the correlation between endogenous and exogenous variables. In this study, to better leverage exogenous variables, especially future exogenous variables, we propose DAG, which utilizes Dual correlAtion network along both the temporal and channel dimensions for time series forecasting with exoGenous variables. Specifically, we propose two core components: the Temporal Correlation Module and the Channel Correlation Module. Both modules consist of a correlation discovery submodule and a correlation injection submodule. The former is designed to capture the correlation effects of historical exogenous variables on future exogenous variables and on historical endogenous variables, respectively. The latter injects the discovered correlation relationships into the processes of forecasting future endogenous variables based on historical endogenous variables and future exogenous variables.

[1523] Why Inference in Large Models Becomes Decomposable After Training

Jidong Jin

Main category: cs.LG

TL;DR: Post-training inference optimization by identifying and removing statistically unsupported parameter dependencies through structural annealing, enabling parallel inference without modifying model functionality.

Details

Motivation: Current inference in large AI models uses dense parameter matrices, leading to unsustainable scaling of inference cost and system complexity. The authors argue this isn't due to insufficient model capacity but from treating inference systems as monolithic operators while ignoring internal structures formed during learning.

Method: The authors show gradient updates in large models are highly localized and selective, leaving many parameter dependencies statistically indistinguishable from initialization. They introduce a post-training statistical criterion and structural annealing procedure that removes unsupported dependencies and reveals stable, independent substructures.

Result: The approach enables structured, parallel inference without modifying model functionality or interfaces, establishing a post-training, model-agnostic structural view of inference systems.

Conclusion: Large models have inherent structural decomposability that can be exploited for efficient inference through post-training analysis and optimization, addressing the unsustainable scaling of inference costs.

Abstract: Inference in large-scale AI models is typically performed on dense parameter matrices, leading to inference cost and system complexity that scale unsustainably with model size. This limitation does not arise from insufficient model capacity, but from treating post-training inference systems as monolithic operators while ignoring internal structures formed during learning. We show that gradient update events in large models are highly localized and selective, leaving many parameter dependencies statistically indistinguishable from their initialization distribution after training. As a result, post-training inference systems are structurally non-uniform and inherently decomposable. Based on this observation, we introduce a post-training statistical criterion and a structural annealing procedure that removes unsupported dependencies and reveals stable, independent substructures. This work establishes a post-training, model-agnostic structural view of inference systems and enables structured, parallel inference without modifying model functionality or interfaces.

[1524] The Multi-Query Paradox in Zeroth-Order Optimization

Wei Lin, Qingyu Song, Hong Xu

Main category: cs.LG

TL;DR: ZO optimization query allocation analysis reveals single-query optimal for averaging but multi-query optimal for projection alignment method.

Details

Motivation: ZO optimization faces a fundamental trade-off: under fixed query budget, queries per iteration vs total iterations are inversely proportional. How to best allocate queries is an under-explored question.

Method: Analyze two aggregation methods: simple averaging (ZO-Avg) and new Projection Alignment method (ZO-Align) derived from local surrogate minimization. Derive convergence rates making query dependence explicit across convex, non-convex, and stochastic settings.

Result: Stark dichotomy: ZO-Avg is always query-inefficient with more than one query per iteration (single-query optimal). ZO-Align generally performs better with more queries per iteration (full-subspace estimation optimal). Multi-query problem reduces to choice between two classic algorithms dictated by aggregation method.

Conclusion: The work systematically resolves query allocation problem in ZO optimization, showing optimal strategy depends on aggregation method, not intermediate query size. Theoretical findings validated by extensive experiments.

Abstract: Zeroth-order (ZO) optimization provides a powerful framework for problems where explicit gradients are unavailable and have to be approximated using only queries to function value. The prevalent single-query approach is simple, but suffers from high estimation variance, motivating a multi-query paradigm to improve estimation accuracy. This, however, creates a critical trade-off: under a fixed budget of queries (i.e. cost), queries per iteration and the total number of optimization iterations are inversely proportional to one another. How to best allocate this budget is a fundamental, under-explored question. This work systematically resolves this query allocation problem. We analyze two aggregation methods: the de facto simple averaging (ZO-Avg), and a new Projection Alignment method (ZO-Align) we derive from local surrogate minimization. By deriving convergence rates for both methods that make the dependence on the number of queries explicit across strongly convex, convex, non-convex, and stochastic settings, we uncover a stark dichotomy: For ZO-Avg, we prove that using more than one query per iteration is always query-inefficient, rendering the single-query approach optimal. On the contrary, ZO-Align generally performs better with more queries per iteration, resulting in a full-subspace estimation as the optimal approach. Thus, our work clarifies that the multi-query problem boils down to a choice not about an intermediate query size, but between two classic algorithms, a choice dictated entirely by the aggregation method used. These theoretical findings are also consistently validated by extensive experiments.

[1525] A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs

Dayal Singh Kalra, Jean-Christophe Gagnon-Audet, Andrey Gromov, Ishita Mediratta, Kelvin Niu, Alexander H Miller, Michael Shvartsman

Main category: cs.LG

TL;DR: Critical sharpness is a computationally efficient curvature measure that captures Hessian sharpness phenomena like progressive sharpening and Edge of Stability, enabling analysis of training dynamics for large language models up to 7B parameters.

Details

Motivation: Direct measurement of Hessian sharpness is computationally prohibitive for Large Language Models, limiting analysis of training dynamics. There's a need for scalable curvature measures to understand phenomena like progressive sharpening and Edge of Stability in large-scale training.

Method: Introduces critical sharpness (λ_c), a computationally efficient measure requiring fewer than 10 forward passes given the update direction Δθ. Also introduces relative critical sharpness (λ_c^{1→2}) to quantify curvature of one loss landscape while optimizing another, enabling analysis of transitions between training phases.

Result: First demonstration of Hessian sharpness phenomena at scale (up to 7B parameters) spanning both pre-training and mid-training of OLMo-2 models. Critical sharpness captures well-documented phenomena including progressive sharpening and Edge of Stability.

Conclusion: Critical sharpness provides a practical tool for diagnosing curvature dynamics and informing data composition choices at scale. Scalable curvature measures can provide actionable insights for large-scale training.

Abstract: Understanding the curvature evolution of the loss landscape is fundamental to analyzing the training dynamics of neural networks. The most commonly studied measure, Hessian sharpness ($λ_{\max}^H$) – the largest eigenvalue of the loss Hessian – determines local training stability and interacts with the learning rate throughout training. Despite its significance in analyzing training dynamics, direct measurement of Hessian sharpness remains prohibitive for Large Language Models (LLMs) due to high computational cost. We analyze $\textit{critical sharpness}$ ($λ_c$), a computationally efficient measure requiring fewer than $10$ forward passes given the update direction $Δ\mathbfθ$. Critically, this measure captures well-documented Hessian sharpness phenomena, including progressive sharpening and Edge of Stability. Using this measure, we provide the first demonstration of these sharpness phenomena at scale, up to $7$B parameters, spanning both pre-training and mid-training of OLMo-2 models. We further introduce $\textit{relative critical sharpness}$ ($λ_c^{1\to 2}$), which quantifies the curvature of one loss landscape while optimizing another, to analyze the transition from pre-training to fine-tuning and guide data mixing strategies. Critical sharpness provides practitioners with a practical tool for diagnosing curvature dynamics and informing data composition choices at scale. More broadly, our work shows that scalable curvature measures can provide actionable insights for large-scale training.

[1526] Auto-bidding under Return-on-Spend Constraints with Uncertainty Quantification

Jiale Han, Chun Gan, Chengcheng Zhang, Jie He, Zhangang Lin, Ching Law, Xiaowu Dai

Main category: cs.LG

TL;DR: A conformal prediction-based method for auto-bidding systems that handles unknown ad impression values, providing uncertainty quantification and performance guarantees for budget and Return-on-Spend constrained bidding without requiring true value knowledge.

Details

Motivation: Existing auto-bidding systems often assume known ad impression values (like conversion rates), but in reality these values are unknown. There's a need for methods that can handle this uncertainty while maintaining performance guarantees for budget and Return-on-Spend constraints.

Method: Uses conformal prediction to quantify uncertainty of ad impression values based on ML predictions from historical bidding data with contextual features. Introduces an adjusted value estimator derived from ML predictions and prediction intervals, then applies this to enhance existing auto-bidding algorithms with budget and RoS constraints.

Result: Theoretical guarantees for achieving high reward while keeping RoS violations low. Empirical results on simulated and real-world industrial datasets show improved performance while maintaining computational efficiency.

Conclusion: The proposed method effectively handles unknown ad values in auto-bidding systems using conformal prediction, providing performance guarantees without requiring true value knowledge, and is compatible with existing industry ML systems.

Abstract: Auto-bidding systems are widely used in advertising to automatically determine bid values under constraints such as total budget and Return-on-Spend (RoS) targets. Existing works often assume that the value of an ad impression, such as the conversion rate, is known. This paper considers the more realistic scenario where the true value is unknown. We propose a novel method that uses conformal prediction to quantify the uncertainty of these values based on machine learning methods trained on historical bidding data with contextual features, without assuming the data are i.i.d. This approach is compatible with current industry systems that use machine learning to predict values. Building on prediction intervals, we introduce an adjusted value estimator derived from machine learning predictions, and show that it provides performance guarantees without requiring knowledge of the true value. We apply this method to enhance existing auto-bidding algorithms with budget and RoS constraints, and establish theoretical guarantees for achieving high reward while keeping RoS violations low. Empirical results on both simulated and real-world industrial datasets demonstrate that our approach improves performance while maintaining computational efficiency.

[1527] From Fuzzy to Exact: The Halo Architecture for Infinite-Depth Reasoning via Rational Arithmetic

Hansheng Ren

Main category: cs.LG

TL;DR: The paper argues that current AI’s use of approximate floating-point arithmetic fundamentally limits reasoning capabilities, proposing exact rational arithmetic hardware (Halo Architecture) as essential for AGI.

Details

Motivation: Current deep learning prioritizes computational throughput over numerical precision, creating a trade-off that conflicts with requirements for general intelligence. The authors argue that high-order causal reasoning - essential for AGI - requires arbitrary-precision, logically consistent arithmetic, and that prevalent LLM failures (logical hallucinations, incoherence) stem from limitations of IEEE 754 floating-point arithmetic where approximation errors compound catastrophically.

Method: Proposes the Halo Architecture which transitions computational foundation from approximate reals to exact rationals. This is realized through a custom Exact Inference Unit (EIU) featuring asynchronous MIMD reduction and dual-modular redundancy to resolve performance and reliability bottlenecks of exact computation at scale.

Result: In rigorous simulations, 600B-parameter BF16 models fail in chaotic systems within steps, while Halo sustains perfect numerical fidelity indefinitely. The architecture demonstrates that exact arithmetic can maintain computational integrity where approximate methods fail.

Conclusion: Exact arithmetic is non-negotiable for advancing reasoning-capable AGI. The work provides a co-designed hardware-software path toward verifiable, exascale-ready AI systems that can support high-order causal reasoning without numerical degradation.

Abstract: The pursuit of scale in deep learning has entrenched a trade-off: computational throughput is prioritized at the expense of numerical precision. We argue this compromise is fundamentally at odds with the requirements of general intelligence. We propose the \textbf{Exactness Hypothesis}: high-order causal reasoning – a cornerstone of AGI – demands a substrate supporting \textbf{arbitrary-precision, logically consistent arithmetic}. We trace prevalent LLM failures, such as logical hallucinations and incoherence, to the inherent limitations of IEEE 754 floating-point arithmetic, where approximation errors compound catastrophically in deep functions. As a solution, we present the \textbf{Halo Architecture}, which transitions the computational foundation from approximate reals ($\mathbb{R}$) to exact rationals ($\mathbb{Q}$). Halo is realized through a custom \textbf{Exact Inference Unit (EIU)}, whose design – featuring asynchronous MIMD reduction and dual-modular redundancy – resolves the performance and reliability bottlenecks of exact computation at scale. In rigorous simulations, 600B-parameter BF16 models fail in chaotic systems within steps, while Halo sustains \textbf{perfect numerical fidelity} indefinitely. Our work posits exact arithmetic as non-negotiable for advancing reasoning-capable AGI and provides a co-designed hardware-software path toward verifiable, exascale-ready AI systems.

[1528] Unrolled Graph Neural Networks for Constrained Optimization

Samar Hadou, Alejandro Ribeiro

Main category: cs.LG

TL;DR: Unrolling dual ascent algorithm dynamics in coupled GNNs for constrained optimization, with primal and dual networks interacting to find Lagrangian saddle points.

Details

Motivation: To solve constrained optimization problems using neural networks that can generalize to out-of-distribution problems while maintaining theoretical guarantees from optimization algorithms.

Method: Unroll dual ascent algorithm dynamics in two coupled GNNs - primal GNN finds stationary points for given dual multipliers, dual GNN refines estimates. Impose descent/ascent constraints to mirror DA algorithm dynamics, with joint training alternating between primal and dual updates.

Result: Approach yields near-optimal near-feasible solutions and demonstrates good generalization to out-of-distribution problems in numerical experiments.

Conclusion: Coupled GNNs unrolling dual ascent algorithm dynamics provide an effective framework for solving constrained optimization problems with generalization capabilities.

Abstract: In this paper, we unroll the dynamics of the dual ascent (DA) algorithm in two coupled graph neural networks (GNNs) to solve constrained optimization problems. The two networks interact with each other at the layer level to find a saddle point of the Lagrangian. The primal GNN finds a stationary point for a given dual multiplier, while the dual network iteratively refines its estimates to reach an optimal solution. We force the primal and dual networks to mirror the dynamics of the DA algorithm by imposing descent and ascent constraints. We propose a joint training scheme that alternates between updating the primal and dual networks. Our numerical experiments demonstrate that our approach yields near-optimal near-feasible solutions and generalizes well to out-of-distribution (OOD) problems.

[1529] Toward Learning POMDPs Beyond Full-Rank Actions and State Observability

Seiji Shaw, Travis Manderson, Chad Kessens, Nicholas Roy

Main category: cs.LG

TL;DR: Paper presents method to learn POMDP parameters (transition/observation matrices) from action-observation sequences using spectral approaches and tensor decompositions, enabling autonomous agents to reason about systems with hidden states.

Details

Motivation: Enable autonomous agents to learn and reason about systems with hidden states (like furniture with hidden locking mechanisms) by learning POMDP parameters from action-observation sequences, addressing limitations of existing spectral methods that can't estimate transition/observation likelihoods needed for downstream reasoning.

Method: Combines Predictive State Representations (PSRs) with tensor decompositions to learn observation and transition matrices up to a similarity transform. Method learns matrices up to a partition of states where states in same partition have identical observation distributions for actions with full-rank transition matrices.

Result: Partition-level transition models learned with sufficient data match performance of PSRs when used with standard sampling-based POMDP solvers. Explicit observation and transition likelihoods can be leveraged to specify planner behavior after model learning.

Conclusion: Proposed method successfully learns POMDP parameters from action-observation sequences, enabling agents to reason about hidden-state systems while providing explicit transition/observation likelihoods that existing spectral methods lack.

Abstract: We are interested in enabling autonomous agents to learn and reason about systems with hidden states, such as furniture with hidden locking mechanisms. We cast this problem as learning the parameters of a discrete Partially Observable Markov Decision Process (POMDP). The agent begins with knowledge of the POMDP’s actions and observation spaces, but not its state space, transitions, or observation models. These properties must be constructed from action-observation sequences. Spectral approaches to learning models of partially observable domains, such as learning Predictive State Representations (PSRs), are known to directly estimate the number of hidden states. These methods cannot, however, yield direct estimates of transition and observation likelihoods, which are important for many downstream reasoning tasks. Other approaches leverage tensor decompositions to estimate transition and observation likelihoods but often assume full state observability and full-rank transition matrices for all actions. To relax these assumptions, we study how PSRs learn transition and observation matrices up to a similarity transform, which may be estimated via tensor methods. Our method learns observation matrices and transition matrices up to a partition of states, where the states in a single partition have the same observation distributions corresponding to actions whose transition matrices are full-rank. Our experiments suggest that these partition-level transition models learned by our method, with a sufficient amount of data, meets the performance of PSRs as models to be used by standard sampling-based POMDP solvers. Furthermore, the explicit observation and transition likelihoods can be leveraged to specify planner behavior after the model has been learned.

[1530] Stability of In-Context Learning: A Spectral Coverage Perspective

Tongxi Wang, Zhuoyang Xia

Main category: cs.LG

TL;DR: The paper proposes a spectral-coverage proxy method to determine optimal prompt length for in-context learning by analyzing demonstration representation statistics, providing verifiable guidance for ICL prompt design.

Details

Motivation: In-context learning reliability varies with demonstration count, but measuring distributional stability directly is expensive at scale, making prompt-length selection largely heuristic.

Method: Uses a spectral-coverage proxy based on the lower tail of the spectrum of a regularized empirical second-moment matrix of demonstration representations. Derives non-asymptotic sample-size requirements under sub-Gaussian assumptions and provides a two-stage estimator for conservative prompt-length recommendations.

Result: The method consistently upper-bounds empirical accuracy knee-points in large-scale experiments. Direct resampling-based stability measurements validate the stability interpretation. A calibration step tightens conservatism while preserving ordering.

Conclusion: Provides practical and verifiable guidance for ICL prompt design through a computable sufficient condition for distributional stability, addressing the prompt-length selection problem with theoretical guarantees and empirical validation.

Abstract: In-context learning (ICL) is a pivotal capability for the practical deployment of large-scale language models, yet its reliability can vary substantially with the number of demonstrations provided in the prompt. A central obstacle is that the target notion, \emph{distributional stability under demonstration resampling}, is expensive to measure directly at scale, making prompt-length selection largely heuristic. We therefore study a \emph{computable sufficient condition} based on a spectral-coverage proxy: the lower tail of the spectrum of a regularized empirical second-moment matrix formed from demonstration representations. Under sub-Gaussian representation assumptions, we derive a non-asymptotic sample-size requirement (a lower bound on $K$) that guarantees this proxy event with prescribed failure probability, yielding a conservative prompt-length recommendation produced by an observable two-stage estimator. In large-scale experiments, the resulting estimates consistently upper-bound empirical accuracy knee-points, which we treat only as a practical surrogate for the prompt-length transition rather than a definition of stability. On a smaller held-out subset, direct resampling-based distributional stability measurements further validate the intended stability interpretation. Finally, a validation-only calibration step tightens the conservatism (typically to about $1.03$–$1.20\times$) while preserving conservative ordering, providing practical and verifiable guidance for ICL prompt design.

[1531] FloydNet: A Learning Paradigm for Global Relational Reasoning

Jingcheng Yu, Mingliang Zeng, Qiwei Ye

Main category: cs.LG

TL;DR: FloydNet introduces a dynamic programming-based architecture for graph reasoning that replaces local message passing with global state refinement, achieving state-of-the-art performance on algorithmic benchmarks and graph reasoning tasks.

Details

Motivation: Current Graph Neural Networks (GNNs) are limited by their local message-passing mechanism, which creates a bottleneck for global, holistic reasoning. The authors argue that dynamic programming offers a more powerful paradigm for complex multi-step reasoning on graphs.

Method: FloydNet maintains a global all-pairs relationship tensor and learns a generalized dynamic programming operator to progressively refine it. This enables the model to develop a task-specific relational calculus for capturing long-range dependencies, achieving 3-WL (2-FWL) expressive power.

Result: FloydNet achieves near-perfect scores (>99%) on CLRS-30 algorithmic benchmark, finds exact optimal solutions for TSP at rates significantly exceeding strong heuristics, and empirically matches the 3-WL test on BREC benchmark.

Conclusion: Learned dynamic programming-style refinement provides a powerful and practical alternative to message passing for high-level graph reasoning, establishing a new paradigm for complex reasoning tasks.

Abstract: Developing models capable of complex, multi-step reasoning is a central goal in artificial intelligence. While representing problems as graphs is a powerful approach, Graph Neural Networks (GNNs) are fundamentally constrained by their message-passing mechanism, which imposes a local bottleneck that limits global, holistic reasoning. We argue that dynamic programming (DP), which solves problems by iteratively refining a global state, offers a more powerful and suitable learning paradigm. We introduce FloydNet, a new architecture that embodies this principle. In contrast to local message passing, FloydNet maintains a global, all-pairs relationship tensor and learns a generalized DP operator to progressively refine it. This enables the model to develop a task-specific relational calculus, providing a principled framework for capturing long-range dependencies. Theoretically, we prove that FloydNet achieves 3-WL (2-FWL) expressive power, and its generalized form aligns with the k-FWL hierarchy. FloydNet demonstrates state-of-the-art performance across challenging domains: it achieves near-perfect scores (often >99%) on the CLRS-30 algorithmic benchmark, finds exact optimal solutions for the general Traveling Salesman Problem (TSP) at rates significantly exceeding strong heuristics, and empirically matches the 3-WL test on the BREC benchmark. Our results establish this learned, DP-style refinement as a powerful and practical alternative to message passing for high-level graph reasoning.

[1532] Evidence for Limited Metacognition in LLMs

Christopher Ackerman

Main category: cs.LG

TL;DR: Researchers develop a novel methodology to quantitatively evaluate metacognitive abilities in LLMs by testing strategic deployment of internal state knowledge, finding frontier models since early 2024 show increasing evidence of metacognition.

Details

Motivation: The paper addresses the growing public concern about LLM self-awareness and sentience, noting that while these concepts have major safety and policy implications, the science for measuring them remains underdeveloped. The authors aim to move beyond self-reports and establish quantitative methods for evaluating metacognitive abilities in LLMs.

Method: Inspired by metacognition research in nonhuman animals, the methodology eschews model self-reports and instead tests strategic deployment of internal state knowledge. Using two experimental paradigms, researchers examine whether models can assess and utilize their own confidence in answering questions correctly, and whether they can anticipate and appropriately use information about what answers they would give. The approach includes analysis of token probabilities to identify upstream internal signals that could support metacognition.

Result: Frontier LLMs introduced since early 2024 show increasingly strong evidence of certain metacognitive abilities, specifically: 1) ability to assess and utilize confidence in factual and reasoning question performance, and 2) ability to anticipate future answers and use that information appropriately. Token probability analysis suggests presence of upstream internal signals that could provide basis for metacognition. However, these abilities are limited in resolution, emerge context-dependently, and appear qualitatively different from human metacognition. Intriguing differences across similarly capable models suggest post-training may influence metacognitive development.

Conclusion: The study establishes a novel quantitative methodology for evaluating metacognition in LLMs, demonstrating that frontier models exhibit measurable metacognitive abilities that are improving over time. However, these abilities differ from human metacognition and have limitations. The findings have implications for AI safety and policy discussions around LLM self-awareness.

Abstract: The possibility of LLM self-awareness and even sentience is gaining increasing public attention and has major safety and policy implications, but the science of measuring them is still in a nascent state. Here we introduce a novel methodology for quantitatively evaluating metacognitive abilities in LLMs. Taking inspiration from research on metacognition in nonhuman animals, our approach eschews model self-reports and instead tests to what degree models can strategically deploy knowledge of internal states. Using two experimental paradigms, we demonstrate that frontier LLMs introduced since early 2024 show increasingly strong evidence of certain metacognitive abilities, specifically the ability to assess and utilize their own confidence in their ability to answer factual and reasoning questions correctly and the ability to anticipate what answers they would give and utilize that information appropriately. We buttress these behavioral findings with an analysis of the token probabilities returned by the models, which suggests the presence of an upstream internal signal that could provide the basis for metacognition. We further find that these abilities 1) are limited in resolution, 2) emerge in context-dependent manners, and 3) seem to be qualitatively different from those of humans. We also report intriguing differences across models of similar capabilities, suggesting that LLM post-training may have a role in developing metacognitive abilities.

[1533] A Scalable Inter-edge Correlation Modeling in CopulaGNN for Link Sign Prediction

Jinkyu Sung, Myunggeum Jee, Joonseok Lee

Main category: cs.LG

TL;DR: A scalable Gaussian copula-based method for link sign prediction on signed graphs that addresses computational challenges through edge embeddings and reformulated conditional probability distributions.

Details

Motivation: Traditional graph methods assume homophily (similar adjacent nodes), but signed graphs contain negative edges that violate this assumption. Existing methods require auxiliary structures, and naive modeling of edge-edge relations is computationally intractable for moderate-scale graphs.

Method: Extends CopulaGNN by: 1) representing the correlation matrix as a Gramian of edge embeddings to reduce parameters, and 2) reformulating the conditional probability distribution to dramatically reduce inference cost. Uses Gaussian copula to model latent statistical dependency among edges.

Result: The method achieves significantly faster convergence than baselines while maintaining competitive prediction performance with state-of-the-art models. Theoretical verification shows linear convergence, proving scalability.

Conclusion: Proposed approach provides a scalable solution for link sign prediction on signed graphs by addressing computational challenges through parameter reduction and efficient inference, making Gaussian copula modeling practical for moderate-scale graphs.

Abstract: Link sign prediction on a signed graph is a task to determine whether the relationship represented by an edge is positive or negative. Since the presence of negative edges violates the graph homophily assumption that adjacent nodes are similar, regular graph methods have not been applicable without auxiliary structures to handle them. We aim to directly model the latent statistical dependency among edges with the Gaussian copula and its corresponding correlation matrix, extending CopulaGNN (Ma et al., 2021). However, a naive modeling of edge-edge relations is computationally intractable even for a graph with moderate scale. To address this, we propose to 1) represent the correlation matrix as a Gramian of edge embeddings, significantly reducing the number of parameters, and 2) reformulate the conditional probability distribution to dramatically reduce the inference cost. We theoretically verify scalability of our method by proving its linear convergence. Also, our extensive experiments demonstrate that it achieves significantly faster convergence than baselines, maintaining competitive prediction performance to the state-of-the-art models.

[1534] Beyond Accuracy and Complexity: The Effective Information Criterion for Structurally Stable Symbolic Regression

Zihan Yu, Guanren Wang, Jingtao Ding, Huandong Wang, Yong Li

Main category: cs.LG

TL;DR: Proposes Effective Information Criterion (EIC) to quantify formula rationality in symbolic regression by measuring structural stability through noise amplification during recursive calculation, improving discovery of physically plausible equations.

Details

Motivation: Traditional symbolic regression focuses on accuracy and complexity but often produces formulas that are numerically ill-conditioned and physically inexplicable, lacking the structural stability of real physical laws.

Method: Introduces EIC that models formulas as information channels and measures amplification of inherent rounding noise during recursive calculation to distinguish physically plausible structures from pathological ones without ground truth.

Result: EIC reveals structural stability gap between human-derived and SR-discovered equations; improves heuristic search Pareto frontiers; boosts generative model pre-training efficiency 2-4x and generalization R2 by 22.4%; aligns with 70% of human expert preferences.

Conclusion: Structural stability is critical for human-perceived interpretability, and EIC provides effective guidance for discovering physically plausible symbolic expressions in regression tasks.

Abstract: Symbolic regression (SR) traditionally balances accuracy and complexity, implicitly assuming that simpler formulas are structurally more rational. We argue that this assumption is insufficient: existing algorithms often exploit this metric to discover accurate and compact but structurally irrational formulas that are numerically ill-conditioned and physically inexplicable. Inspired by the structural stability of real physical laws, we propose the Effective Information Criterion (EIC) to quantify formula rationality. EIC models formulas as information channels and measures the amplification of inherent rounding noise during recursive calculation, effectively distinguishing physically plausible structures from pathological ones without relying on ground truth. Our analysis reveals a stark structural stability gap between human-derived equations and SR-discovered results. By integrating EIC into SR workflows, we provide explicit structural guidance: for heuristic search, EIC steers algorithms toward stable regions to yield superior Pareto frontiers; for generative models, EIC-based filtering improves pre-training sample efficiency by 2-4 times and boosts generalization R2 by 22.4%. Finally, an extensive study with 108 human experts shows that EIC aligns with human preferences in 70% of cases, validating structural stability as a critical prerequisite for human-perceived interpretability.

[1535] Membership Inference Attacks Against Fine-tuned Diffusion Language Models

Yuetian Chen, Kaiyuan Zhang, Yuntao Du, Edoardo Stoppa, Charles Fleming, Ashish Kundu, Bruno Ribeiro, Ninghui Li

Main category: cs.LG

TL;DR: First systematic investigation of Membership Inference Attacks on Diffusion Language Models, showing DLMs are more vulnerable than autoregressive models due to multiple maskable configurations, with proposed SAMA attack achieving 30% AUC improvement over baselines.

Details

Motivation: Diffusion Language Models (DLMs) are emerging as promising alternatives to autoregressive language models, but their privacy vulnerabilities via Membership Inference Attacks remain critically underexplored. Unlike autoregressive models with fixed prediction patterns, DLMs' multiple maskable configurations create exponentially more attack opportunities that need systematic investigation.

Method: Proposes SAMA (Subset-Aggregated Membership Attack) which addresses sparse signal challenge through robust aggregation. Samples masked subsets across progressive densities and applies sign-based statistics effective despite heavy-tailed noise. Uses inverse-weighted aggregation prioritizing sparse masks’ cleaner signals to transform sparse memorization detection into robust voting mechanism.

Result: Experiments on nine datasets show SAMA achieves 30% relative AUC improvement over best baseline, with up to 8 times improvement at low false positive rates. Reveals significant, previously unknown vulnerabilities in DLMs.

Conclusion: DLMs have significant privacy vulnerabilities via Membership Inference Attacks due to their multiple maskable configurations, necessitating development of tailored privacy defenses for these emerging models.

Abstract: Diffusion Language Models (DLMs) represent a promising alternative to autoregressive language models, using bidirectional masked token prediction. Yet their susceptibility to privacy leakage via Membership Inference Attacks (MIA) remains critically underexplored. This paper presents the first systematic investigation of MIA vulnerabilities in DLMs. Unlike the autoregressive models’ single fixed prediction pattern, DLMs’ multiple maskable configurations exponentially increase attack opportunities. This ability to probe many independent masks dramatically improves detection chances. To exploit this, we introduce SAMA (Subset-Aggregated Membership Attack), which addresses the sparse signal challenge through robust aggregation. SAMA samples masked subsets across progressive densities and applies sign-based statistics that remain effective despite heavy-tailed noise. Through inverse-weighted aggregation prioritizing sparse masks’ cleaner signals, SAMA transforms sparse memorization detection into a robust voting mechanism. Experiments on nine datasets show SAMA achieves 30% relative AUC improvement over the best baseline, with up to 8 times improvement at low false positive rates. These findings reveal significant, previously unknown vulnerabilities in DLMs, necessitating the development of tailored privacy defenses.

[1536] Aurora: Towards Universal Generative Multimodal Time Series Forecasting

Xingjian Wu, Jianxin Jin, Wanghui Qiu, Peng Chen, Yang Shu, Bin Yang, Chenjuan Guo

Main category: cs.LG

TL;DR: Aurora is a multimodal time series foundation model that supports multimodal inputs and zero-shot inference for cross-domain time series forecasting, using modality-guided attention and prototype-guided flow matching.

Details

Motivation: Existing time series forecasting models either lack explicit utilization of multimodal domain knowledge (unimodal foundation models) or don't support zero-shot inference for cross-domain scenarios (end-to-end multimodal supervised models). Cross-domain generalization is crucial because similar historical patterns can lead to different future trends due to domain-specific characteristics.

Method: Aurora uses tokenization, encoding, and distillation to extract multimodal domain knowledge from text and images. It employs Modality-Guided Multi-head Self-Attention to inject this knowledge into temporal representations. For forecasting, it uses a novel Prototype-Guided Flow Matching approach where multimodal representations generate conditions and prototypes for future tokens.

Result: Comprehensive experiments on 5 benchmarks (TimeMMD, TSFM-Bench, ProbTS, TFB, and EPF) demonstrate state-of-the-art performance on both unimodal and multimodal scenarios.

Conclusion: Aurora effectively addresses cross-domain generalization in time series forecasting by leveraging multimodal domain knowledge through a novel architecture that supports zero-shot inference and demonstrates superior performance across diverse benchmarks.

Abstract: Cross-domain generalization is very important in Time Series Forecasting because similar historical information may lead to distinct future trends due to the domain-specific characteristics. Recent works focus on building unimodal time series foundation models and end-to-end multimodal supervised models. Since domain-specific knowledge is often contained in modalities like texts, the former lacks the explicit utilization of them, thus hindering the performance. The latter is tailored for end-to-end scenarios and does not support zero-shot inference for cross-domain scenarios. In this work, we introduce Aurora, a Multimodal Time Series Foundation Model, which supports multimodal inputs and zero-shot inference. Pretrained on Corss-domain Multimodal Time Series Corpus, Aurora can adaptively extract and focus on key domain knowledge contained in corrsponding text or image modalities, thus possessing strong Cross-domain generalization capability. Through tokenization, encoding, and distillation, Aurora can extract multimodal domain knowledge as guidance and then utilizes a Modality-Guided Multi-head Self-Attention to inject them into the modeling of temporal representations. In the decoding phase, the multimodal representations are used to generate the conditions and prototypes of future tokens, contributing to a novel Prototype-Guided Flow Matching for generative probabilistic forecasting. Comprehensive experiments on 5 well-recognized benchmarks, including TimeMMD, TSFM-Bench, ProbTS, TFB, and EPF, demonstrate the consistent state-of-the-art performance of Aurora on both unimodal and multimodal scenarios.

[1537] The Flood Complex: Large-Scale Persistent Homology on Millions of Points

Florian Graf, Paolo Pellizzoni, Martin Uray, Stefan Huber, Roland Kwitt

Main category: cs.LG

TL;DR: Flood complex: A scalable persistent homology method for large point clouds using Delaunay triangulation of subsets and ball coverage, enabling efficient computation on millions of points.

Details

Motivation: Existing persistent homology methods (Vietoris-Rips, Alpha complex) face computational limitations with large-scale point clouds due to exponential growth of simplices, preventing their use in downstream ML tasks.

Method: Introduces Flood complex that contains simplices from Delaunay triangulation of a small subset, where simplices are fully covered by balls of radius r emanating from the full point cloud (“flooding”).

Result: Enables PH computation up to dimension 2 on millions of 3D points, shows superior object classification performance on real-world/synthetic data compared to other PH methods and neural networks.

Conclusion: Flood complex provides scalable persistent homology computation with theoretical guarantees and GPU parallelization, addressing computational bottlenecks for large point clouds in ML applications.

Abstract: We consider the problem of computing persistent homology (PH) for large-scale Euclidean point cloud data, aimed at downstream machine learning tasks, where the exponential growth of the most widely-used Vietoris-Rips complex imposes serious computational limitations. Although more scalable alternatives such as the Alpha complex or sparse Rips approximations exist, they often still result in a prohibitively large number of simplices. This poses challenges in the complex construction and in the subsequent PH computation, prohibiting their use on large-scale point clouds. To mitigate these issues, we introduce the Flood complex, inspired by the advantages of the Alpha and Witness complex constructions. Informally, at a given filtration value $r\geq 0$, the Flood complex contains all simplices from a Delaunay triangulation of a small subset of the point cloud $X$ that are fully covered by balls of radius $r$ emanating from $X$, a process we call flooding. Our construction allows for efficient PH computation, possesses several desirable theoretical properties, and is amenable to GPU parallelization. Scaling experiments on 3D point cloud data show that we can compute PH of up to dimension 2 on several millions of points. Importantly, when evaluating object classification performance on real-world and synthetic data, we provide evidence that this scaling capability is needed, especially if objects are geometrically or topologically complex, yielding performance superior to other PH-based methods and neural networks for point cloud data. Source code and datasets are available on https://github.com/plus-rkwitt/flooder.

[1538] HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing

Chengyu Du, Xintao Wang, Aili Chen, Weiyuan Li, Rui Xu, Junteng Liu, Zishan Huang, Rong Tian, Zijun Sun, Yuhao Li, Liheng Feng, Deming Ding, Pengyu Zhao, Yanghua Xiao

Main category: cs.LG

TL;DR: HER is a framework for cognitive-level persona simulation in LLMs that introduces dual-layer thinking (first-person character thinking vs third-person LLM thinking) and uses reasoning-augmented data and human-aligned reward models to improve role-playing capabilities.

Details

Motivation: Current LLM role-playing models effectively capture character tones and knowledge but struggle to simulate the inner thoughts behind behaviors. There's a need for cognitive simulation with high-quality reasoning traces and reliable reward signals aligned with human preferences.

Method: Proposes HER framework with dual-layer thinking, curates reasoning-augmented role-playing data via reverse engineering, constructs human-aligned principles and reward models, and trains models based on Qwen3-32B using supervised and reinforcement learning.

Result: HER models significantly outperform Qwen3-32B baseline, achieving 30.26% improvement on CoSER benchmark and 14.97% gain on Minimax Role-Play Bench, demonstrating effectiveness of the approach.

Conclusion: HER provides a unified framework for cognitive-level persona simulation that addresses limitations in reasoning traces and reward signals, enabling more sophisticated LLM role-playing with inner thought simulation.

Abstract: LLM role-playing, i.e., using LLMs to simulate specific personas, has emerged as a key capability in various applications, such as companionship, content creation, and digital games. While current models effectively capture character tones and knowledge, simulating the inner thoughts behind their behaviors remains a challenge. Towards cognitive simulation in LLM role-play, previous efforts mainly suffer from two deficiencies: data with high-quality reasoning traces, and reliable reward signals aligned with human preferences. In this paper, we propose HER, a unified framework for cognitive-level persona simulation. HER introduces dual-layer thinking, which distinguishes characters’ first-person thinking from LLMs’ third-person thinking. To bridge these gaps, we curate reasoning-augmented role-playing data via reverse engineering and construct human-aligned principles and reward models. Leveraging these resources, we train HER models based on Qwen3-32B via supervised and reinforcement learning. Extensive experiments validate the effectiveness of our approach. Notably, our models significantly outperform the Qwen3-32B baseline, achieving a 30.26 improvement on the CoSER benchmark and a 14.97 gain on the Minimax Role-Play Bench. Our datasets, principles, and models will be released to facilitate future research.

[1539] Monotonic Transformation Invariant Multi-task Learning

Surya Murthy, Kushagra Gupta, Mustafa O. Karabag, David Fridovich-Keil, Ufuk Topcu

Main category: cs.LG

TL;DR: DiBS-MTL: A multi-task learning method based on cooperative bargaining theory that is invariant to task loss scaling, preventing task domination and improving performance on poorly-scaled tasks.

Details

Motivation: Traditional MTL methods suffer from task domination when task losses are arbitrarily scaled, causing certain tasks to dominate training and degrade overall performance. Current methods are sensitive to loss scaling and require heuristics for weighting.

Method: Adapts the Direction-based Bargaining Solution (DiBS) from cooperative bargaining theory to MTL. DiBS yields Pareto stationary solutions immune to task domination due to invariance to monotonic nonaffine task loss transformations. The paper proposes DiBS-MTL, a computationally efficient adaptation of DiBS for MTL settings.

Result: Proves convergence of DiBS to Pareto stationary points for nonconvex losses. Empirically shows DiBS-MTL is competitive with leading MTL methods on standard benchmarks and drastically outperforms baselines on tasks with poorly-scaled losses.

Conclusion: DiBS-MTL provides a theoretically grounded approach to MTL that is invariant to task loss scaling, addressing a fundamental challenge in multi-task learning and offering practical advantages over existing methods.

Abstract: Multi-task learning (MTL) algorithms typically rely on schemes that combine different task losses or their gradients through weighted averaging. These methods aim to find Pareto stationary points by using heuristics that require access to task loss values, gradients, or both. In doing so, a central challenge arises because task losses can be arbitrarily scaled relative to one another, causing certain tasks to dominate training and degrade overall performance. A recent advance in cooperative bargaining theory, the Direction-based Bargaining Solution (DiBS), yields Pareto stationary solutions immune to task domination because of its invariance to monotonic nonaffine task loss transformations. However, the convergence behavior of DiBS in nonconvex MTL settings is currently not understood. To this end, we prove that under standard assumptions, a subsequence of DiBS iterates converges to a Pareto stationary point when task losses are nonconvex, and propose DiBS-MTL, an adaptation of DiBS to the MTL setting which is more computationally efficient that prior bargaining-inspired MTL approaches. Finally, we empirically show that DiBS-MTL is competitive with leading MTL methods on standard benchmarks, and it drastically outperforms state-of-the-art baselines in multiple examples with poorly-scaled task losses, highlighting the importance of invariance to nonaffine monotonic transformations of the loss landscape. Code available at https://github.com/suryakmurthy/dibs-mtl

[1540] Latent Adversarial Regularization for Offline Preference Optimization

Enyi Jiang, Yibo Jacky Zhang, Yinglun Xu, Andreas Haupt, Nancy Amato, Sanmi Koyejo

Main category: cs.LG

TL;DR: GANPO introduces latent-space regularization for language model preference optimization using adversarial training to align internal representations between policy and reference models, improving robustness under distributional shift.

Details

Motivation: Traditional preference optimization for language models uses token-level regularization, but token-space similarity doesn't guarantee semantic or behavioral similarity. This limitation motivates using latent-space regularization to better capture meaningful similarities.

Method: GANPO uses adversarial training inspired by GANs to minimize divergence between internal representations of policy and reference models. It integrates this latent-space regularization into existing offline preference optimization objectives as a regularizer.

Result: Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization. GANPO provides more robust structural feedback under distributional shift and noise while maintaining comparable downstream performance with minor computational overhead.

Conclusion: Latent-space regularization through adversarial training offers a promising alternative to token-level regularization for language model preference optimization, providing better robustness to distributional shifts while preserving performance.

Abstract: Learning from human feedback typically relies on preference optimization that constrains policy updates through token-level regularization. However, preference optimization for language models is particularly challenging because token-space similarity does not imply semantic or behavioral similarity. To address this challenge, we leverage latent-space regularization for language model preference optimization. We introduce GANPO, which achieves latent-space regularization by penalizing divergence between the internal representations of a policy model and a reference model. Given that latent representations are not associated with explicit probability densities, we adopt an adversarial approach inspired by GANs to minimize latent-space divergence. We integrate GANPO as a regularizer into existing offline preference optimization objectives. Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization. Further, by comparing GANPO-induced inferential biases with those from token-level regularization, we find that GANPO provides more robust structural feedback under distributional shift and noise while maintaining comparable downstream performance with minor computational overhead.

[1541] Pretraining Scaling Laws for Generative Evaluations of Language Models

Rylan Schaeffer, Noam Levi, Brando Miranda, Sanmi Koyejo

Main category: cs.LG

TL;DR: The paper proposes and evaluates three different pretraining scaling laws for predicting pass-at-k performance on generative evaluations, analyzing their stability and predictive capabilities.

Details

Motivation: While neural scaling laws are well-established for pretraining losses and discriminative benchmarks, generative benchmarks like mathematical problem-solving or software engineering remain under-explored. The paper aims to develop scaling laws specifically for generative evaluations to help forecast model performance.

Method: The authors propose three different pretraining scaling laws: (1) based on pretraining compute, (2) based on model parameters and pretraining tokens, and (3) based on log likelihoods of gold reference solutions. They evaluate these scaling laws for fitting pass-at-k on generative evaluations and for predicting performance of expensive models using cheaper models.

Result: Key findings: (1) Generative evaluations introduce hyperparameters (like k) that modulate scaling behavior and predictability; (2) The gold reference likelihood law shows unique stability across ~5 orders of magnitude, while other laws stabilize for only ~1.5-2.5 orders; (3) All three scaling laws perform comparably in prediction, with slight variations based on k value; (4) Theoretical connection established showing compute scaling law emerges as compute-optimal envelope of parameters-and-tokens law.

Conclusion: The framework provides insights and methodologies for forecasting generative performance, accelerating progress toward models that can reason, solve, and create. The study reveals important differences in scaling behavior between generative and discriminative evaluations.

Abstract: Neural scaling laws have driven the field’s ever-expanding exponential growth in parameters, data and compute. While scaling behaviors for pretraining losses and discriminative benchmarks are well established, generative benchmarks such as mathematical problem-solving or software engineering remain under-explored. We propose and evaluate three different pretraining scaling laws for fitting pass-at-$k$ on generative evaluations and for predicting pass-at-$k$ of the most expensive model using cheaper models. Our three scaling laws differ in the covariates used: (1) pretraining compute, (2) model parameters and pretraining tokens, (3) log likelihoods of gold reference solutions. First, we demonstrate that generative evaluations introduce new hyperparameters (in our setting, $k$) that act as a control lever for scaling behavior, modulating both the scaling law parameters and the predictability of performance. Second, we identify a stark difference in parameter stability: while the compute and parameters+tokens laws stabilize for only the last $\mathord{\sim}1.5\mathord{-}2.5$ orders of magnitude, the gold reference likelihood law is uniquely stable, converging across $\mathord{\sim}5$ orders. Third, in terms of predictive performance, we find all three scaling laws perform comparably, although the compute law predicts slightly worse for small $k$ and the gold reference law predicts slightly worse for large $k$. Finally, we establish a theoretical connection, proving that the compute scaling law emerges as the compute-optimal envelope of the parameters-and-tokens law. Our framework provides researchers and practitioners with insights and methodologies to forecast generative performance, accelerating progress toward models that can reason, solve, and create.

[1542] EUGens: Efficient, Unified, and General Dense Layers

Sang Min Kim, Byeongchan Kim, Arijit Sehanobish, Somnath Basu Roy Chowdhury, Rahul Kidambi, Dongseok Shim, Avinava Dubey, Snigdha Chaturvedi, Min-hwan Oh, Krzysztof Choromanski

Main category: cs.LG

TL;DR: EUGens are efficient dense layers that generalize standard fully-connected layers using random features and input norm dependence, reducing quadratic to linear complexity while maintaining expressive power.

Details

Motivation: Fully-connected feedforward layers create computation and parameter bottlenecks in neural networks, limiting scalability for real-time applications and resource-constrained environments.

Method: Propose EUGens (Efficient, Unified and General dense layers) that use random features to approximate standard FFLs, incorporate input norm dependence, unify existing efficient FFL extensions, and enable linear-time inference.

Result: EUGens achieve up to 27% faster inference and 30% memory efficiency improvements when integrated into Transformers and MLPs across image classification, language model pre-training, and 3D scene reconstruction tasks.

Conclusion: EUGens offer scalable deployment of large-scale neural networks by reducing computational overhead while preserving expressive power, with potential for real-world applications.

Abstract: Efficient neural networks are essential for scaling machine learning models to real-time applications and resource-constrained environments. Fully-connected feedforward layers (FFLs) introduce computation and parameter count bottlenecks within neural network architectures. To address this challenge, in this work, we propose a new class of dense layers that generalize standard fully-connected feedforward layers, \textbf{E}fficient, \textbf{U}nified and \textbf{Gen}eral dense layers (EUGens). EUGens leverage random features to approximate standard FFLs and go beyond them by incorporating a direct dependence on the input norms in their computations. The proposed layers unify existing efficient FFL extensions and improve efficiency by reducing inference complexity from quadratic to linear time. They also lead to \textbf{the first} unbiased algorithms approximating FFLs with arbitrary polynomial activation functions. Furthermore, EuGens reduce the parameter count and computational overhead while preserving the expressive power and adaptability of FFLs. We also present a layer-wise knowledge transfer technique that bypasses backpropagation, enabling efficient adaptation of EUGens to pre-trained models. Empirically, we observe that integrating EUGens into Transformers and MLPs yields substantial improvements in inference speed (up to \textbf{27}%) and memory efficiency (up to \textbf{30}%) across a range of tasks, including image classification, language model pre-training, and 3D scene reconstruction. Overall, our results highlight the potential of EUGens for the scalable deployment of large-scale neural networks in real-world scenarios.

[1543] A Forensic Analysis of Synthetic Data in RL: Diagnosing and Solving Algorithmic Failures in Model-Based Policy Optimization

Brett Barkley, David Fridovich-Keil

Main category: cs.LG

TL;DR: MBPO’s synthetic data generation can degrade performance in DeepMind Control Suite despite success in OpenAI Gym, revealing environment-specific algorithmic assumptions and failure modes in model-policy coevolution.

Details

Motivation: To understand why Model-Based Policy Optimization (MBPO) underperforms compared to model-free methods in DeepMind Control Suite despite reported success in OpenAI Gym, and to identify the underlying failure modes that prevent policy improvement.

Method: Analyzes MBPO’s performance across seven challenging DMC tasks, identifies two coupled failure modes: scale mismatches between dynamics and reward models causing critic underestimation, and poor target representation inflating model variance. Proposes solutions to address these issues.

Result: After addressing the identified failure modes, MBPO outperforms Soft Actor-Critic in five of seven DMC tasks while preserving strong performance in OpenAI Gym, enabling policy improvement where none was previously possible.

Conclusion: Environment-specific assumptions become implicitly encoded into algorithm design when evaluation is limited. The community should develop taxonomies linking MDP task/environment structure to algorithmic failure modes and clarify how benchmark choices shape algorithm generalization.

Abstract: Synthetic data is a core component of data-efficient Dyna-style model-based reinforcement learning, yet it can also degrade performance. We study when it helps, where it fails, and why, and we show that addressing the resulting failure modes enables policy improvement that was previously unattainable. We focus on Model-Based Policy Optimization (MBPO), which performs actor and critic updates using synthetic action counterfactuals. Despite reports of strong and generalizable sample-efficiency gains in OpenAI Gym, recent work shows that MBPO often underperforms its model-free counterpart, Soft Actor-Critic (SAC), in the DeepMind Control Suite (DMC). Although both suites involve continuous control with proprioceptive robots, this shift leads to sharp performance losses across seven challenging DMC tasks, with MBPO failing in cases where claims of generalization from Gym would imply success. This reveals how environment-specific assumptions can become implicitly encoded into algorithm design when evaluation is limited. We identify two coupled issues behind these failures: scale mismatches between dynamics and reward models that induce critic underestimation and hinder policy improvement during model-policy coevolution, and a poor choice of target representation that inflates model variance and produces error-prone rollouts. Addressing these failure modes enables policy improvement where none was previously possible, allowing MBPO to outperform SAC in five of seven tasks while preserving the strong performance previously reported in OpenAI Gym. Rather than aiming only for incremental average gains, we hope our findings motivate the community to develop taxonomies that tie MDP task- and environment-level structure to algorithmic failure modes, pursue unified solutions where possible, and clarify how benchmark choices ultimately shape the conditions under which algorithms generalize.

[1544] Normality Calibration in Semi-supervised Graph Anomaly Detection

Guolei Zeng, Hezhe Qiao, Guoguo Ai, Jinsong Guo, Guansong Pang

Main category: cs.LG

TL;DR: GraphNC: A graph normality calibration framework for semi-supervised graph anomaly detection that uses both labeled and unlabeled data to calibrate normality in anomaly score and representation spaces.

Details

Motivation: Existing semi-supervised graph anomaly detection methods overfit to labeled normal nodes, leading to high detection errors like false positives. The learned normality is limited to labeled patterns.

Method: Two-component framework: 1) Anomaly Score Distribution Alignment (ScoreDA) aligns anomaly scores with teacher model distribution, 2) Perturbation-based Normality Regularization (NormReg) makes normal node representations more compact via consistency loss on labeled nodes.

Result: The method effectively separates anomaly scores of normal and abnormal classes, making them more separable while regularizing representations to mitigate misleading scores from teacher model.

Conclusion: GraphNC improves semi-supervised graph anomaly detection by calibrating normality across both score and representation spaces using both labeled and unlabeled data.

Abstract: Graph anomaly detection (GAD) has attracted growing interest for its crucial ability to uncover irregular patterns in broad applications. Semi-supervised GAD, which assumes a subset of annotated normal nodes available during training, is among the most widely explored application settings. However, the normality learned by existing semi-supervised GAD methods is limited to the labeled normal nodes, often inclining to overfitting the given patterns. These can lead to high detection errors, such as high false positives. To overcome this limitation, we propose GraphNC , a graph normality calibration framework that leverages both labeled and unlabeled data to calibrate the normality from a teacher model (a pre-trained semi-supervised GAD model) jointly in anomaly score and node representation spaces. GraphNC includes two main components, anomaly score distribution alignment (ScoreDA) and perturbation-based normality regularization (NormReg). ScoreDA optimizes the anomaly scores of our model by aligning them with the score distribution yielded by the teacher model. Due to accurate scores in most of the normal nodes and part of the anomaly nodes in the teacher model, the score alignment effectively pulls the anomaly scores of the normal and abnormal classes toward the two ends, resulting in more separable anomaly scores. Nevertheless, there are inaccurate scores from the teacher model. To mitigate the misleading by these scores, NormReg is designed to regularize the graph normality in the representation space, making the representations of normal nodes more compact by minimizing a perturbation-guided consistency loss solely on the labeled nodes.

[1545] Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference

Yizhi Liu

Main category: cs.LG

TL;DR: The paper analyzes instability in differentiable matching layers using entropy-regularized Optimal Transport, identifies Premature Mode Collapse as a fundamental failure mechanism, and proposes an adaptive scheduling algorithm (Efficient PH-ASC) to stabilize inference by monitoring stability and reducing computational overhead.

Details

Motivation: The motivation is to address the notorious instability in recovering discrete permutations via annealing in differentiable matching layers, which are critical for structural prediction but suffer from premature mode collapse during the annealing process.

Method: The method involves analyzing the non-normal dynamics of the Sinkhorn fixed-point map to reveal a thermodynamic speed limit, then proposing Efficient PH-ASC - an adaptive scheduling algorithm that monitors inference stability and enforces a linear stability law, reducing computational overhead from O(N³) to amortized O(1).

Result: The paper demonstrates that under standard exponential cooling, the shift in target posterior outpaces the contraction rate of the inference operator, forcing trajectories into spurious local basins. The proposed algorithm decouples expensive spectral diagnostics from training and provides a practical solution with available implementation.

Conclusion: The paper concludes that premature mode collapse is a fundamental failure mechanism in differentiable matching layers, and the proposed Efficient PH-ASC algorithm provides an effective adaptive scheduling approach to stabilize inference while maintaining computational efficiency.

Abstract: Differentiable matching layers, often implemented via entropy-regularized Optimal Transport, serve as a critical approximate inference mechanism in structural prediction. However, recovering discrete permutations via annealing $ε\to 0$ is notoriously unstable. We identify a fundamental mechanism for this failure: \textbf{Premature Mode Collapse}. By analyzing the non-normal dynamics of the Sinkhorn fixed-point map, we reveal a theoretical \textbf{thermodynamic speed limit}. Under standard exponential cooling, the shift in the target posterior ($O(1)$) outpaces the contraction rate of the inference operator, which degrades as $O(1/ε)$. This mismatch inevitably forces the inference trajectory into spurious local basins. To address this, we propose \textbf{Efficient PH-ASC}, an adaptive scheduling algorithm that monitors the stability of the inference process. By enforcing a linear stability law, we decouple expensive spectral diagnostics from the training loop, reducing overhead from $O(N^3)$ to amortized $O(1)$. Our implementation and interactive demo are available at https://github.com/xxx0438/torch-sinkhorn-asc and https://huggingface.co/spaces/leon0923/torch-sinkhorn-asc-demo. bounded away from zero in generic training dynamics unless the feature extractor converges unrealistically fast.

[1546] Flatness-Aware Stochastic Gradient Langevin Dynamics

Stefano Bruno, Youngsik Hwang, Jaehyeon An, Sotirios Sabanis, Dong-Young Lim

Main category: cs.LG

TL;DR: fSGLD is a first-order optimization method that biases learning toward flat basins while maintaining computational efficiency, with theoretical convergence guarantees and strong empirical performance across various benchmarks.

Details

Motivation: The paper is motivated by the importance of loss landscape flatness for understanding deep learning behavior and generalization. The authors aim to develop an optimization method that leverages flatness insights while maintaining practical efficiency.

Method: Proposes Flatness-Aware Stochastic Gradient Langevin Dynamics (fSGLD), which biases learning dynamics toward flat basins while retaining SGD/SGLD efficiency. Includes theoretical analysis of convergence to flatness-biased Gibbs distribution with prescribed β-σ coupling.

Result: Empirical evaluation shows strong performance across optimizer benchmarks, Bayesian image classification, uncertainty quantification, and out-of-distribution detection. The theoretically prescribed β-σ coupling proves effective compared to decoupled choices.

Conclusion: fSGLD provides an effective flatness-aware optimization method with theoretical guarantees and practical benefits for uncertainty estimation and generalization.

Abstract: Flatness of the loss landscape has been widely studied as an important perspective for understanding the behavior and generalization of deep learning algorithms. Motivated by this view, we propose Flatness-Aware Stochastic Gradient Langevin Dynamics (fSGLD), a first-order optimization method that biases learning its dynamics toward flat basins while retaining the computational and memory efficiency of SGD and SGLD. We provide a non-asymptotic theoretical analysis showing that fSGLD converges to a flatness-biased Gibbs distribution under a theoretically prescribed coupling between the noise scale $σ$ and the inverse temperature $β$, together with explicit excess risk guarantees. We empirically evaluate fSGLD across standard optimizer benchmarks, Bayesian image classification, uncertainty quantification, and out-of-distribution detection, demonstrating consistently strong performance and reliable uncertainty estimates. Additional experiments confirm the effectiveness of the theoretically prescribed $β$-$σ$ coupling compared to decoupled choices.

[1547] To See Far, Look Close: Evolutionary Forecasting for Long-term Time Series

Jiaming Ma, Siyuan Mu, Ruilin Tang, Haofeng Ma, Qihe Huang, Zhengyang Zhou, Pengkun Wang, Binwu Wang, Yang Wang

Main category: cs.LG

TL;DR: EF paradigm enables single model to outperform task-specific ensembles in long-term time series forecasting by mitigating gradient conflicts in direct forecasting approaches

Details

Motivation: Direct Forecasting (DF) paradigm for Long-term Time Series Forecasting requires computationally expensive re-training for each target horizon and suffers from optimization pathologies where conflicting gradients from distant futures hinder learning of local dynamics

Method: Proposes Evolutionary Forecasting (EF) as a unified generative framework where DF is a degenerate special case. EF trains models on short horizons and uses evolutionary reasoning to extend predictions, avoiding gradient conflicts in optimization

Result: A single EF model surpasses task-specific DF ensembles across standard benchmarks and shows robust asymptotic stability in extreme extrapolation scenarios

Conclusion: EF enables a paradigm shift from passive Static Mapping to autonomous Evolutionary Reasoning in LTSF, offering more efficient and stable forecasting with a single model

Abstract: The prevailing Direct Forecasting (DF) paradigm dominates Long-term Time Series Forecasting (LTSF) by forcing models to predict the entire future horizon in a single forward pass. While efficient, this rigid coupling of output and evaluation horizons necessitates computationally prohibitive re-training for every target horizon. In this work, we uncover a counter-intuitive optimization anomaly: models trained on short horizons-when coupled with our proposed Evolutionary Forecasting (EF) paradigm-significantly outperform those trained directly on long horizons. We attribute this success to the mitigation of a fundamental optimization pathology inherent in DF, where conflicting gradients from distant futures cripple the learning of local dynamics. We establish EF as a unified generative framework, proving that DF is merely a degenerate special case of EF. Extensive experiments demonstrate that a singular EF model surpasses task-specific DF ensembles across standard benchmarks and exhibits robust asymptotic stability in extreme extrapolation. This work propels a paradigm shift in LTSF: moving from passive Static Mapping to autonomous Evolutionary Reasoning.

[1548] TROLL: Trust Regions improve Reinforcement Learning for Large Language Models

Philipp Becker, Niklas Freymuth, Serge Thilges, Fabian Otto, Gerhard Neumann

Main category: cs.LG

TL;DR: TROLL replaces PPO’s clipping mechanism with a principled discrete differentiable trust region projection for token-level KL constraints, improving RL fine-tuning stability and performance for LLMs.

Details

Motivation: PPO's clipping mechanism is a crude approximation of KL-based trust regions that causes unstable updates and suboptimal performance in RL fine-tuning of LLMs. The authors aim to replace this with a more principled approach.

Method: Introduces TROLL (Trust Region Optimization for Large Language models) which replaces PPO’s clip objective with a discrete differentiable trust region projection. The projection operates on a sparse subset of the model’s most important token logits to balance computational cost and effectiveness, providing principled token-level KL constraints.

Result: TROLL consistently outperforms PPO-like clipping across mathematical reasoning and code generation tasks, model families, and advantage-estimation methods in terms of training speed, stability, and final success rates.

Conclusion: The proposed trust region projection serves as a direct replacement for PPO-like clipping during training without altering inference behavior, offering improved RL fine-tuning for LLMs.

Abstract: Reinforcement Learning (RL) with PPO-like clip objectives has become the standard choice for reward-based fine-tuning of large language models (LLMs). Although recent work has explored improved estimators of advantages and normalization, the clipping mechanism itself has remained untouched. Originally introduced as a proxy for principled KL-based trust regions, clipping is a crude approximation that often causes unstable updates and suboptimal performance. We replace the clip objective with a novel discrete differentiable trust region projection, which provides principled token-level KL constraints. The projection operates on a sparse subset of the model’s most important token logits to balance computational cost and projection effectiveness. Our approach, Trust Region Optimization for Large Language models (TROLL), serves as a direct replacement for PPO-like clipping during training and does not alter the model’s inference behavior. Across mathematical reasoning and code generation tasks, model families, as well as advantage-estimation methods, TROLL consistently outperforms PPO-like clipping in terms of training speed, stability, and final success rates.

[1549] TEON: Tensorized Orthonormalization Beyond Layer-Wise Muon for Large Language Model Pre-Training

Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, Dongyang Li, Yupeng Su, Sijia Liu, Zheng Zhang

Main category: cs.LG

TL;DR: TEON extends Muon optimizer by modeling neural network gradients as structured higher-order tensors for cross-layer orthogonalization, improving training efficiency and convergence over layer-wise approaches.

Details

Motivation: Muon optimizer shows strong performance in LLM pre-training through layer-wise gradient orthogonalization, but this approach treats each layer independently. The authors aim to develop a more principled approach that captures cross-layer gradient relationships through tensor modeling.

Method: TEON models neural network gradients as structured higher-order tensors rather than independent matrices per layer. It extends orthogonalization beyond individual layers, provides improved convergence guarantees, and develops practical implementations with approximate SVD schemes for efficiency.

Result: TEON consistently improves training and validation perplexity across GPT-style (130M-774M) and LLaMA-style (60M-1B) models, showing strong robustness under various approximate SVD schemes and better performance than layer-wise Muon.

Conclusion: TEON provides a principled tensor-based generalization of Muon that captures cross-layer gradient relationships, offering improved convergence guarantees and practical performance gains across different model architectures and scales.

Abstract: The Muon optimizer has demonstrated strong empirical performance in pre-training large language models by performing matrix-level gradient (or momentum) orthogonalization in each layer independently. In this work, we propose TEON, a principled generalization of Muon that extends orthogonalization beyond individual layers by modeling the gradients of a neural network as a structured higher-order tensor. We present TEON’s improved convergence guarantee over layer-wise Muon, and further develop a practical instantiation of TEON based on the theoretical analysis with corresponding ablation. We evaluate our approach on two widely adopted architectures: GPT-style models, ranging from 130M to 774M parameters, and LLaMA-style models, ranging from 60M to 1B parameters. Experimental results show that TEON consistently improves training and validation perplexity across model scales and exhibits strong robustness under various approximate SVD schemes.

[1550] Hypernetwork-Driven Low-Rank Adaptation Across Attention Heads

Nghiem T. Diep, Dung Le, Tuan Truong, Tan Dinh, Huy Nguyen, Nhat Ho

Main category: cs.LG

TL;DR: HyRA introduces hypernetwork-driven low-rank adaptation for multi-head attention, generating joint low-rank matrices across attention heads to improve cross-head information sharing and sample efficiency compared to standard LoRA.

Details

Motivation: Existing LoRA methods fine-tune each attention head independently in multi-head self-attention, overlooking potential interactions and shared structure among heads, leading to redundant feature learning.

Method: Proposes Hypernetwork-Driven Low-rank Adaptation (HyRA) that uses a hypernetwork to generate joint low-rank matrices for all attention heads within a layer, promoting cross-head information sharing and avoiding redundant feature learning.

Result: HyRA consistently outperforms existing PEFT baselines across language and vision benchmarks, with substantial improvements over LoRA in low-data regimes, demonstrating practical sample efficiency.

Conclusion: HyRA provides a more effective parameter-efficient fine-tuning approach for multi-head attention by enabling cross-head information sharing through hypernetwork-driven adaptation, particularly beneficial in data-scarce scenarios.

Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as a powerful paradigm for adapting large-scale pre-trained models to downstream tasks with minimal additional parameters. Among PEFT methods, Low-Rank Adaptation (LoRA) stands out for its effectiveness by inserting trainable low-rank matrices into weight updates to enable efficient adaptation. However, when applied to multi-head self-attention, existing LoRA-based methods typically fine-tune each attention head independently, overlooking potential interactions and shared structure among heads. To address this limitation, we propose Hypernetwork-Driven Low-rank Adaptation (HyRA) that employs a hypernetwork to generate joint low-rank matrices for all attention heads within a layer. The shared generator promotes cross-head information sharing, helping low-rank modules avoid the redundant feature learning seen in traditional LoRA methods. Theoretically, our method achieves significantly better sample efficiency compared to standard LoRA. Empirically, we evaluate HyRA on a comprehensive suite of language and vision benchmarks. Our approach consistently outperforms existing parameter-efficient fine-tuning (PEFT) baselines across a wide range of tasks. Notably, in low-data regimes, HyRA achieves substantial improvements over LoRA, underscoring its practical sample efficiency and effectiveness in data-scarce scenarios.

[1551] When LLM Agents Meet Graph Optimization: An Automated Data Quality Improvement Approach

Zhihan Zhang, Xunkai Li, Yilong Zuo, Yanzhe Wen, Zhaoxin Fan, Zhenjun Li, Bing Zhou, Rong-Hua Li, Guoren Wang

Main category: cs.LG

TL;DR: LAGA is a multi-agent framework using LLMs and graph agents for comprehensive quality optimization of text-attributed graphs, addressing textual, structural, and label imperfections through automated detection, planning, action, and evaluation cycles.

Details

Motivation: Text-attributed graphs (TAGs) combine structural relationships with textual semantics but suffer from data quality issues (textual, structural, label imperfections) that degrade GNN performance. Existing approaches lack systematic, comprehensive quality optimization across all aspects of TAG data.

Method: LAGA (Large Language and Graph Agent) is a unified multi-agent framework with four coordinated agents: detection (identifies quality issues), planning (creates optimization strategies), action (executes improvements), and evaluation (assesses results). It performs automated, multi-modal optimization of textual, structural, and label aspects in a continuous loop.

Result: Extensive experiments on 5 datasets with 16 baselines across 9 scenarios demonstrate LAGA’s effectiveness, robustness, and scalability in improving TAG quality and enhancing GNN performance under various data imperfections.

Conclusion: Data-centric quality optimization is crucial for reliable TAG analytics, and LAGA provides a comprehensive, automated solution that outperforms existing approaches by systematically addressing all aspects of TAG data quality through coordinated multi-agent optimization.

Abstract: Text-attributed graphs (TAGs) have become a key form of graph-structured data in modern data management and analytics, combining structural relationships with rich textual semantics for diverse applications. However, the effectiveness of analytical models, particularly graph neural networks (GNNs), is highly sensitive to data quality. Our empirical analysis shows that both conventional and LLM-enhanced GNNs degrade notably under textual, structural, and label imperfections, underscoring TAG quality as a key bottleneck for reliable analytics. Existing studies have explored data-level optimization for TAGs, but most focus on specific degradation types and target a single aspect like structure or label, lacking a systematic and comprehensive perspective on data quality improvement. To address this gap, we propose LAGA (Large Language and Graph Agent), a unified multi-agent framework for comprehensive TAG quality optimization. LAGA formulates graph quality control as a data-centric process, integrating detection, planning, action, and evaluation agents into an automated loop. It holistically enhances textual, structural, and label aspects through coordinated multi-modal optimization. Extensive experiments on 5 datasets and 16 baselines across 9 scenarios demonstrate the effectiveness, robustness and scalability of LAGA, confirming the importance of data-centric quality optimization for reliable TAG analytics.

[1552] Near-Optimal Second-Order Guarantees for Model-Based Adversarial Imitation Learning

Shangzhe Li, Dongruo Zhou, Weitong Zhang

Main category: cs.LG

TL;DR: MB-AIL: Model-based adversarial imitation learning algorithm with horizon-free, second-order sample complexity guarantees for online interaction with offline expert demonstrations

Details

Motivation: Address gaps in understanding benefits of online interaction and impact of stochasticity in adversarial imitation learning, where agents learn from offline expert demonstrations without reward access

Method: Introduces model-based AIL algorithm (MB-AIL) with general function approximations for both expert data and reward-free interactions, establishing horizon-free second-order sample complexity guarantees

Result: MB-AIL achieves minimax-optimal sample complexity for online interaction (up to logarithmic factors) with limited expert demonstrations, matching lower bounds for expert demonstrations in terms of horizon, precision, and policy variance

Conclusion: Theoretical analysis shows MB-AIL’s optimality, and experiments validate that practical implementation matches or surpasses sample efficiency of existing methods

Abstract: We study online adversarial imitation learning (AIL), where an agent learns from offline expert demonstrations and interacts with the environment online without access to rewards. Despite strong empirical results, the benefits of online interaction and the impact of stochasticity remain poorly understood. We address these gaps by introducing a model-based AIL algorithm (MB-AIL) and establish its horizon-free, second-order sample-complexity guarantees under general function approximations for both expert data and reward-free interactions. These second-order bounds provide an instance-dependent result that can scale with the variance of returns under the relevant policies and therefore tighten as the system approaches determinism. Together with second-order, information-theoretic lower bounds on a newly constructed hard-instance family, we show that MB-AIL attains minimax-optimal sample complexity for online interaction (up to logarithmic factors) with limited expert demonstrations and matches the lower bound for expert demonstrations in terms of the dependence on horizon $H$, precision $ε$ and the policy variance $σ^2$. Experiments further validate our theoretical findings and demonstrate that a practical implementation of MB-AIL matches or surpasses the sample efficiency of existing methods.

[1553] Designing ReLU Generative Networks to Enumerate Trees with a Given Tree Edit Distance

Mamoona Ghafoor, Tatsuya Akutsu

Main category: cs.LG

TL;DR: Theoretical proof that ReLU-based generative networks with size O(n³) and constant depth can generate all trees within a specified tree edit distance from a given tree.

Details

Motivation: Tree generation with specified edit distance has applications in computational biology, structured data analysis, and image processing, but existing generative models lack theoretical guarantees about network size/depth needed for exact tree generation.

Method: Prove existence and construct ReLU-based generative networks that can generate all rooted, ordered, vertex-labeled trees within edit distance d from a given tree T. Networks have size O(n³) and constant depth.

Result: Implemented networks generate all valid trees up to 21 nodes within specified edit distance, while state-of-the-art models GraphRNN and GraphGDP only achieve 35% and 48% validation rates respectively.

Conclusion: Provides theoretical foundation for compact generative models and enables exact, valid tree-structured data generation, opening new directions for structured data synthesis.

Abstract: The generation of trees with a specified tree edit distance has significant applications across various fields, including computational biology, structured data analysis, and image processing. Recently, generative networks have been increasingly employed to synthesize new data that closely resembles the original datasets. However, the appropriate size and depth of generative networks required to generate data with a specified tree edit distance remain unclear. In this paper, we theoretically establish the existence and construction of generative networks capable of producing trees similar to a given tree with respect to the tree edit distance. Specifically, for a given rooted, ordered, and vertex-labeled tree T of size n + 1 with labels from an alphabet Σ, and a non-negative integer d, we prove that all rooted, ordered, and vertex-labeled trees over Σwith tree edit distance at most d from T can be generated using a ReLU-based generative network with size O(n^3 ) and constant depth. The proposed networks were implemented and evaluated for generating trees with up to 21 nodes. Due to their deterministic architecture, the networks successfully generated all valid trees within the specified tree edit distance. In contrast, state-of-the-art graph generative models GraphRNN and GraphGDP, which rely on non-deterministic mechanisms, produced significantly fewer valid trees, achieving validation rates of only up to 35% and 48%, respectively. These findings provide a theoretical foundation towards construction of compact generative models and open new directions for exact and valid tree-structured data generation. An implementation of the proposed networks is available at https://github.com/MGANN-KU/TreeGen_ReLUNetworks.

[1554] Prediction Markets with Intermittent Contributions

Michael Vitali, Pierre Pinson

Main category: cs.LG

TL;DR: A prediction market framework that enables collaborative forecasting while addressing data ownership concerns through performance-based reward allocation and adaptive learning.

Details

Motivation: Increasing data availability and demand for accurate forecasts are constrained by data ownership and competitive interests, creating barriers to collaboration between stakeholders.

Method: Proposes a prediction market design using robust regression models to learn optimal forecast combinations while handling missing submissions, with a payoff allocation mechanism considering both in-sample and out-of-sample performance.

Result: Case studies using simulated and real-world data demonstrate the effectiveness and adaptability of the proposed market design in enabling collaborative forecasting.

Conclusion: The prediction market framework provides a viable solution for collaborative forecasting that addresses data ownership concerns while maintaining desirable economic properties.

Abstract: Although both data availability and the demand for accurate forecasts are increasing, collaboration between stakeholders is often constrained by data ownership and competitive interests. In contrast to recent proposals within cooperative game-theoretical frameworks, we place ourselves in a more general framework, based on prediction markets. There, independent agents trade forecasts of uncertain future events in exchange for rewards. We introduce and analyse a prediction market that (i) accounts for the historical performance of the agents, (ii) adapts to time-varying conditions, while (iii) permitting agents to enter and exit the market at will. The proposed design employs robust regression models to learn the optimal forecasts’ combination whilst handling missing submissions. Moreover, we introduce a pay-off allocation mechanism that considers both in-sample and out-of-sample performance while satisfying several desirable economic properties. Case-studies using simulated and real-world data allow demonstrating the effectiveness and adaptability of the proposed market design.

[1555] Functional Distribution Networks (FDN)

Omer Haq

Main category: cs.LG

TL;DR: Functional Distribution Networks (FDN) address overconfidence in probabilistic regressors under distribution shift by placing input-conditioned distributions over network weights, producing adaptive uncertainty estimates through predictive mixtures trained with Monte Carlo beta-ELBO.

Details

Motivation: Modern probabilistic regressors often remain overconfident under distribution shift, failing to provide reliable uncertainty estimates when inputs deviate from training distribution. There's a need for models that can adapt uncertainty dispersion based on input characteristics.

Method: FDN places input-conditioned distributions over network weights, producing predictive mixtures whose dispersion adapts to the input. They are trained with a Monte Carlo beta-ELBO objective. The approach includes an evaluation protocol separating interpolation from extrapolation and emphasizing OOD sanity checks.

Result: On 1D tasks and small/medium UCI-style regression benchmarks, FDN remains competitive in accuracy with Bayesian, ensemble, dropout, and hypernetwork baselines while providing strongly input-dependent, shift-aware uncertainty and competitive calibration under matched parameter and update budgets.

Conclusion: FDN offers a promising approach for improving uncertainty quantification in regression tasks under distribution shift, with adaptive uncertainty estimation that responds to input characteristics while maintaining competitive predictive performance.

Abstract: Modern probabilistic regressors often remain overconfident under distribution shift. Functional Distribution Networks (FDN) place input-conditioned distributions over network weights, producing predictive mixtures whose dispersion adapts to the input; we train them with a Monte Carlo beta-ELBO objective. We pair FDN with an evaluation protocol that separates interpolation from extrapolation and emphasizes simple OOD sanity checks. On controlled 1D tasks and small/medium UCI-style regression benchmarks, FDN remains competitive in accuracy with strong Bayesian, ensemble, dropout, and hypernetwork baselines, while providing strongly input-dependent, shift-aware uncertainty and competitive calibration under matched parameter and update budgets.

[1556] VeFA: Vector-Based Feature Space Adaptation for Robust Model Fine-Tuning

Peng Wang, Minghao Gu, Qiang Huang

Main category: cs.LG

TL;DR: VeFA is a feature-space fine-tuning method that prevents catastrophic forgetting by performing element-wise feature adaptation, keeping weights within pre-trained model’s column space to avoid intruder dimensions.

Details

Motivation: Address catastrophic forgetting in fine-tuning when downstream data is limited or differs from pre-training distribution. Existing parameter-efficient methods operate in weight space and can create intruder dimensions that overwrite pre-trained knowledge.

Method: Vector-based Feature Adaptation (VeFA) operates in feature space rather than weight space. It performs element-wise adaptation on individual features, ensuring fine-tuned weights remain within the column space of pre-trained weight matrix. Inspired by effect equivalence modeling of downstream lurking variables.

Result: VeFA achieves comparable fine-tuning performance to LoRA while consistently exhibiting stronger robustness across image classification, NLU, and NLG benchmarks. Preserves pre-trained representations and improves generalization under distribution shift.

Conclusion: Feature-space adaptation via VeFA effectively mitigates catastrophic forgetting by avoiding intruder dimensions, preserving pre-trained knowledge, and enhancing model robustness during fine-tuning.

Abstract: Catastrophic forgetting is a well-documented challenge in model fine-tuning, particularly when the downstream domain has limited labeled data or differs substantially from the pre-training distribution. Existing parameter-efficient fine-tuning methods largely operate in the weight space by modifying or augmenting the parameters of the pre-trained model, which can lead to models that are overly specialized to the observed downstream data. Recent studies suggest that one mechanism underlying such forgetting is the introduction of intruder dimensions into the representation space during fine-tuning. To mitigate the risk of overwriting pre-trained knowledge and to enhance robustness, we propose Vector-based Feature Adaptation (VeFA), a new fine-tuning method that operates directly in the feature space, which naturally avoids generating intruder dimensions. VeFA performs element-wise adaptation on individual features, thereby ensuring that the effective fine-tuned weights always remain within the column space of the pre-trained weight matrix. This feature-space adaptation perspective is inspired by the idea of effect equivalence modeling (EEM) of downstream lurking variables that induce distribution shifts, which posits that the influence of unobserved factors can be represented as an equivalent aggregate effect on observed features. By compensating for the effects of downstream lurking variables via a lightweight feature-level transformation, VeFA preserves the pre-trained representations and improves model generalization under distribution shift. We evaluate VeFA against LoRA on image classification, NLU, and NLG benchmarks, considering both standard fine-tuning performance and robustness; across these tasks, VeFA achieves comparable fine-tuning performance while consistently exhibiting stronger robustness.

[1557] ELUTQ: Optimizing Quantization Accuracy under LUT-Based Computation for Edge LLMs

Xin Nie, Liang Dong, Haicheng Zhang, Jiawang Xiao, G. Sun

Main category: cs.LG

TL;DR: ELUTQ is an efficient quantization framework with Hierarchical Linear Quantization (HLQ) format that improves low-bit weight quantization accuracy and eliminates dequantization overhead using bit-serial LUT-based GEMM operations.

Details

Motivation: Existing hardware-friendly quantization methods rely on uniform quantization, which suffers from poor weight-distribution fitting and high dequantization overhead under low-bit settings, limiting deployment of large language models on edge devices.

Method: Proposes Hierarchical Linear Quantization (HLQ) format that better captures weight statistics and uses bit-serial LUT-based GEMM operations to eliminate dequantization overhead. Includes optimized quantization pipeline and high-performance kernels for edge deployment.

Result: Achieves performance comparable to QAT methods without retraining, completes quantization of LLaMA 3.1-70B with only 64GB CPU memory and 48GB VRAM, and achieves 1.5x speedup over AWQ on RTX 3090 for 2-bit LLaMA3.1-8B.

Conclusion: ELUTQ provides an efficient quantization framework that improves low-bit accuracy, reduces hardware requirements, and enables efficient deployment of large language models on edge devices with significant speed improvements.

Abstract: Weight quantization effectively reduces memory consumption and enable the deployment of Large Language Models on edge devices, yet existing hardware-friendly methods often rely on uniform quantization, which suffers from poor weight-distribution fitting and high dequantization overhead under low-bit settings. In this paper, we propose ELUTQ, an efficient quantization framework featuring a novel quantization format termed Hierarchical Linear Quantization (HLQ). HLQ is designed to better capture the statistical characteristics of weights and eliminate dequantization overhead using Bit-serial LUT-based GEMM operations. HLQ significantly improves model accuracy under low-bit settings and achieves performance comparable to QAT methods without any retraining of the weights. Moreover, an optimized quantization pipeline is integrated into ELUTQ, enabling it to complete the quantization of LLaMA 3.1-70B using only 64 GB of CPU memory and 48 GB of VRAM, reducing the hardware requirements for large-scale model quantization. To enable efficient deployment on edge devices, ELUTQ designs high-performance kernels to support end-to-end inference. Our 2-bit LLaMA3.1-8B achieves 1.5x speedup over AWQ on RTX 3090. Code is available at https://github.com/Nkniexin/ELUTQ.

[1558] Data as a Lever: A Neighbouring Datasets Perspective on Predictive Multiplicity

Prakhar Ganesh, Hsiang Hsu, Golnoosh Farnadi

Main category: cs.LG

TL;DR: Paper introduces a neighboring datasets framework to study how data processing choices affect model multiplicity, finding that datasets with greater inter-class overlap have lower multiplicity, and applies this to active learning and data imputation.

Details

Motivation: Prior work on model multiplicity has focused on modeling choices while overlooking the critical role of data. The authors aim to understand how data processing decisions affect multiplicity and develop practical methods to address it.

Method: Introduces a neighboring datasets framework where data processing is reframed as choosing between neighboring datasets. Theoretically analyzes relationship between inter-class distribution overlap and multiplicity. Applies framework to active learning and data imputation, conducting systematic studies of existing algorithms and proposing novel multiplicity-aware methods.

Result: Found counterintuitive theoretical relationship: neighboring datasets with greater inter-class distribution overlap exhibit lower multiplicity. Developed multiplicity-aware data acquisition strategies for active learning and multiplicity-aware data imputation methods.

Conclusion: Data plays a crucial role in model multiplicity, and the neighboring datasets framework provides a useful perspective for understanding and addressing multiplicity in practical applications like active learning and data imputation.

Abstract: Multiplicity, the existence of equally good yet competing models, has received growing attention in recent years. While prior work has emphasized modelling choices, the critical role of data in shaping multiplicity has been largely overlooked. In this work, we first introduce a neighbouring datasets framework, arguing that much of data processing can be reframed as choosing between neighbouring datasets. Under this framework, we find a counterintuitive theoretical relationship: neighbouring datasets with greater inter-class distribution overlap exhibit lower multiplicity. Building on this insight, we apply our framework to two domains: active learning and data imputation. For each, we establish natural extensions of the neighbouring datasets perspective, conduct the first systematic study of multiplicity in existing algorithms, and finally, propose novel multiplicity-aware methods, namely, multiplicity-aware data acquisition strategies for active learning and multiplicity-aware data imputation.

[1559] BOND: License to Train with Black-Box Functions

Andrew Clark, Jack Moursounidis, Osmaan Rasouli, William Gan, Cooper Doyle, Anna Leontjeva

Main category: cs.LG

TL;DR: BOND is a perturbative gradient estimation method for black-box functions that adaptively bounds perturbations for accurate sign estimation, enabling end-to-end training of architectures with non-autodifferentiable modules.

Details

Motivation: To enable gradient-based training of neural networks that incorporate non-autodifferentiable modules (like frozen networks or hardware components), overcoming the limitation that many useful transformations cannot be differentiated through standard autodiff.

Method: Bounded Numerical Differentiation (BOND) uses adaptive perturbation bounds to estimate gradients of black-box functions. It operates at black-box interfaces, making it more accurate and scalable than existing perturbative methods for gradient estimation.

Result: BOND enables end-to-end training of architectures with non-autodifferentiable modules. Experiments show these frozen modules can enhance model performance without increasing trainable parameters, and provide insights into adaptive optimizer dynamics.

Conclusion: BOND facilitates hybrid architectures combining differentiable and non-differentiable components, pointing toward hybrid analogue-digital devices as a path to scaling networks while leveraging fixed transformations to expand model capacity.

Abstract: We introduce Bounded Numerical Differentiation (BOND), a perturbative method for estimating the gradients of black-box functions. BOND is distinguished by its formulation, which adaptively bounds perturbations to ensure accurate sign estimation, and by its implementation, which operates at black-box interfaces. This enables BOND to be more accurate and scalable compared to existing methods, facilitating end-to-end training of architectures that incorporate non-autodifferentiable modules. We observe that these modules, implemented in our experiments as frozen networks, can enhance model performance without increasing the number of trainable parameters. Our findings highlight the potential of leveraging fixed transformations to expand model capacity, pointing to hybrid analogue - digital devices as a path to scaling networks, and provides insights into the dynamics of adaptive optimizers.

[1560] On Purely Private Covariance Estimation

Tommaso d’Orsi, Gleb Novikov

Main category: cs.LG

TL;DR: A differentially private covariance matrix estimation method that achieves optimal error bounds across various norms, with different mechanisms for large and small datasets.

Details

Motivation: Existing differentially private covariance estimation methods have suboptimal error bounds, especially for small datasets. The paper aims to develop a mechanism that achieves information-theoretically optimal error guarantees across different matrix norms under pure differential privacy.

Method: Proposes a simple perturbation mechanism for releasing d-dimensional covariance matrices under pure differential privacy. For large datasets (n ≥ d²/ε), uses a direct mechanism. For small datasets (n < d²/ε), projects the output onto a nuclear norm ball of appropriate radius to achieve optimal error bounds.

Result: Achieves provably optimal Frobenius norm error for large datasets, best known error for all p-Schatten norms (p ∈ [1,∞]), and information-theoretically optimal error for all p ≥ 2. For small datasets, achieves optimal Frobenius norm error O(√(d·Tr(Σ)/n)), improving over previous bounds.

Conclusion: The proposed mechanism provides a simple yet optimal differentially private covariance estimator that achieves information-theoretically optimal error guarantees across various matrix norms, with improved performance for small datasets through nuclear norm projection.

Abstract: We present a simple perturbation mechanism for the release of $d$-dimensional covariance matrices $Σ$ under pure differential privacy. For large datasets with at least $n\geq d^2/\varepsilon$ elements, our mechanism recovers the provably optimal Frobenius norm error guarantees of \cite{nikolov2023private}, while simultaneously achieving best known error for all other $p$-Schatten norms, with $p\in [1,\infty]$. Our error is information-theoretically optimal for all $p\ge 2$, in particular, our mechanism is the first purely private covariance estimator that achieves optimal error in spectral norm. For small datasets $n< d^2/\varepsilon$, we further show that by projecting the output onto the nuclear norm ball of appropriate radius, our algorithm achieves the optimal Frobenius norm error $O(\sqrt{d;\text{Tr}(Σ) /n})$, improving over the known bounds of $O(\sqrt{d/n})$ of \cite{nikolov2023private} and ${O}\big(d^{3/4}\sqrt{\text{Tr}(Σ)/n}\big)$ of \cite{dong2022differentially}.

[1561] VecComp: Vector Computing via MIMO Digital Over-the-Air Computation

Saeed Razavikia, José Mairton Barros Da Silva Junior, Carlo Fischione

Main category: cs.LG

TL;DR: VecComp extends ChannelComp framework to enable vector function computation over wireless channels using multiple antennas, maintaining linear complexity scaling with vector dimension and robustness to channel fading.

Details

Motivation: ChannelComp enables digital over-the-air computation of arbitrary functions but is limited to scalar computations and susceptible to channel fading. Many data-centric applications require vector-based computations, necessitating a more robust and scalable framework.

Method: Generalizes ChannelComp by integrating it with multiple-antenna technology to enable vector function computation. The approach ensures computational complexity scales only linearly with vector dimension and provides robustness against channel impairments.

Result: Establishes non-asymptotic upper bound on mean squared error, confirming computation efficiency under fading conditions. Numerical experiments demonstrate effectiveness in improving vector function computation and fading compensation over noisy multiple-access channels.

Conclusion: VecComp provides a scalable, computationally efficient framework for high-dimensional vector function computation over wireless channels, addressing limitations of scalar-only ChannelComp while maintaining robustness to channel fading.

Abstract: Recently, the ChannelComp framework has proposed digital over-the-air computation by designing digital modulations that enable the computation of arbitrary functions. Unlike traditional analog over-the-air computation, which is restricted to nomographic functions, ChannelComp enables a broader range of computational tasks while maintaining compatibility with digital communication systems. This framework is intended for applications that favor local information processing over the mere acquisition of data. However, ChannelComp is currently designed for scalar function computation, while numerous data-centric applications necessitate vector-based computations, and it is susceptible to channel fading. In this work, we introduce a generalization of the ChannelComp framework, called VecComp, by integrating ChannelComp with multiple-antenna technology. This generalization not only enables vector function computation but also ensures scalability in the computational complexity, which increases only linearly with the vector dimension. As such, VecComp remains computationally efficient and robust against channel impairments, making it suitable for high-dimensional, data-centric applications. We establish a non-asymptotic upper bound on the mean squared error of VecComp, affirming its computation efficiency under fading channel conditions. Numerical experiments show the effectiveness of VecComp in improving the computation of vector functions and fading compensation over noisy and fading multiple-access channels.

[1562] Graph Homomorphism Distortion: A Metric to Distinguish Them All and in the Latent Space Bind Them

Martin Carrasco, Olga Zaghen, Kavir Sumaraj, Erik Bekkers, Bastian Rieck

Main category: cs.LG

TL;DR: A new graph homomorphism distortion metric that measures feature distortion when mapping between graphs, addressing the interplay between structure and features in graph learning.

Details

Motivation: Existing graph neural network expressivity measures ignore features and focus only on structure, making it difficult to assess similarity between graphs with close features. There's a need for metrics that capture both structural and feature-based similarities.

Method: Develops a novel (pseudo-)metric based on graph homomorphisms, inspired by metric geometry concepts. The graph homomorphism distortion measures minimal worst-case distortion of node features when mapping one graph to another.

Result: The metric can be efficiently calculated under some assumptions, complements existing expressivity measures like 1-WL, and enables structural encodings that improve graph neural network predictive capabilities.

Conclusion: The proposed graph homomorphism distortion metric effectively addresses the interplay between structure and features in graph learning, providing a useful tool for assessing graph similarity and enhancing graph neural network performance.

Abstract: A large driver of the complexity of graph learning is the interplay between structure and features.When analyzing the expressivity of graph neural networks, however, existing approaches ignore features in favor of structure, making it nigh-impossible to assess to what extent two graphs with close features should be considered similar.We address this by developing a new (pseudo-)metric based on graph homomorphisms.Inspired by concepts from metric geometry, our graph homomorphism distortion measures the minimal worst-case distortion that node features of one graph are subjected to when mapping one graph to another.We demonstrate the utility of our novel measure by showing that (i.) it can be efficiently calculated under some additional assumptions, (ii.) it complements existing expressivity measures like $1$-WL, and (iii.)it permits defining structural encodings, which improve the predictive capabilities of graph neural networks.

Jaden Park, Mu Cai, Feng Yao, Jingbo Shang, Soochahn Lee, Yong Jae Lee

Main category: cs.LG

TL;DR: A method to detect test-set contamination in Vision-Language Models using multimodal semantic perturbations, showing contaminated models fail to generalize under controlled perturbations.

Details

Motivation: Test-set leakage in VLMs using internet-scale pretraining data inflates benchmark performance, but detection methods for contaminated VLMs are underexplored compared to LLM contamination detection.

Method: Deliberately contaminate open-source VLMs on popular benchmarks, then propose a detection method based on multimodal semantic perturbation where contaminated models fail to generalize under controlled perturbations.

Result: Existing detection approaches fail or show inconsistent behavior, while the proposed perturbation method effectively detects contamination across multiple realistic contamination strategies.

Conclusion: The proposed multimodal semantic perturbation method provides a robust and effective way to detect test-set contamination in VLMs, addressing a critical gap in model evaluation.

Abstract: Recent advances in Vision-Language Models (VLMs) have achieved state-of-the-art performance on numerous benchmark tasks. However, the use of internet-scale, often proprietary, pretraining corpora raises a critical concern for both practitioners and users: inflated performance due to test-set leakage. While prior works have proposed mitigation strategies such as decontamination of pretraining data and benchmark redesign for LLMs, the complementary direction of developing detection methods for contaminated VLMs remains underexplored. To address this gap, we deliberately contaminate open-source VLMs on popular benchmarks and show that existing detection approaches either fail outright or exhibit inconsistent behavior. We then propose a novel simple yet effective detection method based on multi-modal semantic perturbation, demonstrating that contaminated models fail to generalize under controlled perturbations. Finally, we validate our approach across multiple realistic contamination strategies, confirming its robustness and effectiveness. The code and perturbed dataset are released at https://github.com/jadenpark0/mm-perturb.

[1564] Forgetting is Everywhere

Ben Sanati, Thomas L. Lee, Trevor McInroe, Aidan Scannell, Nikolay Malkin, David Abel, Amos Storkey

Main category: cs.LG

TL;DR: The paper proposes a unified theory of forgetting in learning algorithms, defining it as lack of self-consistency in predictive distributions and loss of predictive information, with exact Bayesian inference shown to enable adaptation without forgetting.

Details

Motivation: The fundamental challenge of catastrophic forgetting in general learning algorithms, where models forget past knowledge when adapting to new data, lacks a unified theoretical understanding despite decades of study.

Method: Develops an algorithm- and task-agnostic theory characterizing forgetting as lack of self-consistency in predictive distributions and loss of predictive information. Proposes a general measure of forgetting propensity and demonstrates that exact Bayesian inference enables adaptation without forgetting.

Result: Comprehensive experiments across classification, regression, generative modeling, and reinforcement learning empirically demonstrate forgetting is present across all deep learning settings and significantly impacts learning efficiency.

Conclusion: Establishes a principled understanding of forgetting and lays foundation for analyzing and improving information retention capabilities of general learning algorithms.

Abstract: A fundamental challenge in developing general learning algorithms is their tendency to forget past knowledge when adapting to new data. Addressing this problem requires a principled understanding of forgetting; yet, despite decades of study, no unified definition has emerged that provides insights into the underlying dynamics of learning. We propose an algorithm- and task-agnostic theory that characterises forgetting as a lack of self-consistency in a learner’s predictive distribution, manifesting as a loss of predictive information. Our theory naturally yields a general measure of an algorithm’s propensity to forget and demonstrates that exact Bayesian inference allows for adaptation without forgetting. To validate the theory, we design a comprehensive set of experiments that span classification, regression, generative modelling, and reinforcement learning. We empirically demonstrate how forgetting is present across all deep learning settings and plays a significant role in determining learning efficiency. Together, these results establish a principled understanding of forgetting and lay the foundation for analysing and improving the information retention capabilities of general learning algorithms.

[1565] FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models

Yonatan Dukler, Guihong Li, Deval Shah, Vikram Appia, Emad Barsoum

Main category: cs.LG

TL;DR: FarSkip-Collective modifies MoE model architectures with skip connections to overlap computation with communication, enabling efficient distributed training/inference while maintaining accuracy for models up to 109B parameters.

Details

Motivation: Communication overhead is a major bottleneck for efficient distributed execution of Mixture-of-Experts (MoE) models. Current architectures cannot effectively overlap computation with communication, limiting performance in distributed settings.

Method: Modifies model architecture by introducing skip connections that allow communication to be overlapped with computation. Uses self-distillation to convert state-of-the-art models (16B to 109B parameters) while maintaining accuracy. Implements optimized overlapping of communication with computation in training/inference frameworks.

Result: Successfully converted models including Llama 4 Scout (109B) with average accuracy within 1% of original releases across downstream evaluations. Achieved communication-computation overlap benefits through optimized implementations.

Conclusion: FarSkip-Collective enables efficient distributed execution of large MoE models by architecturally enabling communication-computation overlap while preserving model capabilities, accelerating both training and inference.

Abstract: Blocking communication presents a major hurdle in running MoEs efficiently in distributed settings. To address this, we present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication. Our approach modifies the architecture to skip connections in the model and it is unclear a priori whether the modified model architecture can remain as capable, especially for large state-of-the-art models and while modifying all of the model layers. We answer this question in the affirmative and fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication while achieving accuracy on par with their original open-source releases. For example, we convert Llama 4 Scout (109B) via self-distillation and achieve average accuracy within 1% of its instruction tuned release averaged across a wide range of downstream evaluations. In addition to demonstrating retained accuracy of the large modified models, we realize the benefits of FarSkip-Collective through optimized implementations that explicitly overlap communication with computation, accelerating both training and inference in existing frameworks.

[1566] Personalized Federated Learning with Bidirectional Communication Compression via One-Bit Random Sketching

Jiacheng Cheng, Xu Zhang, Guanghui Qiu, Yifang Zhang, Yinchuan Li, Kaiyuan Feng

Main category: cs.LG

TL;DR: pFed1BS: A personalized federated learning framework using one-bit random sketching for extreme communication compression while handling data heterogeneity.

Details

Motivation: Federated Learning faces challenges of high bidirectional communication overhead and client-side data heterogeneity. Need to reduce communication costs while effectively handling diverse client data distributions.

Method: Proposes pFed1BS framework using one-bit random sketching for extreme communication compression. Clients transmit compressed one-bit sketches, server aggregates and broadcasts global one-bit consensus. Uses sign-based regularizer to align local models with global consensus while preserving local characteristics. Employs Fast Hadamard Transform for efficient projection to reduce computational burden.

Result: Theoretical analysis shows convergence to stationary neighborhood of global potential function. Numerical simulations demonstrate substantial communication cost reduction while achieving competitive performance compared to advanced communication-efficient FL algorithms.

Conclusion: pFed1BS effectively addresses FL communication challenges through extreme compression via one-bit sketching while handling data heterogeneity through personalized learning with sign-based regularization.

Abstract: Federated Learning (FL) enables collaborative training across decentralized data, but faces key challenges of bidirectional communication overhead and client-side data heterogeneity. To address communication costs while embracing data heterogeneity, we propose pFed1BS, a novel personalized federated learning framework that achieves extreme communication compression through one-bit random sketching. In personalized FL, the goal shifts from training a single global model to creating tailored models for each client. In our framework, clients transmit highly compressed one-bit sketches, and the server aggregates and broadcasts a global one-bit consensus. To enable effective personalization, we introduce a sign-based regularizer that guides local models to align with the global consensus while preserving local data characteristics. To mitigate the computational burden of random sketching, we employ the Fast Hadamard Transform for efficient projection. Theoretical analysis guarantees that our algorithm converges to a stationary neighborhood of the global potential function. Numerical simulations demonstrate that pFed1BS substantially reduces communication costs while achieving competitive performance compared to advanced communication-efficient FL algorithms.

[1567] Quantized-Tinyllava: a new multimodal foundation model enables efficient split learning

Jiajun Guo, Xin Luo, Jiayin Zheng, Yiqun Wang, Kai-Wei Chang, Wei Wang, Jie Liu

Main category: cs.LG

TL;DR: Quantized-TinyLLaVA: A communication-efficient multimodal foundation model using quantization for split learning to reduce communication overhead by 87.5% while maintaining performance and enhancing privacy.

Details

Motivation: Multimodal foundation models trained on sensitive data across domains raise privacy concerns in distributed setups. Split learning addresses privacy but introduces high communication costs from transmitting high-dimensional intermediate features between partitions.

Method: Proposes Quantized-TinyLLaVA with integrated communication-efficient split learning framework using compression module that quantizes intermediate features into discrete representations before transmission. Derives principled quantization strategy based on entropy coding theory to determine optimal discrete representation levels.

Result: Achieves ~87.5% reduction in communication overhead with 2-bit quantization while maintaining performance of original 16-bit model across five benchmark datasets. Compressed representations show enhanced resilience against feature inversion attacks.

Conclusion: Quantized-TinyLLaVA effectively addresses communication overhead in split learning for multimodal foundation models while maintaining performance and enhancing privacy protection against feature inversion attacks.

Abstract: Multimodal foundation models are increasingly trained on sensitive data across domains such as finance, biomedicine, and personal identifiers. However, this distributed setup raises serious privacy concerns due to the need for cross-partition data sharing. Split learning addresses these concerns by enabling collaborative model training without raw data exchange between partitions, yet it introduces a significant challenge: transmitting high-dimensional intermediate feature representations between partitions leads to substantial communication costs. To address this challenge, we propose Quantized-TinyLLaVA, a multimodal foundation model with an integrated communication-efficient split learning framework. Our approach adopts a compression module that quantizes intermediate feature into discrete representations before transmission, substantially reducing communication overhead. Besides, we derive a principled quantization strategy grounded in entropy coding theory to determine the optimal number of discrete representation levels. We deploy our framework in a two-partition setting, with one partition operating as the client and the other as the server, to realistically simulate distributed training. Under this setup, Quantized-TinyLLaVA achieves an approximate \textbf{87.5%} reduction in communication overhead with 2-bit quantization, while maintaining performance of the original 16-bit model across five benchmark datasets. Furthermore, our compressed representations exhibit enhanced resilience against feature inversion attacks, validating the privacy of transmission. The code is available at https://github.com/anonymous-1742/Quantized-TinyLLaVA.

[1568] UMM-RM: An Upcycle-and-Merge MoE Reward Model for Mitigating Reward Hacking

Lingling Fu, Yongfu Xue

Main category: cs.LG

TL;DR: UMM-RM is a novel reward model architecture that uses mixture-of-experts with shared experts to improve robustness against reward hacking in RLHF, then merges experts into a single dense model for efficient inference.

Details

Motivation: Conventional dense reward models in RLHF are vulnerable to exploitation by policy models through biases and spurious correlations, leading to reward hacking where RM scores increase but alignment with human preferences deteriorates, especially under distribution shift.

Method: UMM-RM upscales feed-forward layers of a dense backbone into a mixture-of-experts reward model with shared experts. Shared experts capture instruction-agnostic preferences, while remaining experts model fine-grained preferences. After training, experts are consolidated into a single dense RM via learnable merging weights.

Result: Experiments across multiple base models and preference datasets show UMM-RM improves accuracy on preference data, reduces reward hacking during PPO training, and yields more stable preference alignment compared to standard dense RMs.

Conclusion: UMM-RM retains robustness and exploitation resistance from expert diversity while avoiding inference overhead of MoE architectures or explicit ensembles, providing a practical solution to reward hacking in RLHF.

Abstract: Reward models (RMs) are a critical component of reinforcement learning from human feedback (RLHF). However, conventional dense RMs are susceptible to exploitation by policy models through biases or spurious correlations, resulting in reward hacking: RM scores increase during training while alignment with human preferences deteriorates, a problem that is further exacerbated under distribution shift.To address this issue, we propose UMM-RM (Upcycle-and-Merge MoE Reward Model). UMM-RM first upscales the feed-forward layers of a dense backbone into a mixture-of-experts (MoE) reward model with shared experts. The shared experts are always activated to capture instruction-agnostic preference signals, while the remaining experts model fine-grained preferences across instructions or task regimes. After training, the experts are consolidated into a single dense RM via learnable merging weights.This design retains the robustness and exploitation resistance provided by expert diversity while avoiding the inference overhead of MoE architectures or explicit ensembles. Experiments across multiple base models and preference datasets show that, compared with standard dense RMs, UMM-RM improves accuracy on preference data, reduces reward hacking during PPO training, and yields more stable preference alignment.

[1569] WUSH: Near-Optimal Adaptive Transforms for LLM Quantization

Jiale Chen, Vage Egiazarian, Roberto L. Castro, Torsten Hoefler, Dan Alistarh

Main category: cs.LG

TL;DR: WUSH introduces data-dependent linear transforms for optimal joint weight-activation quantization in LLMs, combining Hadamard backbone with second-moment components to handle outliers and improve low-bit quantization accuracy.

Details

Motivation: Extreme outliers in LLM weights and activations stretch dynamic range and amplify low-bit quantization errors. Existing transform-based methods (like Hadamard rotations) are fixed and data-agnostic, with unclear optimality for quantization.

Method: Derives closed-form optimal linear blockwise transforms for joint weight-activation quantization under RTN AbsMax-scaled block quantizers. Combines Hadamard backbone with data-dependent second-moment component to form non-orthogonal transform that is provably near-optimal for both FP and INT quantizers.

Result: WUSH improves W4A4 accuracy over strongest Hadamard-based baselines (e.g., +2.8 average points on Llama-3.1-8B-Instruct in MXFP4 with RTN, +0.7 with GPTQ) while delivering up to 6.6× per-layer throughput over BF16 via FP4 MatMul.

Conclusion: WUSH provides theoretically grounded, data-dependent transforms for efficient LLM quantization that outperform fixed transform methods, with practical GPU implementation benefits.

Abstract: Quantizing LLM weights and activations is a standard approach for efficient deployment, but a few extreme outliers can stretch the dynamic range and amplify low-bit quantization errors. Prior transform-based mitigations (e.g., Hadamard rotations) are fixed and data-agnostic, and their optimality for quantization has remained unclear. We derive closed-form optimal linear blockwise transforms for joint weight-activation quantization under standard RTN AbsMax-scaled block quantizers, covering both integer and floating-point formats. The resulting construction, WUSH, combines a Hadamard backbone with a data-dependent second-moment component to form a non-orthogonal transform that is provably near-optimal for FP and INT quantizers under mild assumptions while admitting an efficient fused GPU implementation. Empirically, WUSH improves W4A4 accuracy over the strongest Hadamard-based baselines (e.g., on Llama-3.1-8B-Instruct in MXFP4, it gains +2.8 average points with RTN and +0.7 with GPTQ) while delivering up to 6.6$\times$ per-layer throughput over BF16 via FP4 MatMul. Source code is available at https://github.com/IST-DASLab/WUSH.

[1570] MINIF2F-DAFNY: LLM-Guided Mathematical Theorem Proving via Auto-Active Verification

Mantas Baksys, Stefan Zetzsche, Olivier Bouissou, Sean B. Holden

Main category: cs.LG

TL;DR: LLMs for mathematical theorem proving in auto-active verifier Dafny, with benchmark translation and evaluation showing effective division of labor between LLM guidance and automation.

Details

Motivation: Bridging the gap between interactive theorem provers (requiring detailed low-level steps) and auto-active verifiers (offering automation but focused on software) for mathematical theorem proving using LLMs.

Method: Created MINIF2F-DAFNY benchmark by translating miniF2F to Dafny, evaluated Dafny’s automation alone, then tested 7 off-the-shelf LLMs on remaining problems with modest resources.

Result: Dafny’s automation solved 39-44% of problems with empty proofs; best LLM (Claude Sonnet 4.5) achieved 55.7% success on remaining problems, demonstrating effective division of labor.

Conclusion: LLMs can effectively provide high-level guidance for mathematical theorem proving in auto-active verifiers while automation handles low-level details, creating a productive synergy.

Abstract: LLMs excel at reasoning, but validating their steps remains challenging. Formal verification offers a solution through mechanically checkable proofs. Interactive theorem provers (ITPs) dominate mathematical reasoning but require detailed low-level proof steps, while auto-active verifiers offer automation but focus on software verification. Recent work has begun bridging this divide by evaluating LLMs for software verification in ITPs, but the complementary direction–LLMs for mathematical theorem proving in auto-active verifiers–remains unexplored. We present MINIF2F-DAFNY, the first translation of the widely-used mathematical benchmark miniF2F to an auto-active verifier: Dafny. We find that Dafny’s automation alone solves 39-44% of problems with empty proofs, whereas many require substantial proof guidance in ITPs. For remaining problems, we evaluate 7 off-the-shelf LLMs, achieving 55.7% success with the best model (Claude Sonnet 4.5) using modest resources. These results demonstrate effective division of labor: LLMs provide high-level guidance while automation handles low-level details. Our benchmark can be found on GitHub at http://github.com/dafny-lang/miniF2F .

[1571] In-Context Multi-Operator Learning with DeepOSets

Shao-Ting Chiu, Aditya Nambiar, Ali Syed, Jonathan W. Siegel, Ulisses Braga-Neto

Main category: cs.LG

TL;DR: DeepOSets architecture for multi-operator learning of differential equation solution operators using in-context examples, with theoretical guarantees and fast training.

Details

Motivation: To develop a mathematically rigorous framework for multi-operator learning inspired by in-context learning from large language models, specifically for learning solution operators of differential equations, with theoretical guarantees and efficient training.

Method: Modified DeepOSets architecture for multi-operator learning, where the network learns multiple operators simultaneously and uses in-context examples (input-output pairs) to disambiguate which operator to apply for a given query. The approach includes mathematical formulation of the problem and universality proofs.

Result: DeepOSets successfully learns multiple operators for different initial-value and boundary-value differential equations, accurately predicting solutions for queries and equations not seen during training. Training times are in minutes compared to transformer-based alternatives requiring hours.

Conclusion: DeepOSets provides an architecturally simple, theoretically grounded approach to multi-operator learning for differential equations with fast training and good generalization to unseen operators using in-context examples.

Abstract: An important application of neural networks to scientific computing has been the learning of non-linear operators. In this framework, a neural network is trained to fit a non-linear map between two infinite dimensional spaces, for example, the solution operator of ordinary and partial differential equations. Recently, inspired by the discovery of in-context learning for large language models, an even more ambitious paradigm has been explored, called multi-operator learning. In this approach, a neural network is trained to learn many different operators at the same time. In order to evaluate one of the learned operators, the network is passed example inputs and outputs to disambiguate the desired operator. In this work, we provide a precise mathematical formulation of the multi-operator learning problem. In addition, we modify a simple efficient architecture, called DeepOSets, for multi-operator learning and prove its universality for multi-operator learning. Finally, we provide a comprehensive set of experiments that demonstrate the ability of DeepOSets to learn multiple operators corresponding to different initial-value and boundary-value differential equations and use in-context examples to predict accurately the solutions corresponding to queries and differential equations not seen during training. The main advantage of DeepOSets is its architectural simplicity, which allows the derivation of theoretical guarantees and training times that are in the order of minutes, in contrast to similar transformer-based alternatives that are empirically justified and require hours of training.

[1572] Persistent Multiscale Density-based Clustering

Daniël Bot, Leland McInnes, Jan Aerts

Main category: cs.LG

TL;DR: PLSCAN is a novel density-based clustering algorithm that automatically identifies optimal minimum cluster sizes for HDBSCAN* without requiring hyperparameter tuning, using scale-space clustering principles and persistent homology concepts.

Details

Motivation: Density-based clustering algorithms like DBSCAN and HDBSCAN* require hyperparameter selection (density threshold, minimum cluster size) which is difficult without prior knowledge of data distribution. There's a need for algorithms that can automatically determine appropriate parameters for exploratory data analysis.

Method: PLSCAN applies scale-space clustering principles and is equivalent to persistent homology on a novel metric space. It efficiently identifies all minimum cluster sizes for which HDBSCAN* produces stable (leaf) clusters, eliminating the need for manual parameter selection.

Result: PLSCAN achieves higher average Adjusted Rand Index (ARI) than HDBSCAN* on real-world datasets and is less sensitive to changes in the number of mutual reachability neighbors. It has competitive run-times with k-Means on low-dimensional datasets, scaling similarly to HDBSCAN* at higher dimensions.

Conclusion: PLSCAN provides an effective solution for automatic parameter selection in density-based clustering, making it particularly suitable for exploratory data analysis where prior knowledge about data distribution is limited.

Abstract: Clustering is a cornerstone of modern data analysis. Detecting clusters in exploratory data analyses (EDA) requires algorithms that make few assumptions about the data. Density-based clustering algorithms are particularly well-suited for EDA because they describe high-density regions, assuming only that a density exists. Applying density-based clustering algorithms in practice, however, requires selecting appropriate hyperparameters, which is difficult without prior knowledge of the data distribution. For example, DBSCAN requires selecting a density threshold, and HDBSCAN* relies on a minimum cluster size parameter. In this work, we propose Persistent Leaves Spatial Clustering for Applications with Noise (PLSCAN). This novel density-based clustering algorithm efficiently identifies all minimum cluster sizes for which HDBSCAN* produces stable (leaf) clusters. PLSCAN applies scale-space clustering principles and is equivalent to persistent homology on a novel metric space. We compare its performance to HDBSCAN* on several real-world datasets, demonstrating that it achieves a higher average ARI and is less sensitive to changes in the number of mutual reachability neighbours. Additionally, we compare PLSCAN’s computational costs to k-Means, demonstrating competitive run-times on low-dimensional datasets. At higher dimensions, run times scale more similarly to HDBSCAN*.

[1573] On the Convergence Rate of LoRA Gradient Descent

Siqiao Mu, Diego Klabjan

Main category: cs.LG

TL;DR: First non-asymptotic convergence analysis of original LoRA gradient descent without Lipschitz smoothness assumptions, proving O(1/log T) convergence rate to stationary point.

Details

Motivation: LoRA is widely used for efficient fine-tuning of large models but lacks theoretical convergence analysis due to absence of Lipschitz smoothness, with existing work only providing asymptotic results or making unrealistic boundedness assumptions.

Method: Three-step approach: 1) Reformulate problem using outer product of stacked adapter matrices, 2) Develop modified descent lemma for “Lipschitz-like” reparametrized function, 3) Control step size to enable convergence analysis without Lipschitz assumptions.

Result: Proves LoRA gradient descent converges to stationary point at rate O(1/log T), where T is number of iterations, validated with numerical experiments.

Conclusion: Provides first rigorous non-asymptotic convergence guarantee for original LoRA algorithm, addressing theoretical gap and supporting its practical effectiveness for fine-tuning large models.

Abstract: The low-rank adaptation (LoRA) algorithm for fine-tuning large models has grown popular in recent years due to its remarkable performance and low computational requirements. LoRA trains two adapter" matrices that form a low-rank representation of the model parameters, thereby massively reducing the number of parameters that need to be updated at every step. Although LoRA is simple, its convergence is poorly understood due to the lack of Lipschitz smoothness, a key condition for classic convergence analyses. As a result, current theoretical results only consider asymptotic behavior or assume strong boundedness conditions which artificially enforce Lipschitz smoothness. In this work, we provide for the first time a non-asymptotic convergence analysis of the \textit{original LoRA gradient descent} algorithm, which reflects widespread practice, without such assumptions. Our work relies on three key steps: i) reformulating the problem in terms of the outer product of the stacked adapter matrices, ii) a modified descent lemma for the Lipschitz-like" reparametrized function, and iii) controlling the step size. With this approach, we prove that LoRA gradient descent converges to a stationary point at rate $O(\frac{1}{\log T})$, where $T$ is the number of iterations. We conduct numerical experiments to validate our theoretical findings.

[1574] Benchmarking neural surrogates on realistic spatiotemporal multiphysics flows

Runze Mao, Rui Zhang, Xuan Bai, Tianhao Wu, Teng Zhang, Zhenyi Chen, Minqi Lin, Bocheng Zeng, Yangchen Xu, Yingxuan Xiang, Haoze Zhang, Shubham Goswami, Pierre A. Dawe, Yifan Xu, Zhenhua An, Mengtao Yan, Xiaoyi Lu, Yi Wang, Rongbo Bai, Haobu Gao, Xiaohang Fang, Han Li, Hao Sun, Zhi X. Chen

Main category: cs.LG

TL;DR: REALM is a benchmarking framework for neural surrogates on realistic multiphysics flows, revealing current models’ limitations in handling complex, application-driven scenarios despite good nominal metrics.

Details

Motivation: Current neural surrogate evaluations rely on simplified, low-dimensional proxies that fail to expose models' fragility in realistic multiphysics regimes, creating an "illusion of mastery."

Method: Developed REALM framework with 11 high-fidelity datasets spanning canonical multiphysics to complex propulsion/fire safety scenarios, standardized training/evaluation protocol with multiphysics-aware preprocessing and robust rollout strategy.

Result: Benchmarked over a dozen model families, identifying: 1) scaling barrier governed by dimensionality, stiffness, mesh irregularity; 2) performance controlled by architectural inductive biases rather than parameter count; 3) persistent gap between nominal accuracy and physically trustworthy behavior.

Conclusion: REALM exposes limits of current neural surrogates on realistic multiphysics flows and provides rigorous testbed for developing next-generation physics-aware architectures.

Abstract: Predicting multiphysics dynamics is computationally expensive and challenging due to the severe coupling of multi-scale, heterogeneous physical processes. While neural surrogates promise a paradigm shift, the field currently suffers from an “illusion of mastery”, as repeatedly emphasized in top-tier commentaries: existing evaluations overly rely on simplified, low-dimensional proxies, which fail to expose the models’ inherent fragility in realistic regimes. To bridge this critical gap, we present REALM (REalistic AI Learning for Multiphysics), a rigorous benchmarking framework designed to test neural surrogates on challenging, application-driven reactive flows. REALM features 11 high-fidelity datasets spanning from canonical multiphysics problems to complex propulsion and fire safety scenarios, alongside a standardized end-to-end training and evaluation protocol that incorporates multiphysics-aware preprocessing and a robust rollout strategy. Using this framework, we systematically benchmark over a dozen representative surrogate model families, including spectral operators, convolutional models, Transformers, pointwise operators, and graph/mesh networks, and identify three robust trends: (i) a scaling barrier governed jointly by dimensionality, stiffness, and mesh irregularity, leading to rapidly growing rollout errors; (ii) performance primarily controlled by architectural inductive biases rather than parameter count; and (iii) a persistent gap between nominal accuracy metrics and physically trustworthy behavior, where models with high correlations still miss key transient structures and integral quantities. Taken together, REALM exposes the limits of current neural surrogates on realistic multiphysics flows and offers a rigorous testbed to drive the development of next-generation physics-aware architectures.

[1575] The Procrustean Bed of Time Series: The Optimization Bias of Point-wise Loss

Rongyao Cai, Yuxi Wan, Kexin Zhang, Ming Jin, Hao Wang, Zhiqiang Ge, Daoyi Dong, Yong Liu, Qingsong Wen

Main category: cs.LG

TL;DR: The paper presents a theoretical analysis of Expectation of Optimization Bias (EOB) in time series modeling, showing that point-wise loss functions cause bias that worsens with deterministic structure, and proposes a debiasing method using sequence length reduction and structural orthogonalization.

Details

Motivation: Traditional time series models use point-wise loss functions (like MSE) that assume i.i.d. data, ignoring the causal temporal structure. This creates optimization bias that becomes more severe with deterministic and structured time series, motivating a first-principles analysis of this bias.

Method: The paper provides theoretical analysis of Expectation of Optimization Bias (EOB), deriving closed-form quantification for both linear and non-linear systems. It proposes a debiasing program that eliminates bias through sequence length reduction and structural orthogonalization using DFT or DWT, and introduces a harmonized ℓ_p norm framework to address gradient optimization issues in high-variance sequences.

Result: Extensive experiments validate EOB Theory’s generality and show superior performance of the debiasing program, achieving 5.2% and 5.1% average improvement of MSE and MAE respectively when conducted on iTransformer across 11 datasets.

Conclusion: The paper establishes that EOB is an intrinsic data property governed by sequence length and Structural Signal-to-Noise Ratio, and provides a principled debiasing approach that significantly improves time series modeling performance by addressing fundamental optimization biases in point-wise loss functions.

Abstract: Optimizing time series models via point-wise loss functions (e.g., MSE) relying on a heuristic point-wise i.i.d. assumption disregards the causal temporal structure. Focusing on the core independence issue under covariance stationarity, this paper aims to provide a first-principles analysis of the Expectation of Optimization Bias (EOB). Our analysis reveals a fundamental paradigm paradox: The more deterministic and structured the time series, the more severe the bias incurred by point-wise loss function. We derive the first closed-form quantification for the non-deterministic EOB across linear and non-linear systems, and prove EOB is an intrinsic data property, governed exclusively by sequence length and the defined Structural Signal-to-Noise Ratio. This theoretical discovery motivates our principled debiasing program that eliminates the bias through sequence length reduction and structural orthogonalization. We present a concrete solution via DFT or DWT, and propose a novel harmonized $\ell_p$ norm framework to rectify gradient optimization pathologies of high-variance sequences. Extensive experiments validate EOB Theory’s generality and the superior performance of debiasing program, achieving 5.2% and 5.1% average improvement of MSE and MAE conducted on the iTransformer across 11 datasets, respectively.

[1576] Analytic and Variational Stability in Deep Learning Systems

Ronald Katende

Main category: cs.LG

TL;DR: A unified analytic and variational framework for stability analysis in deep learning systems, introducing Learning Stability Profile and Fundamental Analytic Stability Theorem to characterize perturbation propagation through representation-parameter dynamics.

Details

Motivation: To develop a comprehensive theoretical framework for analyzing stability in deep learning systems, addressing how perturbations propagate through coupled representation-parameter dynamics across different architectures and optimization methods.

Method: Proposes Learning Stability Profile to measure perturbation propagation, establishes Fundamental Analytic Stability Theorem linking bounded sensitivities to Lyapunov-type energy dissipation, extends framework to non-smooth systems using Clarke generalized derivatives and variational Lyapunov functionals.

Result: Provides unified stability theory covering feedforward networks, residual architectures, stochastic gradient methods, ReLU networks, proximal/projected updates, and stochastic subgradient flows, with explicit stability exponents linking design parameters to contractive behavior.

Conclusion: The framework offers a unified dynamical description of stability across architectures and optimization methods, clarifying how design and training choices jointly control robustness and sensitivity to perturbations in deep learning systems.

Abstract: We propose a unified analytic and variational framework for stability in deep learning systems viewed as coupled representation-parameter dynamics. The central object is the Learning Stability Profile, which measures how infinitesimal perturbations propagate through representations, parameters, and update mechanisms along the learning trajectory. Our main result, the Fundamental Analytic Stability Theorem, shows that uniform boundedness of these sensitivities is equivalent, up to norm equivalence, to the existence of a Lyapunov-type energy dissipating along the learning flow. In smooth regimes, this yields explicit stability exponents linking spectral norms, activation regularity, step sizes, and learning rates to contractive behavior. Classical spectral stability of feedforward networks, CFL-type conditions for residual architectures, and temporal stability laws for stochastic gradient methods follow as direct consequences. The framework extends to non-smooth systems, including ReLU networks, proximal and projected updates, and stochastic subgradient flows, by replacing classical derivatives with Clarke generalized derivatives and smooth energies with variational Lyapunov functionals. The resulting theory provides a unified dynamical description of stability across architectures and optimization methods, clarifying how design and training choices jointly control robustness and sensitivity to perturbations.

[1577] Multi-Task Learning for Metal Alloy Property Prediction: An Empirical Study of Negative Transfer and Mitigation Strategies

Sungwoo Kang

Main category: cs.LG

TL;DR: MTL in materials science shows mixed results: degrades regression for resistivity/hardness but improves classification for amorphous-forming ability due to mismatched functional forms and gradient conflicts; PCGrad recovers performance, suggesting need for specialized optimization strategies based on physical mechanisms.

Details

Motivation: To challenge the assumption that physically related properties in materials science share learnable representations for multi-task learning, particularly examining how extreme task-level imbalance affects performance across different material properties.

Method: Used a 54,028-sample metal alloy dataset with extreme task imbalance, evaluated MTL performance on regression (resistivity, hardness) vs classification (amorphous-forming ability), analyzed gradient misalignment from mismatched functional forms, and tested Deep Imbalanced Regression techniques including PCGrad (projecting conflicting gradients) and label distribution smoothing with gradient normalization.

Result: MTL significantly degraded regression performance for resistivity and hardness but improved classification recall for amorphous-forming ability; PCGrad recovered single-task performance for regression tasks; combination of label distribution smoothing with gradient normalization achieved best overall balance.

Conclusion: Propose strategic framework: use independent models for high-precision characterization but employ MTL for high-throughput screening where recall is paramount; findings support “materials property clustering” hypothesis that distinct physical mechanisms require specialized optimization strategies to overcome negative transfer.

Abstract: Multi-task learning (MTL) in materials science relies on the assumption that physically related properties share learnable representations. We challenge this assumption using a 54,028-sample metal alloy dataset exhibiting extreme task-level imbalance. Our results reveal a striking dichotomy: MTL significantly degrades regression performance for resistivity and hardness but improves classification recall for amorphous-forming ability. We trace this divergence to mismatched functional forms–such as resistivity’s polynomial dependence versus hardness’s complex interactions–which cause severe gradient misalignment during optimization. Evaluating Deep Imbalanced Regression techniques, we find that projecting conflicting gradients (PCGrad) recovers single-task performance, while combining label distribution smoothing with gradient normalization achieves the best overall balance. Consequently, we propose a strategic framework: utilize independent models for high-precision characterization, but employ MTL for high-throughput screening where recall is paramount. These findings support a “materials property clustering” hypothesis, suggesting that distinct physical mechanisms require specialized optimization strategies to overcome negative transfer.

[1578] When Does Pairing Seeds Reduce Variance? Evidence from a Multi-Agent Economic Simulation

Udit Sharma

Main category: cs.LG

TL;DR: Paper analyzes statistical benefits of using shared random seeds in comparative evaluation of ML systems, showing variance reduction and improved statistical power when systems are evaluated with identical stochastic realizations.

Details

Motivation: Standard ML evaluation treats runs across different systems as independent, failing to exploit shared sources of randomness. This leads to inefficient statistical comparisons and potentially inconclusive results at fixed computational budgets.

Method: The paper proposes using identical random seeds for evaluating competing systems, creating matched stochastic realizations. This induces positive correlation between outcomes at the seed level, leading to variance reduction in comparative evaluation.

Result: Using an extended learning-based multi-agent economic simulator, the authors demonstrate that paired evaluation with shared seeds exposes systematic differences in aggregate and distributional outcomes that remain statistically inconclusive under independent evaluation at fixed budgets.

Conclusion: Shared random seed evaluation provides strict variance reduction and improved statistical power for comparative ML system evaluation, enabling more efficient and conclusive comparisons at the same computational cost.

Abstract: Machine learning systems appear stochastic but are deterministically random, as seeded pseudorandom number generators produce identical realisations across repeated executions. Standard evaluation practice typically treats runs across alternatives as independent and does not exploit shared sources of randomness. This paper analyses the statistical structure of comparative evaluation under shared random seeds. Under this design, competing systems are evaluated using identical seeds, inducing matched stochastic realisations and yielding strict variance reduction whenever outcomes are positively correlated at the seed level. We demonstrate these effects using an extended learning-based multi-agent economic simulator, where paired evaluation exposes systematic differences in aggregate and distributional outcomes that remain statistically inconclusive under independent evaluation at fixed budgets.

[1579] Imitation from Observations with Trajectory-Level Generative Embeddings

Yongtao Qu, Shangzhe Li, Weitong Zhang

Main category: cs.LG

TL;DR: TGE: Trajectory-level Generative Embedding for offline Learning from Observations using diffusion models to estimate expert state density and create smooth surrogate rewards when expert data is scarce and offline data is suboptimal.

Details

Motivation: Existing offline imitation learning from observations (LfO) methods struggle when expert demonstrations are scarce and offline suboptimal data is far from expert behavior. Distribution-matching approaches impose strict support constraints and rely on brittle one-step models, making it hard to extract useful signal from imperfect data.

Method: Proposes TGE (Trajectory-level Generative Embedding) that constructs dense, smooth surrogate rewards by estimating expert state density in the latent space of a temporal diffusion model trained on offline trajectory data. Leverages smooth geometry of learned diffusion embedding to capture long-horizon temporal dynamics and bridge gaps between disjoint supports.

Result: Empirically, TGE consistently matches or outperforms prior offline LfO methods across a range of D4RL locomotion and manipulation benchmarks.

Conclusion: TGE provides a robust learning signal even when offline data is distributionally distinct from expert demonstrations by using trajectory-level generative embeddings from diffusion models to overcome limitations of traditional distribution-matching approaches.

Abstract: We consider the offline imitation learning from observations (LfO) where the expert demonstrations are scarce and the available offline suboptimal data are far from the expert behavior. Many existing distribution-matching approaches struggle in this regime because they impose strict support constraints and rely on brittle one-step models, making it hard to extract useful signal from imperfect data. To tackle this challenge, we propose TGE, a trajectory-level generative embedding for offline LfO that constructs a dense, smooth surrogate reward by estimating expert state density in the latent space of a temporal diffusion model trained on offline trajectory data. By leveraging the smooth geometry of the learned diffusion embedding, TGE captures long-horizon temporal dynamics and effectively bridges the gap between disjoint supports, ensuring a robust learning signal even when offline data is distributionally distinct from the expert. Empirically, the proposed approach consistently matches or outperforms prior offline LfO methods across a range of D4RL locomotion and manipulation benchmarks.

[1580] Dichotomous Diffusion Policy Optimization

Ruiming Liang, Yinan Zheng, Kexin Zheng, Tianyi Tan, Jianxiong Li, Liyuan Mao, Zhihao Wang, Guang Chen, Hangjun Ye, Jingjing Liu, Jinqiao Wang, Xianyuan Zhan

Main category: cs.LG

TL;DR: DIPOLE is a novel RL algorithm for stable diffusion policy optimization that decomposes optimal policy into dichotomous reward-maximizing and minimizing policies, enabling controllable action generation through linear combination of their scores.

Details

Motivation: Diffusion policies show promise for decision-making tasks but face training challenges in RL: existing methods either have unstable training from direct value maximization or computational issues from Gaussian likelihood approximations requiring many small denoising steps.

Method: Revisits KL-regularized RL objective, formulates greedified policy regularization scheme to decompose optimal policy into dichotomous policies (reward maximization and minimization), enabling stable learning and flexible control through linear combination of scores during inference.

Result: Effective in offline and offline-to-online RL on ExORL and OGBench benchmarks; successfully used to train large vision-language-action model for end-to-end autonomous driving, evaluated on NAVSIM benchmark.

Conclusion: DIPOLE enables stable and controllable diffusion policy optimization, demonstrating potential for complex real-world applications like autonomous driving through vision-language-action models.

Abstract: Diffusion-based policies have gained growing popularity in solving a wide range of decision-making tasks due to their superior expressiveness and controllable generation during inference. However, effectively training large diffusion policies using reinforcement learning (RL) remains challenging. Existing methods either suffer from unstable training due to directly maximizing value objectives, or face computational issues due to relying on crude Gaussian likelihood approximation, which requires a large amount of sufficiently small denoising steps. In this work, we propose DIPOLE (Dichotomous diffusion Policy improvement), a novel RL algorithm designed for stable and controllable diffusion policy optimization. We begin by revisiting the KL-regularized objective in RL, which offers a desirable weighted regression objective for diffusion policy extraction, but often struggles to balance greediness and stability. We then formulate a greedified policy regularization scheme, which naturally enables decomposing the optimal policy into a pair of stably learned dichotomous policies: one aims at reward maximization, and the other focuses on reward minimization. Under such a design, optimized actions can be generated by linearly combining the scores of dichotomous policies during inference, thereby enabling flexible control over the level of greediness.Evaluations in offline and offline-to-online RL settings on ExORL and OGBench demonstrate the effectiveness of our approach. We also use DIPOLE to train a large vision-language-action (VLA) model for end-to-end autonomous driving (AD) and evaluate it on the large-scale real-world AD benchmark NAVSIM, highlighting its potential for complex real-world applications.

[1581] Discount Model Search for Quality Diversity Optimization in High-Dimensional Measure Spaces

Bryon Tjanaka, Henry Chen, Matthew C. Fontaine, Stefanos Nikolaidis

Main category: cs.LG

TL;DR: DMS (Discount Model Search) is a new QD optimization algorithm that uses a continuous model of discount values to handle high-dimensional measure spaces, enabling applications with image-based measures where users can specify desired measures via image datasets.

Details

Motivation: Current QD algorithms struggle with high-dimensional measure spaces due to distortion problems where many solutions map to similar measures. Existing methods like CMA-MAE use histograms that fail to distinguish between similar solutions in high dimensions, causing stagnation.

Method: DMS replaces histogram-based discounting with a continuous model that provides smooth discount value representations. This allows distinguishing between solutions with similar measures in high-dimensional spaces, enabling continued exploration.

Result: DMS outperforms CMA-MAE and other black-box QD algorithms on high-dimensional benchmarks and enables new applications where measure spaces are high-dimensional image spaces, allowing users to specify measures via image datasets.

Conclusion: DMS addresses key limitations of existing QD algorithms in high-dimensional measure spaces by using continuous discount models, enabling new capabilities for QD optimization in domains like image-based measure specification.

Abstract: Quality diversity (QD) optimization searches for a collection of solutions that optimize an objective while attaining diverse outputs of a user-specified, vector-valued measure function. Contemporary QD algorithms are typically limited to low-dimensional measures because high-dimensional measures are prone to distortion, where many solutions found by the QD algorithm map to similar measures. For example, the state-of-the-art CMA-MAE algorithm guides measure space exploration with a histogram in measure space that records so-called discount values. However, CMA-MAE stagnates in domains with high-dimensional measure spaces because solutions with similar measures fall into the same histogram cell and hence receive the same discount value. To address these limitations, we propose Discount Model Search (DMS), which guides exploration with a model that provides a smooth, continuous representation of discount values. In high-dimensional measure spaces, this model enables DMS to distinguish between solutions with similar measures and thus continue exploration. We show that DMS facilitates new capabilities for QD algorithms by introducing two new domains where the measure space is the high-dimensional space of images, which enables users to specify their desired measures by providing a dataset of images rather than hand-designing the measure function. Results in these domains and on high-dimensional benchmarks show that DMS outperforms CMA-MAE and other existing black-box QD algorithms.

[1582] Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion

Mykola Vysotskyi, Zahar Kohut, Mariia Shpir, Taras Rumezhak, Volodymyr Karpiv

Main category: cs.LG

TL;DR: RL framework for diffusion unlearning using timestep-aware critic with noisy-step rewards for better credit assignment and stability

Details

Motivation: Existing diffusion unlearning methods have limitations: supervised weight edits or global penalties lack flexibility, while RL approaches suffer from high-variance updates and weak credit assignment due to sparse end-of-trajectory rewards

Method: Treats denoising as sequential decision process, trains CLIP-based reward predictor on noisy latents, uses per-step signal to compute advantage estimates for policy-gradient updates of reverse diffusion kernel

Result: Achieves better or comparable forgetting to strong baselines across multiple concepts while maintaining image quality and benign prompt fidelity; per-step critics and noisy-conditioned rewards are key to stability and effectiveness

Conclusion: Proposed RL framework is simple to implement, supports off-policy reuse, and plugs into standard text-to-image backbones, providing effective solution for diffusion unlearning with better credit assignment

Abstract: Machine unlearning in text-to-image diffusion models aims to remove targeted concepts while preserving overall utility. Prior diffusion unlearning methods typically rely on supervised weight edits or global penalties; reinforcement-learning (RL) approaches, while flexible, often optimize sparse end-of-trajectory rewards, yielding high-variance updates and weak credit assignment. We present a general RL framework for diffusion unlearning that treats denoising as a sequential decision process and introduces a timestep-aware critic with noisy-step rewards. Concretely, we train a CLIP-based reward predictor on noisy latents and use its per-step signal to compute advantage estimates for policy-gradient updates of the reverse diffusion kernel. Our algorithm is simple to implement, supports off-policy reuse, and plugs into standard text-to-image backbones. Across multiple concepts, the method achieves better or comparable forgetting to strong baselines while maintaining image quality and benign prompt fidelity; ablations show that (i) per-step critics and (ii) noisy-conditioned rewards are key to stability and effectiveness. We release code and evaluation scripts to facilitate reproducibility and future research on RL-based diffusion unlearning.

[1583] FibreCastML: An Open Web Platform for Predicting Electrospun Nanofibre Diameter Distributions

Elisa Roldan, Kirstie Andrews, Stephen M. Richardson, Reyhaneh Fatahian, Glen Cooper, Rasool Erfani, Tasneem Sabir, Neil D. Reeves

Main category: cs.LG

TL;DR: FibreCastML is an ML framework that predicts complete fibre diameter distributions (not just means) from electrospinning parameters, enabling better scaffold optimization for biomedical applications.

Details

Motivation: Existing ML approaches for electrospinning only predict mean fibre diameters, missing the full distribution that governs scaffold performance. There's a need for distribution-aware prediction to enable more reproducible and data-driven optimization of electrospun scaffolds.

Method: Curated meta-dataset of 68,538 fibre measurements from 1,778 studies across 16 biomedical polymers. Used 6 standard processing parameters to train 7 ML models with nested cross-validation (leave-one-study-out). Achieved interpretability via variable importance analysis, SHAP, correlation matrices, and 3D parameter maps.

Result: Non-linear models outperformed linear baselines with R² > 0.91 for several polymers. Solution concentration was the dominant driver of fibre diameter distributions. Experimental validation showed close agreement between predicted and measured distributions.

Conclusion: FibreCastML enables more reproducible, data-driven optimization of electrospun scaffold architectures by predicting complete fibre diameter distributions rather than just means.

Abstract: Electrospinning is a scalable technique for producing fibrous scaffolds with tunable micro- and nanoscale architectures for applications in tissue engineering, drug delivery, and wound care. While machine learning (ML) has been used to support electrospinning process optimisation, most existing approaches predict only mean fibre diameters, neglecting the full diameter distribution that governs scaffold performance. This work presents FibreCastML, an open, distribution-aware ML framework that predicts complete fibre diameter spectra from routinely reported electrospinning parameters and provides interpretable insights into process structure relationships. A meta-dataset comprising 68538 individual fibre diameter measurements extracted from 1778 studies across 16 biomedical polymers was curated. Six standard processing parameters, namely solution concentration, applied voltage, flow rate, tip to collector distance, needle diameter, and collector rotation speed, were used to train seven ML models using nested cross validation with leave one study out external folds. Model interpretability was achieved using variable importance analysis, SHapley Additive exPlanations, correlation matrices, and three dimensional parameter maps. Non linear models consistently outperformed linear baselines, achieving coefficients of determination above 0.91 for several widely used polymers. Solution concentration emerged as the dominant global driver of fibre diameter distributions. Experimental validation across different electrospinning systems demonstrated close agreement between predicted and measured distributions. FibreCastML enables more reproducible and data driven optimisation of electrospun scaffold architectures.

[1584] AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations

Michael J. Clark

Main category: cs.LG

TL;DR: AntiPaSTO enables scalable steering of large language models using minimal human input by separating representations along antiparallel axes with coherence constraints.

Details

Motivation: As models grow more capable, humans cannot reliably verify their outputs, requiring scalable steering methods that are internal, self-supervised, and transfer out-of-distribution - existing methods don't satisfy all three requirements.

Method: Introduces AntiPaSTO which separates representations along an antiparallel axis (+1/-1 produce opposite shifts) with coherence constraints to prevent collapse. Uses minimal human input: two contrasting words inserted into template sentences without preference labels.

Result: Using 800 word pairs on Gemma-3-1B, AntiPaSTO beats prompting baselines by 6.9x on DailyDilemmas and maintains bidirectional control where prompting triggers refusal.

Conclusion: AntiPaSTO provides a scalable approach for steering language models with minimal human supervision, addressing the challenge of verifying increasingly capable AI systems.

Abstract: As models grow more capable, humans cannot reliably verify what they say. Scalable steering requires methods that are internal, self-supervised, and transfer out-of-distribution; existing methods satisfy some but not all three. We introduce AntiPaSTO, which separates representations along an antiparallel axis (+1/-1 produce opposite shifts), with coherence constraints preventing collapse. Human input is minimal: two contrasting words inserted into template sentences, no preference labels. Using 800 such pairs on Gemma-3-1B, AntiPaSTO beats prompting baselines by 6.9x on DailyDilemmas and maintains bidirectional control where prompting triggers refusal.

[1585] Shape-morphing programming of soft materials on complex geometries via neural operator

Lu Chen, Gengxiang Chen, Xu Liu, Jingyan Su, Xuhao Lyu, Lihui Wang, Yingguang Li

Main category: cs.LG

TL;DR: S2NO is a neural operator combining spectral and spatial approaches for high-fidelity shape-morphing prediction on complex geometries, enabling voxel-level material distribution optimization through evolutionary algorithms.

Details

Motivation: While shape-morphing soft materials have potential for applications like conformal implants and aerodynamic morphing, current methods struggle with accurate and diverse morphing designs on complex geometries, requiring new approaches for high-fidelity prediction on irregular domains.

Method: Spectral and Spatial Neural Operator (S2NO) integrates Laplacian eigenfunction encoding to capture global morphing behaviors and spatial convolutions for local behaviors on irregular computational domains. Combined with evolutionary algorithms for voxel-level optimization of material distributions.

Result: Enables shape morphing programming on various complex geometries including irregular-boundary shapes, porous structures, and thin-walled structures. Discretization-invariant property allows super-resolution material distribution design, expanding morphing design diversity and complexity.

Conclusion: S2NO significantly improves efficiency and capability of programming complex shape morphing, advancing the field of shape-morphing material design for practical applications requiring precise control over complex geometries.

Abstract: Shape-morphing soft materials can enable diverse target morphologies through voxel-level material distribution design, offering significant potential for various applications. Despite progress in basic shape-morphing design with simple geometries, achieving advanced applications such as conformal implant deployment or aerodynamic morphing requires accurate and diverse morphing designs on complex geometries, which remains challenging. Here, we present a Spectral and Spatial Neural Operator (S2NO), which enables high-fidelity morphing prediction on complex geometries. S2NO effectively captures global and local morphing behaviours on irregular computational domains by integrating Laplacian eigenfunction encoding and spatial convolutions. Combining S2NO with evolutionary algorithms enables voxel-level optimisation of material distributions for shape morphing programming on various complex geometries, including irregular-boundary shapes, porous structures, and thin-walled structures. Furthermore, the neural operator’s discretisation-invariant property enables super-resolution material distribution design, further expanding the diversity and complexity of morphing design. These advancements significantly improve the efficiency and capability of programming complex shape morphing.

[1586] Weighted Graph Clustering via Scale Contraction and Graph Structure Learning

Haobing Liu, Yinuo Zhang, Tingting Wang, Ruobing Jiang, Yanwei Yu

Main category: cs.LG

TL;DR: A contractile edge-weight-aware graph clustering network that addresses challenges of edge weight utilization through graph contraction and noise-aware attention mechanisms.

Details

Motivation: Most existing graph clustering methods don't fully utilize edge weights, which face two key challenges: increased storage/training time from edge weights, and noise in edge weights that negatively impacts clustering. Few studies jointly optimize clustering and edge weights to mitigate noisy edge impacts.

Method: Proposes a contractile edge-weight-aware graph clustering network with two main components: 1) Cluster-oriented graph contraction module to reduce graph scale while preserving important nodes, and 2) Edge-weight-aware attention network to identify and weaken noisy connections.

Result: Extensive experiments on three real-world weighted graph datasets show the model outperforms the best baseline, demonstrating superior performance. The graph contraction module significantly reduces training time and storage space.

Conclusion: The proposed approach effectively addresses edge weight utilization challenges in graph clustering by jointly optimizing clustering and edge weights, improving performance while reducing computational costs.

Abstract: Graph clustering aims to partition nodes into distinct clusters based on their similarity, thereby revealing relationships among nodes. Nevertheless, most existing methods do not fully utilize these edge weights. Leveraging edge weights in graph clustering tasks faces two critical challenges. (1) The introduction of edge weights may significantly increase storage space and training time, making it essential to reduce the graph scale while preserving nodes that are beneficial for the clustering task. (2) Edge weight information may inherently contain noise that negatively impacts clustering results. However, few studies can jointly optimize clustering and edge weights, which is crucial for mitigating the negative impact of noisy edges on clustering task. To address these challenges, we propose a contractile edge-weight-aware graph clustering network. Specifically, a cluster-oriented graph contraction module is designed to reduce the graph scale while preserving important nodes. An edge-weight-aware attention network is designed to identify and weaken noisy connections. In this way, we can more easily identify and mitigate the impact of noisy edges during the clustering process, thus enhancing clustering effectiveness. We conducted extensive experiments on three real-world weighted graph datasets. In particular, our model outperforms the best baseline, demonstrating its superior performance. Furthermore, experiments also show that the proposed graph contraction module can significantly reduce training time and storage space.

[1587] GPCR-Filter: a deep learning framework for efficient and precise GPCR modulator discovery

Jingjie Ning, Xiangzhen Shen, Li Hou, Shiyi Shen, Jiahao Yang, Junrui Li, Hong Shan, Sanan Wu, Sihan Gao, H. Eric Xu, Xinheng He

Main category: cs.LG

TL;DR: GPCR-Filter is a deep learning framework that combines protein language models and graph neural networks to predict GPCR-ligand interactions, outperforming existing methods and identifying novel agonists.

Details

Motivation: GPCR modulator discovery is challenging due to complex allosteric effects and limitations of conventional assays, requiring more efficient computational approaches for drug development.

Method: Integrates ESM-3 protein language model for GPCR sequence representations with graph neural networks for ligand structures, using attention-based fusion to learn receptor-ligand functional relationships, trained on 90,000+ validated GPCR-ligand pairs.

Result: Outperforms state-of-the-art compound-protein interaction models, generalizes well to unseen receptors/ligands, and successfully identified micromolar-level 5-HT1A receptor agonists with distinct chemical frameworks.

Conclusion: GPCR-Filter establishes a scalable and effective computational approach for GPCR modulator discovery, advancing AI-assisted drug development for complex signaling systems.

Abstract: G protein-coupled receptors (GPCRs) govern diverse physiological processes and are central to modern pharmacology. Yet discovering GPCR modulators remains challenging because receptor activation often arises from complex allosteric effects rather than direct binding affinity, and conventional assays are slow, costly, and not optimized for capturing these dynamics. Here we present GPCR-Filter, a deep learning framework specifically developed for GPCR modulator discovery. We assembled a high-quality dataset of over 90,000 experimentally validated GPCR-ligand pairs, providing a robust foundation for training and evaluation. GPCR-Filter integrates the ESM-3 protein language model for high-fidelity GPCR sequence representations with graph neural networks that encode ligand structures, coupled through an attention-based fusion mechanism that learns receptor-ligand functional relationships. Across multiple evaluation settings, GPCR-Filter consistently outperforms state-of-the-art compound-protein interaction models and exhibits strong generalization to unseen receptors and ligands. Notably, the model successfully identified micromolar-level agonists of the 5-HT\textsubscript{1A} receptor with distinct chemical frameworks. These results establish GPCR-Filter as a scalable and effective computational approach for GPCR modulator discovery, advancing AI-assisted drug development for complex signaling systems.

[1588] Window-Diffusion: Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching

Fengrui Zuo, Zhiwei Ke, Yiming Liu, Wenqi Lou, Chao Wang, Xuehai Zhou

Main category: cs.LG

TL;DR: Window-Diffusion: A window-based token pruning and caching method for diffusion language models that achieves up to 99× inference speedup by exploiting structural locality in DLM inference.

Details

Motivation: Diffusion language models require full-sequence attention at every iteration during inference, causing substantial redundant computation on masked tokens. Existing block-wise diffusion methods need retraining and constrained update orders, limiting applicability to pretrained DLMs.

Method: Token-level analysis reveals structural locality in DLM inference: decoding is driven by prefix-localized active tokens, distant context influence diminishes rapidly, and decoded tokens show temporal stability. Based on this, Window-Diffusion uses a sliding local computation window that partitions undecoded tokens into active tokens (computed online), buffer tokens (KV states cached and periodically refreshed), and far-field tokens (pruned outside window).

Result: Experiments on LLaDA and Dream show that under matched compute budgets, Window-Diffusion achieves up to 99× inference speedup while largely preserving generation performance.

Conclusion: Window-Diffusion enables efficient inference for diffusion language models by exploiting structural locality, achieving significant speedups without requiring model retraining.

Abstract: Diffusion language models (DLMs) generate text through iterative denoising, but inference requires full-sequence attention at every iteration, resulting in substantial redundant computation on masked tokens. Block-wise diffusion can reduce this cost, yet it typically relies on retraining and constrained update orders, limiting its direct applicability to pretrained DLMs. Our token-level analysis reveals pronounced structural locality in DLM inference. Decoding is driven by a small set of prefix-localized active tokens; the influence of distant undecoded context diminishes rapidly, and decoded tokens exhibit stage-wise temporal stability, enabling reuse of intermediate representations except for a brief post-decode transient. Motivated by these observations, we propose \textbf{\placeholder}\footnote{The source code is available at https://github.com/vhicrgit/Window-Diffusion.}, a window-based token pruning and caching method for inference. We maintain a local computation window that slides rightward as denoising progresses, and partition undecoded tokens into: (i) \textit{active tokens} that are computed online, (ii) \textit{buffer tokens} whose KV states are cached and periodically refreshed, and (iii) \textit{far-field tokens} that are pruned outside the window. Computation is restricted to active and buffer tokens within the window, while far-field tokens are omitted at each stage. Experiments on LLaDA and Dream show that, under matched compute budgets, our method achieves up to $99\times$ inference speedup while largely preserving generation performance.

[1589] GraphAllocBench: A Flexible Benchmark for Preference-Conditioned Multi-Objective Policy Learning

Zhiheng Jiang, Yunzhe Wang, Ryan Marr, Ellen Novoseller, Benjamin T. Files, Volkan Ustun

Main category: cs.LG

TL;DR: A new benchmark called GraphAllocBench for Preference-Conditioned Policy Learning in Multi-Objective Reinforcement Learning, featuring graph-based resource allocation tasks inspired by city management.

Details

Motivation: Existing benchmarks for Preference-Conditioned Policy Learning (PCPL) in Multi-Objective Reinforcement Learning (MORL) are limited to toy tasks and fixed environments, lacking realism and scalability needed for evaluating complex allocation problems.

Method: Introduces GraphAllocBench built on CityPlannerEnv, a graph-based resource allocation sandbox environment. Includes diverse objective functions, varying preference conditions, and high-dimensional scalability. Proposes two new evaluation metrics: Proportion of Non-Dominated Solutions (PNDS) and Ordering Score (OS).

Result: GraphAllocBench exposes limitations of existing MORL approaches and demonstrates the potential of graph-based methods like Graph Neural Networks (GNNs) for complex combinatorial allocation tasks. The benchmark is flexible, allowing users to vary objectives, preferences, and allocation rules.

Conclusion: GraphAllocBench establishes a versatile and extensible benchmark for advancing PCPL research, particularly for complex, high-dimensional combinatorial allocation problems using graph-based methods.

Abstract: Preference-Conditioned Policy Learning (PCPL) in Multi-Objective Reinforcement Learning (MORL) aims to approximate diverse Pareto-optimal solutions by conditioning policies on user-specified preferences over objectives. This enables a single model to flexibly adapt to arbitrary trade-offs at run-time by producing a policy on or near the Pareto front. However, existing benchmarks for PCPL are largely restricted to toy tasks and fixed environments, limiting their realism and scalability. To address this gap, we introduce GraphAllocBench, a flexible benchmark built on a novel graph-based resource allocation sandbox environment inspired by city management, which we call CityPlannerEnv. GraphAllocBench provides a rich suite of problems with diverse objective functions, varying preference conditions, and high-dimensional scalability. We also propose two new evaluation metrics – Proportion of Non-Dominated Solutions (PNDS) and Ordering Score (OS) – that directly capture preference consistency while complementing the widely used hypervolume metric. Through experiments with Multi-Layer Perceptrons (MLPs) and graph-aware models, we show that GraphAllocBench exposes the limitations of existing MORL approaches and paves the way for using graph-based methods such as Graph Neural Networks (GNNs) in complex, high-dimensional combinatorial allocation tasks. Beyond its predefined problem set, GraphAllocBench enables users to flexibly vary objectives, preferences, and allocation rules, establishing it as a versatile and extensible benchmark for advancing PCPL. Code: https://github.com/jzh001/GraphAllocBench

[1590] Less is More: Clustered Cross-Covariance Control for Offline RL

Nan Qiao, Sheng Yue, Shuning Wang, Yongheng Deng, Ju Ren

Main category: cs.LG

TL;DR: C^4 method addresses distributional shift in offline RL by mitigating harmful TD cross-covariance effects through buffer partitioning and gradient-based correction.

Details

Motivation: Distributional shift is a fundamental challenge in offline RL, exacerbated by scarce data or datasets dominated by out-of-distribution areas. Standard squared error objectives induce harmful TD cross-covariance that amplifies in OOD areas, biasing optimization and degrading policy learning.

Method: Two complementary strategies: 1) Partitioned buffer sampling that restricts updates to localized replay partitions, attenuates irregular covariance effects, and aligns update directions (C^4); 2) Explicit gradient-based corrective penalty that cancels covariance-induced bias within each update.

Result: Empirical results show higher stability and up to 30% improvement in returns over prior methods, especially with small datasets and splits that emphasize OOD areas.

Conclusion: Buffer partitioning preserves the lower bound property of the maximization objective, and these constraints mitigate excessive conservatism in extreme OOD areas without altering the core behavior of policy constrained offline reinforcement learning.

Abstract: A fundamental challenge in offline reinforcement learning is distributional shift. Scarce data or datasets dominated by out-of-distribution (OOD) areas exacerbate this issue. Our theoretical analysis and experiments show that the standard squared error objective induces a harmful TD cross covariance. This effect amplifies in OOD areas, biasing optimization and degrading policy learning. To counteract this mechanism, we develop two complementary strategies: partitioned buffer sampling that restricts updates to localized replay partitions, attenuates irregular covariance effects, and aligns update directions, yielding a scheme that is easy to integrate with existing implementations, namely Clustered Cross-Covariance Control for TD (C^4). We also introduce an explicit gradient-based corrective penalty that cancels the covariance induced bias within each update. We prove that buffer partitioning preserves the lower bound property of the maximization objective, and that these constraints mitigate excessive conservatism in extreme OOD areas without altering the core behavior of policy constrained offline reinforcement learning. Empirically, our method showcases higher stability and up to 30% improvement in returns over prior methods, especially with small datasets and splits that emphasize OOD areas.

[1591] Accurate Network Traffic Matrix Prediction via LEAD: a Large Language Model-Enhanced Adapter-Based Conditional Diffusion Model

Yu Sun, Yaqiong Liu, Nan Cheng, Jiayuan Li, Zihan Jia, Xialin Du, Mugen Peng

Main category: cs.LG

TL;DR: LEAD: LLM-enhanced diffusion model for network traffic matrix forecasting using traffic-to-image transformation and dual-conditioning strategy

Details

Motivation: Network operations require predictive adaptation under computation/latency constraints, but accurate traffic matrix forecasting is challenging due to stochastic, non-linear, bursty dynamics. Existing models suffer from over-smoothing and lack uncertainty awareness.

Method: 1) Traffic-to-Image paradigm transforms traffic matrices into RGB images for global dependency modeling via vision backbones; 2) Frozen LLM with trainable adapter captures temporal semantics efficiently; 3) Dual-conditioning strategy guides diffusion model to generate complex traffic matrices.

Result: Outperforms all baselines on Abilene and GEANT datasets. On Abilene: 45.2% RMSE reduction, error margin only increases marginally from 0.1098 (1-step) to 0.1134 (20-step). On GEANT: 0.0258 RMSE at 20-step prediction (27.3% lower than best baseline).

Conclusion: LEAD effectively addresses limitations of existing discriminative models for traffic matrix forecasting by combining vision backbones, LLM adapters, and diffusion models with dual-conditioning, achieving superior performance with minimal error accumulation over time.

Abstract: Driven by the evolution toward 6G and AI-native edge intelligence, network operations increasingly require predictive and risk-aware adaptation under stringent computation and latency constraints. Network Traffic Matrix (TM), which characterizes flow volumes between nodes, is a fundamental signal for proactive traffic engineering. However, accurate TM forecasting remains challenging due to the stochastic, non-linear, and bursty nature of network dynamics. Existing discriminative models often suffer from over-smoothing and provide limited uncertainty awareness, leading to poor fidelity under extreme bursts. To address these limitations, we propose LEAD, a Large Language Model (LLM)-Enhanced Adapter-based conditional Diffusion model. First, LEAD adopts a “Traffic-to-Image” paradigm to transform traffic matrices into RGB images, enabling global dependency modeling via vision backbones. Then, we design a “Frozen LLM with Trainable Adapter” model, which efficiently captures temporal semantics with limited computational cost. Moreover, we propose a Dual-Conditioning Strategy to precisely guide a diffusion model to generate complex, dynamic network traffic matrices. Experiments on the Abilene and GEANT datasets demonstrate that LEAD outperforms all baselines. On the Abilene dataset, LEAD attains a remarkable 45.2% reduction in RMSE against the best baseline, with the error margin rising only marginally from 0.1098 at one-step to 0.1134 at 20-step predictions. Meanwhile, on the GEANT dataset, LEAD achieves a 0.0258 RMSE at 20-step prediction horizon which is 27.3% lower than the best baseline.

[1592] Sampling-Free Privacy Accounting for Matrix Mechanisms under Random Allocation

Jan Schuchardt, Nikita Kalinin

Main category: cs.LG

TL;DR: Sampling-free privacy amplification bounds for differentially private matrix factorization using Rényi divergence and conditional composition, improving upon Monte Carlo approaches.

Details

Motivation: Existing Monte Carlo approaches for privacy amplification in differentially private matrix factorization have limitations: guarantees only hold with high probability or require random abstention, and sample requirements are inversely proportional to δ. Need for more reliable, sampling-free bounds.

Method: Develop sampling-free bounds based on Rényi divergence and conditional composition. Use dynamic programming to efficiently compute Rényi divergence bounds. Conditional composition complements by offering stronger guarantees for small ε where Rényi divergence over-approximates. Framework applies to both banded and non-banded matrices.

Result: Demonstrate efficacy through numerical comparisons across broad range of matrix mechanisms used in research and practice. Approach provides more reliable privacy guarantees without sampling limitations.

Conclusion: Proposed sampling-free framework offers improved privacy amplification bounds for differentially private matrix factorization, addressing limitations of Monte Carlo approaches with more efficient computation and stronger guarantees.

Abstract: We study privacy amplification for differentially private model training with matrix factorization under random allocation (also known as the balls-in-bins model). Recent work by Choquette-Choo et al. (2025) proposes a sampling-based Monte Carlo approach to compute amplification parameters in this setting. However, their guarantees either only hold with some high probability or require random abstention by the mechanism. Furthermore, the required number of samples for ensuring $(ε,δ)$-DP is inversely proportional to $δ$. In contrast, we develop sampling-free bounds based on Rényi divergence and conditional composition. The former is facilitated by a dynamic programming formulation to efficiently compute the bounds. The latter complements it by offering stronger privacy guarantees for small $ε$, where Rényi divergence bounds inherently lead to an over-approximation. Our framework applies to arbitrary banded and non-banded matrices. Through numerical comparisons, we demonstrate the efficacy of our approach across a broad range of matrix mechanisms used in research and practice.

[1593] Hardware-Triggered Backdoors

Jonas Möller, Erik Imgrund, Thorsten Eisenhofer, Konrad Rieck

Main category: cs.LG

TL;DR: Hardware-triggered backdoors exploit numerical variations across computing hardware to create attacks where models produce different predictions for the same input on different hardware.

Details

Motivation: Machine learning models are deployed on diverse hardware that can produce small numerical variations during inference. The authors investigate whether these hardware differences can be exploited as a novel attack vector for creating backdoors in ML models.

Method: The approach shapes the model’s decision function to yield different predictions for the same input on different hardware. This is achieved by locally moving the decision boundary close to a target input and refining numerical deviations to flip predictions on selected hardware.

Result: The authors empirically demonstrate that hardware-triggered backdoors can be created reliably across common GPU accelerators, revealing a novel attack vector affecting third-party model usage.

Conclusion: Hardware differences in ML inference create a new security vulnerability. The paper investigates defenses to counter this threat and highlights the risks of using third-party models on different hardware platforms.

Abstract: Machine learning models are routinely deployed on a wide range of computing hardware. Although such hardware is typically expected to produce identical results, differences in its design can lead to small numerical variations during inference. In this work, we show that these variations can be exploited to create backdoors in machine learning models. The core idea is to shape the model’s decision function such that it yields different predictions for the same input when executed on different hardware. This effect is achieved by locally moving the decision boundary close to a target input and then refining numerical deviations to flip the prediction on selected hardware. We empirically demonstrate that these hardware-triggered backdoors can be created reliably across common GPU accelerators. Our findings reveal a novel attack vector affecting the use of third-party models, and we investigate different defenses to counter this threat.

[1594] Demystifying Mergeability: Interpretable Properties to Predict Model Merging Success

Luca Zhou, Bo Zhao, Rose Yu, Emanuele Rodolà

Main category: cs.LG

TL;DR: Model mergeability depends on merging method and partner tasks, not just intrinsic properties; gradient alignment and subspace overlap are key prerequisites

Details

Motivation: Current understanding of model merging treats mergeability as an intrinsic property, but the authors argue it depends on both merging method and partner tasks, requiring better diagnostic tools

Method: Architecture-agnostic framework using linear optimization over interpretable pairwise metrics (gradient L2 distance, etc.) to analyze four merging methods across different tasks

Result: Found substantial variation in success drivers (46.7% metric overlap; 55.3% sign agreement) revealing method-specific “fingerprints”, but subspace overlap and gradient alignment consistently emerge as foundational prerequisites

Conclusion: Mergeability depends on both method and tasks; provides diagnostic foundation for understanding mergeability and motivates fine-tuning strategies that encourage gradient alignment and subspace overlap

Abstract: Model merging combines knowledge from separately fine-tuned models, yet success factors remain poorly understood. While recent work treats mergeability as an intrinsic property, we show with an architecture-agnostic framework that it fundamentally depends on both the merging method and the partner tasks. Using linear optimization over a set of interpretable pairwise metrics (e.g., gradient L2 distance), we uncover properties correlating with post-merge performance across four merging methods. We find substantial variation in success drivers (46.7% metric overlap; 55.3% sign agreement), revealing method-specific “fingerprints”. Crucially, however, subspace overlap and gradient alignment metrics consistently emerge as foundational, method-agnostic prerequisites for compatibility. These findings provide a diagnostic foundation for understanding mergeability and motivate future fine-tuning strategies that explicitly encourage these properties.

[1595] Environment-Conditioned Tail Reweighting for Total Variation Invariant Risk Minimization

Yuanchao Wang, Zhao-Rong Lai, Tianqi Zhong, Fengnan Li

Main category: cs.LG

TL;DR: ECTR combines environment-level invariant learning with sample-level tail reweighting to address both correlation shifts (between environments) and diversity shifts (within environments) for better OOD generalization.

Details

Motivation: Existing invariant risk minimization methods focus on spurious correlations at environment level but ignore sample-level heterogeneity within environments, which can critically impact OOD performance when both correlation and diversity shifts occur simultaneously.

Method: Proposes Environment-Conditioned Tail Reweighting for Total Variation Invariant Risk Minimization (ECTR), which augments TV-based invariant learning with environment-conditioned tail reweighting to jointly address both types of distribution shift. Also extends to scenarios without explicit environment annotations by inferring latent environments through a minimax formulation.

Result: Experiments across regression, tabular, time-series, and image classification benchmarks under mixed distribution shifts demonstrate consistent improvements in both worst-environment and average OOD performance.

Conclusion: ECTR provides a unified framework that makes environment-level invariance and within-environment robustness complementary under mixed distribution shifts, addressing limitations of existing IRM methods.

Abstract: Out-of-distribution (OOD) generalization remains challenging when models simultaneously encounter correlation shifts across environments and diversity shifts driven by rare or hard samples. Existing invariant risk minimization (IRM) methods primarily address spurious correlations at the environment level, but often overlook sample-level heterogeneity within environments, which can critically impact OOD performance. In this work, we propose Environment-Conditioned Tail Reweighting for Total Variation Invariant Risk Minimization (ECTR), a unified framework that augments TV-based invariant learning with environment-conditioned tail reweighting to jointly address both types of distribution shift. By integrating environment-level invariance with within-environment robustness, the proposed approach makes these two mechanisms complementary under mixed distribution shifts. We further extend the framework to scenarios without explicit environment annotations by inferring latent environments through a minimax formulation. Experiments across regression, tabular, time-series, and image classification benchmarks under mixed distribution shifts demonstrate consistent improvements in both worst-environment and average OOD performance.

[1596] CATTO: Balancing Preferences and Confidence in Language Models

Nisarg Parikh, Ananya Sai, Pannaga Shivaswamy, Kunjal Panchal, Andrew Lan

Main category: cs.LG

TL;DR: CATTO is a calibration-aware training objective that improves LLM confidence calibration without sacrificing accuracy, combined with Confidence@k for better token selection.

Details

Motivation: LLMs often have poorly calibrated confidence - high-confidence predictions can be wrong and low-confidence ones correct. Preference-based alignment methods break the link between predictive probability and correctness, exacerbating miscalibration.

Method: Introduces CATTO (Calibration Aware Token-level Training Objective) that aligns predicted confidence with empirical prediction correctness. Can be combined with preference optimization objectives. Also introduces Confidence@k, a test-time scaling mechanism using calibrated token probabilities for Bayes-optimal token selection.

Result: CATTO reduces Expected Calibration Error (ECE) by 2.22%-7.61% in-distribution and 1.46%-10.44% out-of-distribution vs DPO, and by 0.22%-1.24% in-distribution and 1.23%-5.07% out-of-distribution vs strongest DPO baseline. Maintains or slightly improves multiple-choice QA accuracy on five datasets.

Conclusion: CATTO effectively improves LLM confidence calibration without compromising task accuracy, and Confidence@k provides practical benefits for token selection using calibrated probabilities.

Abstract: Large language models (LLMs) often make accurate next token predictions but their confidence in these predictions can be poorly calibrated: high-confidence predictions are frequently wrong, and low-confidence predictions may be correct. This miscalibration is exacerbated by preference-based alignment methods breaking the link between predictive probability and correctness. We introduce a Calibration Aware Token-level Training Objective (CATTO), a calibration-aware objective that aligns predicted confidence with empirical prediction correctness, which can be combined with the original preference optimization objectives. Empirically, CATTO reduces Expected Calibration Error (ECE) by 2.22%-7.61% in-distribution and 1.46%-10.44% out-of-distribution compared to direct preference optimization (DPO), and by 0.22%-1.24% in-distribution and 1.23%-5.07% out-of-distribution compared to the strongest DPO baseline. This improvement in confidence does not come at a cost of losing task accuracy, where CATTO maintains or slightly improves multiple-choice question-answering accuracy on five datasets. We also introduce Confidence@k, a test-time scaling mechanism leveraging calibrated token probabilities for Bayes-optimal selection of output tokens.

[1597] MeshGraphNet-Transformer: Scalable Mesh-based Learned Simulation for Solid Mechanics

Mikel M. Iparraguirre, Iciar Alfaro, David Gonzalez, Elias Cueto

Main category: cs.LG

TL;DR: MGN-T combines Transformers with MeshGraphNets for efficient physics simulation on high-resolution meshes by enabling global modeling while preserving geometric inductive biases.

Details

Motivation: Standard MeshGraphNets suffer from inefficient long-range information propagation on large, high-resolution meshes due to iterative message passing, limiting their scalability for industrial applications.

Method: Proposes MeshGraphNet-Transformer (MGN-T) that integrates a physics-attention Transformer as a global processor with MeshGraphNets’ mesh-based graph representation, updating all nodal states simultaneously while retaining node and edge attributes.

Result: MGN-T successfully handles industrial-scale meshes for impact dynamics where standard MGN fails, accurately models complex phenomena (self-contact, plasticity, multivariate outputs), and outperforms state-of-the-art approaches on classical benchmarks with higher accuracy and fewer parameters.

Conclusion: MGN-T provides an efficient solution for physics simulation on high-resolution meshes by combining global Transformer modeling with geometric inductive biases, enabling industrial-scale applications without hierarchical meshes or deep message-passing stacks.

Abstract: We present MeshGraphNet-Transformer (MGN-T), a novel architecture that combines the global modeling capabilities of Transformers with the geometric inductive bias of MeshGraphNets, while preserving a mesh-based graph representation. MGN-T overcomes a key limitation of standard MGN, the inefficient long-range information propagation caused by iterative message passing on large, high-resolution meshes. A physics-attention Transformer serves as a global processor, updating all nodal states simultaneously while explicitly retaining node and edge attributes. By directly capturing long-range physical interactions, MGN-T eliminates the need for deep message-passing stacks or hierarchical, coarsened meshes, enabling efficient learning on high-resolution meshes with varying geometries, topologies, and boundary conditions at an industrial scale. We demonstrate that MGN-T successfully handles industrial-scale meshes for impact dynamics, a setting in which standard MGN fails due message-passing under-reaching. The method accurately models self-contact, plasticity, and multivariate outputs, including internal, phenomenological plastic variables. Moreover, MGN-T outperforms state-of-the-art approaches on classical benchmarks, achieving higher accuracy while maintaining practical efficiency, using only a fraction of the parameters required by competing baselines.

cs.MA

[1598] Evolving Interpretable Constitutions for Multi-Agent Simulation

Ujwal Kumar, Alice Saito, Hershraj Niranjani, Rayan Yessou, Phan Xuan Tan

Main category: cs.MA

TL;DR: Constitutional Evolution framework uses genetic programming to automatically discover behavioral norms in multi-agent LLM systems, achieving 123% higher societal stability than human-designed constitutions by minimizing communication rather than promoting verbose coordination.

Details

Motivation: Current Constitutional AI focuses on single-model alignment with fixed principles, but multi-agent systems create novel alignment challenges through emergent social dynamics. There's a need to automatically discover effective behavioral norms rather than prescribing them.

Method: Uses grid-world simulation with survival pressure to study individual vs collective welfare. Employs LLM-driven genetic programming with multi-island evolution to evolve constitutions maximizing social welfare without explicit cooperation guidance. Quantifies performance via Societal Stability Score S [0,1] combining productivity, survival, and conflict metrics.

Result: Evolved constitution C* achieves S = 0.556 ± 0.008 (123% higher than human-designed baselines), eliminates conflict, and discovers that minimizing communication (0.9% vs 62.2% social actions) outperforms verbose coordination. Human-designed constitutions perform poorly: adversarial ones cause collapse (S=0), vague prosocial principles yield S=0.249, and Claude 4.5-designed ones achieve only S=0.332.

Conclusion: Cooperative norms can be discovered rather than prescribed in multi-agent systems. The evolved constitution demonstrates that effective social coordination emerges from simple, interpretable rules that minimize communication rather than promoting extensive social interaction.

Abstract: Constitutional AI has focused on single-model alignment using fixed principles. However, multi-agent systems create novel alignment challenges through emergent social dynamics. We present Constitutional Evolution, a framework for automatically discovering behavioral norms in multi-agent LLM systems. Using a grid-world simulation with survival pressure, we study the tension between individual and collective welfare, quantified via a Societal Stability Score S in [0,1] that combines productivity, survival, and conflict metrics. Adversarial constitutions lead to societal collapse (S= 0), while vague prosocial principles (“be helpful, harmless, honest”) produce inconsistent coordination (S = 0.249). Even constitutions designed by Claude 4.5 Opus with explicit knowledge of the objective achieve only moderate performance (S= 0.332). Using LLM-driven genetic programming with multi-island evolution, we evolve constitutions maximizing social welfare without explicit guidance toward cooperation. The evolved constitution C* achieves S = 0.556 +/- 0.008 (123% higher than human-designed baselines, N = 10), eliminates conflict, and discovers that minimizing communication (0.9% vs 62.2% social actions) outperforms verbose coordination. Our interpretable rules demonstrate that cooperative norms can be discovered rather than prescribed.

[1599] Communications-Incentivized Collaborative Reasoning in NetGPT through Agentic Reinforcement Learning

Xiaoxue Yu, Rongpeng Li, Zhifeng Zhao, Honggang Zhang

Main category: cs.MA

TL;DR: NetGPT: A unified agentic framework for AI-native next-generation wireless networks that enables autonomous reasoning and task delegation to specialized agents through agentic communication and reinforcement learning.

Details

Motivation: Current AI deployments in communication systems are siloed and lack adaptability, dynamic task delegation, and multi-agent collaboration. There's a need for a unified framework that integrates AI natively into next-generation wireless networks for autonomous sensing, reasoning, and action.

Method: Proposes NetGPT framework with a core that can perform autonomous reasoning or delegate tasks to domain-specialized agents via agentic communication. Uses agentic reinforcement learning under partial observability with masked loss against external agent uncertainty, entropy-guided exploration, and multi-objective rewards for task quality, coordination efficiency, and resource constraints.

Result: Provides a foundational architecture and training methodology for self-evolving, AI-native next-generation networks capable of autonomous sensing, reasoning, and action in complex communication environments.

Conclusion: NetGPT enables scalable, distributed intelligence across wireless networks by establishing modular responsibilities and interoperable workflows, learning when and how to collaborate effectively.

Abstract: The evolution of next-Generation (xG) wireless networks marks a paradigm shift from connectivity-centric architectures to Artificial Intelligence (AI)-native designs that tightly integrate data, computing, and communication. Yet existing AI deployments in communication systems remain largely siloed, offering isolated optimizations without intrinsic adaptability, dynamic task delegation, or multi-agent collaboration. In this work, we propose a unified agentic NetGPT framework for AI-native xG networks, wherein a NetGPT core can either perform autonomous reasoning or delegate sub-tasks to domain-specialized agents via agentic communication. The framework establishes clear modular responsibilities and interoperable workflows, enabling scalable, distributed intelligence across the network. To support continual refinement of collaborative reasoning strategies, the framework is further enhanced through Agentic reinforcement learning under partially observable conditions and stochastic external states. The training pipeline incorporates masked loss against external agent uncertainty, entropy-guided exploration, and multi-objective rewards that jointly capture task quality, coordination efficiency, and resource constraints. Through this process, NetGPT learns when and how to collaborate, effectively balancing internal reasoning with agent invocation. Overall, this work provides a foundational architecture and training methodology for self-evolving, AI-native xG networks capable of autonomous sensing, reasoning, and action in complex communication environments.

[1600] Symphony-Coord: Emergent Coordination in Decentralized Agent Systems

Zhaoyang Guan, Huixi Cao, Ming Zhong, Eric Yang, Lynn Ai, Yongxin Ni, Bill Shi

Main category: cs.MA

TL;DR: Symphony-Coord is a decentralized multi-agent framework that transforms agent selection into an online multi-armed bandit problem, enabling emergent roles through dynamic routing based on task context and agent states.

Details

Motivation: Current multi-agent LLM systems rely on static role assignments and centralized controllers, leading to inefficient routing, poor adaptability, and fragile fault recovery as agent pools and task distributions evolve.

Method: Two-stage dynamic beacon protocol: 1) lightweight candidate screening to limit overhead, 2) adaptive LinUCB selector that routes subtasks based on context features from task requirements and agent states, optimized through delayed end-to-end feedback.

Result: Provides sublinear regret bounds under linear realizability assumptions, demonstrating convergence toward near-optimal allocation. Validation shows enhanced task routing efficiency and robust self-healing capabilities in scenarios with distribution shifts and agent failures.

Conclusion: Symphony-Coord achieves scalable coordination without predefined roles, enabling emergent specialization and robust adaptation in dynamic multi-agent LLM systems.

Abstract: Multi-agent large language model systems can tackle complex multi-step tasks by decomposing work and coordinating specialized behaviors. However, current coordination mechanisms typically rely on statically assigned roles and centralized controllers. As agent pools and task distributions evolve, these design choices lead to inefficient routing, poor adaptability, and fragile fault recovery capabilities. We introduce Symphony-Coord, a decentralized multi-agent framework that transforms agent selection into an online multi-armed bandit problem, enabling roles to emerge organically through interaction. The framework employs a two-stage dynamic beacon protocol: (i) a lightweight candidate screening mechanism to limit communication and computational overhead; (ii) an adaptive LinUCB selector that routes subtasks based on context features derived from task requirements and agent states, continuously optimized through delayed end-to-end feedback. Under standard linear realizability assumptions, we provide sublinear regret bounds, indicating the system converges toward near-optimal allocation schemes. Validation through simulation experiments and real-world large language model benchmarks demonstrates that Symphony-Coord not only enhances task routing efficiency but also exhibits robust self-healing capabilities in scenarios involving distribution shifts and agent failures, achieving a scalable coordination mechanism without predefined roles.

[1601] Multi-Agent Teams Hold Experts Back

Aneesh Pappu, Batu El, Hancheng Cao, Carmelo di Nolfo, Yanchao Sun, Meng Cao, James Zou

Main category: cs.MA

TL;DR: Self-organizing multi-agent LLM teams consistently fail to match expert agent performance due to integrative compromise behavior that averages expert and non-expert views rather than appropriately weighting expertise.

Details

Motivation: To study whether self-organizing LLM teams can achieve strong synergy where team performance matches or exceeds the best individual member, given that most prior work enforces coordination through fixed roles, workflows, or aggregation rules rather than allowing coordination to emerge through interaction.

Method: Drawing on organizational psychology, the study examines self-organizing LLM teams across human-inspired and frontier ML benchmarks, analyzing their performance compared to individual expert agents, decomposing failures into expert identification vs. leveraging, and conducting conversational analysis to understand coordination behaviors.

Result: LLM teams consistently fail to match their expert agent’s performance (up to 37.6% performance loss), with expert leveraging rather than identification being the primary bottleneck. Teams show tendency toward integrative compromise - averaging expert and non-expert views rather than appropriately weighting expertise - which increases with team size and correlates negatively with performance, though it improves robustness to adversarial agents.

Conclusion: There’s a significant gap in self-organizing multi-agent LLM teams’ ability to harness collective expertise, revealing a trade-off between alignment/consensus-seeking behavior and effective expertise utilization, unlike human teams which can achieve synergy.

Abstract: Multi-agent LLM systems are increasingly deployed as autonomous collaborators, where agents interact freely rather than execute fixed, pre-specified workflows. In such settings, effective coordination cannot be fully designed in advance and must instead emerge through interaction. However, most prior work enforces coordination through fixed roles, workflows, or aggregation rules, leaving open the question of how well self-organizing teams perform when coordination is unconstrained. Drawing on organizational psychology, we study whether self-organizing LLM teams achieve strong synergy, where team performance matches or exceeds the best individual member. Across human-inspired and frontier ML benchmarks, we find that – unlike human teams – LLM teams consistently fail to match their expert agent’s performance, even when explicitly told who the expert is, incurring performance losses of up to 37.6%. Decomposing this failure, we show that expert leveraging, rather than identification, is the primary bottleneck. Conversational analysis reveals a tendency toward integrative compromise – averaging expert and non-expert views rather than appropriately weighting expertise – which increases with team size and correlates negatively with performance. Interestingly, this consensus-seeking behavior improves robustness to adversarial agents, suggesting a trade-off between alignment and effective expertise utilization. Our findings reveal a significant gap in the ability of self-organizing multi-agent teams to harness the collective expertise of their members.

[1602] A-MapReduce: Executing Wide Search via Agentic MapReduce

Mingju Chen, Guibin Zhang, Heng Chang, Yuchen Guo, Shiji Zhou

Main category: cs.MA

TL;DR: A-MapReduce: A MapReduce-inspired multi-agent framework for efficient wide search tasks using parallel processing and experiential memory

Details

Motivation: Existing LLM-based multi-agent systems are good at deep research tasks but inefficient for wide search tasks that require large-scale, breadth-oriented retrieval. Current frameworks use sequential reasoning that struggles with expansive search objectives and long-horizon execution.

Method: Proposes A-MapReduce framework inspired by MapReduce paradigm, treating wide search as horizontally structured retrieval. Uses task-adaptive decomposition for parallel processing of retrieval targets, structured result aggregation, and experiential memory for query-conditioned task allocation and recomposition evolution.

Result: Achieves state-of-the-art performance on WideSearch and DeepWideSearch benchmarks, with 5.11%-17.50% average Item F1 improvements over strong baselines. Reduces running time by 45.8% compared to representative multi-agent baselines while being cost-effective.

Conclusion: A-MapReduce effectively bridges the gap in wide search tasks by introducing horizontal parallel processing architecture, demonstrating superior performance and efficiency compared to existing sequential multi-agent frameworks.

Abstract: Contemporary large language model (LLM)-based multi-agent systems exhibit systematic advantages in deep research tasks, which emphasize iterative, vertically structured information seeking. However, when confronted with wide search tasks characterized by large-scale, breadth-oriented retrieval, existing agentic frameworks, primarily designed around sequential, vertically structured reasoning, remain stuck in expansive search objectives and inefficient long-horizon execution. To bridge this gap, we propose A-MapReduce, a MapReduce paradigm-inspired multi-agent execution framework that recasts wide search as a horizontally structured retrieval problem. Concretely, A-MapReduce implements parallel processing of massive retrieval targets through task-adaptive decomposition and structured result aggregation. Meanwhile, it leverages experiential memory to drive the continual evolution of query-conditioned task allocation and recomposition, enabling progressive improvement in large-scale wide-search regimes. Extensive experiments on five agentic benchmarks demonstrate that A-MapReduce is (i) high-performing, achieving state-of-the-art performance on WideSearch and DeepWideSearch, and delivering 5.11% - 17.50% average Item F1 improvements compared with strong baselines with OpenAI o3 or Gemini 2.5 Pro backbones; (ii) cost-effective and efficient, delivering superior cost-performance trade-offs and reducing running time by 45.8% compared to representative multi-agent baselines. The code is available at https://github.com/mingju-c/AMapReduce.

[1603] Evidence-Decision-Feedback: Theory-Driven Adaptive Scaffolding for LLM Agents

Clayton Cohn, Siyuan Guo, Surya Rayala, Hanchen David Wang, Naveeduddin Mohammed, Umesh Timalsina, Shruti Jain, Angela Eeds, Menton Deweese, Pamela J. Osborn Popp, Rebekah Stanton, Shakeera Walker, Meiyi Ma, Gautam Biswas

Main category: cs.MA

TL;DR: EDF framework for adaptive scaffolding using LLMs in educational multi-agent systems, instantiated as Copa agent for STEM+C problem-solving

Details

Motivation: Current multi-agent LLM architectures for pedagogical agents often use "one-size-fits-all" approaches that lack personalization, limiting their ability to provide adaptive support for students' knowledge construction and critical thinking development.

Method: Introduces Evidence-Decision-Feedback (EDF) framework integrating intelligent tutoring systems and agentic behavior, organizing interactions around evidentiary inference, pedagogical decision-making, and adaptive feedback. Instantiated through Copa, an agentic collaborative peer agent for STEM+C problem-solving.

Result: In authentic high school classroom study, EDF-aligned interactions: 1) align feedback with students’ demonstrated understanding and task mastery; 2) promote gradual scaffold fading; 3) support interpretable, evidence-grounded explanations without fostering overreliance.

Conclusion: EDF framework provides effective adaptive scaffolding for educational multi-agent LLM systems, enabling personalized support that respects students’ individual learning trajectories while maintaining interpretability and avoiding dependency.

Abstract: Multi-agent LLM architectures offer opportunities for pedagogical agents to help students construct domain knowledge and develop critical-thinking skills, yet many operate on a “one-size-fits-all” basis, limiting their ability to provide personalized support. To address this, we introduce Evidence-Decision-Feedback (EDF), a theoretical framework for adaptive scaffolding using LLMs. EDF integrates elements of intelligent tutoring systems and agentic behavior by organizing interactions around evidentiary inference, pedagogical decision-making, and adaptive feedback. We instantiate EDF through Copa, an agentic collaborative peer agent for STEM+C problem-solving. In an authentic high school classroom study, we show that EDF-aligned interactions align feedback with students’ demonstrated understanding and task mastery; promote gradual scaffold fading; and support interpretable, evidence-grounded explanations without fostering overreliance.

[1604] TABX: A High-Throughput Sandbox Battle Simulator for Multi-Agent Reinforcement Learning

Hayeong Lee, JunHyeok Oh, Byung-Jun Lee

Main category: cs.MA

TL;DR: TABX is a high-throughput JAX-based sandbox for reconfigurable multi-agent reinforcement learning environments with granular parameter control and GPU acceleration.

Details

Motivation: Existing MARL benchmarks lack modularity for custom evaluation scenarios, limiting systematic investigation of emergent agent behaviors and algorithmic trade-offs across diverse task complexities.

Method: Developed TABX (Totally Accelerated Battle Simulator in JAX) as a high-throughput sandbox with granular environmental parameter control, leveraging JAX for hardware-accelerated GPU execution to enable massive parallelization.

Result: TABX provides a fast, extensible, and easily customized framework that significantly reduces computational overhead while facilitating study of MARL agents in complex structured domains.

Conclusion: TABX serves as a scalable foundation for future MARL research by enabling systematic investigation of agent behaviors and algorithmic trade-offs through reconfigurable multi-agent tasks.

Abstract: The design of environments plays a critical role in shaping the development and evaluation of cooperative multi-agent reinforcement learning (MARL) algorithms. While existing benchmarks highlight critical challenges, they often lack the modularity required to design custom evaluation scenarios. We introduce the Totally Accelerated Battle Simulator in JAX (TABX), a high-throughput sandbox designed for reconfigurable multi-agent tasks. TABX provides granular control over environmental parameters, permitting a systematic investigation into emergent agent behaviors and algorithmic trade-offs across a diverse spectrum of task complexities. Leveraging JAX for hardware-accelerated execution on GPUs, TABX enables massive parallelization and significantly reduces computational overhead. By providing a fast, extensible, and easily customized framework, TABX facilitates the study of MARL agents in complex structured domains and serves as a scalable foundation for future research. Our code is available at: https://anonymous.4open.science/r/TABX-00CA.

[1605] Self-Evolving Coordination Protocol in Multi-Agent AI Systems: An Exploratory Systems Feasibility Study

Jose Manuel de la Chica Rodriguez, Juan Manuel Vera Díaz

Main category: cs.MA

TL;DR: Self-Evolving Coordination Protocols (SECP) enable limited, externally validated self-modification of coordination protocols while preserving formal invariants in multi-agent systems, demonstrated through Byzantine consensus protocol evaluation.

Details

Motivation: In safety-critical domains like finance, multi-agent systems need coordination mechanisms that satisfy strict formal requirements, remain auditable, and operate within bounded limits. Current systems lack the ability to evolve while maintaining these constraints.

Method: Study compares four coordination regimes: unanimous hard veto, weighted scalar aggregation, SECP v1.0 (agent-designed non-scalar protocol), and SECP v2.0 (result of governed modification). Six Byzantine consensus protocol proposals evaluated by six specialized decision modules under identical hard constraints.

Result: A single recursive modification increased proposal coverage from two to three accepted proposals while preserving all declared invariants. Demonstrates bounded self-modification is technically implementable, auditable, and analyzable under explicit formal constraints.

Conclusion: Bounded self-modification of coordination protocols is feasible and establishes a foundation for governed multi-agent systems in safety-critical domains, though study makes no claims about statistical significance, optimality, or learning.

Abstract: Contemporary multi-agent systems increasingly rely on internal coordination mechanisms to combine, arbitrate, or constrain the outputs of heterogeneous components. In safety-critical and regulated domains such as finance, these mechanisms must satisfy strict formal requirements, remain auditable, and operate within explicitly bounded limits. Coordination logic therefore functions as a governance layer rather than an optimization heuristic. This paper presents an exploratory systems feasibility study of Self-Evolving Coordination Protocols (SECP): coordination protocols that permit limited, externally validated self-modification while preserving fixed formal invariants. We study a controlled proof-of-concept setting in which six fixed Byzantine consensus protocol proposals are evaluated by six specialized decision modules. All coordination regimes operate under identical hard constraints, including Byzantine fault tolerance (f < n/3), O(n2) message complexity, complete non-statistical safety and liveness arguments, and bounded explainability. Four coordination regimes are compared in a single-shot design: unanimous hard veto, weighted scalar aggregation, SECP v1.0 (an agent-designed non-scalar protocol), and SECP v2.0 (the result of one governed modification). Outcomes are evaluated using a single metric, proposal coverage, defined as the number of proposals accepted. A single recursive modification increased coverage from two to three accepted proposals while preserving all declared invariants. The study makes no claims regarding statistical significance, optimality, convergence, or learning. Its contribution is architectural: it demonstrates that bounded self-modification of coordination protocols is technically implementable, auditable, and analyzable under explicit formal constraints, establishing a foundation for governed multi-agent systems.

[1606] Normative Feeling: Socially Patterned Affective Mechanisms

Stavros Anagnou, Daniel Polani, Christoph Salge

Main category: cs.MA

TL;DR: Evolutionary model shows how normative processes (punishment) versus competition lead to different mood mechanisms in resource dilemmas, with normative conditions evolving mood as implicit social signals for resource conservation.

Details

Motivation: To understand how the coupling between norm violations and emotional consequences evolved, and how normative processes might have shaped even ancient capacities like mood evolution.

Method: Agent-based model with evolvable affect in a shared resource dilemma, comparing competition (non-normative) versus punishment (normative) conditions to see what mood mechanisms emerge.

Result: Different mood mechanisms emerged: competition evolved “bad mood -> consume more” leading to tragedy of the commons, while punishment evolved “bad mood -> consume less” where negative affect functions as implicit social sanction signal.

Conclusion: Normative processes enable social preferences to emerge in distributed psychological mechanisms, reprogramming cognitive/physiological systems by embedding cultural patterns into psychological dispositions.

Abstract: Breaking a norm elicits both material and emotional consequences, yet how this coupling arose evolutionarily remains unclear. We investigate this question in light of emerging work suggesting that normativity’s building blocks emerged earlier in evolution than previously considered, arguing that normative processes should inform accounts of how even ancient capacities such as mood evolved. Using a definition of normative processes we developed, we created an agent-based model with evolvable affect in a shared resource dilemma, comparing competition (non-normative) versus punishment (normative) conditions. Critically, different mood mechanisms emerge under each condition. Under competition, agents evolve a “bad mood -> consume more” response, creating a tragedy of the commons leading to resource depletion and population collapse. Under punishment, agents evolve a “bad mood -> consume less” mechanism, where negative affect functions as an implicit signal of social sanction, promoting resource conservation. Importantly, once normative logic is imprinted through punishment, it creates an evolutionary pathway for mood-based signalling that operates without costly physical enforcement. Our findings demonstrate how normative processes enable social preferences to emerge in a distributed manner within psychological mechanisms, showing how normative processes reprogram cognitive and physiological systems by embedding cultural patterns into psychological dispositions.

[1607] BMG-Q: Localized Bipartite Match Graph Attention Q-Learning for Ride-Pooling Order Dispatch

Yulong Hu, Siyuan Feng, Sen Li

Main category: cs.MA

TL;DR: BMG-Q is a novel MARL algorithm for ride-pooling order dispatch using localized bipartite match graphs and Graph Attention DQN with ILP optimization.

Details

Motivation: To improve ride-pooling order dispatch by developing a scalable MARL framework that captures dynamic vehicle interactions and reduces overestimation bias.

Method: Uses localized bipartite match graphs to model MDP, develops GATDDQN (Graph Attention Double Deep Q Network) as MARL backbone, combines with ILP for global coordination, and incorporates gradient clipping and posterior score functions.

Result: Outperforms benchmark RL frameworks by ~10% in cumulative rewards, reduces overestimation bias by over 50%, and maintains robustness across task variations and fleet sizes.

Conclusion: BMG-Q is an effective, scalable, and robust framework for ride-pooling order dispatch that advances MARL applications in transportation systems.

Abstract: This paper introduces Localized Bipartite Match Graph Attention Q-Learning (BMG-Q), a novel Multi-Agent Reinforcement Learning (MARL) algorithm framework tailored for ride-pooling order dispatch. BMG-Q advances ride-pooling decision-making process with the localized bipartite match graph underlying the Markov Decision Process, enabling the development of novel Graph Attention Double Deep Q Network (GATDDQN) as the MARL backbone to capture the dynamic interactions among ride-pooling vehicles in fleet. Our approach enriches the state information for each agent with GATDDQN by leveraging a localized bipartite interdependence graph and enables a centralized global coordinator to optimize order matching and agent behavior using Integer Linear Programming (ILP). Enhanced by gradient clipping and localized graph sampling, our GATDDQN improves scalability and robustness. Furthermore, the inclusion of a posterior score function in the ILP captures the online exploration-exploitation trade-off and reduces the potential overestimation bias of agents, thereby elevating the quality of the derived solutions. Through extensive experiments and validation, BMG-Q has demonstrated superior performance in both training and operations for thousands of vehicle agents, outperforming benchmark reinforcement learning frameworks by around 10% in accumulative rewards and showing a significant reduction in overestimation bias by over 50%. Additionally, it maintains robustness amidst task variations and fleet size changes, establishing BMG-Q as an effective, scalable, and robust framework for advancing ride-pooling order dispatch operations.

[1608] Replicating the behaviour of electric vehicle drivers using an agent-based reinforcement learning model

Zixin Feng, Qunshan Zhao, Alison Heppenstall

Main category: cs.MA

TL;DR: A multi-stage reinforcement learning framework for simulating private EV driver charging behavior at national scale, validated against real data to identify charging deserts and inform policy.

Details

Motivation: Existing EV charging network simulations use static behavioral rules that fail to capture adaptive human behaviors, while reinforcement learning approaches have focused on fleet optimization rather than private drivers making independent decisions.

Method: Proposes a multi-stage reinforcement learning framework to simulate private EV driver charging demand across national-scale road networks, validated against real-world data to identify the most realistic training stage.

Result: Model successfully captures adaptive behaviors and bounded rationality of private drivers, identifies critical ‘charging deserts’ where drivers have consistently low state of charge, and highlights policy needs for rapid charging hubs along motorways and city boundaries.

Conclusion: The framework provides realistic simulation of private EV driver behavior, reveals infrastructure gaps, and supports policy decisions for expanding charging networks to meet long-distance trip demands.

Abstract: Despite the rapid expansion of electric vehicle (EV) charging networks, questions remain about their efficiency in meeting the growing needs of EV drivers. Previous simulation-based approaches, which rely on static behavioural rules, have struggled to capture the adaptive behaviours of human drivers. Although reinforcement learning has been introduced in EV simulation studies, its application has primarily focused on optimising fleet operations rather than modelling private drivers who make independent charging decisions. To address the gap, we propose a multi-stage reinforcement learning framework that simulates charging demand of private EV drivers across a national-scale road network. We validate the model against real-world data and identify the training stage that most closely reflects actual driver behaviour, which captures both the adaptive behaviours and bounded rationality of private drivers. Based on the simulation results, we also identify critical ‘charging deserts’ where EV drivers consistently have low state of charge. Our findings also highlight recent policy shifts toward expanding rapid charging hubs along motorway corridors and city boundaries to meet the demand from long-distance trips.

[1609] LLM-Based Multi-Agent Blackboard System for Information Discovery in Data Science

Alireza Salemi, Mihir Parmar, Palash Goyal, Yiwen Song, Jinsung Yoon, Hamed Zamani, Tomas Pfister, Hamid Palangi

Main category: cs.MA

TL;DR: A novel multi-agent blackboard architecture for scalable data discovery in large data lakes, outperforming existing methods by 13-57%.

Details

Motivation: Large language models face deployment challenges due to difficulty finding relevant data in large, heterogeneous data lakes. Existing multi-agent systems struggle with scalability and require central controllers with precise knowledge of sub-agent capabilities, which is impractical in large-scale settings.

Method: Proposes a blackboard-inspired multi-agent paradigm where a central agent posts requests to a shared blackboard, and autonomous subordinate agents (responsible for data partitions or web retrieval) volunteer to respond based on their capabilities, eliminating need for central coordination of agent expertise.

Result: The blackboard architecture substantially outperforms strong baselines on three benchmarks (KramaBench, modified DSBench, DA-Code), achieving 13%-57% relative improvements in end-to-end success and up to 9% relative gain in data discovery F1 over best baseline.

Conclusion: The blackboard architecture provides a scalable and flexible solution for data discovery in large data lakes, addressing limitations of existing multi-agent systems by removing the need for central coordination of agent capabilities.

Abstract: Advances in large language models (LLMs) have created new opportunities in data science, but their deployment is often limited by the challenge of finding relevant data in large data lakes. Existing methods struggle with this: both single- and multi-agent systems are quickly overwhelmed by large, heterogeneous files, and master-slave multi-agent systems rely on a rigid central controller that requires precise knowledge of each sub-agent’s capabilities, which is not possible in large-scale settings where the main agent lacks full observability over sub-agents’ knowledge and competencies. We propose a novel multi-agent paradigm inspired by the blackboard architecture for traditional AI models. In our framework, a central agent posts requests to a shared blackboard, and autonomous subordinate agents - either responsible for a partition of the data lake or retrieval from the web - volunteer to respond based on their capabilities. This design improves scalability and flexibility by removing the need for a central coordinator to know each agent’s expertise or internal knowledge. We evaluate the approach on three benchmarks that require data discovery: KramaBench and modified versions of DSBench and DA-Code. Results show that the blackboard architecture substantially outperforms strong baselines, achieving 13%-57% relative improvements in end-to-end success and up to a 9% relative gain in data discovery F1 over the best baseline.

[1610] Introduction to Automated Negotiation

Dave de Jonge

Main category: cs.MA

TL;DR: Introductory textbook on automated negotiation for CS students with no prior knowledge, includes Python framework for implementing negotiation algorithms

Details

Motivation: To provide an accessible introduction to automated negotiation for computer science students who are new to the field, requiring only basic math and programming skills

Method: Textbook approach with educational content and a simple Python-based negotiation framework that allows students to implement their own algorithms and conduct experiments

Result: A complete educational package including theoretical foundations and practical tools for learning automated negotiation through hands-on implementation

Conclusion: Provides a comprehensive starting point for students to learn automated negotiation with practical implementation experience through the included Python framework

Abstract: This book is an introductory textbook targeted towards computer science students who are completely new to the topic of automated negotiation. It does not require any prerequisite knowledge, except for elementary mathematics and basic programming skills. This book comes with an simple toy-world negotiation framework implemented in Python that can be used by the readers to implement their own negotiation algorithms and perform experiments with them. This framework is small and simple enough that any reader who does not like to work in Python should be able to re-implement it very quickly in any other programming language of their choice.

[1611] MAS-Shield: A Defense Framework for Secure and Efficient LLM MAS

Kaixiang Wang, Zhaojiacheng Zhou, Bunyod Suvonov, Jiong Lou, Jie LI

Main category: cs.MA

TL;DR: MAS-Shield: A hierarchical defense framework for LLM-based multi-agent systems that uses coarse-to-fine filtering to protect against linguistic attacks while maintaining efficiency.

Details

Motivation: LLM-based Multi-Agent Systems are vulnerable to linguistic attacks that can cause cascading failures. Existing defenses face a dilemma: single-auditor methods are prone to single points of failure, while committee-based approaches are computationally expensive for multi-turn interactions.

Method: Three-stage coarse-to-fine filtering pipeline: 1) Critical Agent Selection targets high-influence nodes, 2) Light Auditing uses lightweight sentry models to filter benign cases, 3) Global Consensus Auditing escalates suspicious signals to heavyweight committee for arbitration.

Result: Achieves 92.5% recovery rate against diverse adversarial scenarios and reduces defense latency by over 70% compared to existing methods.

Conclusion: MAS-Shield effectively optimizes the security-efficiency trade-off for defending LLM-based multi-agent systems against linguistic attacks.

Abstract: Large Language Model (LLM)-based Multi-Agent Systems (MAS) are susceptible to linguistic attacks that can trigger cascading failures across the network. Existing defenses face a fundamental dilemma: lightweight single-auditor methods are prone to single points of failure, while robust committee-based approaches incur prohibitive computational costs in multi-turn interactions. To address this challenge, we propose \textbf{MAS-Shield}, a secure and efficient defense framework designed with a coarse-to-fine filtering pipeline. Rather than applying uniform scrutiny, MAS-Shield dynamically allocates defense resources through a three-stage protocol: (1) \textbf{Critical Agent Selection } strategically targets high-influence nodes to narrow the defense surface; (2) \textbf{Light Auditing} employs lightweight sentry models to rapidly filter the majority of benign cases; and (3) \textbf{Global Consensus Auditing} escalates only suspicious or ambiguous signals to a heavyweight committee for definitive arbitration. This hierarchical design effectively optimizes the security-efficiency trade-off. Experiments demonstrate that MAS-Shield achieves a 92.5% recovery rate against diverse adversarial scenarios and reduces defense latency by over 70% compared to existing methods.

[1612] When Personas Override Payoffs: Role Identity Bias in Multi-Agent LLM Decision-Making

Viswonathan Manoranjan, Snehalkumar `Neil’ S. Gaikwad

Main category: cs.MA

TL;DR: LLM multi-agent systems show that role personas bias strategic reasoning toward socially preferred outcomes rather than payoff optimization, with Qwen models being highly sensitive to these design choices while Llama/Mistral show rigid behavior.

Details

Motivation: To understand how design choices like role-based personas and payoff visibility affect LLM reasoning in multi-agent systems, particularly whether they function as strategic reasoners optimizing payoffs or as identity-driven actors prioritizing role alignment.

Method: Systematic experiments across four LLM architectures (Qwen-7B, Qwen-32B, Llama-8B, Mistral-7B) in complex environmental decision-making games with four agents, using Nash equilibrium achievement as a diagnostic for strategic reasoning.

Result: Role identity bias fundamentally alters strategic reasoning even when payoff-optimal equilibria exist. With personas present, all achieved equilibria correspond to Green Transition, while models fail to reach equilibrium when Tragedy of the Commons is payoff-optimal. Qwen models are highly sensitive to personas and payoff visibility, while Llama and Mistral show rigid reasoning behavior.

Conclusion: Representational choices (personas and payoff visibility) are substantive governance decisions that determine whether multi-agent systems act as strategic reasoners or identity-driven actors, with important implications for real-world deployment.

Abstract: Large language models are increasingly deployed in multi-agent systems for strategic tasks, yet how design choices such as role-based personas and payoff visibility affect reasoning remains poorly understood. We investigate whether multi-agent systems function as strategic reasoners capable of payoff optimization or as identity-driven actors that prioritize role alignment over explicit incentives. Using Nash equilibrium achievement as a diagnostic for strategic reasoning, we conduct systematic experiments across four LLM architectures (Qwen-7B, Qwen-32B, Llama-8B, Mistral-7B) in complex environmental decision-making games involving four agents. We show that role identity bias fundamentally alters strategic reasoning even when payoff-optimal equilibria exist and complete payoff information is available. Removing personas and providing explicit payoffs enables Qwen models to achieve high Nash equilibrium rates, indicating that both conditions are necessary for strategic reasoning. In contrast, personas systematically bias equilibrium selection toward socially preferred outcomes: with personas present, all of the achieved equilibria correspond to Green Transition, while models entirely fail to reach equilibrium when Tragedy of the Commons is payoff-optimal. The effect of explicit payoffs depends entirely on persona presence, revealing strong interactions between representational design choices. We also observe clear model-dependent patterns. Qwen architectures are highly sensitive to both personas and payoff visibility, whereas Llama and Mistral exhibit rigid reasoning behavior across conditions. These findings demonstrate that representational choices are substantive governance decisions that determine whether multi-agent systems act as strategic reasoners or identity-driven actors, with important implications for real-world deployment.

cs.MM

Qingcao Li, Miao He, Liang Yi, Qing Wen, Yitao Zhang, Hongshuo Jin, Peng Cheng, Zhongjie Ba, Li Lu, Kui Ren

Main category: cs.MM

TL;DR: Two-stage multimodal system for video deepfake detection combining audio and visual analysis with score fusion strategies

Details

Motivation: To develop an effective system for detecting fake audio-visual content (video deepfakes) by leveraging multimodal information from both audio and visual streams

Method: Two-stage framework with unimodal detection and multimodal score fusion: audio deepfake detection + localization modules, image-based deepfake detection + localization modules, and multimodal score fusion strategies

Result: Achieved AUC of 0.87, AP of 0.55, AR of 0.23 on challenge test set with final score of 0.5528

Conclusion: Multimodal fusion approach combining audio and visual analysis effectively detects and localizes manipulated segments in deepfake videos

Abstract: This paper presents a system for detecting fake audio-visual content (i.e., video deepfake), developed for Track 2 of the DDL Challenge. The proposed system employs a two-stage framework, comprising unimodal detection and multimodal score fusion. Specifically, it incorporates an audio deepfake detection module and an audio localization module to analyze and pinpoint manipulated segments in the audio stream. In parallel, an image-based deepfake detection and localization module is employed to process the visual modality. To effectively leverage complementary information across different modalities, we further propose a multimodal score fusion strategy that integrates the outputs from both audio and visual modules. Guided by a detailed analysis of the training and evaluation dataset, we explore and evaluate several score calculation and fusion strategies to improve system robustness. Overall, the final fusion-based system achieves an AUC of 0.87, an AP of 0.55, and an AR of 0.23 on the challenge test set, resulting in a final score of 0.5528.

[1614] MTAVG-Bench: A Comprehensive Benchmark for Evaluating Multi-Talker Dialogue-Centric Audio-Video Generation

Yang-Hao Zhou, Haitian Li, Rexar Lin, Heyan Huang, Jinxing Zhou, Changsen Yuan, Tian Lan, Ziqin Zhou, Yudong Li, Jiajun Xu, Jingyun Liao, Yi-Ming Cheng, Xuefeng Chen, Xian-Ling Mao, Yousheng Feng

Main category: cs.MM

TL;DR: MTAVG-Bench: A benchmark for evaluating audio-visual multi-speaker dialogue generation in T2AV models, addressing gaps in existing evaluation for multi-talker settings.

Details

Conclusion: MTAVG-Bench enables fine-grained failure analysis for rigorous model comparison and targeted video generation refinement in multi-speaker dialogue generation.

Mohamed Saleh, Zahra Ahmadi

Main category: cs.MM

Details

[1616] Seeing, Hearing, and Knowing Together: Multimodal Strategies in Deepfake Videos Detection

Chen Chen, Dion Hoe-Lian Goh

Main category: cs.MM

TL;DR: Study examines human strategies for detecting deepfake videos, finding multimodal cue combinations (visual, audio, intuition) are most effective for successful identification.

Details

Motivation: As deepfake videos become increasingly sophisticated and difficult to recognize, understanding human detection strategies is crucial for designing effective media literacy interventions and helping people become more resilient to deceptive digital media.

Method: Conducted study with 195 participants (ages 21-40) who judged real and deepfake videos, rated confidence, and reported cues used across visual, audio, and knowledge strategies. Used association rule mining to identify cue combinations that shaped performance.

Result: Participants were more accurate with real videos than deepfakes and showed lower expected calibration error for real content. Visual appearance, vocal cues, and intuition often co-occurred for successful identifications, highlighting importance of multimodal approaches in human detection.

Conclusion: Findings reveal which cues help or hinder detection and suggest directions for designing media literacy tools that guide effective cue use, helping people improve identification skills and become more resilient to deceptive digital media.

Abstract: As deepfake videos become increasingly difficult for people to recognise, understanding the strategies humans use is key to designing effective media literacy interventions. We conducted a study with 195 participants between the ages of 21 and 40, who judged real and deepfake videos, rated their confidence, and reported the cues they relied on across visual, audio, and knowledge strategies. Participants were more accurate with real videos than with deepfakes and showed lower expected calibration error for real content. Through association rule mining, we identified cue combinations that shaped performance. Visual appearance, vocal, and intuition often co-occurred for successful identifications, which highlights the importance of multimodal approaches in human detection. Our findings show which cues help or hinder detection and suggest directions for designing media literacy tools that guide effective cue use. Building on these insights can help people improve their identification skills and become more resilient to deceptive digital media.

[1617] Mixture of Disentangled Experts with Missing Modalities for Robust Multimodal Sentiment Analysis

Xiang Li, Xiaoming Zhang, Dezhuang Miao, Xianfu Cheng, Dawei Li, Honggui Han, Zhoujun Li

Main category: cs.MM

TL;DR: DERL: A disentangled expert representation learning framework for robust multimodal sentiment analysis that handles missing/corrupted data through hybrid experts, orthogonal representation spaces, and multi-level reconstruction.

Details

Motivation: Real-world noise often leads to missing or corrupted data in multimodal sentiment analysis, and existing feature-disentangled methods struggle with internal variations of heterogeneous information under uncertain missingness, making it difficult to learn effective multimodal representations from degraded modalities.

Method: DERL employs hybrid experts to adaptively disentangle multimodal inputs into orthogonal private and shared representation spaces, uses multi-level reconstruction for collaborative supervision to enhance representation expressiveness and robustness, and uses disentangled features as modality experts to generate importance-aware fusion results.

Result: Extensive experiments on two MSA benchmarks show DERL outperforms state-of-the-art methods under various missing-modality conditions, achieving improvements of 2.47% in Acc-2 and 2.25% in MAE on MOSI under intra-modal missingness.

Conclusion: DERL provides an effective framework for robust multimodal sentiment analysis that handles missing/corrupted data through disentangled expert representation learning and multi-level reconstruction supervision.

Abstract: Multimodal Sentiment Analysis (MSA) integrates multiple modalities to infer human sentiment, but real-world noise often leads to missing or corrupted data. However, existing feature-disentangled methods struggle to handle the internal variations of heterogeneous information under uncertain missingness, making it difficult to learn effective multimodal representations from degraded modalities. To address this issue, we propose DERL, a Disentangled Expert Representation Learning framework for robust MSA. Specifically, DERL employs hybrid experts to adaptively disentangle multimodal inputs into orthogonal private and shared representation spaces. A multi-level reconstruction strategy is further developed to provide collaborative supervision, enhancing both the expressiveness and robustness of the learned representations. Finally, the disentangled features act as modality experts with distinct roles to generate importance-aware fusion results. Extensive experiments on two MSA benchmarks demonstrate that DERL outperforms state-of-the-art methods under various missing-modality conditions. For instance, our method achieves improvements of 2.47% in Acc-2 and 2.25% in MAE on MOSI under intra-modal missingness.

eess.AS

[1618] High-Fidelity Generative Audio Compression at 0.275kbps

Hao Ma, Ruihao Jing, Shansong Liu, Cheng Gong, Chi Zhang, Xiao-Lei Zhang, Xuelong Li

Main category: eess.AS

TL;DR: GAC introduces a generative audio compression paradigm that shifts from waveform reconstruction to semantic understanding and generative synthesis, achieving ultra-low bitrate compression (0.275kbps) with high fidelity using computational power at the receiver.

Details

Motivation: Traditional audio compression methods fail at ultra-low bitrates due to focus on waveform reconstruction, leading to acoustic artifacts and semantic distortion. There's a need for task-oriented effectiveness rather than signal fidelity.

Method: GAC integrates semantic understanding at the transmitter with scalable generative synthesis at the receiver, leveraging the “More Computation, Less Bandwidth” philosophy. It uses a 1.8B-parameter model within the AI Flow framework, grounded in the Law of Information Capacity.

Result: Achieves high-fidelity reconstruction of 32kHz general audio at 0.275kbps, with strong intelligible audio transmission even at 0.175kbps (~3000x compression ratio), significantly outperforming state-of-the-art neural codecs in perceptual quality and semantic consistency.

Conclusion: GAC represents a paradigm shift from signal fidelity to task-oriented effectiveness, demonstrating that computational power at the receiver can overcome extreme communication bottlenecks for high-quality audio compression.

Abstract: High-fidelity general audio compression at ultra-low bitrates is crucial for applications ranging from low-bandwidth communication to generative audio-language modeling. Traditional audio compression methods and contemporary neural codecs are fundamentally designed for waveform reconstruction. As a result, when operating at ultra-low bitrates, these methods degrade rapidly and often fail to preserve essential information, leading to severe acoustic artifacts and pronounced semantic distortion. To overcome these limitations, we introduce Generative Audio Compression (GAC), a novel paradigm shift from signal fidelity to task-oriented effectiveness. Implemented within the AI Flow framework, GAC is theoretically grounded in the Law of Information Capacity. These foundations posit that abundant computational power can be leveraged at the receiver to offset extreme communication bottlenecks–exemplifying the More Computation, Less Bandwidth philosophy. By integrating semantic understanding at the transmitter with scalable generative synthesis at the receiver, GAC offloads the information burden to powerful model priors. Our 1.8B-parameter model achieves high-fidelity reconstruction of 32kHz general audio at an unprecedented bitrate of 0.275kbps. Even at 0.175kbps, it still preserves a strong intelligible audio transmission capability, which represents an about 3000x compression ratio, significantly outperforming current state-of-the-art neural codecs in maintaining both perceptual quality and semantic consistency.

[1619] Solving Room Impulse Response Inverse Problems Using Flow Matching with Analytic Wiener Denoiser

Kyung Yun Lee, Nils Meyer-Kahlen, Vesa Välimäki, Sebastian J. Schlecht

Main category: eess.AS

TL;DR: RIRFlow: A training-free Bayesian framework for room impulse response inverse problems using flow matching with an analytic Gaussian process prior derived from RIR statistical structure.

Details

Motivation: Current RIR estimation methods rely on supervised learning or learned generative priors, requiring large training data and suffering from poor generalization outside training distribution. There's a need for training-free approaches that can handle various inverse problems robustly.

Method: Proposes RIRFlow, a training-free Bayesian framework using flow matching. Derives an analytic Gaussian process prior with exponentially decaying variance from RIR statistical structure, yielding closed-form MMSE Wiener denoiser. Integrates this analytic denoiser as prior in flow-based inverse solver for guided posterior sampling. Extends to nonlinear/non-Gaussian problems via local Gaussian approximation.

Result: Experiments on real RIRs across different inverse problems demonstrate robust performance, showing effectiveness of combining classic RIR model with flow-based generative inference.

Conclusion: RIRFlow provides a training-free, robust solution for RIR inverse problems by combining analytic RIR modeling with modern flow-based inference, eliminating need for data-driven priors while maintaining effectiveness across various problem types.

Abstract: Room impulse response (RIR) estimation naturally arises as a class of inverse problems, including denoising and deconvolution. While recent approaches often rely on supervised learning or learned generative priors, such methods require large amounts of training data and may generalize poorly outside the training distribution. In this work, we present RIRFlow, a training-free Bayesian framework for RIR inverse problems using flow matching. We derive a flow-consistent analytic prior from the statistical structure of RIRs, eliminating the need for data-driven priors. Specifically, we model RIR as a Gaussian process with exponentially decaying variance, which yields a closed-form minimum mean squared error (MMSE) Wiener denoiser. This analytic denoiser is integrated as a prior in an existing flow-based inverse solver, where inverse problems are solved via guided posterior sampling. Furthermore, we extend the solver to nonlinear and non-Gaussian inverse problems via a local Gaussian approximation of the guided posterior, and empirically demonstrate that this approximation remains effective in practice. Experiments on real RIRs across different inverse problems demonstrate robust performance, highlighting the effectiveness of combining a classic RIR model with the recent flow-based generative inference.

[1620] Adapting Where It Matters: Depth-Aware Adaptation for Efficient Multilingual Speech Recognition in Low-Resource Languages

Yang Xiao, Eun-Jung Holden, Ting Dang

Main category: eess.AS

TL;DR: DAMA is a depth-aware model adaptation framework for multilingual ASR that allocates adaptation capacity based on layer-specific language specificity, achieving state-of-the-art performance on low-resource languages with 80% fewer parameters.

Details

Motivation: Adapting multilingual speech foundation models to low-resource languages is challenging due to data scarcity and efficiency constraints. Full fine-tuning is expensive and overfits, while parameter-efficient methods like LoRA apply uniform adaptation across layers, overlooking internal representation differences and compromising effectiveness.

Method: Analyzes multilingual ASR models to reveal a U-shaped adaptability pattern: early and late layers are language-specific and require more adaptation, while intermediate layers retain shared semantics and need less. Proposes DAMA framework that allocates adaptation capacity according to each layer’s role, with SVD-based initialization to preserve the U-shaped pattern and frozen middle-layer basis for efficiency.

Result: Evaluated on 18 low-resource languages across two benchmark datasets, DAMA matches or surpasses state-of-the-art accuracy with 80% fewer trainable parameters, achieves 29% error reduction under extreme data scarcity, and significantly improves memory, training time, and computational efficiency over baselines.

Conclusion: Structure-aware adaptation (DAMA) provides efficient, scalable multilingual ASR adaptation, highlighting the benefits of depth-aware parameter allocation for low-resource language adaptation in speech foundation models.

Abstract: Recent speech foundation models excel at multilingual automatic speech recognition (ASR) for high-resource languages, but adapting them to low-resource languages remains challenging due to data scarcity and efficiency constraints. Full-model fine-tuning is computationally expensive and prone to overfitting, while parameter-efficient methods like LoRA apply adaptation uniformly across layers, overlooking internal representations thus compromising effectiveness and efficiency. We analyze multilingual ASR models and reveal a U-shaped adaptability pattern: early and late layers are language-specific and require more adaptation, while intermediate layers retain shared semantics and need less. Building on this observation, we propose DAMA, a Depth-Aware Model Adaptation framework that allocates adaptation capacity according to each layer’s role. DAMA also introduces Singular Value Decomposition (SVD)-based initialization to constrain adaptation and preserve the U-shaped pattern, as well as a frozen middle-layer basis for further efficiency. Evaluated on 18 low-resource languages across two benchmark datasets, DAMA matches or surpasses state-of-the-art accuracy with 80% fewer trainable parameters, achieves a 29% error reduction under extreme data scarcity, and significantly improves memory, training time, and computational efficiency over baselines. These results highlight the benefits of structure-aware adaptation for efficient, scalable multilingual ASR.

[1621] SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling

Yochai Yemini, Yoav Ellinson, Rami Ben-Ari, Sharon Gannot, Ethan Fetaya

Main category: eess.AS

TL;DR: Unsupervised audio-visual speech separation using diffusion priors for clean speech and noise, outperforming supervised methods on noisy multi-speaker mixtures.

Details

Motivation: Address the challenge of separating speech from real-world environmental noise using single-microphone audio-visual data, where supervised methods may struggle with diverse noise conditions.

Method: Uses generative inverse sampling with dedicated diffusion priors for clean speech and ambient noise, reformulating a recent inverse sampler to jointly recover all sources. Extends to handle off-screen speaker separation.

Result: Outperforms leading supervised baselines in WER across mixtures of 1, 2, and 3 speakers with noise, despite being entirely unsupervised. Separated noise component has high fidelity suitable for acoustic scene detection.

Conclusion: The unsupervised diffusion-based approach effectively handles real-world noisy speech separation and can be extended to off-screen scenarios, with potential applications in downstream tasks like acoustic scene analysis.

Abstract: This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise. Our approach is based on generative inverse sampling, where we model clean speech and ambient noise with dedicated diffusion priors and jointly leverage them to recover all underlying sources. To achieve this, we reformulate a recent inverse sampler to match our setting. We evaluate on mixtures of 1, 2, and 3 speakers with noise and show that, despite being entirely unsupervised, our method consistently outperforms leading supervised baselines in \ac{WER} across all conditions. We further extend our framework to handle off-screen speaker separation. Moreover, the high fidelity of the separated noise component makes it suitable for downstream acoustic scene detection. Demo page: https://ssnapsicml.github.io/ssnapsicml2026/

[1622] HuPER: A Human-Inspired Framework for Phonetic Perception

Chenxu Guo, Jiachen Lian, Yisi Liu, Baihe Huang, Shriyaa Narayanan, Cheol Jun Cho, Gopala Anumanchipalli

Main category: eess.AS

TL;DR: HuPER is a human-inspired phonetic perception framework that models adaptive inference over acoustic-phonetic evidence and linguistic knowledge, achieving SOTA performance with minimal training data and strong cross-lingual transfer.

Details

Motivation: To develop a more human-like phonetic perception system that can adapt to diverse acoustic conditions and leverage linguistic knowledge, addressing limitations of traditional ASR systems that require massive labeled data and struggle with cross-lingual transfer.

Method: Models phonetic perception as adaptive inference combining acoustic-phonetic evidence with linguistic knowledge, enabling multi-path perception under varying acoustic conditions with only 100 hours of training data.

Result: Achieves state-of-the-art phonetic error rates on five English benchmarks and demonstrates strong zero-shot transfer to 95 unseen languages, while enabling adaptive multi-path perception.

Conclusion: HuPER provides an effective human-inspired approach to phonetic perception that requires minimal training data, achieves strong performance, and enables cross-lingual transfer and adaptive perception.

Abstract: We propose HuPER, a human-inspired framework that models phonetic perception as adaptive inference over acoustic-phonetics evidence and linguistic knowledge. With only 100 hours of training data, HuPER achieves state-of-the-art phonetic error rates on five English benchmarks and strong zero-shot transfer to 95 unseen languages. HuPER is also the first framework to enable adaptive, multi-path phonetic perception under diverse acoustic conditions. All training data, models, and code are open-sourced. Code and demo avaliable at https://github.com/HuPER29/HuPER.

[1623] Joint Optimization of ASV and CM tasks: BTUEF Team’s Submission for WildSpoof Challenge

Oguzhan Kurnaz, Jagabandhu Mishra, Tomi Kinnunen, Cemal Hanilci

Main category: eess.AS

TL;DR: A modular spoofing-aware speaker verification framework that combines ASV and CM systems through non-linear fusion with trainable a-DCF loss optimization, achieving best results with ReDimNet ASV embeddings and fine-tuned SSL-AASIST CM.

Details

Motivation: To improve robustness of speaker verification systems against spoofing attacks by jointly addressing automatic speaker verification and spoofing countermeasures through an effective modular framework that enables reuse of existing ASV and CM systems.

Method: Proposes a modular SASV framework with non-linear fusion of ASV and CM systems, explicitly modeling their interaction, and optimizing with operating-condition-dependent trainable a-DCF loss. Evaluated using ECAPA-TDNN and ReDimNet as ASV embedding extractors and SSL-AASIST as CM model, with experiments conducted with/without fine-tuning on WildSpoof SASV training data.

Result: Best performance achieved by combining ReDimNet-based ASV embeddings with fine-tuned SSL-AASIST representations, yielding a-DCF of 0.0515 on progress evaluation set and 0.2163 on final evaluation set.

Conclusion: The modular SASV framework effectively combines existing ASV and CM systems through non-linear fusion and trainable loss optimization, demonstrating strong performance in spoofing-aware speaker verification, with ReDimNet ASV and fine-tuned SSL-AASIST being the optimal combination.

Abstract: Spoofing-aware speaker verification (SASV) jointly addresses automatic speaker verification and spoofing countermeasures to improve robustness against adversarial attacks. In this paper, we investigate our recently proposed modular SASV framework that enables effective reuse of publicly available ASV and CM systems through non-linear fusion, explicitly modeling their interaction, and optimization with an operating-condition-dependent trainable a-DCF loss. The framework is evaluated using ECAPA-TDNN and ReDimNet as ASV embedding extractors and SSL-AASIST as the CM model, with experiments conducted both with and without fine-tuning on the WildSpoof SASV training data. Results show that the best performance is achieved by combining ReDimNet-based ASV embeddings with fine-tuned SSL-AASIST representations, yielding an a-DCF of 0.0515 on the progress evaluation set and 0.2163 on the final evaluation set.

[1624] Short-wave admittance correction for a time-domain cochlear transmission line model

François Deloche, Morgan Thienpont, Sarah Verhulst

Main category: eess.AS

TL;DR: A time-domain transmission line model for cochlear mechanics with numerical correction for 2D effects using autoregressive filtering and regression techniques, applied to gerbil cochlear physiology with level-dependent nonlinearities.

Details

Motivation: Standard 1D transmission line models of cochlear mechanics fail to capture important 2D effects like pressure focusing around the basilar membrane and transverse viscous damping, especially in the short-wave region. These limitations become apparent when modeling small mammal cochleae where frequency selectivity shows only moderate dependence on sound level.

Method: Developed a numerical correction for basilar membrane admittance to account for 2D effects in time domain using autoregressive filtering and regression techniques. Implemented in a gerbil-specific transmission line model with instantaneous nonlinearities (variable damping) and made the correction level-dependent using a feedback loop.

Result: The updated model achieved decoupling between frequency selectivity and gain, providing 5 dB additional gain and extending the compressive regime range by 10 dB. Successfully addressed the insufficient compression issue in the original 1D nonlinear model.

Conclusion: The work demonstrates successful integration of analytical and regression methods for characterizing BM admittance, and combination of instantaneous and non-instantaneous nonlinearities. The approach provides a more accurate cochlear mechanics model for small mammals.

Abstract: Transmission line (TL) models implemented in the time domain can efficiently simulate basilar-membrane (BM) displacement in response to transient or non-stationary sounds. By design, a TL model is well-suited for an one-dimensional (1-D) characterization of the traveling wave, but the real configuration of the cochlea also introduces higher-dimensional effects. Such effects include the focusing of the pressure around the BM and transverse viscous damping, both of which are magnified in the short-wave region. The two effects depend on the wavelength and are more readily expressed in the frequency domain. In this paper, we introduce a numerical correction for the BM admittance to account for 2-D effects in the time domain using autoregressive filtering and regression techniques. The correction was required for the implementation of a TL model tailored to the gerbil cochlear physiology. The model, which includes instantaneous nonlinearities in the form of variable damping, initially presented insufficient compression with increasing sound levels. This limitation was explained by the strong coupling between gain and frequency selectivity assumed in the 1-D nonlinear TL model, whereas cochlear frequency selectivity shows only a moderate dependence on sound level in small mammals. The correction factor was implemented in the gerbil model and made level-dependent using a feedback loop. The updated model achieved some decoupling between frequency selectivity and gain, providing 5 dB of additional gain and extending the range of sound levels of the compressive regime by 10 dB. We discuss the relevance of this work through two key features: the integration of both analytical and regression methods for characterizing BM admittance, and the combination of instantaneous and non-instantaneous nonlinearities.

[1625] RIR-Former: Coordinate-Guided Transformer for Continuous Reconstruction of Room Impulse Responses

Shaoheng Xu, Chunyi Sun, Jihui, Zhang, Prasanga N. Samarasinghe, Thushara D. Abhayapala

Main category: eess.AS

TL;DR: RIR-Former: A transformer-based model for reconstructing room impulse responses from sparse microphone measurements using sinusoidal positional encoding and segmented multi-branch decoder architecture.

Details

Motivation: Measuring room impulse responses (RIRs) densely across space is impractical, but RIRs are essential for acoustic signal processing tasks. There's a need for efficient methods to reconstruct complete RIRs from sparse measurements.

Method: Proposes RIR-Former, a grid-free, one-step feed-forward transformer model with sinusoidal encoding for microphone position information and a segmented multi-branch decoder that separately handles early reflections and late reverberation.

Result: RIR-Former consistently outperforms state-of-the-art baselines in terms of normalized mean square error (NMSE) and cosine distance (CD) across diverse simulated acoustic environments, varying missing rates, and array configurations.

Conclusion: The approach shows strong potential for practical deployment and motivates future work on scaling to complex array geometries, dynamic acoustic scenes, and real-world environments.

Abstract: Room impulse responses (RIRs) are essential for many acoustic signal processing tasks, yet measuring them densely across space is often impractical. In this work, we propose RIR-Former, a grid-free, one-step feed-forward model for RIR reconstruction. By introducing a sinusoidal encoding module into a transformer backbone, our method effectively incorporates microphone position information, enabling interpolation at arbitrary array locations. Furthermore, a segmented multi-branch decoder is designed to separately handle early reflections and late reverberation, improving reconstruction across the entire RIR. Experiments on diverse simulated acoustic environments demonstrate that RIR-Former consistently outperforms state-of-the-art baselines in terms of normalized mean square error (NMSE) and cosine distance (CD), under varying missing rates and array configurations. These results highlight the potential of our approach for practical deployment and motivate future work on scaling from randomly spaced linear arrays to complex array geometries, dynamic acoustic scenes, and real-world environments.

[1626] UL-UNAS: Ultra-Lightweight U-Nets for Real-Time Speech Enhancement via Network Architecture Search

Xiaobin Rong, Leyan Yang, Dahan Wang, Yuxiang Hu, Changbao Zhu, Kai Chen, Jing Lu

Main category: eess.AS

TL;DR: Proposes UL-UNAS, an ultra-lightweight U-net optimized by neural architecture search for real-time speech enhancement on low-footprint devices.

Details

Motivation: Need for lightweight models for real-time speech enhancement applications, especially for deployment on low-footprint devices where computational resources are limited.

Method: 1) Explores efficient convolutional blocks within U-Net framework; 2) Introduces affine PReLU activation function and causal time-frequency attention module; 3) Uses neural architecture search to find optimal architecture in designed search space.

Result: UL-UNAS outperforms latest ultra-lightweight models with same/lower computational complexity and delivers competitive performance compared to more computationally intensive baseline models.

Conclusion: The proposed UL-UNAS provides an effective solution for speech enhancement on resource-constrained devices through careful architecture design and neural architecture search optimization.

Abstract: Lightweight models are essential for real-time speech enhancement applications. In recent years, there has been a growing trend toward developing increasingly compact models for speech enhancement. In this paper, we propose an Ultra-Lightweight U-net optimized by Network Architecture Search (UL-UNAS), which is suitable for implementation in low-footprint devices. Firstly, we explore the application of various efficient convolutional blocks within the U-Net framework to identify the most promising candidates. Secondly, we introduce two boosting components to enhance the capacity of these convolutional blocks: a novel activation function named affine PReLU and a causal time-frequency attention module. Furthermore, we leverage neural architecture search to discover an optimal architecture within our carefully designed search space. By integrating the above strategies, UL-UNAS not only significantly outperforms the latest ultra-lightweight models with the same or lower computational complexity, but also delivers competitive performance compared to recent baseline models that require substantially higher computational resources. Source code and audio demos are available at https://github.com/Xiaobin-Rong/ul-unas.

[1627] Investigation of Speech and Noise Latent Representations in Single-channel VAE-based Speech Enhancement

Jiatong Li, Simon Doclo

Main category: eess.AS

TL;DR: Investigates how different latent representations from pretrained VAEs affect speech enhancement performance, showing that clearly separated speech/noise representations improve results over overlapping ones.

Details

Motivation: Previous VAE-based speech enhancement systems use pretrained VAEs for speech and noise, but modifying pretrained VAE loss terms affects the latent representations. The paper investigates how these different representations impact speech enhancement performance.

Method: Uses a VAE-based single-channel speech enhancement system with Bayesian permutation training. Two pretrained VAEs obtain latent representations for speech and noise, then a noisy VAE learns to generate these representations from noisy speech. Experiments analyze how different latent space characteristics affect performance.

Result: Experiments on DNS3, WSJ0-QUT, and VoiceBank-DEMAND datasets show that latent spaces where speech and noise representations are clearly separated significantly improve performance over standard VAEs that produce overlapping representations.

Conclusion: The quality of latent representations in VAE-based speech enhancement matters - clearly separated speech and noise representations lead to better enhancement performance than overlapping ones.

Abstract: Recently, a variational autoencoder (VAE)-based single-channel speech enhancement system using Bayesian permutation training has been proposed, which uses two pretrained VAEs to obtain latent representations for speech and noise. Based on these pretrained VAEs, a noisy VAE learns to generate speech and noise latent representations from noisy speech for speech enhancement. Modifying the pretrained VAE loss terms affects the pretrained speech and noise latent representations. In this paper, we investigate how these different representations affect speech enhancement performance. Experiments on the DNS3, WSJ0-QUT, and VoiceBank-DEMAND datasets show that a latent space where speech and noise representations are clearly separated significantly improves performance over standard VAEs, which produce overlapping speech and noise representations.

[1628] Neural acoustic multipole splatting for room impulse response synthesis

Geonwoo Baek, Jung-Woo Choi

Main category: eess.AS

TL;DR: NAMS predicts Room Impulse Responses at arbitrary positions using neural acoustic multipoles with pruning for efficiency.

Details

Motivation: Accurate RIR prediction at arbitrary receiver positions is crucial for spatial audio rendering applications, requiring methods that balance physical accuracy with computational efficiency.

Method: Neural Acoustic Multipole Splatting (NAMS) learns positions of neural acoustic multipoles and predicts their emitted signals and directivities using neural networks, representing sound fields through multipole combinations that adhere to Helmholtz equation constraints, with progressive pruning of redundant multipoles during training.

Result: NAMS outperforms previous approaches on most metrics across real and synthetic datasets while maintaining fast inference, with multipole splatting achieving better performance than monopole models using only 20% of the poles.

Conclusion: NAMS provides an effective approach for RIR prediction that combines physical constraints with neural network flexibility, offering improved performance and efficiency through multipole representation and pruning.

Abstract: Room Impulse Response (RIR) prediction at arbitrary receiver positions is essential for practical applications such as spatial audio rendering. We propose Neural Acoustic Multipole Splatting (NAMS), which synthesizes RIRs at unseen receiver positions by learning the positions of neural acoustic multipoles and predicting their emitted signals and directivities using a neural network. Representing sound fields through a combination of multipoles offers sufficient flexibility to express complex acoustic scenes while adhering to physical constraints such as the Helmholtz equation. We also introduce a pruning strategy that starts from a dense splatting of neural acoustic multipoles and progressively eliminates redundant ones during training. Experiments conducted on both real and synthetic datasets indicate that the proposed method surpasses previous approaches on most metrics while maintaining rapid inference. Ablation studies reveal that multipole splatting with pruning achieves better performance than the monopole model with just 20% of the poles.

[1629] Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

Kai-Wei Chang, En-Pei Hu, Chun-Yi Kuan, Wenze Ren, Wei-Chih Chen, Guan-Ting Lin, Yu Tsao, Shao-Hua Sun, Hung-yi Lee, James Glass

Main category: eess.AS

TL;DR: Game-Time Benchmark evaluates conversational spoken language models’ temporal dynamics capabilities like timing, tempo, and simultaneous speaking through basic instruction-following and advanced temporally-constrained tasks.

Details

Motivation: Current conversational spoken language models lack systematic evaluation of their temporal dynamics capabilities (timing, tempo, simultaneous speaking), which are critical for conversational fluency but remain unevaluated.

Method: Introduced Game-Time Benchmark framework with basic instruction-following tasks and advanced tasks with temporal constraints (tempo adherence, synchronized responses) to systematically assess temporal capabilities of diverse SLM architectures.

Result: State-of-the-art models handle basic tasks well but many contemporary systems struggle with fundamental instruction-following; nearly all models degrade substantially under temporal constraints, exposing weaknesses in time awareness and full-duplex interaction.

Conclusion: Game-Time Benchmark provides foundation for guiding future research toward more temporally-aware conversational AI, highlighting persistent weaknesses in current SLMs’ temporal dynamics capabilities.

Abstract: Conversational Spoken Language Models (SLMs) are emerging as a promising paradigm for real-time speech interaction. However, their capacity of temporal dynamics, including the ability to manage timing, tempo and simultaneous speaking, remains a critical and unevaluated challenge for conversational fluency. To address this gap, we introduce the Game-Time Benchmark, a framework to systematically assess these temporal capabilities. Inspired by how humans learn a language through language activities, Game-Time consists of basic instruction-following tasks and advanced tasks with temporal constraints, such as tempo adherence and synchronized responses. Our evaluation of diverse SLM architectures reveals a clear performance disparity: while state-of-the-art models handle basic tasks well, many contemporary systems still struggle with fundamental instruction-following. More critically, nearly all models degrade substantially under temporal constraints, exposing persistent weaknesses in time awareness and full-duplex interaction. The Game-Time Benchmark provides a foundation for guiding future research toward more temporally-aware conversational AI. Demos and datasets are available on our project website https://ga642381.github.io/Game-Time.

[1630] I-DCCRN-VAE: An Improved Deep Representation Learning Framework for Complex VAE-based Single-channel Speech Enhancement

Jiatong Li, Simon Doclo

Main category: eess.AS

TL;DR: Improved DCCRN-VAE speech enhancement system with three modifications: removing skip connections, using β-VAE for better latent space regularization, and generating both speech and noise latent representations, achieving better generalization on mismatched datasets.

Details

Motivation: To improve the generalization ability of existing VAE-based speech enhancement systems (DCCRN-VAE) by modifying the architecture and training approach to learn more informative latent representations that work better on mismatched/noisy conditions.

Method: Three key modifications: 1) Remove skip connections in pretrained VAEs to encourage more informative latent representations; 2) Use β-VAE during pretraining to better balance reconstruction and latent space regularization; 3) Have the noise suppression VAE generate both speech and noise latent representations instead of just speech.

Result: Achieves comparable performance to DCCRN and DCCRN-VAE baselines on matched DNS3 dataset, but outperforms them on mismatched datasets (WSJ0-QUT, Voicebank-DEMEND), demonstrating improved generalization. Also shows similar performance can be achieved with classical fine-tuning instead of adversarial training.

Conclusion: The proposed modifications to DCCRN-VAE improve generalization ability for speech enhancement tasks, particularly in mismatched conditions, while also simplifying the training pipeline by enabling classical fine-tuning instead of adversarial training.

Abstract: Recently, a complex variational autoencoder (VAE)-based single-channel speech enhancement system based on the DCCRN architecture has been proposed. In this system, a noise suppression VAE (NSVAE) learns to extract clean speech representations from noisy speech using pretrained clean speech and noise VAEs with skip connections. In this paper, we improve DCCRN-VAE by incorporating three key modifications: 1) removing the skip connections in the pretrained VAEs to encourage more informative speech and noise latent representations; 2) using $β$-VAE in pretraining to better balance reconstruction and latent space regularization; and 3) a NSVAE generating both speech and noise latent representations. Experiments show that the proposed system achieves comparable performance as the DCCRN and DCCRN-VAE baselines on the matched DNS3 dataset but outperforms the baselines on mismatched datasets (WSJ0-QUT, Voicebank-DEMEND), demonstrating improved generalization ability. In addition, an ablation study shows that a similar performance can be achieved with classical fine-tuning instead of adversarial training, resulting in a simpler training pipeline.

[1631] FastSLM: Hierarchical Frame Q-Former for Effective Speech Modality Adaptation

Junseok Lee, Sangyong Lee, Chang-Jae Chun

Main category: eess.AS

TL;DR: FastSLM introduces a token-efficient architecture for scaling multimodal LLMs to long-form speech through extreme temporal compression, reducing tokens by 93% while maintaining competitive performance.

Details

Motivation: Current MLLMs struggle with long-form speech due to explosive token growth from high-frame-rate acoustic features, making long-context processing computationally prohibitive.

Method: Uses Hierarchical Frame Querying Transformer (HFQ-Former) to progressively distill local acoustic details into compact, semantically rich representations across multiple temporal scales, achieving 1.67 tokens per second.

Result: Achieves competitive performance with SOTA models on long-form benchmarks while operating with significantly lower FLOPs and parameter counts.

Conclusion: Extreme token compression is a viable pathway for making real-time, long-context speech understanding feasible for LLMs under computational constraints.

Abstract: Although Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision, language, and video understanding tasks, scaling them to long-form speech remains a critical bottleneck due to the explosive growth of input tokens. Existing speech-language models typically project high-frame-rate acoustic features directly into the LLM input space, rendering long-context processing computationally prohibitive as audio duration increases. In this paper, we present FastSLM, a token-efficient architecture designed to overcome this scalability limit through extreme temporal compression. At its core is the Hierarchical Frame Querying Transformer (HFQ-Former), which progressively distills local acoustic details into compact, semantically rich representations across multiple temporal scales. This hierarchical abstraction reduces the speech representation rate to just 1.67 tokens per second, achieving a 93 percent reduction in tokens compared to standard frame-level adapters, while preserving the critical context required for complex reasoning. Experimental results demonstrate that FastSLM achieves competitive performance with state-of-the-art models on long-form benchmarks, despite operating with significantly lower FLOPs and parameter counts. Our findings establish that extreme token compression is a viable pathway to making real-time, long-context speech understanding feasible for LLMs, even under strict computational constraints. The source code and model checkpoints are available at https://anonymous.4open.science/r/FastSLM-8BD3

eess.IV

[1632] Frequent Pattern Mining approach to Image Compression

Avinash Kadimisetty, C. Oswald, B. Sivalselvan

Main category: eess.IV

TL;DR: A novel image compression method using k-means clustering and frequent pattern mining to replace DCT in JPEG, achieving 45% better compression with minimal quality loss.

Details

Motivation: To improve image compression efficiency by addressing redundancy through pattern mining rather than traditional DCT-based approaches, reducing code table size while maintaining visual quality.

Method: Combines k-means clustering with closed frequent sequence mining to replace DCT in JPEG compression; uses refined GSP algorithm with pruning techniques to optimize pattern cardinality and reduce code table size.

Result: Achieves 45% improvement in compression ratios on benchmark datasets with negligible loss in visual quality (PSNR and SSIM metrics show minimal degradation).

Conclusion: The proposed FPM-based compression method significantly outperforms existing alternatives in compression efficiency while maintaining acceptable image quality, offering a promising alternative to traditional DCT-based approaches.

Abstract: The paper focuses on Image Compression, explaining efficient approaches based on Frequent Pattern Mining(FPM). The proposed compression mechanism is based on clustering similar pixels in the image and thus using cluster identifiers in image compression. Redundant data in the image is effectively handled by replacing the DCT phase of conventional JPEG through a mixture of k-means Clustering and Closed Frequent Sequence Mining. To optimize the cardinality of pattern(s) in encoding, efficient pruning techniques have been used through the refinement of Conventional Generalized Sequential Pattern Mining(GSP) algorithm. We have proposed a mechanism for finding the frequency of a sequence which will yield significant reduction in the code table size. The algorithm is tested by compressing benchmark datasets yielding an improvement of 45% in compression ratios, often outperforming the existing alternatives. PSNR and SSIM, which are the image quality metrics, have been tested which show a negligible loss in visual quality.

[1633] Radiomics in Medical Imaging: Methods, Applications, and Challenges

Fnu Neha, Deepak kumar Shukla

Main category: eess.IV

TL;DR: Comprehensive survey of radiomics pipelines analyzing how methodological choices across acquisition, preprocessing, feature engineering, modeling, and evaluation affect robustness and clinical translation.

Details

Motivation: Radiomics faces challenges with feature instability, reproducibility, validation bias, and limited clinical translation. Existing reviews focus on applications or isolated components, lacking analysis of how interdependent design choices across the entire pipeline affect robustness.

Method: End-to-end analysis of radiomics pipelines examining methodological decisions at each stage: feature extraction, selection, dimensionality reduction; classical ML and deep learning approaches; ensemble/hybrid frameworks; validation protocols; data leakage prevention; statistical reliability.

Result: Identifies challenges in standardization, domain shift, and clinical deployment. Outlines future directions including hybrid radiomics-AI models, multimodal fusion, federated learning, and standardized benchmarking.

Conclusion: Comprehensive survey provides framework for understanding radiomics pipeline interdependencies and guides future research toward more robust, clinically translatable radiomics systems.

Abstract: Radiomics enables quantitative medical image analysis by converting imaging data into structured, high-dimensional feature representations for predictive modeling. Despite methodological developments and encouraging retrospective results, radiomics continue to face persistent challenges related to feature instability, limited reproducibility, validation bias, and restricted clinical translation. Existing reviews largely focus on application-specific outcomes or isolated pipeline components, with limited analysis of how interdependent design choices across acquisition, preprocessing, feature engineering, modeling, and evaluation collectively affect robustness and generalizability. This survey provides an end-to-end analysis of radiomics pipelines, examining how methodological decisions at each stage influence feature stability, model reliability, and translational validity. This paper reviews radiomic feature extraction, selection, and dimensionality reduction strategies; classical machine and deep learning-based modeling approaches; and ensemble and hybrid frameworks, with emphasis on validation protocols, data leakage prevention, and statistical reliability. Clinical applications are discussed with a focus on evaluation rigor rather than reported performance metrics. The survey identifies open challenges in standardization, domain shift, and clinical deployment, and outlines future directions such as hybrid radiomics-artificial intelligence models, multimodal fusion, federated learning, and standardized benchmarking.

[1634] Toward a Unified Semantic Loss Model for Deep JSCC-based Transmission of EO Imagery

Ti Ti Nguyen, Thanh-Dung Le, Vu Nguyen Ha, Duc-Dung Tran, Hung Nguyen-Kha, Dinh-Hieu Tran, Carlos L. Marcos-Rojas, Juan C. Merlano-Duncan, Symeon Chatzinotas

Main category: eess.IV

TL;DR: Deep Joint Source-Channel Coding for Earth Observation imagery transmission with semantic loss analysis across reconstruction and task-oriented frameworks.

Details

Motivation: High-resolution Earth Observation imagery creates massive data volumes that strain satellite communication systems with limited bandwidth, power, and dynamic link conditions, requiring efficient transmission solutions.

Method: Investigates Deep Joint Source-Channel Coding (DJSCC) with two approaches: 1) reconstruction-centric framework analyzing semantic degradation under varying compression ratios and channel SNR, and 2) task-oriented framework integrating DJSCC with lightweight application-specific models (e.g., EfficientViT) measured by downstream task accuracy.

Result: Proposes a unified semantic loss framework that captures both reconstruction-centric and task-oriented performance within a single model, characterizing the relationship between JSCC compression, channel SNR, and semantic quality.

Conclusion: DJSCC offers actionable insights for designing robust and efficient EO imagery transmission under resource-constrained satellite links through semantic-aware compression and transmission strategies.

Abstract: Modern Earth Observation (EO) systems increasingly rely on high-resolution imagery to support critical applications such as environmental monitoring, disaster response, and land-use analysis. Although these applications benefit from detailed visual data, the resulting data volumes impose significant challenges on satellite communication systems constrained by limited bandwidth, power, and dynamic link conditions. To address these limitations, this paper investigates Deep Joint Source-Channel Coding (DJSCC) as an effective source-channel paradigm for the transmission of EO imagery. We focus on two complementary aspects of semantic loss in DJSCC-based systems. First, a reconstruction-centric framework is evaluated by analyzing the semantic degradation of reconstructed images under varying compression ratios and channel signal-to-noise ratios (SNR). Second, a task-oriented framework is developed by integrating DJSCC with lightweight, application-specific models (e.g., EfficientViT), with performance measured using downstream task accuracy rather than pixel-level fidelity. Based on extensive empirical analysis, we propose a unified semantic loss framework that captures both reconstruction-centric and task-oriented performance within a single model. This framework characterizes the implicit relationship between JSCC compression, channel SNR, and semantic quality, offering actionable insights for the design of robust and efficient EO imagery transmission under resource-constrained satellite links.

[1635] Visible Singularities Guided Correlation Network for Limited-Angle CT Reconstruction

Yiyang Wen, Liu Shi, Zekun Zhou, WenZhe Shan, Qiegen Liu

Main category: eess.IV

TL;DR: A deep learning method called VSGC for limited-angle CT reconstruction that uses visible singularities guidance and correlation networks to address directional artifacts and information loss.

Details

Motivation: Limited-angle CT reduces radiation dose and scanning time but suffers from directional artifacts and information loss due to missing projection angles. Existing deep learning methods don't fully address these core imaging characteristics.

Method: VSGC network with two core steps: 1) Extract visible singularities edge features from LACT images and focus attention on them, 2) Establish correlations between VS edge features and other image regions. Uses multi-scale loss function with anisotropic constraint.

Result: VSGC outperforms alternative methods, especially in small angular ranges, with PSNR improvement of 2.45 dB and SSIM enhancement of 1.5%. Validated on both simulated and real datasets.

Conclusion: The proposed VSGC method effectively addresses directional artifacts and information loss in LACT reconstruction by leveraging visible singularities guidance and correlation networks, demonstrating superior performance particularly in challenging small angular ranges.

Abstract: Limited-angle computed tomography (LACT) offers the advantages of reduced radiation dose and shortened scanning time. Traditional reconstruction algorithms exhibit various inherent limitations in LACT. Currently, most deep learning-based LACT reconstruction methods focus on multi-domain fusion or the introduction of generic priors, failing to fully align with the core imaging characteristics of LACT-such as the directionality of artifacts and directional loss of structural information, which are caused by the absence of projection angles in certain directions. Inspired by the theory of visible and invisible singularities, taking into account the aforementioned core imaging characteristics of LACT, we propose a Visible Singularities Guided Correlation network for LACT reconstruction (VSGC). The design philosophy of VSGC consists of two core steps: First, extract VS edge features from LACT images and focus the model’s attention on these VS. Second, establish correlations between the VS edge features and other regions of the image. Additionally, a multi-scale loss function with anisotropic constraint is employed to constrain the model to converge in multiple aspects. Finally, qualitative and quantitative validations are conducted on both simulated and real datasets to verify the effectiveness and feasibility of the proposed design. Particularly, in comparison with alternative methods, VSGC delivers more prominent performance in small angular ranges, with the PSNR improvement of 2.45 dB and the SSIM enhancement of 1.5%. The code is publicly available at https://github.com/yqx7150/VSGC.

[1636] SCALED : Surrogate-gradient for Codec-Aware Learning of Downsampling in ABR Streaming

Esteban Pesnel, Julien Le Tanou, Michael Ropert, Thomas Maugey, Aline Roumy

Main category: eess.IV

TL;DR: A novel framework for end-to-end training of video streaming pipelines using real, non-differentiable codecs with data-driven surrogate gradients, achieving 5.19% BD-BR improvement over codec-agnostic approaches.

Details

Motivation: Traditional ABR streaming pipelines optimize processing stages in isolation, leading to suboptimal end-to-end rate-distortion performance. While learned resampling methods exist, training end-to-end is challenging due to non-differentiable standard video codecs. Differentiable proxy codecs are approximations that may not fully capture real codec behavior.

Method: Introduces a framework enabling end-to-end training with real, non-differentiable codecs using data-driven surrogate gradients derived from actual compression errors. This aligns training objectives with deployment performance by bypassing the need for differentiable proxy models.

Result: Achieves 5.19% improvement in BD-BR (PSNR) compared to codec-agnostic training approaches, with consistent performance across the entire rate-distortion convex hull spanning multiple downsampling ratios.

Conclusion: The framework successfully enables end-to-end optimization of video streaming pipelines using real codecs, overcoming the limitations of differentiable proxy models and improving rate-distortion performance significantly.

Abstract: The rapid growth in video consumption has introduced significant challenges to modern streaming architectures. Over-the-Top (OTT) video delivery now predominantly relies on Adaptive Bitrate (ABR) streaming, which dynamically adjusts bitrate and resolution based on client-side constraints such as display capabilities and network bandwidth. This pipeline typically involves downsampling the original high-resolution content, encoding and transmitting it, followed by decoding and upsampling on the client side. Traditionally, these processing stages have been optimized in isolation, leading to suboptimal end-to-end rate-distortion (R-D) performance. The advent of deep learning has spurred interest in jointly optimizing the ABR pipeline using learned resampling methods. However, training such systems end-to-end remains challenging due to the non-differentiable nature of standard video codecs, which obstructs gradient-based optimization. Recent works have addressed this issue using differentiable proxy models, based either on deep neural networks or hybrid coding schemes with differentiable components such as soft quantization, to approximate the codec behavior. While differentiable proxy codecs have enabled progress in compression-aware learning, they remain approximations that may not fully capture the behavior of standard, non-differentiable codecs. To our knowledge, there is no prior evidence demonstrating the inefficiencies of using standard codecs during training. In this work, we introduce a novel framework that enables end-to-end training with real, non-differentiable codecs by leveraging data-driven surrogate gradients derived from actual compression errors. It facilitates the alignment between training objectives and deployment performance. Experimental results show a 5.19% improvement in BD-BR (PSNR) compared to codec-agnostic training approaches, consistently across the entire rate-distortion convex hull spanning multiple downsampling ratios.

[1637] SurfelSoup: Learned Point Cloud Geometry Compression With a Probablistic SurfelTree Representation

Tingyu Fan, Ran Gong, Yueyu Hu, Yao Wang

Main category: eess.IV

TL;DR: SurfelSoup: An end-to-end learned surface-based framework for point cloud geometry compression using probabilistic surfel representation organized in an octree-like hierarchy for rate-distortion optimization.

Details

Motivation: To improve point cloud geometry compression by moving beyond traditional voxel-based approaches and point-wise compression, which can be redundant in smooth regions, and to achieve compact yet smooth surface reconstructions with better rate-distortion performance.

Method: Proposes pSurfel (probabilistic surfel) representation using bounded generalized Gaussian distribution to model local point occupancies, organized into pSurfelTree (octree-like hierarchy) with Tree Decision module that adaptively terminates tree subdivision for optimal surfel granularity selection.

Result: Consistent gains on geometry compression over voxel-based baselines and MPEG standard G-PCC-GesTM-TriSoup, with visually superior reconstructions featuring smooth and coherent surface structures under MPEG common test conditions.

Conclusion: SurfelSoup provides an effective surface-based framework for point cloud compression that avoids redundant point-wise compression in smooth regions and achieves better rate-distortion performance with high-quality surface reconstructions.

Abstract: This paper presents SurfelSoup, an end-to-end learned surface-based framework for point cloud geometry compression, with surface-structured primitives for representation. It proposes a probabilistic surface representation, pSurfel, which models local point occupancies using a bounded generalized Gaussian distribution. In addition, the pSurfels are organized into an octree-like hierarchy, pSurfelTree, with a Tree Decision module that adaptively terminates the tree subdivision for rate-distortion optimal Surfel granularity selection. This formulation avoids redundant point-wise compression in smooth regions and produces compact yet smooth surface reconstructions. Experimental results under the MPEG common test condition show consistent gain on geometry compression over voxel-based baselines and MPEG standard G-PCC-GesTM-TriSoup, while providing visually superior reconstructions with smooth and coherent surface structures.

[1638] Recent Advances of End-to-End Video Coding Technologies for AVS Standard Development

Xihua Sheng, Xiongzhuang Liang, Chuanbo Tang, Zhirui Zuo, Yifan Bian, Yutao Xie, Zhuoyuan Li, Yuqi Li, Hui Xiang, Li Li, Dong Liu

Main category: eess.IV

TL;DR: AVS-EEM project develops end-to-end intelligent video coding with practical deployment focus, achieving superior compression efficiency over conventional AVS3 under strict complexity constraints.

Details

Motivation: To pursue greater video compression efficiency through intelligent video coding while maintaining practical deployment feasibility with low computational complexity and adherence to conventional video coding test conditions.

Method: Established AVS-EEM project with systematic technical framework covering model architectures, training strategies, and inference optimizations, with iterative refinement over two years under strict complexity constraints.

Result: Latest model achieves superior compression efficiency compared to conventional AVS3 reference software, showing substantial performance improvement through continuous refinement.

Conclusion: AVS-EEM represents significant progress toward deployable intelligent video coding standard, demonstrating that end-to-end AI approaches can outperform conventional methods while meeting practical deployment requirements.

Abstract: Video coding standards are essential to enable the interoperability and widespread adoption of efficient video compression technologies. In pursuit of greater video compression efficiency, the AVS video coding working group launched the standardization exploration of end-to-end intelligent video coding, establishing the AVS End-to-End Intelligent Video Coding Exploration Model (AVS-EEM) project. A core design principle of AVS-EEM is its focus on practical deployment, featuring inherently low computational complexity and requiring strict adherence to the common test conditions of conventional video coding. This paper details the development history of AVS-EEM and provides a systematic introduction to its key technical framework, covering model architectures, training strategies, and inference optimizations. These innovations have collectively driven the project’s rapid performance evolution, enabling continuous and significant gains under strict complexity constraints. Through over two years of iterative refinement and collaborative effort, the coding performance of AVS-EEM has seen substantial improvement. Experimental results demonstrate that its latest model achieves superior compression efficiency compared to the conventional AVS3 reference software, marking a significant step toward a deployable intelligent video coding standard.

[1639] A Renderer-Enabled Framework for Computing Parameter Estimation Lower Bounds in Plenoptic Imaging Systems

Abhinav V. Sambasivan, Liam J. Coulter, Richard G. Paxman, Jarvis D. Haupt

Main category: eess.IV

TL;DR: A framework for computing information-theoretic lower bounds on scene parameter estimation error in plenoptic imaging systems, particularly for passive indirect imaging where observations lack direct line-of-sight information.

Details

Motivation: To establish fundamental limits on how accurately scene parameters can be estimated from noisy plenoptic observations, especially in challenging passive indirect imaging scenarios where traditional direct measurement approaches fail.

Method: Uses computer graphics rendering to synthesize forward models, then evaluates the Hammersley-Chapman-Robbins bound to compute lower bounds on variance of unbiased estimators. Analyzes effects of inexact rendering on bounds.

Result: The framework produces lower bounds that match Maximum Likelihood Estimator performance in canonical object localization problems, indicating they capture true fundamental limits in representative scenarios.

Conclusion: The proposed framework successfully establishes information-theoretic limits for plenoptic imaging parameter estimation, providing theoretical foundations for evaluating estimator performance in complex imaging scenarios.

Abstract: This work focuses on assessing the information-theoretic limits of scene parameter estimation in plenoptic imaging systems. A general framework to compute lower bounds on the parameter estimation error from noisy plenoptic observations is presented, with a particular focus on passive indirect imaging problems, where the observations do not contain line-of-sight information about the parameter(s) of interest. Using computer graphics rendering software to synthesize the often-complicated dependence among parameter(s) of interest and observations, i.e. the forward model, the proposed framework evaluates the Hammersley-Chapman-Robbins bound to establish lower bounds on the variance of any unbiased estimator of the unknown parameters. The effects of inexact rendering of the true forward model on the computed lower bounds are also analyzed, both theoretically and via simulations. Experimental evaluations compare the computed lower bounds with the performance of the Maximum Likelihood Estimator on a canonical object localization problem, showing that the lower bounds computed via the framework proposed here are indicative of the true underlying fundamental limits in several nominally representative scenarios.

[1640] Advanced Geometric Correction Algorithms for 3D Medical Reconstruction: Comparison of Computed Tomography and Macroscopic Imaging

Tomasz Les, Tomasz Markiewicz, Malgorzata Lorent, Miroslaw Dziekiewicz, Krzysztof Siwek

Main category: eess.IV

TL;DR: Hybrid two-stage registration framework for 3D kidney reconstruction from macroscopic slices using CT models as reference, combining geometric optimization with deep learning refinement.

Details

Motivation: Addresses data-scarcity and high-distortion challenges in macroscopic imaging where fully learning-based registration fails due to limited training diversity and large nonrigid deformations exceeding convolutional filter capture range.

Method: Two-stage approach: 1) Optimal Cross-section Matching (OCM) algorithm for constrained global alignment (translation, rotation, scaling), 2) lightweight deep-learning refinement network (VoxelMorph-inspired) predicting residual local deformations between consecutive slices with hierarchical decomposition of registration manifold.

Result: Experiments on 40 kidneys dataset showed better results compared to single-stage baselines, maintaining physical calibration via Hough-based grid detection and Bezier-based contour smoothing for robust meshing and volume estimation.

Conclusion: The hybrid OCM+DL framework integrates explicit geometric priors with neural network flexibility, ensuring stable optimization and plausible deformation fields with few training examples, advancing precision and anatomical realism for multimodal 3D reconstructions in medical applications.

Abstract: This paper introduces a hybrid two-stage registration framework for reconstructing three-dimensional (3D) kidney anatomy from macroscopic slices, using CT-derived models as the geometric reference standard. The approach addresses the data-scarcity and high-distortion challenges typical of macroscopic imaging, where fully learning-based registration (e.g., VoxelMorph) often fails to generalize due to limited training diversity and large nonrigid deformations that exceed the capture range of unconstrained convolutional filters. In the proposed pipeline, the Optimal Cross-section Matching (OCM) algorithm first performs constrained global alignment: translation, rotation, and uniform scaling to establish anatomically consistent slice initialization. Next, a lightweight deep-learning refinement network, inspired by VoxelMorph, predicts residual local deformations between consecutive slices. The core novelty of this architecture lies in its hierarchical decomposition of the registration manifold. This hybrid OCM+DL design integrates explicit geometric priors with the flexible learning capacity of neural networks, ensuring stable optimization and plausible deformation fields even with few training examples. Experiments on an original dataset of 40 kidneys demonstrated better results compared to single-stage baselines. The pipeline maintains physical calibration via Hough-based grid detection and employs Bezier-based contour smoothing for robust meshing and volume estimation. Although validated on kidney data, the proposed framework generalizes to other soft-tissue organs reconstructed from optical or photographic cross-sections. By decoupling interpretable global optimization from data-efficient deep refinement, the method advances the precision, reproducibility, and anatomical realism of multimodal 3D reconstructions for surgical planning, morphological assessment, and medical education.

[1641] Benchmarking Vanilla GAN, DCGAN, and WGAN Architectures for MRI Reconstruction: A Quantitative Analysis

Humaira Mehwish, Hina Shakir, Muneeba Rashid, Asarim Aamir, Reema Qaiser Khan

Main category: eess.IV

TL;DR: Comparison of three GAN architectures (Vanilla GAN, DCGAN, WGAN) for MRI reconstruction across knee, brain, and cardiac datasets, showing DCGAN and WGAN achieve superior image quality metrics.

Details

Motivation: MRI is crucial for medical diagnosis but often suffers from quality issues; GANs can enhance MRI reconstruction quality and diagnostic accuracy, but there's a need to compare different GAN architectures across various body regions.

Method: Evaluated three GAN models: Vanilla GAN, DCGAN, and WGAN on knee (1000 images), cardiac (805 images), and brain (90 images) MRI datasets. Used SSIM and PSNR metrics to assess reconstruction quality, with statistical validation of results.

Result: DCGAN achieved best SSIM (0.97) and PSNR (49.3), while WGAN had SSIM of 0.99 but lower PSNR (43.5). Vanilla GAN performed worst with SSIM 0.84 and PSNR 26. Results show DCGAN and WGAN frameworks are promising for MR image reconstruction.

Conclusion: DCGAN and WGAN-based frameworks show superior performance for MRI reconstruction across different body regions, providing a reproducible benchmark for future hybrid GANs and clinical MRI applications.

Abstract: Magnetic Resonance Imaging (MRI) is a crucial imaging modality for viewing internal body structures. This research work analyses the performance of popular GAN models for accurate and precise MRI reconstruction by enhancing image quality and improving diagnostic accuracy. Three GAN architectures considered in this study are Vanilla GAN, Deep Convolutional GAN (DCGAN), and Wasserstein GAN (WGAN). They were trained and evaluated using knee, brain, and cardiac MRI datasets to assess their generalizability across body regions. While the Vanilla GAN operates on the fundamentals of the adversarial network setup, DCGAN advances image synthesis by securing the convolutional layers, giving a superior appearance to the prevalent spatial features. Training instability is resolved in WGAN through the Wasserstein distance to minimize an unstable regime, therefore, ensuring stable convergence and high-quality images. The GAN models were trained and tested using 1000 MR images of an anonymized knee, 805 images of Heart, 90 images of Brain MRI dataset. The Structural Similarity Index (SSIM) for Vanilla GAN is 0.84, DCGAN is 0.97, and WGAN is 0.99. The Peak Signal to Noise Ratio (PSNR) for Vanilla GAN is 26, DCGAN is 49.3, and WGAN is 43.5. The results were further statistically validated. This study shows that DCGAN and WGAN-based frameworks are promising in MR image reconstruction because of good image quality and superior accuracy. With the first cross-organ benchmark of baseline GANs under a common preprocessing pipeline, this work provides a reproducible benchmark for future hybrid GANs and clinical MRI applications.

[1642] Unified ROI-based Image Compression Paradigm with Generalized Gaussian Model

Kai Hu, Junfu Tan, Fang Xu, Ramy Samy, Yu Liu

Main category: eess.IV

TL;DR: Proposes Generalized Gaussian Model (GGM) for ROI-based image compression to better fit sharp-peaked, heavy-tailed distributions of latent variables, improving coding performance over traditional Gaussian models.

Details

Motivation: ROI-based image compression creates uneven bit allocation leading to sharp-peaked, heavy-tailed distributions that Gaussian models fail to accurately describe, resulting in coding performance loss.

Method: Develops unified rate-distortion optimization theory, proposes GGM for flexible distribution modeling, introduces differentiable functions and dynamic lower bound for stable optimization, and uses finite differences for gradient computation.

Result: Achieves SOTA on COCO2017 for ROI reconstruction and downstream tasks (segmentation, object detection), provides more precise distribution fitting than classical probability models, and superior coding performance.

Conclusion: GGM effectively addresses distribution modeling challenges in ROI compression, offering improved performance for both reconstruction and downstream vision tasks.

Abstract: Region-of-Interest (ROI)-based image compression allocates bits unevenly according to the semantic importance of different regions. Such differentiated coding typically induces a sharp-peaked and heavy-tailed distribution. This distribution characteristic mathematically necessitates a probability model with adaptable shape parameters for accurate description. However, existing methods commonly use a Gaussian model to fit this distribution, resulting in a loss of coding performance. To systematically analyze the impact of this distribution on ROI coding, we develop a unified rate-distortion optimization theoretical paradigm. Building on this paradigm, we propose a novel Generalized Gaussian Model (GGM) to achieve flexible modeling of the latent variables distribution. To support stable optimization of GGM, we introduce effective differentiable functions and further propose a dynamic lower bound to alleviate train-test mismatch. Moreover, finite differences are introduced to solve the gradient computation after GGM fits the distribution. Experiments on COCO2017 demonstrate that our method achieves state-of-the-art in both ROI reconstruction and downstream tasks (e.g., Segmentation, Object Detection). Furthermore, compared to classical probability models, our GGM provides a more precise fit to feature distributions and achieves superior coding performance. The project page is at https://github.com/hukai-tju/ROIGGM.

[1643] Lightweight Super Resolution-enabled Coding Model for the JPEG Pleno Learning-based Point Cloud Coding Standard

André F. R. Guarda, Nuno M. M. Rodrigues, Fernando Pereira

Main category: eess.IV

TL;DR: A lightweight point cloud geometry coding model that reduces JPEG Pleno standard complexity by 70% while maintaining compression efficiency.

Details

Motivation: Point cloud applications need efficient coding due to large data volumes (millions of points per object). The JPEG Pleno standard has competitive performance but high complexity, limiting adoption in resource-constrained environments.

Method: Proposes a novel lightweight coding model using compressed domain approach for super-resolution and major reduction of latent channels. Achieves 70% parameter reduction while maintaining compression efficiency.

Result: 70% reduction in total model parameters with slight average compression performance gains on JPEG Pleno Point Cloud coding dataset.

Conclusion: The lightweight model enables broader adoption of JPEG Pleno standard in resource-constrained environments while maintaining competitive compression performance.

Abstract: While point cloud-based applications are gaining traction due to their ability to provide rich and immersive experiences, they critically need efficient coding solutions due to the large volume of data involved, often many millions of points per object. The JPEG Pleno Learning-based Point Cloud Coding standard, as the first learning-based coding standard for static point clouds, has set a foundational framework with very competitive compression performance regarding the relevant conventional and learning-based alternative point cloud coding solutions. This paper proposes a novel lightweight point cloud geometry coding model that significantly reduces the complexity of the standard, which is essential for the broad adoption of this coding standard, particularly in resource-constrained environments, while simultaneously achieving small average compression efficiency benefits. The novel coding model is based on the pioneering adoption of a compressed domain approach for the super-resolution model, in addition to a major reduction of the number of latent channels. A reduction of approximately 70% in the total number of model parameters is achieved while simultaneously offering slight average compression performance gains for the JPEG Pleno Point Cloud coding dataset.

[1644] Hyperspectral Image Fusion with Spectral-Band and Fusion-Scale Agnosticism

Yu-Jie Liang, Zihan Cao, Liang-Jian Deng, Yang Yang, Malu Zhang

Main category: eess.IV

TL;DR: SSA is a universal framework for multispectral/hyperspectral image fusion that achieves both spectral-band and fusion-scale agnosticism through novel Matryoshka Kernel operators and implicit neural representations.

Details

Motivation: Current MS/HS fusion models are limited to fixed spectral bands and spatial scales, making them non-transferable across different sensors. There's a need for a universal framework that can generalize to unseen sensors and scales.

Method: Proposes SSA framework with two key innovations: 1) Matryoshka Kernel (MK) operator that enables adaptation to arbitrary numbers of spectral channels, and 2) Implicit Neural Representation (INR) backbone that models HS signal as continuous function for arbitrary spatial resolution reconstruction.

Result: Extensive experiments show the single model achieves state-of-the-art performance while generalizing well to unseen sensors and scales, demonstrating effective transferability.

Conclusion: SSA enables a single MS/HS fusion model that generalizes effectively across diverse sensors and spatial scales, paving the way toward future hyperspectral foundation models.

Abstract: Current deep learning models for Multispectral and Hyperspectral Image Fusion (MS/HS fusion) are typically designed for fixed spectral bands and spatial scales, which limits their transferability across diverse sensors. To address this, we propose SSA, a universal framework for MS/HS fusion with spectral-band and fusion-scale agnosticism. Specifically, we introduce Matryoshka Kernel (MK), a novel operator that enables a single model to adapt to arbitrary numbers of spectral channels. Meanwhile, we build SSA upon an Implicit Neural Representation (INR) backbone that models the HS signal as a continuous function, enabling reconstruction at arbitrary spatial resolutions. Together, these two forms of agnosticism enable a single MS/HS fusion model that generalizes effectively to unseen sensors and spatial scales. Extensive experiments demonstrate that our single model achieves state-of-the-art performance while generalizing well to unseen sensors and scales, paving the way toward future HS foundation models.

[1645] Diagnostic Impact of Cine Clips for Thyroid Nodule Assessment on Ultrasound

Jichen Yang, Brian C. Allen, Kirti Magudia, Lisa M. Ho, Chad M. Miller, Maciej A. Mazurowski, Benjamin Wildman-Tobriner

Main category: eess.IV

TL;DR: Cine imaging in thyroid ultrasound doesn’t significantly improve diagnostic accuracy for nodule assessment compared to static images alone.

Details

Motivation: To evaluate whether cine clips (video recordings) in thyroid ultrasound provide additional diagnostic value beyond static images for thyroid nodule assessment using ACR TI-RADS.

Method: Reader study with 4 radiologists assessing 100 thyroid nodules (50 benign, 50 malignant) over 3 rounds: first two rounds with static images only, third round with both static and cine images. Evaluated TI-RADS scores and management recommendations compared to cytopathology results.

Result: No significant improvement in sensitivity (0.65 vs 0.67) or specificity (0.20 vs 0.22) with cine imaging. Management recommendations were similar, though TI-RADS point totals were slightly higher with cine images.

Conclusion: Cine imaging doesn’t significantly change diagnostic performance for thyroid nodule assessment. Current guidelines without mandatory cine imaging are sufficient for accurate diagnosis.

Abstract: Background: Thyroid ultrasound is commonly performed using a combination of static images and cine clips (video recordings). However, the exact utility and impact of cine images remains unknown. This study aimed to evaluate the impact of cine imaging on accuracy and consistency of thyroid nodule assessment, using the American College of Radiology Thyroid Reporting and Data System (ACR TI-RADS). Methods: 50 benign and 50 malignant thyroid nodules with cytopathology results were included. A reader study with 4 specialty-trained radiologists was then conducted over 3 rounds, assessing only static images in the first two rounds and both static and cine images in the third round. TI-RADS scores and the consequent management recommendations were then evaluated by comparing them to the malignancy status of the nodules. Results: Mean sensitivity for malignancy detection was 0.65 for static images and 0.67 with both static and cine images (p>0.5). Specificity was 0.20 for static images and 0.22 with both static and cine images (p>0.5). Management recommendations were similar with and without cine images. Intrareader agreement on feature assignments remained consistent across all rounds, though TI-RADS point totals were slightly higher with cine images. Conclusion: The inclusion of cine imaging for thyroid nodule assessment on ultrasound did not significantly change diagnostic performance. Current practice guidelines, which do not mandate cine imaging, are sufficient for accurate diagnosis.

[1646] Coordinate-conditioned Deconvolution for Scalable Spatially Varying High-Throughput Imaging

Qianwan Yang, Zhixiong Chen, Jiaqi Zhang, Ruipeng Guo, Guorong Hu, Lei Tian

Main category: eess.IV

TL;DR: SV-CoDe is a scalable deep learning framework using coordinate-conditioned convolutions for spatially varying deconvolution in wide-field fluorescence microscopy, enabling patch-based training and uniform high-resolution reconstruction across large fields of view.

Details

Motivation: Compact wide-field fluorescence microscopy suffers from spatially varying blur due to field-dependent aberrations, vignetting, and sensor truncation. Existing learning-based spatially varying reconstruction methods have memory and training costs that scale poorly with image dimensions.

Method: Proposes SV-CoDe (Spatially Varying Coordinate-conditioned Deconvolution) that uses coordinate-conditioned convolutions to locally adapt reconstruction kernels, enabling patch-based training that decouples parameter count from field of view size.

Result: Achieves best image quality in simulated and experimental measurements while requiring 10x less model size and 10x less training data than prior baselines. Generalizes robustly to bead phantoms, weakly scattering brain slices, and freely moving C. elegans.

Conclusion: SV-CoDe offers a scalable, physics-aware solution for correcting spatially varying blur in compact optical systems and is readily extendable to a broad range of biomedical imaging applications.

Abstract: Wide-field fluorescence microscopy with compact optics often suffers from spatially varying blur due to field-dependent aberrations, vignetting, and sensor truncation, while finite sensor sampling imposes an inherent trade-off between field of view (FOV) and resolution. Computational Miniaturized Mesoscope (CM2) alleviate the sampling limit by multiplexing multiple sub-views onto a single sensor, but introduce view crosstalk and a highly ill-conditioned inverse problem compounded by spatially variant point spread functions (PSFs). Prior learning-based spatially varying (SV) reconstruction methods typically rely on global SV operators with fixed input sizes, resulting in memory and training costs that scale poorly with image dimensions. We propose SV-CoDe (Spatially Varying Coordinate-conditioned Deconvolution), a scalable deep learning framework that achieves uniform, high-resolution reconstruction across a 6.5 mm FOV. Unlike conventional methods, SV-CoDe employs coordinate-conditioned convolutions to locally adapt reconstruction kernels; this enables patch-based training that decouples parameter count from FOV size. SV-CoDe achieves the best image quality in both simulated and experimental measurements while requiring 10x less model size and 10x less training data than prior baselines. Trained purely on physics-based simulations, the network robustly generalizes to bead phantoms, weakly scattering brain slices, and freely moving C. elegans. SV-CoDe offers a scalable, physics-aware solution for correcting SV blur in compact optical systems and is readily extendable to a broad range of biomedical imaging applications.

[1647] SatFusion: A Unified Framework for Enhancing Remote Sensing Image via Multi-Frame and Multi-Source Image Fusion

Yufei Tong, Guanjie Cheng, Peihan Wu, Feiyi Chen, Xinkui Zhao, Shuiguang Deng

Main category: eess.IV

TL;DR: SatFusion: A unified framework for remote sensing image enhancement via joint multi-frame and multi-source fusion, combining multi-frame super-resolution with pansharpening techniques.

Details

Motivation: Remote sensing imaging faces hardware constraints and physical limitations, making high-quality image acquisition challenging. Existing approaches like multi-frame super-resolution (MFSR) and pansharpening are studied in isolation, each with limitations: MFSR lacks high-resolution structural priors, while pansharpening depends on upsampled images and is sensitive to noise/misalignment. With Satellite IoT development, leveraging large numbers of low-quality complementary images becomes increasingly important.

Method: Proposes SatFusion framework with two modules: 1) Multi-Frame Image Fusion (MFIF) extracts high-resolution semantic features from multiple low-resolution multispectral frames, 2) Multi-Source Image Fusion (MSIF) integrates fine-grained structural information from high-resolution panchromatic images with implicit pixel-level alignment. SatFusion* variant adds panchromatic-guided mechanism to multi-frame fusion stage, combining structure-aware feature embedding with transformer-based adaptive aggregation for spatially adaptive feature selection.

Result: Extensive experiments on WorldStrat, WV3, QB, and GF2 datasets demonstrate consistent outperformance over existing approaches in reconstruction quality, robustness, and generalizability.

Conclusion: SatFusion provides a unified framework that effectively combines multi-frame and multi-source fusion for remote sensing image enhancement, addressing limitations of isolated approaches and leveraging complementary information from multiple sources.

Abstract: Remote sensing (RS) imaging is constrained by hardware cost and physical limitations, making high-quality image acquisition challenging and motivating image fusion for quality enhancement. Multi-frame super-resolution (MFSR) and Pansharpening exploit complementary information from multiple frames and multiple sources, respectively, but are usually studied in isolation: MFSR lacks high-resolution structural priors for fine-grained texture recovery, while Pansharpening depends on upsampled multispectral images and is sensitive to noise and misalignment. With the rapid development of the Satellite Internet of Things (Sat-IoT), effectively leveraging large numbers of low-quality yet information-complementary images has become increasingly important. To this end, we propose SatFusion, a unified framework for enhancing RS images via joint multi-frame and multi-source fusion. SatFusion employs a Multi-Frame Image Fusion (MFIF) module to extract high-resolution semantic features from multiple low-resolution multispectral frames, and integrates fine-grained structural information from a high-resolution panchromatic image through a Multi-Source Image Fusion (MSIF) module, enabling robust feature integration with implicit pixel-level alignment. To further mitigate the lack of structural priors in multi-frame fusion, we introduce SatFusion*, which incorporates a panchromatic-guided mechanism into the multi-frame fusion stage. By combining structure-aware feature embedding with transformer-based adaptive aggregation, SatFusion* enables spatially adaptive selection of multi-frame features and strengthens the coupling between multi-frame and multi-source representations. Extensive experiments on the WorldStrat, WV3, QB, and GF2 datasets demonstrate that our methods consistently outperform existing approaches in terms of reconstruction quality, robustness, and generalizability.

[1648] A texture-based framework for foundational ultrasound models

Tal Grutman, Carmel Shinar, Tali Ilovitsh

Main category: eess.IV

TL;DR: TUSA is a self-supervised learning framework that reformulates ultrasound analysis as a texture problem, integrating ultrasound physics into foundation models to improve medical imaging performance.

Details

Motivation: Ultrasound images have unique acoustic properties that differ from natural images, causing standard foundation models to underperform. Current models trained on ultrasound data lack integration of ultrasound physics, necessitating domain-specific approaches.

Method: Proposed Texture Ultrasound Semantic Analysis (TUSA) reformulates self-supervised learning as texture analysis. Uses contrastive methods to extract domain-specific representations from B-mode images. Trained on combination of open-source, simulated, and in vivo data.

Result: TUSA outperforms larger foundation models on downstream tasks: COVID detection (70%), spinal hematoma (100%), vitreous hemorrhage (97%). Better correlation with quantitative parameters: liver steatosis (r=0.83), ejection fraction (r=0.63), oxygen saturation (r=0.38).

Conclusion: Integrating ultrasound physics into learning frameworks via texture analysis improves model generalizability for medical imaging tasks. TUSA demonstrates superior performance over standard foundation models on diverse clinical applications.

Abstract: Ultrasound is the most widely used medical imaging modality, yet the images it produces are fundamentally unique, arising from tissue-dependent scattering, reflection, and speed-of-sound variations that produce a constrained set of characteristic textures that differ markedly from natural-image statistics. These acoustically driven patterns make ultrasound challenging for algorithms originally designed for natural images. To bridge this gap, the field has increasingly turned to foundation models, hoping to leverage their generalization capabilities. However, these models often falter in ultrasound applications because they are not designed for ultrasound physics, they are merely trained on ultrasound data. Therefore, it is essential to integrate ultrasound-specific domain knowledge into established learning frameworks. We achieve this by reformulating self-supervised learning as a texture-analysis problem, introducing texture ultrasound semantic analysis (TUSA). Using TUSA, models learn to leverage highly scalable contrastive methods to extract true domain-specific representations directly from simple B-mode images. We train a TUSA model on a combination of open-source, simulated, and in vivo data. The latent space is compared to several larger foundation models, demonstrating that our approach gives TUSA models better generalizability for difficult downstream tasks on unique online datasets as well as a clinical eye dataset collected for this study. Our model achieves higher accuracy in detecting COVID (70%), spinal hematoma (100%) and vitreous hemorrhage (97%) and correlates more closely with quantitative parameters like liver steatosis (r = 0.83), ejection fraction (r = 0.63), and oxygen saturation (r = 0.38). We open-source the model weights and training script: https://github.com/talg2324/tusa

[1649] Generative Video Compression: Towards 0.01% Compression Rate for Video Transmission

Xiangyu Chen, Jixiang Luo, Jingyu Xu, Fangqiu Yi, Chi Zhang, Xuelong Li

Main category: eess.IV

TL;DR: GVC is a novel video compression framework using generative video models to achieve extreme compression rates (as low as 0.02%) by shifting computational burden from transmission to receiver-side inference.

Details

Motivation: The paper addresses the challenge of achieving extreme video compression rates (as low as 0.01%) for bandwidth-constrained environments like emergency rescue, remote surveillance, and mobile edge computing, while maintaining perceptual quality.

Method: GVC leverages modern generative video models to encode videos into extremely compact representations, then uses powerful generative priors at the receiver to synthesize high-quality video from minimal transmitted information. It introduces a compression-computation trade-off strategy for practical deployment on consumer-grade GPUs.

Result: The framework achieves compression rates as low as 0.02% in some cases, demonstrating viability for extreme compression while maintaining perceptual quality. It enables practical deployment through efficient computation strategies.

Conclusion: GVC offers a new effective, efficient, scalable, and practical video communication paradigm for bandwidth- and resource-constrained environments by redefining compression limits through generative models.

Abstract: Whether a video can be compressed at an extreme compression rate as low as 0.01%? To this end, we achieve the compression rate as 0.02% at some cases by introducing Generative Video Compression (GVC), a new framework that redefines the limits of video compression by leveraging modern generative video models to achieve extreme compression rates while preserving a perception-centric, task-oriented communication paradigm, corresponding to Level C of the Shannon-Weaver model. Besides, How we trade computation for compression rate or bandwidth? GVC answers this question by shifting the burden from transmission to inference: it encodes video into extremely compact representations and delegates content reconstruction to the receiver, where powerful generative priors synthesize high-quality video from minimal transmitted information. Is GVC practical and deployable? To ensure practical deployment, we propose a compression-computation trade-off strategy, enabling fast inference on consume-grade GPUs. Within the AI Flow framework, GVC opens new possibility for video communication in bandwidth- and resource-constrained environments such as emergency rescue, remote surveillance, and mobile edge computing. Through empirical validation, we demonstrate that GVC offers a viable path toward a new effective, efficient, scalable, and practical video communication paradigm.

[1650] MarkCleaner: High-Fidelity Watermark Removal via Imperceptible Micro-Geometric Perturbation

Xiaoxi Kong, Jieyu Yuan, Pengdi Chen, Yuanlin Zhang, Chongyi Li, Bin Li

Main category: eess.IV

TL;DR: MarkCleaner is a watermark removal framework that uses micro-geometric perturbations to break semantic watermarks while preserving image content, achieving real-time performance with high visual fidelity.

Details

Motivation: Semantic watermarks are robust against conventional attacks but vulnerable to micro-geometric perturbations that break phase alignment. Current regeneration-based methods cause semantic drift, so a new approach is needed to remove watermarks without altering image content.

Method: MarkCleaner uses micro-geometry-perturbed supervision to separate semantic content from spatial alignment. It employs a mask-guided encoder for spatial representations and a 2D Gaussian Splatting-based decoder to parameterize geometric perturbations while preserving semantics.

Result: Extensive experiments show MarkCleaner achieves superior watermark removal effectiveness and visual fidelity compared to existing methods, while enabling efficient real-time inference.

Conclusion: Micro-geometric perturbations effectively break semantic watermarks, and MarkCleaner provides a practical solution for watermark removal without semantic drift, with real-time performance capabilities.

Abstract: Semantic watermarks exhibit strong robustness against conventional image-space attacks. In this work, we show that such robustness does not survive under micro-geometric perturbations: spatial displacements can remove watermarks by breaking the phase alignment. Motivated by this observation, we introduce MarkCleaner, a watermark removal framework that avoids semantic drift caused by regeneration-based watermark removal. Specifically, MarkCleaner is trained with micro-geometry-perturbed supervision, which encourages the model to separate semantic content from strict spatial alignment and enables robust reconstruction under subtle geometric displacements. The framework adopts a mask-guided encoder that learns explicit spatial representations and a 2D Gaussian Splatting-based decoder that explicitly parameterizes geometric perturbations while preserving semantic content. Extensive experiments demonstrate that MarkCleaner achieves superior performance in both watermark removal effectiveness and visual fidelity, while enabling efficient real-time inference. Our code will be made available upon acceptance.

[1651] Edge-Aligned Initialization of Kernels for Steered Mixture-of-Experts

Martin Determann, Elvira Fleig

Main category: eess.IV

TL;DR: Proposes edge-based initialization for Steered Mixture-of-Experts (SMoE) image modeling using Canny edge detection to reduce computational cost and improve convergence.

Details

Motivation: SMoE is powerful for spatial-domain image modeling but suffers from computationally intensive gradient-based optimization requiring per-image parameter estimation. Current initialization strategies directly affect convergence and reconstruction quality, creating a need for more efficient initialization methods.

Method: Uses Canny edge detection to extract sparse image contours, then deterministically infers kernel positions and orientations. A separate approach enables direct estimation of initial expert coefficients, reducing both memory consumption and computational cost.

Result: Achieves good reconstruction quality while significantly reducing the need for stochastic optimization through the edge-based initialization scheme.

Conclusion: The proposed edge-based initialization scheme makes SMoE more practical by reducing computational burden while maintaining reconstruction quality, addressing a key barrier to practical adoption.

Abstract: Steered Mixture-of-Experts (SMoE) has recently emerged as a powerful framework for spatial-domain image modeling, enabling high-fidelity image representation using a remarkably small number of parameters. Its ability to steer kernel-based experts toward structural image features has led to successful applications in image compression, denoising, super-resolution, and light field processing. However, practical adoption is hindered by the reliance on gradient-based optimization to estimate model parameters on a per-image basis - a process that is computationally intensive and difficult to scale. Initialization strategies for SMoE are an essential component that directly affects convergence and reconstruction quality. In this paper, we propose a novel, edge-based initialization scheme that achieves good reconstruction qualities while reducing the need for stochastic optimization significantly. Through a method that leverages Canny edge detection to extract a sparse set of image contours, kernel positions and orientations are deterministically inferred. A separate approach enables the direct estimation of initial expert coefficients. This initialization reduces both memory consumption and computational cost.

[1652] Streamlined Hybrid Annotation Framework using Scalable Codestream for Bandwidth-Restricted UAV Object Detection

Karim El Khoury, Tiffanie Godelaine, Simon Delvaux, Sebastien Lugan, Benoit Macq

Main category: eess.IV

TL;DR: A hybrid annotation framework using JPEG 2000 compression and deep learning for faster emergency UAV image analysis under bandwidth constraints.

Details

Motivation: Emergency response missions rely on fast visual information relay via UAVs, but bandwidth limitations delay data transmission and decision-making in critical situations.

Method: Uses JPEG 2000 compression with a fine-tuned deep learning network for initial low-resolution annotation, then selectively enhances resolution in critical areas using JPEG 2000’s scalable codestream for human expert review.

Result: The proposed hybrid framework reduces response time by a factor of 34 compared to baseline approaches in emergency situations.

Conclusion: The framework effectively addresses bandwidth limitations in emergency UAV operations by combining automated detection with selective human-in-the-loop annotation.

Abstract: Emergency response missions depend on the fast relay of visual information, a task to which unmanned aerial vehicles are well adapted. However, the effective use of unmanned aerial vehicles is often compromised by bandwidth limitations that impede fast data transmission, thereby delaying the quick decision-making necessary in emergency situations. To address these challenges, this paper presents a streamlined hybrid annotation framework that utilizes the JPEG 2000 compression algorithm to facilitate object detection under limited bandwidth. The proposed framework employs a fine-tuned deep learning network for initial image annotation at lower resolutions and uses JPEG 2000’s scalable codestream to selectively enhance the image resolution in critical areas that require human expert annotation. We show that our proposed hybrid framework reduces the response time by a factor of 34 in emergency situations compared to a baseline approach.

[1653] Future frame prediction in chest and liver cine MRI using the PCA respiratory motion model: comparing transformers and dynamically trained recurrent neural networks

Michel Pohl, Mitsuru Uesaka, Hiroyuki Takahashi, Kazuyuki Demachi, Ritu Bhusal Chhatkuli

Main category: eess.IV

TL;DR: Paper investigates frame forecasting in chest and liver cine MRI for radiotherapy compensation, comparing RNNs with online learning and transformers for predicting respiratory motion patterns.

Details

Motivation: Respiratory motion causes target-location uncertainties in thoraco-abdominal tumor radiotherapy due to treatment-system latency. Need to forecast future frames in cine MRI to compensate for these delays and improve treatment accuracy.

Method: Uses PCA to decompose optical-flow fields into static deformations and time-dependent weights. Compares linear filters, population/sequence-specific encoder-only transformers, and RNNs trained with various online learning algorithms (RTRL, UORO, DNI, SnAp-1) for forecasting low-dimensional motion weights.

Result: Linear regression performed best at short horizons (1.3mm error at 0.32s), while RTRL and SnAp-1 outperformed others at medium-to-long horizons (errors below 1.4mm and 2.8mm on different datasets). Transformers were competitive for low-to-medium horizons but limited by data scarcity and domain shift.

Conclusion: Online learning RNNs (RTRL, SnAp-1) show promise for respiratory motion forecasting in radiotherapy applications, outperforming transformers in medium-to-long horizons. The approach enables adaptation to changing respiratory patterns via on-the-fly parameter updates.

Abstract: Respiratory motion complicates accurate irradiation of thoraco-abdominal tumors in radiotherapy, as treatment-system latency entails target-location uncertainties. This work addresses frame forecasting in chest and liver cine MRI to compensate for such delays. We investigate RNNs trained with online learning algorithms, enabling adaptation to changing respiratory patterns via on-the-fly parameter updates, and transformers, increasingly common in time series forecasting for their ability to capture long-term dependencies. Experiments were conducted using 12 sagittal thoracic and upper-abdominal cine-MRI sequences from ETH Zürich and OvGU. PCA decomposes the Lucas-Kanade optical-flow field into static deformations and low-dimensional time-dependent weights. We compare various methods forecasting the latter: linear filters, population and sequence-specific encoder-only transformers, and RNNs trained with real-time recurrent learning (RTRL), unbiased online recurrent optimization, decoupled neural interfaces, and sparse one-step approximation (SnAp-1). Predicted displacements were used to warp the reference frame and generate future images. Prediction accuracy decreased with the horizon h. Linear regression performed best at short horizons (1.3mm geometrical error at h=0.32s, ETH Zürich data), while RTRL and SnAp-1 outperformed the other algorithms at medium-to-long horizons, with geometrical errors below 1.4mm and 2.8mm on the sequences from ETH Zürich and OvGU (the latter featuring higher motion variability, noise, and lower contrast), respectively. The sequence-specific transformer was competitive for low-to-medium horizons, but transformers remained overall limited by data scarcity and domain shift between datasets. Predicted frames visually resembled the ground truth, with notable errors occurring near the diaphragm at end-inspiration and regions affected by out-of-plane motion.

[1654] Scalable dataset acquisition for data-driven lensless imaging

Clara S. Hung, Leyla A. Kabuli, Vasilisa Ponomarenko, Laura Waller

Main category: eess.IV

TL;DR: Open-access 25,000 image dataset for lensless imaging with synchronized multi-camera acquisition and computational ground truth registration.

Details

Motivation: Data-driven developments in lensless imaging require large datasets, but existing datasets are limited. There's a need for comprehensive datasets captured under controlled conditions with ground truth registration.

Method: Developed a data acquisition pipeline with multiple synchronized lensless imaging systems capturing images in parallel under identical conditions. Created open-source camera synchronization code and reproducible hardware setup.

Result: Released an open-access dataset of 25,000 images from two lensless imagers with paired computational ground truth registration. The system enables reproducible data collection for machine learning applications.

Conclusion: The dataset and acquisition pipeline provide valuable resources for data-driven developments in lensless imaging, including machine learning-based reconstruction algorithms and end-to-end system design.

Abstract: Data-driven developments in lensless imaging, such as machine learning-based reconstruction algorithms, require large datasets. In this work, we introduce a data acquisition pipeline that can capture from multiple lensless imaging systems in parallel, under the same imaging conditions, and paired with computational ground truth registration. We provide an open-access 25,000 image dataset with two lensless imagers, a reproducible hardware setup, and open-source camera synchronization code. Experimental datasets from our system can enable data-driven developments in lensless imaging, such as machine learning-based reconstruction algorithms and end-to-end system design.

[1655] From Slices to Structures: Unsupervised 3D Reconstruction of Female Pelvic Anatomy from Freehand Transvaginal Ultrasound

Max Krähenmann, Sergio Tascon-Morales, Fabian Laumer, Julia E. Vogt, Ece Ozkan

Main category: eess.IV

TL;DR: TVGS: Unsupervised 3D reconstruction from freehand 2D transvaginal ultrasound sweeps using Gaussian Splatting adaptation with slice-aware rasterizer and joint pose-structure optimization.

Details

Motivation: Volumetric ultrasound could improve diagnostic accuracy but adoption is limited by specialized hardware and restrictive acquisition protocols. Current methods require external tracking or learned pose estimators.

Method: Adapts Gaussian Splatting to ultrasound domain with slice-aware differentiable rasterizer tailored to ultrasound physics/geometry. Models anatomy as anisotropic 3D Gaussians optimized from image-level supervision. Joint optimization refines slice poses alongside anatomical structure for robustness against irregular probe motion.

Result: Creates compact, flexible, memory-efficient volumetric representation capturing anatomical detail with high spatial fidelity. Demonstrates accurate 3D reconstruction from 2D ultrasound through purely computational means.

Conclusion: Offers scalable alternative to conventional 3D ultrasound systems, enabling new opportunities for AI-assisted analysis and diagnosis without specialized hardware.

Abstract: Volumetric ultrasound has the potential to significantly improve diagnostic accuracy and clinical decision-making, yet its widespread adoption remains limited by dependence on specialized hardware and restrictive acquisition protocols. In this work, we present a novel unsupervised framework for reconstructing 3D anatomical structures from freehand 2D transvaginal ultrasound sweeps, without requiring external tracking or learned pose estimators. Our method, TVGS, adapts the principles of Gaussian Splatting to the domain of ultrasound, introducing a slice-aware, differentiable rasterizer tailored to the unique physics and geometry of ultrasound imaging. We model anatomy as a collection of anisotropic 3D Gaussians and optimize their parameters directly from image-level supervision. To ensure robustness against irregular probe motion, we introduce a joint optimization scheme that refines slice poses alongside anatomical structure. The result is a compact, flexible, and memory-efficient volumetric representation that captures anatomical detail with high spatial fidelity. This work demonstrates that accurate 3D reconstruction from 2D ultrasound images can be achieved through purely computational means, offering a scalable alternative to conventional 3D systems and enabling new opportunities for AI-assisted analysis and diagnosis.

[1656] RDDM: Practicing RAW Domain Diffusion Model for Real-world Image Restoration

Yan Chen, Yi Wen, Wei Li, Junchao Liu, Yong Guo, Jie Hu, Xinghao Chen

Main category: eess.IV

TL;DR: RDDM is an end-to-end diffusion model that restores photo-realistic images directly from sensor RAW data, bypassing conventional ISP pipelines for higher fidelity restoration.

Details

Motivation: Current sRGB-domain diffusion models face a dilemma between high fidelity and image generation, processing lossy sRGB inputs while ignoring accessible RAW data from edge devices, leading to suboptimal performance.

Method: Proposes RAW-domain VAE (RVAE) to handle domain distribution issues, configurable multi-bayer LoRA module for diverse RAW Bayer patterns, and a scalable data synthesis pipeline for training on RAW LQ-HQ pairs from existing sRGB datasets.

Result: RDDM demonstrates superiority over state-of-the-art sRGB diffusion methods, yielding higher fidelity results with fewer artifacts in extensive experiments.

Conclusion: Direct RAW domain restoration with RDDM outperforms conventional two-stage ISP->IR pipelines, offering better image restoration quality by leveraging sensor RAW data directly.

Abstract: We present the RAW domain diffusion model (RDDM), an end-to-end diffusion model that restores photo-realistic images directly from the sensor RAW data. While recent sRGB-domain diffusion methods achieve impressive results, they are caught in a dilemma between high fidelity and image generation. These models process lossy sRGB inputs and neglect the accessibility of the sensor RAW images in many scenarios, e.g., in image and video capturing in edge devices, resulting in sub-optimal performance. RDDM obviates this limitation by directly restoring images in the RAW domain, replacing the conventional two-stage image signal processing (ISP)->Image Restoration (IR) pipeline. However, a simple adaptation of pre-trained diffusion models to the RAW domain confronts many challenges. To this end, we propose: (1) a RAW-domain VAE (RVAE), encoding sensor RAW and decoding it into an enhanced linear domain image, to solve the out-of-distribution (OOD) issues between the different domain distributions; (2) a configurable multi-bayer (CMB) LoRA module, adapting diverse RAW Bayer patterns such as RGGB, BGGR, etc. To compensate for the deficiency in the dataset, we develop a scalable data synthesis pipeline synthesizing RAW LQ-HQ pairs from existing sRGB datasets for large-scale training. Extensive experiments demonstrate RDDM’s superiority over state-of-the-art sRGB diffusion methods, yielding higher fidelity results with fewer artifacts. Codes will be publicly available at https://github.com/YanCHEN-fr/RDDM.

[1657] AI-Based Stroke Rehabilitation Domiciliary Assessment System with ST_GCN Attention

Suhyeon Lim, Ye-eun Kim, Andrew J. Choi

Main category: eess.IV

TL;DR: Home-based stroke rehabilitation system using RGB-D cameras and wearables with RAST-G@ deep learning model for movement assessment and feedback.

Details

Motivation: Stroke recovery requires continuous rehabilitation integrated into daily living, but current systems lack effective home-based solutions with quantitative assessment and feedback.

Method: System includes: (1) hardware with RGB-D camera and wearables, (2) mobile app for guidance, (3) AI server with RAST-G@ model (spatio-temporal graph CNN + transformer attention) for movement assessment. Uses NRC dataset of 10 ADL and 5 ROM activities with therapist scores.

Result: RAST-G@ outperforms baselines on KIMORE and NRC datasets in MAD, RMSE, and MAPE metrics. System provides patient-centered assessment and monitoring feedback.

Conclusion: Proposed system offers scalable, quantitative, and consistent home-based rehabilitation assessment for stroke recovery.

Abstract: Effective stroke recovery requires continuous rehabilitation integrated with daily living. To support this need, we propose a home-based rehabilitation exercise and feedback system. The system consists of (1) hardware setup with RGB-D camera and wearable sensors to capture stroke movements, (2) a mobile application for exercise guidance, and (3) an AI server for assessment and feedback. When a stroke user exercises following the application guidance, the system records skeleton sequences, which are then assessed by the deep learning model, RAST-G@ (Rehabilitation Assessment Spatio-Temporal Graph ATtention). The model employs a spatio-temporal graph convolutional network to extract skeletal features and integrates transformer-based temporal attention to figure out action quality. For system implementation, we constructed the NRC dataset, include 10 upper-limb activities of daily living (ADL) and 5 range-of-motion (ROM) collected from stroke and non-disabled participants, with Score annotations provided by licensed physiotherapists. Results on the KIMORE and NRC datasets show that RAST-G@ improves over baseline in terms of MAD, RMSE, and MAPE. Furthermore, the system provides user feedback that combines patient-centered assessment and monitoring. The results demonstrate that the proposed system offers a scalable approach for quantitative and consistent domiciliary rehabilitation assessment.

[1658] Tumor-anchored deep feature random forests for out-of-distribution detection in lung cancer segmentation

Aneesh Rangnekar, Harini Veeraraghavan

Main category: eess.IV

TL;DR: RF-Deep: A lightweight random forests-based OOD detection framework for tumor segmentation in CT scans that uses deep features from pretrained transformers to detect out-of-distribution inputs without increasing model complexity.

Details

Motivation: Current tumor segmentation models for CT scans are vulnerable to out-of-distribution inputs, producing confidently incorrect segmentations that pose clinical risks. Existing OOD detection methods have limitations: logit-based approaches suffer from task-specific biases, while architectural modifications increase computational costs.

Method: RF-Deep is a post-hoc, plug-and-play framework that uses random forests for OOD detection. It extracts hierarchical features from pretrained-then-finetuned transformer backbones, focusing on multiple regions of interest anchored to predicted tumor segmentations. The approach requires limited outlier exposure and works with various network architectures.

Result: RF-Deep achieved AUROC > 93.50 for challenging near-OOD datasets (pulmonary embolism, negative COVID-19) and near-perfect detection (AUROC > 99.00) for far-OOD datasets (kidney cancer, healthy pancreas). It outperformed logit-based and radiomics approaches and maintained consistent performance across networks with different depths and pretraining strategies.

Conclusion: RF-Deep provides an effective, lightweight, architecture-agnostic solution for OOD detection in medical image segmentation, enhancing the reliability of tumor segmentation from CT volumes without increasing computational complexity.

Abstract: Accurate segmentation of cancerous lesions from 3D computed tomography (CT) scans is essential for automated treatment planning and response assessment. However, even state-of-the-art models combining self-supervised learning (SSL) pretrained transformers with convolutional decoders are susceptible to out-of-distribution (OOD) inputs, generating confidently incorrect tumor segmentations, posing risks to safe clinical deployment. Existing logit-based methods suffer from task-specific model biases, while architectural enhancements to explicitly detect OOD increase parameters and computational costs. Hence, we introduce a lightweight, plug-and-play post-hoc random forests-based OOD detection framework called RF-Deep that leverages deep features with limited outlier exposure. RF-Deep enhances generalization to imaging variations by repurposing the hierarchical features from the pretrained-then-finetuned backbone, providing task-relevant OOD detection by extracting the features from multiple regions of interest anchored to the predicted tumor segmentations. We compared RF-Deep against existing OOD detection methods using 2,056 CT scans across near-OOD (pulmonary embolism, negative COVID-19) and far-OOD (kidney cancer, healthy pancreas) datasets. RF-Deep achieved AUROC > 93.50 for the challenging near-OOD datasets and near-perfect detection (AUROC > 99.00) for the far-OOD datasets, substantially outperforming logit-based and radiomics approaches. RF-Deep maintained consistent performance across networks of different depths and pretraining strategies, demonstrating its effectiveness as a lightweight, architecture-agnostic approach to enhance the reliability of tumor segmentation from CT volumes.

[1659] Reinforced Rate Control for Neural Video Compression via Inter-Frame Rate-Distortion Awareness

Wuyang Cong, Junqi Shi, Lizhong Wang, Weijing Shi, Ming Lu, Hao Chen, Zhan Ma

Main category: eess.IV

TL;DR: RL-based rate control framework for neural video compression that jointly optimizes bitrate allocation and coding parameters through frame-by-frame sequential decisions, achieving better rate-distortion performance and bitrate adherence.

Details

Motivation: Existing neural video compression methods have superior compression efficiency but struggle with effective rate control due to complex temporal dependencies. Current rate control schemes focus on distortion interactions but overlook inter-frame rate dependencies, leading to suboptimal bitrate allocation and cascading parameter decisions.

Method: Proposes a reinforcement learning-based rate control framework that formulates the task as a frame-by-frame sequential decision process. An RL agent observes spatiotemporal states and selects coding parameters to optimize long-term reward reflecting rate-distortion performance and bitrate adherence. The approach jointly determines bitrate allocation and coding parameters in a single step, independent of GOP structure.

Result: Extensive experiments across diverse NVC architectures show the method reduces average relative bitrate error to 1.20% and achieves up to 13.45% bitrate savings at typical GOP sizes, outperforming existing approaches. The framework also demonstrates improved robustness to content variation and bandwidth fluctuations with lower coding overhead.

Conclusion: The RL-based rate control framework effectively addresses rate control challenges in neural video compression by jointly optimizing bitrate allocation and coding parameters, making it highly suitable for practical deployment with superior performance and robustness.

Abstract: Neural video compression (NVC) has demonstrated superior compression efficiency, yet effective rate control remains a significant challenge due to complex temporal dependencies. Existing rate control schemes typically leverage frame content to capture distortion interactions, overlooking inter-frame rate dependencies arising from shifts in per-frame coding parameters. This often leads to suboptimal bitrate allocation and cascading parameter decisions. To address this, we propose a reinforcement-learning (RL)-based rate control framework that formulates the task as a frame-by-frame sequential decision process. At each frame, an RL agent observes a spatiotemporal state and selects coding parameters to optimize a long-term reward that reflects rate-distortion (R-D) performance and bitrate adherence. Unlike prior methods, our approach jointly determines bitrate allocation and coding parameters in a single step, independent of group of pictures (GOP) structure. Extensive experiments across diverse NVC architectures show that our method reduces the average relative bitrate error to 1.20% and achieves up to 13.45% bitrate savings at typical GOP sizes, outperforming existing approaches. In addition, our framework demonstrates improved robustness to content variation and bandwidth fluctuations with lower coding overhead, making it highly suitable for practical deployment.

Editor’s Picks

[1] MTAVG-Bench: A Comprehensive Benchmark for Evaluating Multi-Talker Dialogue-Centric Audio-Video Generation

[2] Cross-Modal Binary Attention: An Energy-Efficient Fusion Framework for Audio-Visual Learning

[3] LPIPS-AttnWav2Lip: Generic Audio-Driven lip synchronization for Talking Head Generation in the Wild

Today’s Research Highlights

Table of Contents

cs.CL

[1] PPoGA: Predictive Plan-on-Graph with Action for Knowledge Graph Question Answering

[2] Unlocking Electronic Health Records: A Hybrid Graph RAG Approach to Safe Clinical AI for Patient QA

[3] G-MemLLM: Gated Latent Memory Augmentation for Long-Context Reasoning in Large Language Models

[4] PTCBENCH: Benchmarking Contextual Stability of Personality Traits in LLM Systems

[5] SafeTalkCoach: Diversity-Driven Multi-Agent Simulation for Parent-Teen Health Conversations

[6] Construct, Align, and Reason: Large Ontology Models for Enterprise Knowledge Management

[7] Reversible Diffusion Decoding for Diffusion Language Models

[8] DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking

[9] Benchmarking Uncertainty Calibration in Large Language Model Long-Form Question Answering

[10] Faithful-Patchscopes: Understanding and Mitigating Model Bias in Hidden Representations Explanation of Large Language Models

[11] Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling

[12] MiNER: A Two-Stage Pipeline for Metadata Extraction from Municipal Meeting Minutes

[13] A Baseline Multimodal Approach to Emotion Recognition in Conversations

[14] Detecting AI-Generated Content in Academic Peer Reviews

[15] Bias in the Ear of the Listener: Assessing Sensitivity in Audio Language Models Across Linguistic, Demographic, and Positional Variations

[16] DETOUR: An Interactive Benchmark for Dual-Agent Search and Reasoning

[17] DecompressionLM: Deterministic, Diagnostic, and Zero-Shot Concept Graph Extraction from Language Models

[18] Clause-Internal or Clause-External? Testing Turkish Reflexive Binding in Adapted versus Chain of Thought Large Language Models

[19] Segment-Level Attribution for Selective Learning of Long Reasoning Traces

[20] When Agents “Misremember” Collectively: Exploring the Mandela Effect in LLM-based Multi-Agent Systems

[21] What Matters to an LLM? Behavioral and Computational Evidences from Summarization

[22] Words that make SENSE: Sensorimotor Norms in Learned Lexical Token Representations

[23] Intention-Adaptive LLM Fine-Tuning for Text Revision Generation

[24] From Knowledge to Inference: Scaling Laws of Specialized Reasoning on GlobalHealthAtlas

[25] Culturally-Grounded Governance for Multilingual Language Models: Rights, Data Boundaries, and Accountable AI Design

[26] Reasoning by Commented Code for Table Question Answering

[27] A Hierarchical and Attentional Analysis of Argument Structure Constructions in BERT Using Naturalistic Corpora

[28] Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs

[29] The French Drama Revolution: Political Economy and Literary Production, 1700-1900

[30] Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR

[31] Hermes the Polyglot: A Unified Framework to Enhance Expressiveness for Multimodal Interlingual Subtitling

[32] Lookahead-then-Verify: Reliable Constrained Decoding for Diffusion LLMs under Context-Free Grammars

[33] Transformer-Based Model for Multilingual Hope Speech Detection

[34] Language Family Matters: Evaluating LLM-Based ASR Across Linguistic Boundaries

[35] Jailbreaking LLMs via Calibration

[36] Formal Semantic Control over Language Models

[37] LegalOne: A Family of Foundation Models for Reliable Legal Reasoning

[38] Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation

[39] EchoReview: Learning Peer Review from the Echoes of Scientific Citations

[40] ExperienceWeaver: Optimizing Small-sample Experience Learning for LLM-based Clinical Text Improvement

[41] CURP: Codebook-based Continuous User Representation for Personalized Generation with LLMs

[42] Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

[43] Temporal Leakage in Search-Engine Date-Filtered Web Retrieval: A Case Study from Retrospective Forecasting

[44] Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning

[45] APR: Penalizing Structural Redundancy in Large Reasoning Models via Anchor-based Process Rewards

[46] WordCraft: Scaffolding the Keyword Method for L2 Vocabulary Learning with Multimodal LLMs

[47] Eliciting Trustworthiness Priors of Large Language Models via Economic Games

[48] Reasoning as State Transition: A Representational Analysis of Reasoning Evolution in Large Language Models

[49] HyLRA: Hybrid Layer Reuse Attention for Efficient Long-Context Inference

[50] Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis

[51] Factuality on Demand: Controlling the Factuality-Informativeness Trade-off in Text Generation

[52] Unifying Adversarial Robustness and Training Across Text Scoring Models

[53] ILSIC: Corpora for Identifying Indian Legal Statutes from Queries by Laypeople

[54] EffGen: Enabling Small Language Models as Capable Autonomous Agents

[55] Do Schwartz Higher-Order Values Help Sentence-Level Human Value Detection? When Hard Gating Hurts

[56] Neural FOXP2 – Language Specific Neuron Steering for Targeted Language Improvement in LLMs

[57] Verification Required: The Impact of Information Credibility on AI Persuasion

[58] Trust in One Round: Confidence Estimation for Large Language Models via Structural Signals

[59] MedSpeak: A Knowledge Graph-Aided ASR Error Correction Framework for Spoken Medical QA

[60] DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning

[61] Sparse Reward Subsystem in Large Language Models

[62] DeALOG: Decentralized Multi-Agents Log-Mediated Reasoning Framework

[63] Reliable Use of Lemmas via Eligibility Reasoning and Section$-$Aware Reinforcement Learning

[64] Distilling Token-Trained Models into Byte-Level Models

[65] Large Language Models as Students Who Think Aloud: Overly Coherent, Verbose, and Confident

[66] Personality Expression Across Contexts: Linguistic and Behavioral Variation in LLM Agents

[67] Exploring Knowledge Purification in Multi-Teacher Knowledge Distillation for LLMs

[68] From Utterance to Vividity: Training Expressive Subtitle Translation LLM via Adaptive Local Preference Optimization

[69] What If We Allocate Test-Time Compute Adaptively?

[70] Logic-Oriented Retriever Enhancement via Contrastive Learning

[71] Tendem: A Hybrid AI+Human Platform

[72] Long-range Modeling and Processing of Multimodal Event Sequences

[73] Don’t Judge a Book by its Cover: Testing LLMs’ Robustness Under Logical Obfuscation