Editor’s Picks
Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.
[1] Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models
Haoqin Sun, Chenyang Lyu, Shiwan Zhao, Xuanfan Ni, Xiangyu Kong, Longyue Wang, Weihua Luo, Yong Qin
Main category: cs.SD
TL;DR: Speech-XL is a novel model that addresses long-form audio understanding limitations in Large Speech Language Models by using Speech Summarization Tokens (SST) to compress speech intervals via KV sparsification, enabling efficient processing of extended audio sequences.
Details
Motivation: Current Large Speech Language Models struggle with long-form audio understanding due to limited context length and high memory requirements for processing extended audio sequences, creating a bottleneck for practical applications.Method: Introduces Speech Summarization Tokens (SST) that encapsulate speech interval information into KV pairs, trained via instruction fine-tuning with curriculum learning from low to high compression ratios, leveraging LLMs’ intrinsic KV sparsification capacity.
Result: Achieves competitive performance on major benchmarks (LongSpeech and AUDIOMARATHON) despite using significantly less training data than other baselines, effectively addressing long-form audio modeling bottlenecks.
Conclusion: Speech-XL provides a novel approach to condensing extensive acoustic sequences by addressing key limitations in long-form audio understanding through efficient compression techniques.
Abstract: Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value (KV) sparsification capacity of Large Language Models (LLMs) to achieve high-ratio speech input compression. Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated KV pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner–advancing from low-ratio (simple) to high-ratio (challenging) compression. Despite utilizing significantly less training data than other baselines, our model achieves highly competitive performance on major benchmarks, including LongSpeech and AUDIOMARATHON. By addressing the long-standing bottlenecks in long-form audio modeling, our approach offers a novel perspective on the condensation of extensive acoustic sequences.
Relevance: 9/10
[2] DCER: Dual-Stage Compression and Energy-Based Reconstruction
Yiwen Wang, Jiahao Qin
Main category: cs.LG
TL;DR: DCER: Dual-stage compression and energy-based reconstruction framework for robust multimodal fusion that handles noisy inputs and missing modalities through frequency transforms and learned energy functions.
Details
Motivation: Multimodal fusion faces two key robustness challenges: noisy inputs degrade representation quality, and missing modalities cause prediction failures. Existing methods often fail to address both issues simultaneously.Method: Proposes DCER with dual-stage compression: 1) within-modality frequency transforms (wavelet for audio, DCT for video) to remove noise while preserving task-relevant patterns, and 2) cross-modality bottleneck tokens to force genuine integration. For missing modalities, uses energy-based reconstruction via gradient descent on a learned energy function.
Result: Achieves state-of-the-art performance on CMU-MOSI, CMU-MOSEI, and CH-SIMS benchmarks. Shows U-shaped robustness pattern favoring multimodal fusion at both complete and high-missing conditions. Energy function provides intrinsic uncertainty quantification with ρ > 0.72 correlation with prediction error.
Conclusion: DCER provides a unified framework addressing both noisy inputs and missing modalities in multimodal fusion, with energy-based reconstruction offering uncertainty quantification and robust performance across varying missing modality conditions.
Abstract: Multimodal fusion faces two robustness challenges: noisy inputs degrade representation quality, and missing modalities cause prediction failures. We propose DCER, a unified framework addressing both challenges through dual-stage compression and energy-based reconstruction. The compression stage operates at two levels: within-modality frequency transforms (wavelet for audio, DCT for video) remove noise while preserving task-relevant patterns, and cross-modality bottleneck tokens force genuine integration rather than modality-specific shortcuts. For missing modalities, energy-based reconstruction recovers representations via gradient descent on a learned energy function, with the final energy providing intrinsic uncertainty quantification (\r{ho} > 0.72 correlation with prediction error). Experiments on CMU-MOSI, CMU-MOSEI, and CH-SIMS demonstrate state-of-the-art performance across all benchmarks, with a U-shaped robustness pattern favoring multimodal fusion at both complete and high-missing conditions. The code will be available on Github.
Relevance: 9/10
[3] A$^2$-LLM: An End-to-end Conversational Audio Avatar Large Language Model
Xiaolin Hu, Hang Yuan, Xinzhu Sang, Binbin Yan, Zhou Yu, Cong Huang, Kai Chen
Main category: cs.LG
TL;DR: A^2-LLM is an end-to-end conversational audio avatar LLM that jointly models language, audio prosody, and 3D facial motion in a unified framework, generating emotionally rich facial movements beyond lip-sync.
Details
Motivation: Current conversational digital humans rely on cascaded architectures with accumulated errors, high latency, and poor real-time performance. These systems lack access to conversational context and prioritize rigid lip-sync over emotional depth.Method: Propose A^2-LLM, an end-to-end conversational audio avatar large language model that jointly reasons about language, audio prosody, and 3D facial motion within a unified framework. Introduce FLAME-QA, a high-quality multimodal dataset aligning semantic intent with expressive facial dynamics in QA format.
Result: The system achieves superior emotional expressiveness while maintaining real-time efficiency with 500 ms latency and 0.7 RTF (real-time factor).
Conclusion: A^2-LLM addresses limitations of cascaded architectures by providing unified multimodal reasoning for conversational avatars, enabling emotionally rich facial movements beyond simple lip-synchronization.
Abstract: Developing expressive and responsive conversational digital humans is a cornerstone of next-generation human-computer interaction. While large language models (LLMs) have significantly enhanced dialogue capabilities, most current systems still rely on cascaded architectures that connect independent modules. These pipelines are often plagued by accumulated errors, high latency, and poor real-time performance. Lacking access to the underlying conversational context, these pipelines inherently prioritize rigid lip-sync over emotional depth. To address these challenges, we propose A$^2$-LLM, an end-to-end conversational audio avatar large language model that jointly reasons about language, audio prosody, and 3D facial motion within a unified framework. To facilitate training, we introduce FLAME-QA, a high-quality multimodal dataset designed to align semantic intent with expressive facial dynamics within a QA format. By leveraging deep semantic understanding, A$^2$-LLM generates emotionally rich facial movements beyond simple lip-synchronization. Experimental results demonstrate that our system achieves superior emotional expressiveness while maintaining real-time efficiency (500 ms latency, 0.7 RTF).
Relevance: 9/10
Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 91]
- cs.CV [Total: 169]
- cs.AI [Total: 92]
- cs.SD [Total: 10]
- cs.LG [Total: 267]
- cs.MA [Total: 4]
- cs.MM [Total: 1]
- eess.AS [Total: 10]
- eess.IV [Total: 13]
cs.CL
[1] BioACE: An Automated Framework for Biomedical Answer and Citation Evaluations
Deepak Gupta, Davis Bartels, Dina Demner-Fuhsman
Main category: cs.CL
TL;DR: BioACE: Automated framework for evaluating biomedical answers and citations generated by LLMs, assessing completeness, correctness, precision, and recall against ground-truth facts.
Details
Motivation: As LLMs are increasingly used for biomedical question answering, there's a need for automated evaluation of answer quality and citation reliability, since manual expert assessment is time-consuming and costly in this specialized domain.Method: Proposes BioACE framework with automated approaches to evaluate multiple aspects: completeness, correctness, precision, and recall against ground-truth nuggets. Compares various methods including NLI, pre-trained language models, and LLMs for citation quality assessment.
Result: Extensive experiments show correlation with human evaluations. The framework identifies best approaches for biomedical answer and citation evaluation, released as an open-source package.
Conclusion: BioACE provides an effective automated evaluation framework for biomedical LLM outputs, addressing the challenge of expert verification in this specialized domain through comprehensive multi-aspect assessment.
Abstract: With the increasing use of large language models (LLMs) for generating answers to biomedical questions, it is crucial to evaluate the quality of the generated answers and the references provided to support the facts in the generated answers. Evaluation of text generated by LLMs remains a challenge for question answering, retrieval-augmented generation (RAG), summarization, and many other natural language processing tasks in the biomedical domain, due to the requirements of expert assessment to verify consistency with the scientific literature and complex medical terminology. In this work, we propose BioACE, an automated framework for evaluating biomedical answers and citations against the facts stated in the answers. The proposed BioACE framework considers multiple aspects, including completeness, correctness, precision, and recall, in relation to the ground-truth nuggets for answer evaluation. We developed automated approaches to evaluate each of the aforementioned aspects and performed extensive experiments to assess and analyze their correlation with human evaluations. In addition, we considered multiple existing approaches, such as natural language inference (NLI) and pre-trained language models and LLMs, to evaluate the quality of evidence provided to support the generated answers in the form of citations into biomedical literature. With the detailed experiments and analysis, we provide the best approaches for biomedical answer and citation evaluation as a part of BioACE (https://github.com/deepaknlp/BioACE) evaluation package.
[2] CoWork-X: Experience-Optimized Co-Evolution for Multi-Agent Collaboration System
Zexin Lin, Jiachen Yu, Haoyang Zhang, Yuzhao Li, Zhonghang Li, Yujiu Yang, Junjie Wang, Xiaoqiang Ji
Main category: cs.CL
TL;DR: CoWork-X: An active co-evolution framework for real-time collaborative agents using HTN-based skill retrieval and post-episode skill consolidation to reduce latency and token usage while improving performance.
Details
Motivation: Current language-conditioned agents struggle with real-time coordination and multi-episode adaptation under strict token budgets, facing trade-offs between latency-inducing in-episode reasoning and unreliable post-episode text-based improvements.Method: CoWork-X uses a Skill-Agent with HTN-based skill retrieval from a structured skill library, and a post-episode Co-Optimizer that performs patch-style skill consolidation with budget constraints and drift regularization, inspired by fast-slow memory separation.
Result: In Overcooked-AI-like benchmarks, CoWork-X achieves stable cumulative performance gains while steadily reducing online latency and token usage.
Conclusion: The framework successfully addresses real-time collaborative challenges by separating fast execution from slow optimization, enabling efficient multi-episode adaptation under token constraints.
Abstract: Large language models are enabling language-conditioned agents in interactive environments, but highly cooperative tasks often impose two simultaneous constraints: sub-second real-time coordination and sustained multi-episode adaptation under a strict online token budget. Existing approaches either rely on frequent in-episode reasoning that induces latency and timing jitter, or deliver post-episode improvements through unstructured text that is difficult to compile into reliable low-cost execution. We propose CoWork-X, an active co-evolution framework that casts peer collaboration as a closed-loop optimization problem across episodes, inspired by fast–slow memory separation. CoWork-X instantiates a Skill-Agent that executes via HTN (hierarchical task network)-based skill retrieval from a structured, interpretable, and compositional skill library, and a post-episode Co-Optimizer that performs patch-style skill consolidation with explicit budget constraints and drift regularization. Experiments in challenging Overcooked-AI-like realtime collaboration benchmarks demonstrate that CoWork-X achieves stable, cumulative performance gains while steadily reducing online latency and token usage.
[3] Capacity Constraints and the Multilingual Penalty for Lexical Disambiguation
Sean Trott, Pamela D. Rivière
Main category: cs.CL
TL;DR: Multilingual language models underperform monolingual ones on lexical disambiguation tasks due to three capacity constraints: reduced embedding isotropy, weaker attention to disambiguating cues, and increased multi-token segmentation.
Details
Motivation: Multilingual language models often perform worse than monolingual models, potentially due to capacity limitations when handling multiple languages. The paper aims to quantify this "multilingual penalty" specifically for lexical disambiguation tasks requiring precise semantic representations.Method: Used controlled datasets of human relatedness judgments for ambiguous words in English and Spanish. Compared monolingual and multilingual LMs from the same families. Explored three capacity constraints: representational (embedding isotropy), attentional (attention to disambiguating cues), and vocabulary-related (multi-token segmentation).
Result: Multilingual LMs consistently showed reduced performance compared to monolingual counterparts. Evidence found for all three capacity constraints, and these factors statistically accounted for the variance previously attributed to multilingual status alone.
Conclusion: Multilingual LMs suffer from multiple capacity constraints that correlate with reduced disambiguation performance, suggesting these limitations contribute to the observed multilingual penalty.
Abstract: Multilingual language models (LMs) sometimes under-perform their monolingual counterparts, possibly due to capacity limitations. We quantify this ``multilingual penalty’’ for lexical disambiguation–a task requiring precise semantic representations and contextualization mechanisms–using controlled datasets of human relatedness judgments for ambiguous words in both English and Spanish. Comparing monolingual and multilingual LMs from the same families, we find consistently reduced performance in multilingual LMs. We then explore three potential capacity constraints: representational (reduced embedding isotropy), attentional (reduced attention to disambiguating cues), and vocabulary-related (increased multi-token segmentation). Multilingual LMs show some evidence of all three limitations; moreover, these factors statistically account for the variance formerly attributed to a model’s multilingual status. These findings suggest both that multilingual LMs do suffer from multiple capacity constraints, and that these constraints correlate with reduced disambiguation performance.
[4] Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories
Sidi Lu, Zhenwen Liang, Dongyang Ma, Yan Wang, Haitao Mi, Dong Yu
Main category: cs.CL
TL;DR: Locas introduces a locally-supported parametric memory that can be flexibly offloaded from or merged into transformer model parameters, enabling efficient continual learning with minimal parameter overhead.
Details
Motivation: To bridge test-time-training with parametric memory that supports efficient continual learning while minimizing catastrophic forgetting, allowing models to store past context information with minimal parameter overhead.Method: Two variants: conventional two-layer MLP (with theoretical guarantees) and GLU-FFN structure (compatible with SOTA LLMs). Uses principled initialization by reusing model parameters, activations, and gradients. Tested on PG-19 language modeling and LoCoMo dialogue QA tasks.
Result: With only 0.02% additional parameters, Locas-GLU stores past context information while maintaining small context windows. MMLU evaluation shows minimal general capability loss after memorizing entire books, demonstrating effective catastrophic forgetting prevention.
Conclusion: Locas enables efficient continual learning by permanentizing past context into parametric knowledge with minimal catastrophic forgetting, offering a flexible memory mechanism for transformer models.
Abstract: In this paper, we aim to bridge test-time-training with a new type of parametric memory that can be flexibly offloaded from or merged into model parameters. We present Locas, a Locally-Supported parametric memory that shares the design of FFN blocks in modern transformers, allowing it to be flexibly permanentized into the model parameters while supporting efficient continual learning. We discuss two major variants of Locas: one with a conventional two-layer MLP design that has a clearer theoretical guarantee; the other one shares the same GLU-FFN structure with SOTA LLMs, and can be easily attached to existing models for both parameter-efficient and computation-efficient continual learning. Crucially, we show that proper initialization of such low-rank sideway-FFN-style memories – performed in a principled way by reusing model parameters, activations and/or gradients – is essential for fast convergence, improved generalization, and catastrophic forgetting prevention. We validate the proposed memory mechanism on the PG-19 whole-book language modeling and LoCoMo long-context dialogue question answering tasks. With only 0.02% additional parameters in the lowest case, Locas-GLU is capable of storing the information from past context while maintaining a much smaller context window. In addition, we also test the model’s general capability loss after memorizing the whole book with Locas, through comparative MMLU evaluation. Results show the promising ability of Locas to permanentize past context into parametric knowledge with minimized catastrophic forgetting of the model’s existing internal knowledge.
[5] Data Kernel Perspective Space Performance Guarantees for Synthetic Data from Transformer Models
Michael Browder, Kevin Duh, J. David Harris, Vince Lyzinski, Paul McNamee, Youngser Park, Carey E. Priebe, Peter Viechnicki
Main category: cs.CL
TL;DR: Proposes Data Kernel Perspective Space (DKPS) to provide mathematical foundations and statistical guarantees for transformer model outputs, addressing uncertainty in synthetic data generation for language technology.
Details
Motivation: Addresses the data scarcity problem in language technology and generative AI by providing mathematical analysis tools for transformer models, which are currently black boxes with unpredictable synthetic data properties.Method: Introduces Data Kernel Perspective Space (DKPS) as a mathematical framework to analyze transformer models, deriving performance guarantees that can be applied to downstream tasks like neural machine translation and LLMs trained with Contrastive Preference Optimization.
Result: Provides mathematical derivation of DKPS showing how it offers performance guarantees for transformer model outputs, enabling better understanding and control of synthetic data generation quality.
Conclusion: DKPS offers foundational mathematical analysis for transformer models, addressing the uncertainty in synthetic data generation and providing concrete statistical guarantees for language technology applications.
Abstract: Scarcity of labeled training data remains the long pole in the tent for building performant language technology and generative AI models. Transformer models – particularly LLMs – are increasingly being used to mitigate the data scarcity problem via synthetic data generation. However, because the models are black boxes, the properties of the synthetic data are difficult to predict. In practice it is common for language technology engineers to ‘fiddle’ with the LLM temperature setting and hope that what comes out the other end improves the downstream model. Faced with this uncertainty, here we propose Data Kernel Perspective Space (DKPS) to provide the foundation for mathematical analysis yielding concrete statistical guarantees for the quality of the outputs of transformer models. We first show the mathematical derivation of DKPS and how it provides performance guarantees. Next we show how DKPS performance guarantees can elucidate performance of a downstream task, such as neural machine translation models or LLMs trained using Contrastive Preference Optimization (CPO). Limitations of the current work and future research are also discussed.
[6] Towards a Science of Collective AI: LLM-based Multi-Agent Systems Need a Transition from Blind Trial-and-Error to Rigorous Science
Jingru Fan, Dewen Liu, Yufan Dang, Huatao Li, Yuheng Wang, Wei Liu, Feiyu Duan, Xuanwen Ding, Shu Yao, Lin Wu, Ruijie Shi, Wai-Shing Leung, Yuan Cheng, Zhongyu Wei, Cheng Yang, Chen Qian, Zhiyuan Liu, Maosong Sun
Main category: cs.CL
TL;DR: Proposes a scientific framework for Multi-Agent Systems using LLMs, introducing a collaboration gain metric to distinguish genuine collaboration from resource accumulation and establishing a factor attribution paradigm for systematic optimization.
Details
Motivation: Current LLM-based Multi-Agent Systems rely on empirical trial-and-error without a principled scientific framework, lacking structured taxonomy of factors and unified metrics to distinguish genuine collaboration gains from mere resource accumulation.Method: Establishes collaboration gain metric (Γ) as scientific standard to isolate intrinsic gains from increased budgets, proposes factor attribution paradigm to identify collaboration-driving factors, and constructs systematic MAS factor library with control-level presets and information-level dynamics.
Result: Provides a framework to transition from blind experimentation to rigorous science in LLM-based Multi-Agent Systems, enabling systematic optimization and improvement through principled factor attribution and standardized metrics.
Conclusion: The proposed framework facilitates the transition from empirical trial-and-error to design science in Multi-Agent Systems, paving the way toward a true science of Collective AI with systematic optimization capabilities.
Abstract: Recent advancements in Large Language Models (LLMs) have greatly extended the capabilities of Multi-Agent Systems (MAS), demonstrating significant effectiveness across a wide range of complex and open-ended domains. However, despite this rapid progress, the field still relies heavily on empirical trial-and-error. It lacks a unified and principled scientific framework necessary for systematic optimization and improvement. This bottleneck stems from the ambiguity of attribution: first, the absence of a structured taxonomy of factors leaves researchers restricted to unguided adjustments; second, the lack of a unified metric fails to distinguish genuine collaboration gain from mere resource accumulation. In this paper, we advocate for a transition to design science through an integrated framework. We advocate to establish the collaboration gain metric ($Γ$) as the scientific standard to isolate intrinsic gains from increased budgets. Leveraging $Γ$, we propose a factor attribution paradigm to systematically identify collaboration-driving factors. To support this, we construct a systematic MAS factor library, structuring the design space into control-level presets and information-level dynamics. Ultimately, this framework facilitates the transition from blind experimentation to rigorous science, paving the way towards a true science of Collective AI.
[7] Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions
Jinchuan Tian, Haoran Wang, Bo-Hao Su, Chien-yu Huang, Qingzheng Wang, Jiatong Shi, William Chen, Xun Gong, Siddhant Arora, Chin-Jou Li, Masao Someki, Takashi Maekaku, Yusuke Shinohara, Jin Sakuma, Chao-Han Huck Yang, Shinji Watanabe
Main category: cs.CL
TL;DR: Bagpiper is an 8B audio foundation model that processes audio holistically using rich captions to bridge physical signals with cognitive concepts, achieving unified understanding and generation for general audio.
Details
Motivation: Current audio foundation models use rigid, task-specific supervision that addresses isolated factors rather than holistic audio understanding. Human intelligence processes audio holistically by bridging physical signals with abstract cognitive concepts. The authors aim to create a model that mimics this holistic approach.Method: Bagpiper uses rich captions (comprehensive natural language descriptions) to interpret physical audio, establishing bidirectional mapping between raw audio and high-level conceptual space. Pre-trained on 600B tokens, it adopts a caption-then-process workflow during fine-tuning, simulating intermediate cognitive reasoning without task-specific priors.
Result: Bagpiper outperforms Qwen-2.5-Omni on MMAU and AIRBench for audio understanding, surpasses CosyVoice3 and TangoFlux in generation quality, and can synthesize arbitrary compositions of speech, music, and sound effects. It achieves unified understanding and generation for general audio.
Conclusion: Bagpiper demonstrates that holistic audio processing via rich captions enables unified understanding and generation capabilities, representing a significant advance in audio foundation models that bridge the gap between physical signals and cognitive concepts.
Abstract: Current audio foundation models typically rely on rigid, task-specific supervision, addressing isolated factors of audio rather than the whole. In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks. Grounded in this philosophy, we introduce Bagpiper, an 8B audio foundation model that interprets physical audio via rich captions, i.e., comprehensive natural language descriptions that encapsulate the critical cognitive concepts inherent in the signal (e.g., transcription, audio events). By pre-training on a massive corpus of 600B tokens, the model establishes a robust bidirectional mapping between raw audio and this high-level conceptual space. During fine-tuning, Bagpiper adopts a caption-then-process workflow, simulating an intermediate cognitive reasoning step to solve diverse tasks without task-specific priors. Experimentally, Bagpiper outperforms Qwen-2.5-Omni on MMAU and AIRBench for audio understanding and surpasses CosyVoice3 and TangoFlux in generation quality, capable of synthesizing arbitrary compositions of speech, music, and sound effects. To the best of our knowledge, Bagpiper is among the first works that achieve unified understanding generation for general audio. Model, data, and code are available at Bagpiper Home Page.
[8] Multilingual Extraction and Recognition of Implicit Discourse Relations in Speech and Text
Ahmed Ruby, Christian Hardmeier, Sara Stymne
Main category: cs.CL
TL;DR: Multimodal approach using Qwen2-Audio for implicit discourse relation classification across English, French, and Spanish, showing that combining text and audio improves performance over text-only models.
Details
Motivation: Implicit discourse relation classification is challenging because contextual cues can be distributed across modalities and languages, and text alone may not capture all necessary information.Method: Proposed a multimodal approach integrating textual and acoustic information through Qwen2-Audio for joint modeling of text and audio, with automatic dataset construction for multilingual (English, French, Spanish) implicit discourse relations.
Result: Text-based models outperform audio-based models, but integrating both modalities enhances performance, and cross-lingual transfer provides substantial improvements for low-resource languages.
Conclusion: Multimodal integration of text and audio can improve implicit discourse relation classification, and cross-lingual transfer is beneficial for low-resource languages in this task.
Abstract: Implicit discourse relation classification is a challenging task, as it requires inferring meaning from context. While contextual cues can be distributed across modalities and vary across languages, they are not always captured by text alone. To address this, we introduce an automatic method for distantly related and unrelated language pairs to construct a multilingual and multimodal dataset for implicit discourse relations in English, French, and Spanish. For classification, we propose a multimodal approach that integrates textual and acoustic information through Qwen2-Audio, allowing joint modeling of text and audio for implicit discourse relation classification across languages. We find that while text-based models outperform audio-based models, integrating both modalities can enhance performance, and cross-lingual transfer can provide substantial improvements for low-resource languages.
[9] GreekMMLU: A Native-Sourced Multitask Benchmark for Evaluating Language Models in Greek
Yang Zhang, Mersin Konomi, Christos Xypolopoulos, Konstantinos Divriotis, Konstantinos Skianis, Giannis Nikolentzos, Giorgos Stamou, Guokan Shang, Michalis Vazirgiannis
Main category: cs.CL
TL;DR: GreekMMLU: A native-sourced Greek benchmark for evaluating LLMs with 21,805 multiple-choice questions across 45 subjects, revealing performance gaps between models.
Details
Motivation: Existing Greek evaluation datasets are often machine-translated from English and fail to capture authentic Greek linguistic and cultural characteristics, creating a need for native-sourced benchmarks.Method: Created GreekMMLU by sourcing or authoring 21,805 multiple-choice questions in Greek from academic, professional, and governmental exams, organized under a new subject taxonomy with educational difficulty levels.
Result: Evaluation of 80+ LLMs shows substantial performance gaps between frontier vs. open-weight models and Greek-adapted vs. general multilingual models, with systematic analysis of factors influencing performance.
Conclusion: GreekMMLU enables robust evaluation of Greek language understanding in LLMs, revealing current limitations and providing insights for improving Greek language capabilities.
Abstract: Large Language Models (LLMs) are commonly trained on multilingual corpora that include Greek, yet reliable evaluation benchmarks for Greek-particularly those based on authentic, native-sourced content-remain limited. Existing datasets are often machine-translated from English, failing to capture Greek linguistic and cultural characteristics. We introduce GreekMMLU, a native-sourced benchmark for massive multitask language understanding in Greek, comprising 21,805 multiple-choice questions across 45 subject areas, organized under a newly defined subject taxonomy and annotated with educational difficulty levels spanning primary to professional examinations. All questions are sourced or authored in Greek from academic, professional, and governmental exams. We publicly release 16,857 samples and reserve 4,948 samples for a private leaderboard to enable robust and contamination-resistant evaluation. Evaluations of over 80 open- and closed-source LLMs reveal substantial performance gaps between frontier and open-weight models, as well as between Greek-adapted models and general multilingual ones. Finally, we provide a systematic analysis of factors influencing performance-including model scale, adaptation, and prompting-and derive insights for improving LLM capabilities in Greek.
[10] LinguistAgent: A Reflective Multi-Model Platform for Automated Linguistic Annotation
Bingru Li
Main category: cs.CL
TL;DR: LinguistAgent is a user-friendly platform that automates linguistic annotation using a reflective multi-model architecture with dual-agent workflow for metaphor identification and other complex semantic tasks.
Details
Motivation: Data annotation is a major bottleneck in Humanities and Social Sciences, especially for complex semantic tasks like metaphor identification. While LLMs show promise, there's a gap between their theoretical capability and practical utility for researchers.Method: LinguistAgent uses a reflective multi-model architecture with dual-agent workflow (Annotator and Reviewer) simulating professional peer-review. It supports three paradigms: Prompt Engineering (Zero/Few-shot), Retrieval-Augmented Generation, and Fine-tuning.
Result: The system demonstrates efficacy using metaphor identification as an example, providing real-time token-level evaluation (Precision, Recall, F1 score) against human gold standards. The application and codes are publicly released.
Conclusion: LinguistAgent bridges the gap between LLM capabilities and practical research utility, offering an integrated platform for automated linguistic annotation with comparative experimental support across multiple paradigms.
Abstract: Data annotation remains a significant bottleneck in the Humanities and Social Sciences, particularly for complex semantic tasks such as metaphor identification. While Large Language Models (LLMs) show promise, a significant gap remains between the theoretical capability of LLMs and their practical utility for researchers. This paper introduces LinguistAgent, an integrated, user-friendly platform that leverages a reflective multi-model architecture to automate linguistic annotation. The system implements a dual-agent workflow, comprising an Annotator and a Reviewer, to simulate a professional peer-review process. LinguistAgent supports comparative experiments across three paradigms: Prompt Engineering (Zero/Few-shot), Retrieval-Augmented Generation, and Fine-tuning. We demonstrate LinguistAgent’s efficacy using the task of metaphor identification as an example, providing real-time token-level evaluation (Precision, Recall, and $F_1$ score) against human gold standards. The application and codes are released on https://github.com/Bingru-Li/LinguistAgent.
[11] Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems
Ziyuan Yang, Wenxuan Ding, Shangbin Feng, Yulia Tsvetkov
Main category: cs.CL
TL;DR: Multi-LLM collaboration systems are vulnerable to malicious models, causing significant performance degradation, especially in reasoning and safety tasks. Proposed mitigation strategies using external supervisors can recover most performance but complete security remains an open problem.
Details
Motivation: As language models increasingly collaborate in decentralized systems (routing, multi-agent debate, model merging), there's a critical safety risk when some models are compromised or malicious. The paper aims to quantify the impact of malicious models in multi-LLM systems and develop mitigation strategies.Method: 1) Engineered four categories of malicious LMs, 2) Plugged them into four types of popular model collaboration systems, 3) Evaluated compromised systems across 10 datasets, 4) Proposed mitigation strategies using external supervisors that oversee model collaboration to disable/mask malicious components.
Result: Malicious models severely impact multi-LLM systems, with average performance degradation of 7.12% for reasoning tasks and 7.94% for safety domains. Mitigation strategies using external supervisors recover 95.31% of initial performance on average.
Conclusion: Multi-LLM collaboration systems are vulnerable to malicious components, especially in critical domains like reasoning and safety. While external supervision can mitigate most damage, making such systems fully resistant to malicious models remains an open research challenge.
Abstract: Language models (LMs) are increasingly used in collaboration: multiple LMs trained by different parties collaborate through routing systems, multi-agent debate, model merging, and more. Critical safety risks remain in this decentralized paradigm: what if some of the models in multi-LLM systems are compromised or malicious? We first quantify the impact of malicious models by engineering four categories of malicious LMs, plug them into four types of popular model collaboration systems, and evaluate the compromised system across 10 datasets. We find that malicious models have a severe impact on the multi-LLM systems, especially for reasoning and safety domains where performance is lowered by 7.12% and 7.94% on average. We then propose mitigation strategies to alleviate the impact of malicious components, by employing external supervisors that oversee model collaboration to disable/mask them out to reduce their influence. On average, these strategies recover 95.31% of the initial performance, while making model collaboration systems fully resistant to malicious models remains an open research question.
[12] The Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems
Shangbin Feng, Kishan Panaganti, Yulia Tsvetkov, Wenhao Yu
Main category: cs.CL
TL;DR: Distilling collaborative patterns from multiple language models into a single model to preserve collaboration benefits while reducing computational costs, with a single-multi evolution loop for continuous improvement.
Details
Motivation: Model collaboration systems combine strengths of diverse models but incur high computational costs from loading multiple models. Need to improve efficiency while preserving collaborative benefits.Method: Distill collaborative patterns into a single model trained on outputs of model collaboration system. Propose single-multi evolution loop: multiple LMs collaborate → each distills from collaborative outputs → improved LMs collaborate again → repeat evolution cycle.
Result: Individual models improve by 8.0% on average, absorbing collaboration strengths while reducing cost to single model. Collaboration benefits from stronger post-distillation LMs, improving over initial systems by 14.9% on average.
Conclusion: The single-multi evolution loop effectively distills collaborative benefits into single models, outperforms existing evolutionary AI methods, is compatible with diverse settings, and solves problems initial models struggle with.
Abstract: Model collaboration – systems where multiple language models (LMs) collaborate – combines the strengths of diverse models with cost in loading multiple LMs. We improve efficiency while preserving the strengths of collaboration by distilling collaborative patterns into a single model, where the model is trained on the outputs of the model collaboration system. At inference time, only the distilled model is employed: it imitates the collaboration while only incurring the cost of a single model. Furthermore, we propose the single-multi evolution loop: multiple LMs collaborate, each distills from the collaborative outputs, and these post-distillation improved LMs collaborate again, forming a collective evolution ecosystem where models evolve and self-improve by interacting with an environment of other models. Extensive experiments with 7 collaboration strategies and 15 tasks (QA, reasoning, factuality, etc.) demonstrate that: 1) individual models improve by 8.0% on average, absorbing the strengths of collaboration while reducing the cost to a single model; 2) the collaboration also benefits from the stronger and more synergistic LMs after distillation, improving over initial systems without evolution by 14.9% on average. Analysis reveals that the single-multi evolution loop outperforms various existing evolutionary AI methods, is compatible with diverse model/collaboration/distillation settings, and helps solve problems where the initial model/system struggles to.
[13] Are Open-Weight LLMs Ready for Social Media Moderation? A Comparative Study on Bluesky
Hsuan-Yu Chou, Wajiha Naveed, Shuyan Zhou, Xiaowei Yang
Main category: cs.CL
TL;DR: Open-weight LLMs show comparable performance to proprietary LLMs for social media content moderation tasks, with similar sensitivity and specificity ranges, enabling privacy-preserving moderation on consumer hardware.
Details
Motivation: As internet access expands, exposure to harmful content increases, creating need for effective moderation. While proprietary LLMs have shown promise for content moderation, the capabilities of open-weight LLMs remain unexplored, especially for privacy-preserving solutions on consumer hardware.Method: Evaluated seven state-of-the-art LLMs (four proprietary, three open-weight) using real-world Bluesky posts, moderation decisions by Bluesky Moderation Service, and author annotations. Measured sensitivity and specificity for different content types (rudeness, intolerance, threats).
Result: Open-weight LLMs showed considerable overlap with proprietary models: sensitivity 81%-97% vs 72%-98%, specificity 91%-100% vs 93%-99%. Specificity exceeded sensitivity for rudeness detection, but opposite for intolerance and threats. Identified inter-rater agreement between human moderators and LLMs.
Conclusion: Open-weight LLMs can support privacy-preserving moderation on consumer-grade hardware, suggesting new directions for moderation systems that balance community values with individual preferences while maintaining performance comparable to proprietary solutions.
Abstract: As internet access expands, so does exposure to harmful content, increasing the need for effective moderation. Research has demonstrated that large language models (LLMs) can be effectively utilized for social media moderation tasks, including harmful content detection. While proprietary LLMs have been shown to zero-shot outperform traditional machine learning models, the out-of-the-box capability of open-weight LLMs remains an open question. Motivated by recent developments of reasoning LLMs, we evaluate seven state-of-the-art models: four proprietary and three open-weight. Testing with real-world posts on Bluesky, moderation decisions by Bluesky Moderation Service, and annotations by two authors, we find a considerable degree of overlap between the sensitivity (81%–97%) and specificity (91%–100%) of the open-weight LLMs and those (72%–98%, and 93%–99%) of the proprietary ones. Additionally, our analysis reveals that specificity exceeds sensitivity for rudeness detection, but the opposite holds for intolerance and threats. Lastly, we identify inter-rater agreement across human moderators and the LLMs, highlighting considerations for deploying LLMs in both platform-scale and personalized moderation contexts. These findings show open-weight LLMs can support privacy-preserving moderation on consumer-grade hardware and suggest new directions for designing moderation systems that balance community values with individual user preferences.
[14] TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling
Liang-Hsuan Tseng, Yi-Chang Chen, Kuan-Yi Lee, Da-Shan Shiu, Hung-yi Lee
Main category: cs.CL
TL;DR: TASTE introduces text-aligned speech tokenization and embedding for joint speech-text modeling, enabling spoken language models that preserve paralinguistic information while reducing token sequence length.
Details
Motivation: To enable more natural human-LLM interaction through spoken language models that can both listen and speak, addressing the modality gap between speech and text representations in joint modeling approaches.Method: Proposes TASTE (Text-Aligned Speech Tokenization and Embedding) using attention-based aggregation with speech reconstruction as training objective to align speech tokens with corresponding text transcriptions during tokenization.
Result: TASTE preserves paralinguistic information while dramatically reducing token sequence length, enabling straightforward joint spoken language modeling via LoRA on pre-trained text LLMs; performs comparably to previous work on SALMON and StoryCloze, and significantly outperforms other pre-trained SLMs on speech continuation tasks.
Conclusion: TASTE is the first end-to-end approach using reconstruction objective to automatically learn text-aligned speech tokenization suitable for spoken language modeling, bridging the speech-text modality gap effectively.
Abstract: Recent efforts target spoken language models (SLMs) that not only listen but also speak for more natural human-LLM interaction. Joint speech-text modeling is a promising direction to achieve this. However, the effectiveness of recent speech tokens for joint modeling remains underexplored. To address this, we introduce Text-Aligned Speech Tokenization and Embedding (TASTE), a method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage. We propose a method that can achieve this through a attention-based aggregation mechanism and with speech reconstruction as the training objective. We conduct extensive experiments and show that TASTE can preserve essential paralinguistic information while dramatically reducing the token sequence length. With TASTE, we perform straightforward joint spoken language modeling by using Low-Rank Adaptation on the pre-trained text LLM. Experimental results show that TASTE-based SLMs perform comparable to previous work on SALMON and StoryCloze; while significantly outperform other pre-trained SLMs on speech continuation across subjective and objective evaluations. To our knowledge, TASTE is the first end-to-end approach that utilizes a reconstruction objective to automatically learn a text-aligned speech tokenization and embedding suitable for spoken language modeling. Our demo, code, and model are available at https://mtkresearch.github.io/TASTE-SpokenLM.github.io.
[15] Aligning Large Language Model Behavior with Human Citation Preferences
Kenichiro Ando, Tatsuya Harada
Main category: cs.CL
TL;DR: This paper analyzes how LLMs determine what content to cite compared to human preferences, finding systematic biases in citation behavior that can be calibrated through preference optimization.
Details
Motivation: While LLM services increasingly add citations for credibility, there's limited understanding of how LLMs recognize cite-worthiness and how this aligns with human preferences. The research aims to characterize the gap between LLM citation behavior and human citation preferences.Method: Constructed a dataset to characterize human citation preferences vs. LLM behavior. Web-derived texts were categorized into eight citation-motivation types, with pairwise citation preferences exhaustively evaluated across all type combinations. Used Direct Preference Optimization to calibrate model behavior.
Result: Humans most frequently seek citations for medical text, and stronger models show similar tendency. Models are 27% more likely than humans to cite text explicitly marked as needing citations (like Wikipedia), reducing alignment accuracy. Models systematically underselect numeric sentences (-22.6%) and sentences with personal names (-20.1%), categories where humans typically demand citations. Direct Preference Optimization successfully calibrated model behavior to better match human preferences.
Conclusion: LLMs have systematic biases in citation behavior that don’t align well with human preferences, but these can be calibrated through preference optimization. The study provides foundation for more fine-grained investigation of LLM citation preferences.
Abstract: Most services built on powerful large-scale language models (LLMs) add citations to their output to enhance credibility. Recent research has paid increasing attention to the question of what reference documents to link to outputs. However, how LLMs recognize cite-worthiness and how this process should be controlled remains underexplored. In this study, we focus on what kinds of content LLMs currently tend to cite and how well that behavior aligns with human preferences. We construct a dataset to characterize the relationship between human citation preferences and LLM behavior. Web-derived texts are categorized into eight citation-motivation types, and pairwise citation preferences are exhaustively evaluated across all type combinations to capture fine-grained contrasts. Our results show that humans most frequently seek citations for medical text, and stronger models display a similar tendency. We also find that current models are as much as $27%$ more likely than humans to add citations to text that is explicitly marked as needing citations on sources such as Wikipedia, and this overemphasis reduces alignment accuracy. Conversely, models systematically underselect numeric sentences (by $-22.6%$ relative to humans) and sentences containing personal names (by $-20.1%$), categories for which humans typically demand citations. Furthermore, experiments with Direct Preference Optimization demonstrate that model behavior can be calibrated to better match human citation preferences. We expect this study to provide a foundation for more fine-grained investigations into LLM citation preferences.
[16] Quantifying the Knowledge Proximity Between Academic and Industry Research: An Entity and Semantic Perspective
Hongye Zhao, Yi Zhao, Chengzhi Zhang
Main category: cs.CL
TL;DR: This paper analyzes academia-industry co-evolution by quantifying knowledge proximity through fine-grained entity extraction and semantic space analysis, revealing increasing convergence particularly after technological changes.
Details
Motivation: Existing studies on academia-industry knowledge proximity rely on macro indicators like collaborative paper counts, lacking fine-grained analysis of knowledge units, which limits understanding of collaboration dynamics and resource allocation efficiency.Method: 1) Extract fine-grained knowledge entities using pre-trained models and measure sequence overlaps with cosine similarity; 2) Analyze topological features via complex network analysis; 3) Use unsupervised contrastive learning to quantify semantic space convergence by measuring cross-institutional textual similarities; 4) Examine correlations using citation distribution patterns.
Result: Knowledge proximity between academia and industry increases over time, especially following technological changes, providing evidence of bidirectional adaptation. Academia’s knowledge dominance weakens during technological paradigm shifts.
Conclusion: The study provides a fine-grained methodology for analyzing academia-industry co-evolution, revealing dynamic knowledge convergence patterns and institutional adaptation mechanisms that can inform collaboration frameworks and resource allocation.
Abstract: The academia and industry are characterized by a reciprocal shaping and dynamic feedback mechanism. Despite distinct institutional logics, they have adapted closely in collaborative publishing and talent mobility, demonstrating tension between institutional divergence and intensive collaboration. Existing studies on their knowledge proximity mainly rely on macro indicators such as the number of collaborative papers or patents, lacking an analysis of knowledge units in the literature. This has led to an insufficient grasp of fine-grained knowledge proximity between industry and academia, potentially undermining collaboration frameworks and resource allocation efficiency. To remedy the limitation, this study quantifies the trajectory of academia-industry co-evolution through fine-grained entities and semantic space. In the entity measurement part, we extract fine-grained knowledge entities via pre-trained models, measure sequence overlaps using cosine similarity, and analyze topological features through complex network analysis. At the semantic level, we employ unsupervised contrastive learning to quantify convergence in semantic spaces by measuring cross-institutional textual similarities. Finally, we use citation distribution patterns to examine correlations between bidirectional knowledge flows and similarity. Analysis reveals that knowledge proximity between academia and industry rises, particularly following technological change. This provides textual evidence of bidirectional adaptation in co-evolution. Additionally, academia’s knowledge dominance weakens during technological paradigm shifts. The dataset and code for this paper can be accessed at https://github.com/tinierZhao/Academic-Industrial-associations.
[17] FedMosaic: Federated Retrieval-Augmented Generation via Parametric Adapters
Zhilin Liang, Yuxiang Wang, Zimu Zhou, Hainan Zhang, Boyi Liu, Yongxin Tong
Main category: cs.CL
TL;DR: FedMosaic is a federated RAG framework that uses parametric adapters with document clustering and selective aggregation to enable privacy-preserving knowledge retrieval without sharing raw documents.
Details
Motivation: Traditional RAG assumes centralized knowledge, which violates privacy requirements in domains with siloed data. Federated RAG (FedRAG) is needed to enable LLMs to access distributed knowledge without sharing raw documents.Method: Uses parametric adapters instead of raw-text exchange; clusters semantically related documents into multi-document adapters with document-specific masks; performs selective adapter aggregation to combine only relevance-aligned, non-conflicting adapters.
Result: Achieves 10.9% higher accuracy than state-of-the-art methods, reduces storage costs by 78.8-86.3%, communication costs by 91.4%, while maintaining privacy by never sharing raw documents.
Conclusion: FedMosaic successfully addresses FedRAG challenges through efficient parametric adapter design with clustering and selective aggregation, enabling practical privacy-preserving RAG.
Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding generation in external knowledge to improve factuality and reduce hallucinations. Yet most deployments assume a centralized corpus, which is infeasible in privacy aware domains where knowledge remains siloed. This motivates federated RAG (FedRAG), where a central LLM server collaborates with distributed silos without sharing raw documents. In context RAG violates this requirement by transmitting verbatim documents, whereas parametric RAG encodes documents into lightweight adapters that merge with a frozen LLM at inference, avoiding raw-text exchange. We adopt the parametric approach but face two unique challenges induced by FedRAG: high storage and communication from per-document adapters, and destructive aggregation caused by indiscriminately merging multiple adapters. We present FedMosaic, the first federated RAG framework built on parametric adapters. FedMosaic clusters semantically related documents into multi-document adapters with document-specific masks to reduce overhead while preserving specificity, and performs selective adapter aggregation to combine only relevance-aligned, nonconflicting adapters. Experiments show that FedMosaic achieves an average 10.9% higher accuracy than state-of-the-art methods in four categories, while lowering storage costs by 78.8% to 86.3% and communication costs by 91.4%, and never sharing raw documents.
[18] Copyright Detective: A Forensic System to Evidence LLMs Flickering Copyright Leakage Risks
Guangwei Zhang, Jianing Zhu, Cheng Qian, Neil Gong, Rada Mihalcea, Zhaozhuo Xu, Jingrui He, Jiaqi Ma, Yun Huang, Chaowei Xiao, Bo Li, Ahmed Abbasi, Dongwon Lee, Heng Ji, Denghui Zhang
Main category: cs.CL
TL;DR: Copyright Detective: First interactive forensic system for detecting, analyzing, and visualizing potential copyright risks in LLM outputs through evidence discovery process
Details
Motivation: Current approaches treat copyright infringement as static classification, but copyright law is complex and requires evidence discovery. Need systematic auditing tools for verbatim memorization and paraphrase-level leakage in LLMs to support responsible deployment.Method: Interactive forensic system integrating multiple detection paradigms: content recall testing, paraphrase-level similarity analysis, persuasive jailbreak probing, and unlearning verification. Uses interactive prompting, response collection, and iterative workflows within unified extensible framework.
Result: First system enabling systematic auditing of copyright risks in LLM outputs, supporting transparent evaluation even with black-box access to models.
Conclusion: Copyright Detective provides comprehensive forensic analysis for copyright risks in LLMs, moving beyond static classification to evidence-based discovery process for responsible AI deployment.
Abstract: We present Copyright Detective, the first interactive forensic system for detecting, analyzing, and visualizing potential copyright risks in LLM outputs. The system treats copyright infringement versus compliance as an evidence discovery process rather than a static classification task due to the complex nature of copyright law. It integrates multiple detection paradigms, including content recall testing, paraphrase-level similarity analysis, persuasive jailbreak probing, and unlearning verification, within a unified and extensible framework. Through interactive prompting, response collection, and iterative workflows, our system enables systematic auditing of verbatim memorization and paraphrase-level leakage, supporting responsible deployment and transparent evaluation of LLM copyright risks even with black-box access.
[19] CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs
Haoran Li, Sucheng Ren, Alan Yuille, Feng Wang
Main category: cs.CL
TL;DR: CoPE introduces soft clipping of low-frequency RoPE components to unify OOD mitigation and semantic modeling for better length generalization in LLMs.
Details
Motivation: Current methods for adapting Rotary Positional Embedding (RoPE) to longer contexts fall into two categories: out-of-distribution (OOD) mitigation and Semantic Modeling. The authors aim to unify these approaches through a minimalist intervention that addresses both objectives simultaneously.Method: CoPE (soft clipping) applies soft clipping to low-frequency components of RoPE. This approach eliminates OOD outliers, refines semantic signals, and prevents spectral leakage caused by hard clipping methods.
Result: Extensive experiments show that applying soft clipping to RoPE yields significant performance gains that scale up to 256k context length, establishing CoPE as state-of-the-art for length generalization.
Conclusion: CoPE successfully unifies OOD mitigation and semantic modeling objectives through soft clipping of RoPE’s low-frequency components, providing an effective solution for context scaling in LLMs with strong theoretical grounding and empirical results.
Abstract: Rotary Positional Embedding (RoPE) is a key component of context scaling in Large Language Models (LLMs). While various methods have been proposed to adapt RoPE to longer contexts, their guiding principles generally fall into two categories: (1) out-of-distribution (OOD) mitigation, which scales RoPE frequencies to accommodate unseen positions, and (2) Semantic Modeling, which posits that the attention scores computed with RoPE should always prioritize semantically similar tokens. In this work, we unify these seemingly distinct objectives through a minimalist intervention, namely CoPE: soft clipping lowfrequency components of RoPE. CoPE not only eliminates OOD outliers and refines semantic signals, but also prevents spectral leakage caused by hard clipping. Extensive experiments demonstrate that simply applying our soft clipping strategy to RoPE yields significant performance gains that scale up to 256k context length, validating our theoretical analysis and establishing CoPE as a new state-of-the-art for length generalization. Our code, data, and models are available at https://github.com/hrlics/CoPE.
[20] Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR
Fanfan Liu, Youyang Yin, Peng Shi, Siqi Yang, Zhixiong Zeng, Haibo Qiu
Main category: cs.CL
TL;DR: The paper analyzes response length patterns in RLVR algorithms for LLMs/VLMs, identifies length bias in GSPO, and proposes LUSPO to address this issue, achieving SOTA performance on reasoning tasks.
Details
Motivation: RLVR algorithms show varied response length patterns during training, with some algorithms experiencing length collapse. The paper aims to understand the fundamental causes of these variations and develop a length-unbiased optimization strategy.Method: Conducts theoretical analysis of RLVR algorithm components, identifies length bias in GSPO, and proposes LUSPO algorithm that rectifies this bias by making the loss function unbiased with respect to response length.
Result: LUSPO consistently achieves superior performance across mathematical reasoning benchmarks and multimodal reasoning scenarios, demonstrating state-of-the-art performance compared to existing methods like GRPO and GSPO.
Conclusion: LUSPO represents a novel optimization strategy that addresses length bias in RLVR algorithms, preventing response length collapse and improving reasoning capabilities in LLMs and VLMs.
Abstract: Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.
[21] MentorCollab: Selective Large-to-Small Inference-Time Guidance for Efficient Reasoning
Haojin Wang, Yike Wang, Shangbin Feng, Hannaneh Hajishirzi, Yulia Tsvetkov
Main category: cs.CL
TL;DR: MentorCollab enables large reasoning models to selectively guide small language models at inference time using divergence detection and verification, improving reasoning performance with minimal overhead.
Details
Motivation: Large reasoning models (LRMs) have strong reasoning capabilities but high inference costs, while small language models (SLMs) are efficient but struggle with multi-step reasoning. Existing collaboration methods often promote imitation rather than effective guidance.Method: Proposes MentorCollab where an LRM selectively guides an SLM at inference time. At random token positions, the system probes for divergences between models, uses a lightweight verifier to decide whether the SLM should follow a short lookahead segment from the mentor or continue independently.
Result: Across 15 SLM-LRM pairs and 3 domains (math reasoning, general knowledge, commonsense reasoning), the method improves performance in 12 settings with average gains of 3.0% and up to 8.0%, while using only 18.4% of tokens from the expensive mentor model on average.
Conclusion: Selective inference-time guidance can restore large-model reasoning ability without substantial inference overhead, showing that short segments and selective probing are sufficient for effective collaboration between models of different scales.
Abstract: Large reasoning models (LRMs) achieve strong performance by producing long chains of thought, but their inference costs are high and often generate redundant reasoning. Small language models (SLMs) are far more efficient, yet struggle on multi-step reasoning tasks. A natural idea is to let a large model guide a small one at inference time as a mentor, yet existing collaboration methods often promote imitation, resulting in verbose reasoning without consistent error correction. We propose MentorCollab, an inference-time collaboration method in which an LRM selectively and sparsely guides an SLM, rather than taking over generation. At randomly sampled token positions, we probe for divergences between the two models and use a lightweight verifier to decide whether the SLM should follow a short lookahead segment from its mentor or continue on its own. Across 15 SLM–LRM pairs and 3 domains (math reasoning, general knowledge, and commonsense reasoning), our method improves performance in 12 settings, with average gains of 3.0% and up to 8.0%, while adopting only having 18.4% tokens generated by the expensive mentor model on average. We find that short segments and selective probing are sufficient for effective collaboration. Our results show that selective inference-time guidance restores large-model reasoning ability without substantial inference overhead.
[22] How Do Language Models Acquire Character-Level Information?
Soma Sato, Ryohei Sasano
Main category: cs.CL
TL;DR: LMs implicitly encode character-level information through tokenization-dependent factors (merge rules, orthographic constraints) and tokenization-independent factors (semantic associations, syntactic information).
Details
Motivation: To understand how language models acquire character-level knowledge despite not being explicitly trained on character-level information, and to reveal the underlying mechanisms behind this phenomenon.Method: Analyze LMs trained under controlled settings (specified pre-training datasets or tokenizers) compared to standard settings, categorizing contributing factors into tokenization-dependent and tokenization-independent categories.
Result: Merge rules and orthographic constraints are primary tokenization-dependent factors, while semantic associations of substrings and syntactic information are key tokenization-independent factors contributing to character-level knowledge.
Conclusion: LMs acquire character-level information through both tokenization-dependent mechanisms (related to how text is segmented) and tokenization-independent mechanisms (related to linguistic structure and meaning), providing insights into how models learn subword representations.
Abstract: Language models (LMs) have been reported to implicitly encode character-level information, despite not being explicitly provided during training. However, the mechanisms underlying this phenomenon remain largely unexplored. To reveal the mechanisms, we analyze how models acquire character-level knowledge by comparing LMs trained under controlled settings, such as specifying the pre-training dataset or tokenizer, with those trained under standard settings. We categorize the contributing factors into those independent of tokenization. Our analysis reveals that merge rules and orthographic constraints constitute primary factors arising from tokenization, whereas semantic associations of substrings and syntactic information function as key factors independent of tokenization.
[23] PACE: Defying the Scaling Hypothesis of Exploration in Iterative Alignment for Mathematical Reasoning
Jun Rao, Zixiong Yu, Xuebo Liu, Guhan Chen, Jing Li, Jiansheng Wei, Xiaojun Meng, Min Zhang
Main category: cs.CL
TL;DR: PACE introduces a more efficient alternative to Best-of-N sampling for aligning LLMs on reasoning tasks, using corrective exploration with minimal compute budget instead of aggressive sampling.
Details
Motivation: Standard DPO implementations rely on Best-of-N sampling (N≥8) to mine golden trajectories, but this scaling approach has diminishing returns and can cause policy collapse in mathematical reasoning tasks due to verifier noise amplification.Method: PACE replaces brute-force Best-of-N sampling with a generation-based corrective strategy using minimal budget (2<N<3). It synthesizes high-fidelity preference pairs from failed explorations through proximal alignment.
Result: PACE outperforms DPO-R1 (N=16) while using only about 1/5 of the compute, demonstrating superior robustness against reward hacking and label noise.
Conclusion: Aggressive exploration in reasoning tasks yields diminishing returns; PACE’s corrective exploration approach provides more efficient and robust alignment for LLMs on reasoning tasks.
Abstract: Iterative Direct Preference Optimization has emerged as the state-of-the-art paradigm for aligning Large Language Models on reasoning tasks. Standard implementations (DPO-R1) rely on Best-of-N sampling (e.g., $N \ge 8$) to mine golden trajectories from the distribution tail. In this paper, we challenge this scaling hypothesis and reveal a counter-intuitive phenomenon: in mathematical reasoning, aggressive exploration yields diminishing returns and even catastrophic policy collapse. We theoretically demonstrate that scaling $N$ amplifies verifier noise and induces detrimental distribution shifts. To resolve this, we introduce \textbf{PACE} (Proximal Alignment via Corrective Exploration), which replaces brute-force mining with a generation-based corrective strategy. Operating with a minimal budget ($2<N<3$), PACE synthesizes high-fidelity preference pairs from failed explorations. Empirical evaluations show that PACE outperforms DPO-R1 $(N=16)$ while using only about $1/5$ of the compute, demonstrating superior robustness against reward hacking and label noise.
[24] Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks
Chaimae Abouzahir, Congbo Ma, Nizar Habash, Farah E. Shamout
Main category: cs.CL
TL;DR: Cross-lingual analysis reveals persistent performance gaps in LLMs for Arabic vs. English medical QA, with gaps widening with task complexity, due to tokenization issues and unreliable confidence metrics.
Details
Motivation: LLMs are widely used in medical applications but are often English-centric, limiting their robustness for linguistically diverse communities. Performance discrepancies in low-resource languages for medical tasks exist but underlying causes remain poorly understood.Method: Cross-lingual empirical analysis of LLM performance on Arabic and English medical question answering, including tokenization analysis of Arabic medical text and reliability analysis of model-reported confidence and explanations.
Result: Persistent language-driven performance gap between Arabic and English that intensifies with increasing task complexity. Tokenization analysis shows structural fragmentation in Arabic medical text, and reliability analysis reveals limited correlation between model-reported confidence/explanations and correctness.
Conclusion: Findings underscore the need for language-aware design and evaluation strategies in LLMs for medical tasks to address performance disparities across languages.
Abstract: In recent years, Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education, and medical question answering. Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities. Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood. In this study, we conduct a cross-lingual empirical analysis of LLM performance on Arabic and English medical question and answering. Our findings reveal a persistent language-driven performance gap that intensifies with increasing task complexity. Tokenization analysis exposes structural fragmentation in Arabic medical text, while reliability analysis suggests that model-reported confidence and explanations exhibit limited correlation with correctness. Together, these findings underscore the need for language-aware design and evaluation strategies in LLMs for medical tasks.
[25] IESR:Efficient MCTS-Based Modular Reasoning for Text-to-SQL with Large Language Models
Tao Liu, Jiafan Lu, Bohan Yu, Pengcheng Wu, Liu Haixin, Guoyu Xu, Li Xiangheng, Lixiao Li, Jiaming Hou, Zhao Shijun, Xinglin Lyu, Kunli Zhang, Yuxiang Jia, Hongyin Zan
Main category: cs.CL
TL;DR: IESR framework enhances lightweight LLMs for Text-to-SQL with information understanding, MCTS-based reasoning, and consistency verification, achieving SOTA on complex reasoning benchmarks without fine-tuning.
Details
Motivation: Current Text-to-SQL methods struggle with complex reasoning, domain knowledge, hypothetical queries, and remain costly for enterprise deployment, despite good performance on standard benchmarks.Method: IESR framework: (i) uses LLMs for information understanding and schema linking, decoupling mathematical computation from SQL generation; (ii) integrates multi-path reasoning via Monte Carlo Tree Search with majority voting; (iii) adds trajectory consistency verification with a discriminator model.
Result: Achieves state-of-the-art performance on complex reasoning benchmark LogicCat (24.28 EX) and Archer dataset (37.28 EX) using only compact lightweight models without fine-tuning. Analysis reveals biases in current coder models regarding physical knowledge, math computation, and common-sense reasoning.
Conclusion: IESR demonstrates that lightweight LLMs can achieve strong Text-to-SQL performance through structured reasoning and verification mechanisms, highlighting important research directions for addressing model biases and reasoning deficiencies.
Abstract: Text-to-SQL is a key natural language processing task that maps natural language questions to SQL queries, enabling intuitive interaction with web-based databases. Although current methods perform well on benchmarks like BIRD and Spider, they struggle with complex reasoning, domain knowledge, and hypothetical queries, and remain costly in enterprise deployment. To address these issues, we propose a framework named IESR(Information Enhanced Structured Reasoning) for lightweight large language models: (i) leverages LLMs for key information understanding and schema linking, and decoupling mathematical computation and SQL generation, (ii) integrates a multi-path reasoning mechanism based on Monte Carlo Tree Search (MCTS) with majority voting, and (iii) introduces a trajectory consistency verification module with a discriminator model to ensure accuracy and consistency. Experimental results demonstrate that IESR achieves state-of-the-art performance on the complex reasoning benchmark LogicCat (24.28 EX) and the Archer dataset (37.28 EX) using only compact lightweight models without fine-tuning. Furthermore, our analysis reveals that current coder models exhibit notable biases and deficiencies in physical knowledge, mathematical computation, and common-sense reasoning, highlighting important directions for future research. We released code at https://github.com/Ffunkytao/IESR-SLM.
[26] Beyond Length: Context-Aware Expansion and Independence as Developmentally Sensitive Evaluation in Child Utterances
Jiyun Chun, Eric Fosler-Lussier, Michael White, Andrew Perrault
Main category: cs.CL
TL;DR: LLM-based framework for evaluating child utterance quality in dialogue, focusing on Expansion (contextual elaboration) and Independence (discourse contribution) rather than traditional length-based metrics.
Details
Motivation: Current metrics for evaluating children's speech quality (like MLU, lexical diversity, readability indices) are dominated by length and ignore conversational context, missing important aspects like reasoning depth, topic maintenance, and discourse planning.Method: Introduces an LLM-as-a-judge framework that first classifies the Previous Adult Utterance Type, then scores child responses along two axes: Expansion (contextual elaboration and inferential depth) and Independence (child’s contribution to advancing discourse).
Result: Shows developmental validity through age-related patterns, improves age estimation over baselines, detects discourse relation differences, and aligns with human judgments, enabling large-scale evaluation.
Conclusion: Shifts child utterance assessment from length measurement to evaluating how meaningfully children’s speech contributes to and advances conversation within context, using LLMs for context-sensitive evaluation.
Abstract: Evaluating the quality of children’s utterances in adult-child dialogue remains challenging due to insufficient context-sensitive metrics. Common proxies such as Mean Length of Utterance (MLU), lexical diversity (vocd-D), and readability indices (Flesch-Kincaid Grade Level, Gunning Fog Index) are dominated by length and ignore conversational context, missing aspects of response quality such as reasoning depth, topic maintenance, and discourse planning. We introduce an LLM-as-a-judge framework that first classifies the Previous Adult Utterance Type and then scores the child’s response along two axes: Expansion (contextual elaboration and inferential depth) and Independence (the child’s contribution to advancing the discourse). These axes reflect fundamental dimensions in child language development, where Expansion captures elaboration, clause combining, and causal and contrastive connectives. Independence captures initiative, topic control, decreasing reliance on adult scaffolding through growing self-regulation, and audience design. We establish developmental validity by showing age-related patterns and demonstrate predictive value by improving age estimation over common baselines. We further confirm semantic sensitivity by detecting differences tied to discourse relations. Our metrics align with human judgments, enabling large-scale evaluation. This shifts child utterance assessment from simply measuring length to evaluating how meaningfully the child’s speech contributes to and advances the conversation within its context.
[27] Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better
Ji Zhao, Yufei Gu, Shitong Shao, Xun Zhou, Liang Xiang, Zeke Xie
Main category: cs.CL
TL;DR: LET accelerates LLM pretraining by using late-layer representations from small pretrained models to guide early layers of larger models during early training, achieving faster convergence and better performance.
Details
Motivation: Pretraining large language models is computationally expensive, and existing small pretrained models are underutilized. The paper explores whether small pretrained models can accelerate training of larger models.Method: Proposes Late-to-Early Training (LET) paradigm with two mechanisms: late-to-early-step learning (using late training phase knowledge in early steps) and late-to-early-layer learning (using late-layer representations to guide early layers).
Result: Achieves up to 1.6× speedup with nearly 5% improvement in downstream task accuracy for 1.4B LLM on Pile dataset, even when using pretrained model with 10× fewer parameters.
Conclusion: LET enables efficient LLM pretraining by leveraging existing small models, significantly reducing computational costs while improving performance.
Abstract: As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of numerous pretrained LLMs developed at significant computational expense, a fundamental real-world question remains underexplored: \textit{Can we leverage existing small pretrained models to accelerate the training of larger models?} In this paper, we propose a Late-to-Early Training (LET) paradigm that enables LLMs to explicitly learn later knowledge in earlier steps and earlier layers. The core idea is to guide the early layers of an LLM during early training using representations from the late layers of a pretrained (i.e. late training phase) model. We identify two key mechanisms that drive LET’s effectiveness: late-to-early-step learning and late-to-early-layer learning. These mechanisms significantly accelerate training convergence while robustly enhancing both language modeling capabilities and downstream task performance, enabling faster training with superior performance. Extensive experiments on 1.4B and 7B parameter models demonstrate LET’s efficiency and effectiveness. Notably, when training a 1.4B LLM on the Pile dataset, our method achieves up to 1.6$\times$ speedup with nearly 5% improvement in downstream task accuracy compared to standard training, even when using a pretrained model with 10$\times$ fewer parameters than the target model.
[28] OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration
Shaobo Wang, Xuan Ouyang, Tianyi Xu, Yuzheng Hu, Jialin Liu, Guo Chen, Tianyu Zhang, Junhao Zheng, Kexin Yang, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang
Main category: cs.CL
TL;DR: OPUS is a dynamic data selection framework that uses optimizer-induced projected utility to select better training tokens, achieving superior performance with fewer tokens across various model scales and domains.
Details
Motivation: As high-quality public text data becomes exhausted (the "Data Wall"), there's a need to shift from training on more tokens to training on better tokens. Existing methods either use static filters that ignore training dynamics or dynamic methods that are optimizer-agnostic.Method: OPUS defines utility in the optimizer-induced update space, scoring candidates by projecting their effective updates (shaped by modern optimizers) onto a target direction from a stable, in-distribution proxy. Uses Ghost technique with CountSketch for computational efficiency and Boltzmann sampling for data diversity.
Result: Achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. Outperforms industrial-level baselines and even full 200B-token training with only 30B tokens. When combined with static filters, further improves efficiency even with lower-quality data. In specialized domains, achieves superior performance using only 0.5B tokens vs. full training with 3B tokens.
Conclusion: OPUS provides an effective dynamic data selection framework that significantly improves pre-training efficiency and data utilization, addressing the Data Wall problem by selecting better tokens rather than more tokens.
Abstract: As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.
[29] Grammatical Error Correction Evaluation by Optimally Transporting Edit Representation
Takumi Goto, Yusuke Sakai, Taro Watanabe
Main category: cs.CL
TL;DR: UOT-ERRANT is a new metric for grammatical error correction evaluation that uses edit vectors and unbalanced optimal transport to measure similarity between hypothesis and reference edits, improving performance especially in fluency domains.
Details
Motivation: Current reference-based metrics for GEC evaluation using embeddings like BERTScore are ineffective because many words remain unchanged between source, hypothesis, and reference sentences. There's a need for metrics specifically designed for GEC that focus on edits rather than full sentences.Method: Proposes edit vectors to represent edits (using ERRANT framework) and introduces UOT-ERRANT metric that transports these edit vectors from hypothesis to reference using unbalanced optimal transport. The transport plan provides interpretable soft edit alignment.
Result: Experiments with SEEDA meta-evaluation show UOT-ERRANT improves evaluation performance, particularly in the +Fluency domain where many edits occur. The method is highly interpretable due to the transport plan serving as soft edit alignment.
Conclusion: UOT-ERRANT is an effective and interpretable metric for GEC evaluation that outperforms existing methods, especially for fluency-focused corrections, making it useful for both system ranking and analyzing GEC systems.
Abstract: Automatic evaluation in grammatical error correction (GEC) is crucial for selecting the best-performing systems. Currently, reference-based metrics are a popular choice, which basically measure the similarity between hypothesis and reference sentences. However, similarity measures based on embeddings, such as BERTScore, are often ineffective, since many words in the source sentences remain unchanged in both the hypothesis and the reference. This study focuses on edits specifically designed for GEC, i.e., ERRANT, and computes similarity measured over the edits from the source sentence. To this end, we propose edit vector, a representation for an edit, and introduce a new metric, UOT-ERRANT, which transports these edit vectors from hypothesis to reference using unbalanced optimal transport. Experiments with SEEDA meta-evaluation show that UOT-ERRANT improves evaluation performance, particularly in the +Fluency domain where many edits occur. Moreover, our method is highly interpretable because the transport plan can be interpreted as a soft edit alignment, making UOT-ERRANT a useful metric for both system ranking and analyzing GEC systems. Our code is available from https://github.com/gotutiyan/uot-errant.
[30] Once Correct, Still Wrong: Counterfactual Hallucination in Multilingual Vision-Language Models
Basel Mousi, Fahim Dalvi, Shammur Chowdhury, Firoj Alam, Nadir Durrani
Main category: cs.CL
TL;DR: M2CQA is a culturally grounded multimodal benchmark for evaluating vision-language model hallucinations across 17 MENA countries with contrastive statements in English, Arabic, and dialects, introducing CounterFactual Hallucination Rate (CFHR) to measure counterfactual acceptance.
Details
Motivation: Existing hallucination benchmarks rarely test culturally plausible but visually incorrect interpretations, particularly outside Western contexts and English. There's a need to evaluate VLMs' susceptibility to accepting culturally plausible but factually incorrect interpretations in diverse cultural settings.Method: Built M2CQA benchmark from images spanning 17 MENA countries, paired with contrastive true and counterfactual statements in English, Arabic, and its dialects. Proposed CounterFactual Hallucination Rate (CFHR) to measure counterfactual acceptance conditioned on correctly answering the true statement. Evaluated state-of-the-art VLMs under multiple prompting strategies.
Result: CFHR rises sharply in Arabic, especially in dialects, even when true-statement accuracy remains high. Reasoning-first prompting consistently increases counterfactual hallucination, while answering before justifying improves robustness.
Conclusion: Cultural context significantly affects VLM hallucination rates, with higher susceptibility in non-English languages and dialects. Prompting strategy matters - reasoning-first approaches increase hallucination while answer-first approaches improve robustness.
Abstract: Vision-language models (VLMs) can achieve high accuracy while still accepting culturally plausible but visually incorrect interpretations. Existing hallucination benchmarks rarely test this failure mode, particularly outside Western contexts and English. We introduce M2CQA, a culturally grounded multimodal benchmark built from images spanning 17 MENA countries, paired with contrastive true and counterfactual statements in English, Arabic, and its dialects. To isolate hallucination beyond raw accuracy, we propose the CounterFactual Hallucination Rate (CFHR), which measures counterfactual acceptance conditioned on correctly answering the true statement. Evaluating state-of-the-art VLMs under multiple prompting strategies, we find that CFHR rises sharply in Arabic, especially in dialects, even when true-statement accuracy remains high. Moreover, reasoning-first prompting consistently increases counterfactual hallucination, while answering before justifying improves robustness. We will make the experimental resources and dataset publicly available for the community.
[31] Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs
Yao Zhou, Zeen Song, Wenwen Qiang, Fengge Wu, Shuyi Zhou, Changwen Zheng, Hui Xiong
Main category: cs.CL
TL;DR: CFA² uses causal front-door adjustment to jailbreak LLMs by modeling safety mechanisms as unobserved confounders and stripping defense features with sparse autoencoders.
Details
Motivation: Safety alignment in LLMs operates as latent internal states that obscure inherent capabilities, making it difficult to understand and bypass safety mechanisms. The paper aims to provide a mechanistic interpretation of jailbreaking by modeling safety as an unobserved confounder from a causal perspective.Method: Proposes Causal Front-Door Adjustment Attack (CFA²) using Pearl’s Front-Door Criterion to sever confounding associations. Employs Sparse Autoencoders (SAEs) to physically strip defense-related features and isolate core task intent. Reduces computationally expensive marginalization to deterministic intervention with low inference complexity.
Result: CFA² achieves state-of-the-art attack success rates while offering mechanistic interpretation of the jailbreaking process.
Conclusion: The paper presents an effective causal framework for jailbreaking LLMs that provides interpretable insights into safety mechanisms while maintaining high attack success rates.
Abstract: Safety alignment mechanisms in Large Language Models (LLMs) often operate as latent internal states, obscuring the model’s inherent capabilities. Building on this observation, we model the safety mechanism as an unobserved confounder from a causal perspective. Then, we propose the \textbf{C}ausal \textbf{F}ront-Door \textbf{A}djustment \textbf{A}ttack ({\textbf{CFA}}$^2$) to jailbreak LLM, which is a framework that leverages Pearl’s Front-Door Criterion to sever the confounding associations for robust jailbreaking. Specifically, we employ Sparse Autoencoders (SAEs) to physically strip defense-related features, isolating the core task intent. We further reduce computationally expensive marginalization to a deterministic intervention with low inference complexity. Experiments demonstrate that {CFA}$^2$ achieves state-of-the-art attack success rates while offering a mechanistic interpretation of the jailbreaking process.
[32] Structured Context Engineering for File-Native Agentic Systems: Evaluating Schema Accuracy, Format Effectiveness, and Multi-File Navigation at Scale
Damon McMillan
Main category: cs.CL
TL;DR: Systematic study of context engineering for LLM agents operating on structured data through SQL generation, testing 9,649 experiments across 11 models, 4 formats, and schemas up to 10,000 tables.
Details
Motivation: LLM agents increasingly operate external systems through programmatic interfaces, but practitioners lack empirical guidance on how to structure the context these agents consume, particularly for structured data operations.Method: Used SQL generation as a proxy for programmatic agent operations, conducting 9,649 experiments across 11 models, 4 formats (YAML, Markdown, JSON, TOON), and schemas ranging from 10 to 10,000 tables to systematically study context engineering.
Result: Found that: 1) Architecture choice is model-dependent (frontier models benefit from file-based context while open source models show deficits), 2) Format doesn’t significantly affect aggregate accuracy but individual models show sensitivities, 3) Model capability is dominant factor (21% accuracy gap between frontier and open source), 4) File-native agents scale to 10,000 tables via domain-partitioned schemas, 5) File size doesn’t predict runtime efficiency due to format-unfamiliar search patterns.
Conclusion: Architectural decisions for LLM agents on structured systems should be tailored to model capability rather than assuming universal best practices, providing evidence-based guidance for practitioners.
Abstract: Large Language Model agents increasingly operate external systems through programmatic interfaces, yet practitioners lack empirical guidance on how to structure the context these agents consume. Using SQL generation as a proxy for programmatic agent operations, we present a systematic study of context engineering for structured data, comprising 9,649 experiments across 11 models, 4 formats (YAML, Markdown, JSON, Token-Oriented Object Notation [TOON]), and schemas ranging from 10 to 10,000 tables. Our findings challenge common assumptions. First, architecture choice is model-dependent: file-based context retrieval improves accuracy for frontier-tier models (Claude, GPT, Gemini; +2.7%, p=0.029) but shows mixed results for open source models (aggregate -7.7%, p<0.001), with deficits varying substantially by model. Second, format does not significantly affect aggregate accuracy (chi-squared=2.45, p=0.484), though individual models, particularly open source, exhibit format-specific sensitivities. Third, model capability is the dominant factor, with a 21 percentage point accuracy gap between frontier and open source tiers that dwarfs any format or architecture effect. Fourth, file-native agents scale to 10,000 tables through domain-partitioned schemas while maintaining high navigation accuracy. Fifth, file size does not predict runtime efficiency: compact formats can consume significantly more tokens at scale due to format-unfamiliar search patterns. These findings provide practitioners with evidence-based guidance for deploying LLM agents on structured systems, demonstrating that architectural decisions should be tailored to model capability rather than assuming universal best practices.
[33] Reasoning under Ambiguity: Uncertainty-Aware Multilingual Emotion Classification under Partial Supervision
Md. Mithun Hossaina, Mashary N. Alrasheedy, Nirban Bhowmick, Shamim Forhad, Md. Shakil Hossain, Sudipto Chaki, Md Shafiqul Islam
Main category: cs.CL
TL;DR: Uncertainty-aware framework for multilingual multi-label emotion classification that handles emotional ambiguity and incomplete supervision through entropy-based weighting and positive-unlabeled regularization.
Details
Motivation: Multilingual emotion identification faces challenges due to emotional ambiguity (multiple co-occurring emotional states) and incomplete supervision (missing or heterogeneous annotations). Existing methods assume fully observed labels and use deterministic learning, leading to biased learning under partial supervision.Method: Proposes Reasoning under Ambiguity framework with: 1) shared multilingual encoder with language-specific optimization, 2) entropy-based ambiguity weighting that down-weights highly ambiguous training instances, 3) mask-aware objective with positive-unlabeled regularization for robust learning under partial supervision.
Result: Experiments on English, Spanish, and Arabic emotion classification benchmarks show consistent improvements over strong baselines across multiple metrics, with improved training stability, robustness to annotation sparsity, and enhanced interpretability.
Conclusion: The uncertainty-aware framework effectively addresses emotional ambiguity and incomplete supervision in multilingual emotion classification, providing more reliable predictions and better handling of annotation uncertainty.
Abstract: Contemporary knowledge-based systems increasingly rely on multilingual emotion identification to support intelligent decision-making, yet they face major challenges due to emotional ambiguity and incomplete supervision. Emotion recognition from text is inherently uncertain because multiple emotional states often co-occur and emotion annotations are frequently missing or heterogeneous. Most existing multi-label emotion classification methods assume fully observed labels and rely on deterministic learning objectives, which can lead to biased learning and unreliable predictions under partial supervision. This paper introduces Reasoning under Ambiguity, an uncertainty-aware framework for multilingual multi-label emotion classification that explicitly aligns learning with annotation uncertainty. The proposed approach uses a shared multilingual encoder with language-specific optimization and an entropy-based ambiguity weighting mechanism that down-weights highly ambiguous training instances rather than treating missing labels as negative evidence. A mask-aware objective with positive-unlabeled regularization is further incorporated to enable robust learning under partial supervision. Experiments on English, Spanish, and Arabic emotion classification benchmarks demonstrate consistent improvements over strong baselines across multiple evaluation metrics, along with improved training stability, robustness to annotation sparsity, and enhanced interpretability.
[34] Transport and Merge: Cross-Architecture Merging for Large Language Models
Chenhang Cui, Binyun Yang, Fei Shen, Yuxin Chen, Jingnan Zheng, Xiang Wang, An Zhang, Tat-Seng Chua
Main category: cs.CL
TL;DR: Cross-architecture model merging framework using optimal transport to transfer knowledge from large LLMs to smaller heterogeneous models
Details
Motivation: Real-world deployments often use smaller models trained on low-resource data, creating a gap with large high-resource LLMs. Need mechanisms to transfer knowledge from large models to smaller heterogeneous targets, which existing merging approaches can't handle due to architecture incompatibility.Method: Proposes optimal transport-based framework that aligns activations to infer cross-neuron correspondences between heterogeneous models. Uses transport plans to guide direct weight-space fusion, requiring only a small set of inputs for effective transfer.
Result: Extensive experiments across low-resource languages and specialized domains demonstrate consistent improvements over target models.
Conclusion: The framework enables effective knowledge transfer from large high-resource LLMs to heterogeneous low-resource targets, bridging the gap between model scaling and practical deployment constraints.
Abstract: Large language models (LLMs) achieve strong capabilities by scaling model capacity and training data, yet many real-world deployments rely on smaller models trained or adapted from low-resource data. This gap motivates the need for mechanisms to transfer knowledge from large, high-resource models to smaller, low-resource targets. While model merging provides an effective transfer mechanism, most existing approaches assume architecture-compatible models and therefore cannot directly transfer knowledge from large high-resource LLMs to heterogeneous low-resource targets. In this work, we propose a cross-architecture merging framework based on optimal transport (OT) that aligns activations to infer cross-neuron correspondences between heterogeneous models. The resulting transport plans are then used to guide direct weight-space fusion, enabling effective high-resource to low-resource transfer using only a small set of inputs. Extensive experiments across low-resource languages and specialized domains demonstrate consistent improvements over target models.
[35] A Human-in-the-Loop, LLM-Centered Architecture for Knowledge-Graph Question Answering
Larissa Pusch, Alexandre Courtiol, Tim Conrad
Main category: cs.CL
TL;DR: LLM-powered interactive framework for generating and refining Cypher graph queries on Knowledge Graphs, improving accessibility while maintaining factual accuracy.
Details
Motivation: LLMs have limitations in knowledge-intensive domains (hallucinations, outdated info, limited explainability), and while text-based RAG helps, it struggles with multi-hop reasoning. Knowledge Graphs offer precise querying but require query language expertise, creating an accessibility barrier.Method: Interactive framework where LLMs generate and explain Cypher graph queries, with users iteratively refining them through natural language. Applied to real-world KGs, with core evaluation using 90-query benchmark on synthetic movie KG measuring query explanation quality and fault detection across multiple LLMs.
Result: Framework improves accessibility to complex datasets while preserving factual accuracy and semantic rigor. Provides insight into how model performance varies across domains. Includes two smaller real-life experiments on Hyena KG and MaRDI KG.
Conclusion: Interactive LLM-KG framework bridges the gap between natural language accessibility and precise graph querying, making complex knowledge graphs more usable while maintaining accuracy and explainability.
Abstract: Large Language Models (LLMs) excel at language understanding but remain limited in knowledge-intensive domains due to hallucinations, outdated information, and limited explainability. Text-based retrieval-augmented generation (RAG) helps ground model outputs in external sources but struggles with multi-hop reasoning. Knowledge Graphs (KGs), in contrast, support precise, explainable querying, yet require a knowledge of query languages. This work introduces an interactive framework in which LLMs generate and explain Cypher graph queries and users iteratively refine them through natural language. Applied to real-world KGs, the framework improves accessibility to complex datasets while preserving factual accuracy and semantic rigor and provides insight into how model performance varies across domains. Our core quantitative evaluation is a 90-query benchmark on a synthetic movie KG that measures query explanation quality and fault detection across multiple LLMs, complemented by two smaller real-life query-generation experiments on a Hyena KG and the MaRDI (Mathematical Research Data Initiative) KG.
[36] Multi-Task GRPO: Reliable LLM Reasoning Across Tasks
Shyam Sundhar Ramesh, Xiaotong Ji, Matthieu Zimmer, Sangwoong Yoon, Zhiyong Wang, Haitham Bou Ammar, Aurelien Lucchi, Ilija Bogunovic
Main category: cs.CL
TL;DR: MT-GRPO: A multi-task adaptation of GRPO that dynamically balances task weights to optimize worst-task performance and uses ratio-preserving sampling for more reliable multi-task reasoning.
Details
Motivation: Standard RL-based post-training with GRPO works well for individual reasoning tasks but fails in multi-task settings where tasks compete for optimization resources, leading to imbalanced performance where some tasks dominate while others stagnate.Method: Proposes MT-GRPO with two key components: (1) dynamic task weight adaptation to explicitly optimize worst-task performance and promote balanced progress, and (2) ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights.
Result: MT-GRPO consistently outperforms baselines in worst-task accuracy, achieving 16-28% absolute improvement over standard GRPO and 6% over DAPO, while maintaining competitive average accuracy. Also achieves 50% fewer training steps to reach 50% worst-task accuracy in 3-task setting.
Conclusion: MT-GRPO provides an effective solution for reliable multi-task reasoning in LLMs, addressing the imbalance problem in standard multi-task RL approaches and enabling more efficient achievement of balanced performance across diverse tasks.
Abstract: RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.
[37] CASTLE: A Comprehensive Benchmark for Evaluating Student-Tailored Personalized Safety in Large Language Models
Rui Jia, Ruiyi Lan, Fengrui Liu, Zhongxiang Dai, Bo Jiang, Jing Shao, Jingyuan Chen, Guandong Xu, Fei Wu, Min Zhang
Main category: cs.CL
TL;DR: CASTLE: A benchmark for evaluating personalized safety of LLMs in education, focusing on student-tailored safety risks across diverse student attributes.
Details
Motivation: Current LLMs produce homogeneous responses that ignore student heterogeneity in cognitive and psychological attributes, posing safety risks to vulnerable groups. Existing safety evaluations fail to capture divergent harms across different student attributes.Method: Proposed Student-Tailored Personalized Safety concept and constructed CASTLE benchmark covering 15 educational safety risks and 14 student attributes (92,908 bilingual scenarios). Designed three evaluation metrics: Risk Sensitivity, Emotional Empathy, and Student Alignment.
Result: Experiments on 18 state-of-the-art LLMs show all models scored below average safety rating of 2.3 out of 5, indicating substantial deficiencies in personalized safety assurance.
Conclusion: CASTLE reveals significant challenges in LLM safety for personalized education, highlighting the need for models that can adapt responses to diverse student attributes to ensure educational safety.
Abstract: Large language models (LLMs) have advanced the development of personalized learning in education. However, their inherent generation mechanisms often produce homogeneous responses to identical prompts. This one-size-fits-all mechanism overlooks the substantial heterogeneity in students cognitive and psychological, thereby posing potential safety risks to vulnerable groups. Existing safety evaluations primarily rely on context-independent metrics such as factual accuracy, bias, or toxicity, which fail to capture the divergent harms that the same response might cause across different student attributes. To address this gap, we propose the concept of Student-Tailored Personalized Safety and construct CASTLE based on educational theories. This benchmark covers 15 educational safety risks and 14 student attributes, comprising 92,908 bilingual scenarios. We further design three evaluation metrics: Risk Sensitivity, measuring the model ability to detect risks; Emotional Empathy, evaluating the model capacity to recognize student states; and Student Alignment, assessing the match between model responses and student attributes. Experiments on 18 SOTA LLMs demonstrate that CASTLE poses a significant challenge: all models scored below an average safety rating of 2.3 out of 5, indicating substantial deficiencies in personalized safety assurance.
[38] Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew
Giuseppe Samo, Paola Merlo
Main category: cs.CL
TL;DR: Transformer models’ ability to represent complex verb paradigms depends on tokenization strategies and language morphology characteristics.
Details
Motivation: To understand how transformer models represent complex verb paradigms in morphologically rich languages like Turkish and Hebrew, and how tokenization strategies affect this ability.Method: Used Blackbird Language Matrices task on natural data to test monolingual and multilingual transformer models with different tokenization strategies (atomic, subword units, character-level, morpheme-aware segmentation).
Result: For Turkish (transparent morphology), both monolingual and multilingual models succeed with atomic or subword tokenization. For Hebrew (non-concatenative morphology), monolingual models with morpheme-aware segmentation perform well, while multilingual models with character-level tokenization fail. Performance improves on synthetic datasets across all models.
Conclusion: Tokenization strategy is crucial for transformer models to handle morphological complexity, with language-specific segmentation needed for non-concatenative morphology like Hebrew.
Abstract: We investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, concentrating on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish – with its transparent morphological markers – both monolingual and multilingual models succeed, either when tokenization is atomic or when it breaks words into small subword units. For Hebrew, instead, monolingual and multilingual models diverge. A multilingual model using character-level tokenization fails to capture the language non-concatenative morphology, but a monolingual model with morpheme-aware segmentation performs well. Performance improves on more synthetic datasets, in all models.
[39] MedErrBench: A Fine-Grained Multilingual Benchmark for Medical Error Detection and Correction with Clinical Expert Annotations
Congbo Ma, Yichun Zhang, Yousef Al-Jazzazi, Ahamed Foisal, Laasya Sharma, Yousra Sadqi, Khaled Saleh, Jihad Mallat, Farah E. Shamout
Main category: cs.CL
TL;DR: MedErrBench is a multilingual benchmark for detecting, localizing, and correcting errors in clinical text across English, Arabic, and Chinese, developed with clinician guidance to evaluate LLMs in healthcare applications.
Details
Motivation: Existing clinical text errors can lead to serious adverse consequences like misdiagnosis or incorrect treatment. With LLMs increasingly used in healthcare, comprehensive multilingual benchmarks are needed but remain scarce across diverse languages and contexts.Method: Developed MedErrBench with clinician guidance using an expanded taxonomy of ten common error types. Created natural clinical cases in English, Arabic, and Chinese annotated by domain experts. Evaluated general-purpose, language-specific, and medical-domain LLMs across error detection, localization, and correction tasks.
Result: Revealed notable performance gaps, particularly in non-English settings, highlighting the need for clinically grounded, language-aware systems. Performance varied across models and languages.
Conclusion: MedErrBench advances multilingual clinical NLP to promote safer and more equitable AI-based healthcare globally. The benchmark and evaluation protocols are publicly available to support further research.
Abstract: Inaccuracies in existing or generated clinical text may lead to serious adverse consequences, especially if it is a misdiagnosis or incorrect treatment suggestion. With Large Language Models (LLMs) increasingly being used across diverse healthcare applications, comprehensive evaluation through dedicated benchmarks is crucial. However, such datasets remain scarce, especially across diverse languages and contexts. In this paper, we introduce MedErrBench, the first multilingual benchmark for error detection, localization, and correction, developed under the guidance of experienced clinicians. Based on an expanded taxonomy of ten common error types, MedErrBench covers English, Arabic and Chinese, with natural clinical cases annotated and reviewed by domain experts. We assessed the performance of a range of general-purpose, language-specific, and medical-domain language models across all three tasks. Our results reveal notable performance gaps, particularly in non-English settings, highlighting the need for clinically grounded, language-aware systems. By making MedErrBench and our evaluation protocols publicly-available, we aim to advance multilingual clinical NLP to promote safer and more equitable AI-based healthcare globally. The dataset is available in the supplementary material. An anonymized version of the dataset is available at: https://github.com/congboma/MedErrBench.
[40] Consensus-Aligned Neuron Efficient Fine-Tuning Large Language Models for Multi-Domain Machine Translation
Shuting Jiang, Ran Song, Yuxin Huang, Yan Xiang, Yantuan Xian, Shengxiang Gao, Zhengtao Yu
Main category: cs.CL
TL;DR: A neuron-efficient fine-tuning framework for multi-domain machine translation that identifies consensus-aligned neurons in LLMs to capture both general translation patterns and domain-specific nuances.
Details
Motivation: Large language models show impressive translation capabilities but struggle with domain adaptation. Existing methods like in-context learning and parameter-efficient fine-tuning suffer from domain shift, parameter interference, and limited generalization across diverse domains.Method: Proposes a neuron-efficient fine-tuning framework that identifies consensus-aligned neurons by maximizing mutual information between neuron behavior and domain features. Fine-tunes LLMs guided by these selected neurons to mitigate parameter interference and domain-specific overfitting.
Result: Comprehensive experiments on three LLMs across ten German-English and Chinese-English translation domains show the method consistently outperforms strong PEFT baselines on both seen and unseen domains, achieving state-of-the-art performance.
Conclusion: The proposed neuron-efficient fine-tuning framework effectively addresses domain adaptation challenges in LLMs for machine translation, enabling better capture of both generalizable patterns and domain-specific nuances while mitigating parameter interference.
Abstract: Multi-domain machine translation (MDMT) aims to build a unified model capable of translating content across diverse domains. Despite the impressive machine translation capabilities demonstrated by large language models (LLMs), domain adaptation still remains a challenge for LLMs. Existing MDMT methods such as in-context learning and parameter-efficient fine-tuning often suffer from domain shift, parameter interference and limited generalization. In this work, we propose a neuron-efficient fine-tuning framework for MDMT that identifies and updates consensus-aligned neurons within LLMs. These neurons are selected by maximizing the mutual information between neuron behavior and domain features, enabling LLMs to capture both generalizable translation patterns and domain-specific nuances. Our method then fine-tunes LLMs guided by these neurons, effectively mitigating parameter interference and domain-specific overfitting. Comprehensive experiments on three LLMs across ten German-English and Chinese-English translation domains evidence that our method consistently outperforms strong PEFT baselines on both seen and unseen domains, achieving state-of-the-art performance.
[41] OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale
Jingze Shi, Zhangyang Peng, Yizhang Zhu, Yifan Wu, Guang Liu, Yuyu Luo
Main category: cs.CL
TL;DR: OmniMoE introduces vector-level Atomic Experts for extreme granularity in MoE architectures, with system-algorithm co-design to overcome routing complexity and memory access challenges, achieving both high accuracy and fast inference.
Details
Motivation: Existing Mixture-of-Experts architectures face a trade-off between expert specialization granularity and hardware execution efficiency. The authors aim to push expert granularity to its logical extreme while maintaining computational efficiency.Method: OmniMoE uses vector-level Atomic Experts with a shared dense MLP branch. It employs system-algorithm co-design: (1) Cartesian Product Router that reduces routing complexity from O(N) to O(sqrt(N)), and (2) Expert-Centric Scheduling that transforms scattered memory lookups into dense matrix operations.
Result: On seven benchmarks, OmniMoE (1.7B active parameters) achieves 50.9% zero-shot accuracy, outperforming coarse-grained (DeepSeekMoE) and fine-grained (PEER) baselines. It reduces inference latency from 73ms to 6.7ms (10.9x speedup) compared to PEER.
Conclusion: Massive-scale fine-grained MoE can be both fast and accurate through careful system-algorithm co-design, with OmniMoE demonstrating superior performance and efficiency over existing MoE approaches.
Abstract: Mixture-of-Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade-off between the granularity of expert specialization and hardware execution efficiency. We propose OmniMoE, a system-algorithm co-designed framework that pushes expert granularity to its logical extreme. OmniMoE introduces vector-level Atomic Experts, enabling scalable routing and execution within a single MoE layer, while retaining a shared dense MLP branch for general-purpose processing. Although this atomic design maximizes capacity, it poses severe challenges for routing complexity and memory access. To address these, OmniMoE adopts a system-algorithm co-design: (i) a Cartesian Product Router that decomposes the massive index space to reduce routing complexity from O(N) to O(sqrt(N)); and (ii) Expert-Centric Scheduling that inverts the execution order to turn scattered, memory-bound lookups into efficient dense matrix operations. Validated on seven benchmarks, OmniMoE (with 1.7B active parameters) achieves 50.9% zero-shot accuracy across seven benchmarks, outperforming coarse-grained (e.g., DeepSeekMoE) and fine-grained (e.g., PEER) baselines. Crucially, OmniMoE reduces inference latency from 73ms to 6.7ms (a 10.9-fold speedup) compared to PEER, demonstrating that massive-scale fine-grained MoE can be fast and accurate. Our code is open-sourced at https://github.com/flash-algo/omni-moe.
[42] CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering
Hao Yang, Zhiyu Yang, Xupeng Zhang, Wei Wei, Yunjie Zhang, Lin Yang
Main category: cs.CL
TL;DR: CompactRAG is a retrieval-augmented generation framework that decouples offline corpus restructuring into atomic QA pairs from online reasoning, reducing LLM calls to just two per query regardless of reasoning hops.
Details
Motivation: Existing multi-hop RAG systems are inefficient due to alternating between retrieval and reasoning at each step, leading to repeated LLM calls, high token consumption, and unstable entity grounding across hops.Method: Offline: LLM reads corpus once and converts it into atomic QA knowledge base with minimal, fine-grained question-answer pairs. Online: Complex queries are decomposed and rewritten for entity consistency, resolved through dense retrieval followed by RoBERTa-based answer extraction, with LLM invoked only twice total.
Result: Experiments on HotpotQA, 2WikiMultiHopQA, and MuSiQue show competitive accuracy while substantially reducing token consumption compared to iterative RAG baselines.
Conclusion: CompactRAG provides a cost-efficient and practical approach to multi-hop reasoning over large knowledge corpora by minimizing LLM usage while maintaining performance.
Abstract: Retrieval-augmented generation (RAG) has become a key paradigm for knowledge-intensive question answering. However, existing multi-hop RAG systems remain inefficient, as they alternate between retrieval and reasoning at each step, resulting in repeated LLM calls, high token consumption, and unstable entity grounding across hops. We propose CompactRAG, a simple yet effective framework that decouples offline corpus restructuring from online reasoning. In the offline stage, an LLM reads the corpus once and converts it into an atomic QA knowledge base, which represents knowledge as minimal, fine-grained question-answer pairs. In the online stage, complex queries are decomposed and carefully rewritten to preserve entity consistency, and are resolved through dense retrieval followed by RoBERTa-based answer extraction. Notably, during inference, the LLM is invoked only twice in total - once for sub-question decomposition and once for final answer synthesis - regardless of the number of reasoning hops. Experiments on HotpotQA, 2WikiMultiHopQA, and MuSiQue demonstrate that CompactRAG achieves competitive accuracy while substantially reducing token consumption compared to iterative RAG baselines, highlighting a cost-efficient and practical approach to multi-hop reasoning over large knowledge corpora. The implementation is available at GitHub.
[43] LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards
Bowen Ping, Zijun Chen, Yiyao Yu, Tingfeng Hui, Junchi Yan, Baobao Chang
Main category: cs.CL
TL;DR: LongR: A reinforcement learning framework for LLM long-context reasoning that integrates dynamic “Think-and-Read” mechanism with contextual density rewards based on relative information gain.
Details
Motivation: Existing RL approaches for LLM reasoning in long-context scenarios (like long-dialogue understanding and structured data analysis) rely on sparse, outcome-only rewards which provide insufficient guidance for complex reasoning. There's a need for more effective reward mechanisms to enhance long-context performance.Method: Proposes LongR framework with two key components: 1) Dynamic “Think-and-Read” mechanism that interleaves reasoning with document consultation, and 2) Contextual density reward based on relative information gain to quantify the utility of relevant documents.
Result: Achieves 9% gain on LongBench v2 and consistent improvements on RULER and InfiniteBench. Shows robust efficiency in navigating extensive contexts and enhances performance across diverse RL algorithms (DAPO, GSPO).
Conclusion: LongR effectively addresses the limitations of sparse rewards in long-context reasoning by providing more informative guidance through contextual density rewards and dynamic document consultation, leading to significant performance improvements across multiple benchmarks.
Abstract: Reinforcement Learning has emerged as a key driver for LLM reasoning. This capability is equally pivotal in long-context scenarios–such as long-dialogue understanding and structured data analysis, where the challenge extends beyond consuming tokens to performing rigorous deduction. While existing efforts focus on data synthesis or architectural changes, recent work points out that relying solely on sparse, outcome-only rewards yields limited gains, as such coarse signals are often insufficient to effectively guide the complex long-context reasoning. To address this, we propose LongR, a unified framework that enhances long-context performance by integrating a dynamic “Think-and-Read” mechanism, which interleaves reasoning with document consultation, with a contextual density reward based on relative information gain to quantify the utility of the relevant documents. Empirically, LongR achieves a 9% gain on LongBench v2 and consistent improvements on RULER and InfiniteBench, demonstrating robust efficiency in navigating extensive contexts. Furthermore, LongR consistently enhances performance across diverse RL algorithms (e.g., DAPO, GSPO). Finally, we conduct in-depth analyses to investigate the impact of reasoning chain length on efficiency and the model’s robustness against distractors.
[44] Different Time, Different Language: Revisiting the Bias Against Non-Native Speakers in GPT Detectors
Adnan Al Ali, Jindřich Helcl, Jindřich Libovický
Main category: cs.CL
TL;DR: Study examines LLM-generated text detectors for Czech language, finds no systematic bias against non-native speakers and shows contemporary detectors work effectively without relying on perplexity features.
Details
Motivation: Address concerns about LLM misuse in academia and potential bias in automated detection systems that may falsely flag essays from non-native speakers as AI-generated due to their lower perplexity scores.Method: Revisits previous claims about detector bias two years later in Czech language context. Analyzes perplexity of texts from native vs. non-native speakers, examines detectors from three separate families, and investigates whether contemporary detectors rely on perplexity features.
Result: Shows perplexity of Czech texts from non-native speakers is not lower than native speakers. Finds no systematic bias against non-native speakers across three detector families. Demonstrates contemporary detectors operate effectively without relying on perplexity features.
Conclusion: Contemporary LLM-generated text detectors for Czech language show no systematic bias against non-native speakers and function effectively using features beyond perplexity, addressing previous concerns about fairness in automated detection systems.
Abstract: LLM-based assistants have been widely popularised after the release of ChatGPT. Concerns have been raised about their misuse in academia, given the difficulty of distinguishing between human-written and generated text. To combat this, automated techniques have been developed and shown to be effective, to some extent. However, prior work suggests that these methods often falsely flag essays from non-native speakers as generated, due to their low perplexity extracted from an LLM, which is supposedly a key feature of the detectors. We revisit these statements two years later, specifically in the Czech language setting. We show that the perplexity of texts from non-native speakers of Czech is not lower than that of native speakers. We further examine detectors from three separate families and find no systematic bias against non-native speakers. Finally, we demonstrate that contemporary detectors operate effectively without relying on perplexity.
[45] Reinforcement World Model Learning for LLM-based Agents
Xiao Yu, Baolin Peng, Ruize Xu, Yelong Shen, Pengcheng He, Suman Nath, Nikhil Singh, Jiangfeng Gao, Zhou Yu
Main category: cs.CL
TL;DR: RWML learns action-conditioned world models for LLM-based agents using sim-to-real gap rewards to align simulated next states with observed environment states in embedding space.
Details
Motivation: LLMs struggle with anticipating action consequences and adapting to environment dynamics in agentic settings, highlighting the need for world-modeling capabilities in LLM-based agents.Method: Reinforcement World Model Learning (RWML) - a self-supervised method that learns action-conditioned world models on textual states using sim-to-real gap rewards. It aligns simulated next states with realized next states in a pre-trained embedding space, providing more robust training than next-state token prediction.
Result: Significant gains over base model on ALFWorld and τ² Bench. When combined with task-success rewards, outperforms direct task-success reward RL by 6.9 and 5.7 points respectively, while matching expert-data training performance.
Conclusion: RWML provides an effective self-supervised approach for learning world models in LLM-based agents, improving their ability to anticipate consequences and adapt to environment dynamics without requiring expert demonstrations.
Abstract: Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinforcement World Model Learning (RWML), a self-supervised method that learns action-conditioned world models for LLM-based agents on textual states using sim-to-real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre-trained embedding space. Unlike next-state token prediction, which prioritizes token-level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM-as-a-judge. We evaluate our method on ALFWorld and $τ^2$ Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and $τ^2$ Bench respectively, while matching the performance of expert-data training.
[46] OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions
Fangzhi Xu, Hang Yan, Qiushi Sun, Jinyang Wu, Zixian Huang, Muye Huang, Jingyang Gong, Zichen Ding, Kanzhi Cheng, Yian Wang, Xinyu Che, Zeyi Sun, Jian Zhang, Zhangyue Yin, Haoran Luo, Xuanjing Huang, Ben Kao, Jun Liu, Qika Lin
Main category: cs.CL
TL;DR: OdysseyArena introduces a new evaluation framework for autonomous agents focusing on long-horizon, inductive interactions where agents must discover latent transition laws from experience, rather than just following explicit rules.
Details
Motivation: Existing evaluations for autonomous agents primarily use deductive paradigms with explicit rules and static goals, neglecting the inductive necessity for agents to autonomously discover latent transition laws from experience, which is crucial for agentic foresight and strategic coherence.Method: The authors introduce OdysseyArena with four primitives that translate abstract transition dynamics into concrete interactive environments. They establish OdysseyArena-Lite with 120 tasks for standardized benchmarking of inductive efficiency and long-horizon discovery, and OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons (>200 steps).
Result: Extensive experiments on 15+ leading LLMs reveal that even frontier models exhibit significant deficiencies in inductive scenarios, identifying a critical bottleneck in autonomous discovery in complex environments.
Conclusion: The paper identifies a critical gap in current agent evaluation and proposes OdysseyArena as a solution to better assess agents’ inductive reasoning and long-horizon discovery capabilities, revealing substantial limitations in current LLM-based agents.
Abstract: The rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous agents capable of navigating complex environments. However, existing evaluations primarily adopt a deductive paradigm, where agents execute tasks based on explicitly provided rules and static goals, often within limited planning horizons. Crucially, this neglects the inductive necessity for agents to discover latent transition laws from experience autonomously, which is the cornerstone for enabling agentic foresight and sustaining strategic coherence. To bridge this gap, we introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions. We formalize and instantiate four primitives, translating abstract transition dynamics into concrete interactive environments. Building upon this, we establish OdysseyArena-Lite for standardized benchmarking, providing a set of 120 tasks to measure an agent’s inductive efficiency and long-horizon discovery. Pushing further, we introduce OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons (e.g., > 200 steps). Extensive experiments on 15+ leading LLMs reveal that even frontier models exhibit a deficiency in inductive scenarios, identifying a critical bottleneck in the pursuit of autonomous discovery in complex environments. Our code and data are available at https://github.com/xufangzhi/Odyssey-Arena
[47] RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference
Siran Liu, Guoxia Wang, Sa Wang, Jinle Zeng, HaoYang Xie, Siyu Lou, JiaBin Yang, DianHai Yu, Haifeng Wang, Chao Yang
Main category: cs.CL
TL;DR: RRAttention: A novel dynamic sparse attention method using head round-robin sampling to reduce quadratic complexity while maintaining query independence and global pattern discovery, achieving near-full attention performance with 2.4× speedup.
Details
Motivation: The quadratic complexity of attention mechanisms is a critical bottleneck for LLMs processing long contexts. Existing dynamic sparse attention methods face trade-offs: requiring preprocessing, lacking global evaluation, violating query independence, or having high computational overhead.Method: RRAttention uses a head round-robin (RR) sampling strategy that rotates query sampling positions across attention heads within each stride. This maintains query independence while enabling efficient global pattern discovery through stride-level aggregation. The method reduces complexity from O(L²) to O(L²/S²) and uses adaptive Top-τ selection for optimal sparsity.
Result: Extensive experiments on natural language understanding (HELMET) and multimodal video comprehension (Video-MME) show RRAttention recovers over 99% of full attention performance while computing only half of the attention blocks, achieving 2.4× speedup at 128K context length and outperforming existing dynamic sparse attention methods.
Conclusion: RRAttention simultaneously achieves all desirable properties for dynamic sparse attention - no preprocessing, global evaluation, query independence, and low overhead - making it an effective solution for efficient long-context processing in LLMs.
Abstract: The quadratic complexity of attention mechanisms poses a critical bottleneck for large language models processing long contexts. While dynamic sparse attention methods offer input-adaptive efficiency, they face fundamental trade-offs: requiring preprocessing, lacking global evaluation, violating query independence, or incurring high computational overhead. We present RRAttention, a novel dynamic sparse attention method that simultaneously achieves all desirable properties through a head \underline{r}ound-\underline{r}obin (RR) sampling strategy. By rotating query sampling positions across attention heads within each stride, RRAttention maintains query independence while enabling efficient global pattern discovery with stride-level aggregation. Our method reduces complexity from $O(L^2)$ to $O(L^2/S^2)$ and employs adaptive Top-$τ$ selection for optimal sparsity. Extensive experiments on natural language understanding (HELMET) and multimodal video comprehension (Video-MME) demonstrate that RRAttention recovers over 99% of full attention performance while computing only half of the attention blocks, achieving 2.4$\times$ speedup at 128K context length and outperforming existing dynamic sparse attention methods.
[48] xList-Hate: A Checklist-Based Framework for Interpretable and Generalizable Hate Speech Detection
Adrián Girón, Pablo Miralles, Javier Huertas-Tato, Sergio D’Antonio, David Camacho
Main category: cs.CL
TL;DR: xList-Hate is a diagnostic framework that decomposes hate speech detection into explicit concept-level questions answered by LLMs, then aggregates them via interpretable decision trees for robust, explainable content moderation.
Details
Motivation: Current hate speech detection models overfit to dataset-specific definitions and lack robustness under domain shift and annotation noise. There's a need for more transparent, interpretable, and robust approaches that can handle varying legal frameworks and annotation guidelines.Method: Decomposes hate speech detection into a checklist of explicit, concept-level questions grounded in normative criteria. Each question is independently answered by an LLM to produce binary diagnostic representations. These signals are aggregated by a lightweight, fully interpretable decision tree.
Result: Consistently improves cross-dataset robustness and performance under domain shift compared to supervised methods. Provides fine-grained interpretability through explicit decision paths and factor-level analysis, and shows less sensitivity to annotation inconsistency and contextual ambiguity.
Conclusion: Reframing hate speech detection as a diagnostic reasoning task rather than monolithic classification provides a robust, explainable, and extensible alternative for content moderation.
Abstract: Hate speech detection is commonly framed as a direct binary classification problem despite being a composite concept defined through multiple interacting factors that vary across legal frameworks, platform policies, and annotation guidelines. As a result, supervised models often overfit dataset-specific definitions and exhibit limited robustness under domain shift and annotation noise. We introduce xList-Hate, a diagnostic framework that decomposes hate speech detection into a checklist of explicit, concept-level questions grounded in widely shared normative criteria. Each question is independently answered by a large language model (LLM), producing a binary diagnostic representation that captures hateful content features without directly predicting the final label. These diagnostic signals are then aggregated by a lightweight, fully interpretable decision tree, yielding transparent and auditable predictions. We evaluate it across multiple hate speech benchmarks and model families, comparing it against zero-shot LLM classification and in-domain supervised fine-tuning. While supervised methods typically maximize in-domain performance, we consistently improves cross-dataset robustness and relative performance under domain shift. In addition, qualitative analysis of disagreement cases provides evidence that the framework can be less sensitive to certain forms of annotation inconsistency and contextual ambiguity. Crucially, the approach enables fine-grained interpretability through explicit decision paths and factor-level analysis. Our results suggest that reframing hate speech detection as a diagnostic reasoning task, rather than a monolithic classification problem, provides a robust, explainable, and extensible alternative for content moderation.
[49] EuroLLM-22B: Technical Report
Miguel Moura Ramos, Duarte M. Alves, Hippolyte Gisserot-Boukhlef, João Alves, Pedro Henrique Martins, Patrick Fernandes, José Pombal, Nuno M. Guerreiro, Ricardo Rei, Nicolas Boizard, Amin Farajian, Mateusz Klimaszewski, José G. C. de Souza, Barry Haddow, François Yvon, Pierre Colombo, Alexandra Birch, André F. T. Martins
Main category: cs.CL
TL;DR: EuroLLM-22B is a multilingual large language model trained from scratch to support all 24 official EU languages plus 11 additional languages, addressing underrepresentation of European languages in existing open LLMs.
Details
Motivation: European languages are underrepresented and underserved in existing open large language models, creating a need for models that better serve European citizens and their linguistic diversity.Method: Developed EuroLLM-22B from scratch with careful tokenizer design, architectural specifications, data filtering, and training procedures optimized for multilingual performance across 35 European languages.
Result: EuroLLM-22B demonstrates strong performance in reasoning, instruction following, and translation across multilingual benchmarks, achieving results competitive with models of comparable size.
Conclusion: EuroLLM-22B successfully addresses the multilingual needs of European citizens and the research community releases models, datasets, and code to support future multilingual AI research.
Abstract: This report presents EuroLLM-22B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-22B’s development, including tokenizer design, architectural specifications, data filtering, and training procedures. Across a broad set of multilingual benchmarks, EuroLLM-22B demonstrates strong performance in reasoning, instruction following, and translation, achieving results competitive with models of comparable size. To support future research, we release our base and instruction-tuned models, our multilingual web pretraining data and updated EuroBlocks instruction datasets, as well as our pre-training and evaluation codebases.
[50] Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models
Shuo Nie, Hexuan Deng, Chao Wang, Ruiyu Fang, Xuebo Liu, Shuangyong Song, Yu Li, Min Zhang, Xuelong Li
Main category: cs.CL
TL;DR: FaithRL uses step-level reinforcement learning with explicit faithfulness rewards and implicit truncated resampling to reduce hallucinations in small reasoning models’ chain-of-thought reasoning.
Details
Motivation: Small reasoning models are prone to faithfulness hallucinations in intermediate reasoning steps, especially when using outcome-based rewards that can reinforce unfaithful reasoning if the final answer is correct.Method: Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL) introduces step-level supervision via explicit faithfulness rewards from a process reward model, combined with an implicit truncated resampling strategy that generates contrastive signals from faithful prefixes.
Result: Experiments across multiple SRMs and Open-Book QA benchmarks show FaithRL consistently reduces hallucinations in both CoT and final answers, leading to more faithful and reliable reasoning.
Conclusion: FaithRL effectively addresses faithfulness issues in small reasoning models through step-level reinforcement learning with explicit process rewards and implicit contrastive signals.
Abstract: As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain-of-thought (CoT) reasoning in resource-constrained settings. However, they are prone to faithfulness hallucinations, especially in intermediate reasoning steps. Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained CoT evaluation, which can inadvertently reinforce unfaithful reasoning when the final answer is correct. To address these limitations, we propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model, together with an implicit truncated resampling strategy that generates contrastive signals from faithful prefixes. Experiments across multiple SRMs and Open-Book QA benchmarks demonstrate that FaithRL consistently reduces hallucinations in both the CoT and final answers, leading to more faithful and reliable reasoning. Code is available at https://github.com/Easy195/FaithRL.
[51] Codified Finite-state Machines for Role-playing
Letian Peng, Yupeng Hou, Kun Zhou, Jingbo Shang
Main category: cs.CL
TL;DR: CFSMs/CPFSMs use LLMs to automatically convert character profiles into finite-state machines for consistent role-playing, extending to probabilistic models for uncertainty handling.
Details
Motivation: Existing prompting methods for role-playing with LLMs capture surface actions but fail to track latent character states that drive consistent interactions, creating a need for better state modeling.Method: Introduces Codified Finite-State Machines (CFSMs) that automatically convert textual character profiles into FSMs using LLM-based coding, and extends to Codified Probabilistic Finite-State Machines (CPFSMs) with probabilistic transitions for uncertainty handling.
Result: CFSMs and CPFSMs outperform baseline methods in both synthetic evaluations and real-world role-playing scenarios, demonstrating effectiveness in structured tasks and open-ended stochastic state exploration.
Conclusion: The framework successfully bridges traditional finite-state machines with modern LLM capabilities, providing interpretable structures that enforce character consistency while handling uncertainty in open-ended role-playing scenarios.
Abstract: Modeling latent character states is crucial for consistent and engaging role-playing (RP) with large language models (LLMs). Yet, existing prompting-based approaches mainly capture surface actions, often failing to track the latent states that drive interaction. We revisit finite-state machines (FSMs), long used in game design to model state transitions. While effective in small, well-specified state spaces, traditional hand-crafted, rule-based FSMs struggle to adapt to the open-ended semantic space of RP. To address this, we introduce Codified Finite-State Machines (CFSMs), a framework that automatically codifies textual character profiles into FSMs using LLM-based coding. CFSMs extract key states and transitions directly from the profile, producing interpretable structures that enforce character consistency. To further capture uncertainty and variability, we extend CFSMs into Codified Probabilistic Finite-State Machines (CPFSMs), where transitions are modeled as probability distributions over states. Through both synthetic evaluations and real-world RP scenarios in established artifacts, we demonstrate that CFSM and CPFSM outperform generally applied baselines, verifying effectiveness not only in structured tasks but also in open-ended stochastic state exploration.
[52] KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs
Jian Chen, Zhuoran Wang, Jiayu Qin, Ming Li, Meng Wang, Changyou Chen, Yin Chen, Qizhen Weng, Yirui Liu
Main category: cs.CL
TL;DR: KV-CoRE is an SVD-based method for quantifying the low-rank compressibility of kv-caches in LLMs, enabling systematic analysis across models, datasets, and languages.
Details
Motivation: KV-caches in large language models consume significant GPU memory bandwidth during autoregressive decoding, especially with long contexts. Existing compression approaches often overlook the data-dependent nature of kv-caches and their variation across different layers.Method: KV-CoRE uses Singular Value Decomposition (SVD) to compute optimal low-rank approximations of kv-caches under the Frobenius norm. It’s gradient-free, incremental, and enables efficient dataset-level, layer-wise evaluation across multiple models and datasets.
Result: Analysis across five English domains and sixteen languages revealed systematic patterns linking compressibility to model architecture, training data, and language coverage. Normalized Effective Rank was shown to strongly correlate with performance degradation under compression.
Conclusion: The study establishes a principled evaluation framework and the first large-scale benchmark for kv-cache compressibility in LLMs, offering insights for dynamic, data-aware compression strategies and data-centric model development.
Abstract: Large language models rely on kv-caches to avoid redundant computation during autoregressive decoding, but as context length grows, reading and writing the cache can quickly saturate GPU memory bandwidth. Recent work has explored KV-cache compression, yet most approaches neglect the data-dependent nature of kv-caches and their variation across layers. We introduce KV-CoRE KV-cache Compressibility by Rank Evaluation), an SVD-based method for quantifying the data-dependent low-rank compressibility of kv-caches. KV-CoRE computes the optimal low-rank approximation under the Frobenius norm and, being gradient-free and incremental, enables efficient dataset-level, layer-wise evaluation. Using this method, we analyze multiple models and datasets spanning five English domains and sixteen languages, uncovering systematic patterns that link compressibility to model architecture, training data, and language coverage. As part of this analysis, we employ the Normalized Effective Rank as a metric of compressibility and show that it correlates strongly with performance degradation under compression. Our study establishes a principled evaluation framework and the first large-scale benchmark of kv-cache compressibility in LLMs, offering insights for dynamic, data-aware compression and data-centric model development.
[53] Polyglots or Multitudes? Multilingual LLM Answers to Value-laden Multiple-Choice Questions
Léo Labat, Etienne Ollion, François Yvon
Main category: cs.CL
TL;DR: This paper investigates language-induced variation in value-laden MCQ responses of multilingual LLMs, finding that while larger instruction-tuned models show higher consistency, language-specific behavior emerges selectively on certain questions.
Details
Motivation: The paper aims to understand whether multilingual LLMs behave consistently across languages (like theoretical polyglots) or show language-dependent variations in value-laden responses (like multiple monolingual models), addressing a gap in studying language effects on value expression beyond factual recall.Method: Created Multilingual European Value Survey (MEVS) with human-translated questions in 8 European languages. Tested over 30 multilingual LLMs of various sizes and alignment status under controlled prompt variations including answer order, symbol type, and tail character.
Result: Larger instruction-tuned models show higher overall consistency, but response robustness varies greatly across questions. Language-specific behavior emerges in all consistent instruction-fine-tuned models, but only on certain questions, suggesting selective effects of preference fine-tuning.
Conclusion: Multilingual LLMs exhibit complex language-dependent variations in value-laden responses, with consistency patterns varying by model size, instruction tuning, and specific question types, warranting further study of selective preference fine-tuning effects.
Abstract: Multiple-Choice Questions (MCQs) are often used to assess knowledge, reasoning abilities, and even values encoded in large language models (LLMs). While the effect of multilingualism has been studied on LLM factual recall, this paper seeks to investigate the less explored question of language-induced variation in value-laden MCQ responses. Are multilingual LLMs consistent in their responses across languages, i.e. behave like theoretical polyglots, or do they answer value-laden MCQs depending on the language of the question, like a multitude of monolingual models expressing different values through a single model? We release a new corpus, the Multilingual European Value Survey (MEVS), which, unlike prior work relying on machine translation or ad hoc prompts, solely comprises human-translated survey questions aligned in 8 European languages. We administer a subset of those questions to over thirty multilingual LLMs of various sizes, manufacturers and alignment-fine-tuning status under comprehensive, controlled prompt variations including answer order, symbol type, and tail character. Our results show that while larger, instruction-tuned models display higher overall consistency, the robustness of their responses varies greatly across questions, with certain MCQs eliciting total agreement within and across models while others leave LLM answers split. Language-specific behavior seems to arise in all consistent, instruction-fine-tuned models, but only on certain questions, warranting a further study of the selective effect of preference fine-tuning.
[54] Self-Improving Multilingual Long Reasoning via Translation-Reasoning Integrated Training
Junxiao Liu, Zhijun Wang, Yixiao Li, Zhejian Lai, Liqian Huang, Xin Huang, Xue Han, Junlan Feng, Shujian Huang
Main category: cs.CL
TL;DR: TRIT is a self-improving framework that integrates translation training into multilingual reasoning to improve both question understanding and response generation across languages without external feedback or additional multilingual data.
Details
Motivation: Long reasoning models struggle in multilingual settings - they tend to reason in English for non-English questions, and when constrained to reason in the question language, accuracies drop substantially. This is caused by limited abilities in both multilingual question understanding and multilingual reasoning.Method: Proposes TRIT (Translation-Reasoning Integrated Training), a self-improving framework that integrates translation training into multilingual reasoning. The method jointly enhances multilingual question understanding and response generation without external feedback or additional multilingual data.
Result: On MMATH, the method outperforms multiple baselines by an average of 7 percentage points, improving both answer correctness and language consistency. Cross-lingual question alignment improves by over 10 percentage points, and translation quality improves for both mathematical questions and general-domain text (up to 8.4 COMET points on FLORES-200).
Conclusion: Integrating translation training into multilingual reasoning effectively addresses both multilingual question understanding and reasoning challenges, leading to significant improvements in multilingual reasoning performance without requiring additional multilingual data.
Abstract: Long reasoning models often struggle in multilingual settings: they tend to reason in English for non-English questions; when constrained to reasoning in the question language, accuracies drop substantially. The struggle is caused by the limited abilities for both multilingual question understanding and multilingual reasoning. To address both problems, we propose TRIT (Translation-Reasoning Integrated Training), a self-improving framework that integrates the training of translation into multilingual reasoning. Without external feedback or additional multilingual data, our method jointly enhances multilingual question understanding and response generation. On MMATH, our method outperforms multiple baselines by an average of 7 percentage points, improving both answer correctness and language consistency. Further analysis reveals that integrating translation training improves cross-lingual question alignment by over 10 percentage points and enhances translation quality for both mathematical questions and general-domain text, with gains up to 8.4 COMET points on FLORES-200.
[55] Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embedding Space
Felipe D. Toro-Hernández, Jesuino Vieira Filho, Rodrigo M. Cabral-Carvalho
Main category: cs.CL
TL;DR: A framework representing concept production as navigation through semantic embedding space, using transformer models to extract geometric and dynamical metrics from semantic trajectories.
Details
Motivation: To investigate how humans traverse semantic knowledge space by developing a computational framework that represents concept production as navigation through embedding space, bridging cognitive modeling with learned representations.Method: Construct participant-specific semantic trajectories using cumulative embeddings from transformer text embedding models, extract geometric and dynamical metrics (distance to next, distance to centroid, entropy, velocity, acceleration), and evaluate on four multilingual datasets across different property generation tasks.
Result: The framework distinguishes between clinical groups and concept types across different languages, with cumulative embeddings working best for longer trajectories. Different embedding models yielded similar results despite different training pipelines.
Conclusion: Semantic navigation can be effectively modeled as structured trajectories through embedding space, establishing a pipeline for quantifying semantic representation dynamics with applications in clinical research, cross-linguistic analysis, and artificial cognition assessment.
Abstract: Semantic representations can be framed as a structured, dynamic knowledge space through which humans navigate to retrieve and manipulate meaning. To investigate how humans traverse this geometry, we introduce a framework that represents concept production as navigation through embedding space. Using different transformer text embedding models, we construct participant-specific semantic trajectories based on cumulative embeddings and extract geometric and dynamical metrics, including distance to next, distance to centroid, entropy, velocity, and acceleration. These measures capture both scalar and directional aspects of semantic navigation, providing a computationally grounded view of semantic representation search as movement in a geometric space. We evaluate the framework on four datasets across different languages, spanning different property generation tasks: Neurodegenerative, Swear verbal fluency, Property listing task in Italian, and in German. Across these contexts, our approach distinguishes between clinical groups and concept types, offering a mathematical framework that requires minimal human intervention compared to typical labor-intensive linguistic pre-processing methods. Comparison with a non-cumulative approach reveals that cumulative embeddings work best for longer trajectories, whereas shorter ones may provide too little context, favoring the non-cumulative alternative. Critically, different embedding models yielded similar results, highlighting similarities between different learned representations despite different training pipelines. By framing semantic navigation as a structured trajectory through embedding space, bridging cognitive modeling with learned representation, thereby establishing a pipeline for quantifying semantic representation dynamics with applications in clinical research, cross-linguistic analysis, and the assessment of artificial cognition.
[56] DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs
Lizhuo Luo, Shenggui Li, Yonggang Wen, Tianwei Zhang
Main category: cs.CL
TL;DR: DSB introduces dynamic block scheduling for diffusion LLMs to adapt to semantic difficulty, improving both generation quality and inference efficiency through training-free methods.
Details
Motivation: Fixed block schedules in diffusion LLMs are suboptimal because they don't adapt to semantic difficulty, forcing premature commitments on uncertain positions while delaying easy decisions near boundaries.Method: Proposes Dynamic Sliding Block (DSB) with dynamic block sizes based on semantic difficulty, plus DSB Cache for efficient KV-cache management - both training-free.
Result: Extensive experiments show DSB with DSB Cache consistently improves generation quality and inference efficiency across multiple models and benchmarks.
Conclusion: Dynamic adaptation of block scheduling to semantic difficulty is crucial for reliable and efficient inference in diffusion LLMs.
Abstract: Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduce DSB Cache, a training-free KV-cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs. Code is released at https://github.com/lizhuo-luo/DSB.
[57] A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies
Panagiotis Kaliosis, Adithya V Ganesan, Oscar N. E. Kjell, Whitney Ringwald, Scott Feltman, Melissa A. Carr, Dimitris Samaras, Camilo Ruggero, Benjamin J. Luft, Roman Kotov, Andrew H. Schwartz
Main category: cs.CL
TL;DR: LLMs can assess PTSD severity from narratives, with accuracy influenced by contextual knowledge, reasoning effort, model size, and ensemble methods.
Details
Motivation: To understand what factors affect the accuracy of LLMs when used in zero-shot fashion for mental health assessment, specifically PTSD severity evaluation from natural language narratives.Method: Evaluated 11 state-of-the-art LLMs on clinical dataset of 1,437 PTSD narratives. Systematically varied contextual knowledge (definitions, distribution summaries, interview questions) and modeling strategies (zero-shot vs few-shot, reasoning effort, model sizes, structured vs direct prediction, output rescaling, 9 ensemble methods).
Result: LLMs most accurate with detailed construct definitions and narrative context; increased reasoning effort improves accuracy; open-weight models plateau beyond 70B parameters while closed-weight models improve with newer generations; best performance achieved by ensembling supervised models with zero-shot LLMs.
Conclusion: Choice of contextual knowledge and modeling strategies is crucial for deploying LLMs to accurately assess mental health conditions like PTSD.
Abstract: Large language models (LLMs) are increasingly being used in a zero-shot fashion to assess mental health conditions, yet we have limited knowledge on what factors affect their accuracy. In this study, we utilize a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to comprehensively evaluate the performance of 11 state-of-the-art LLMs. To understand the factors affecting accuracy, we systematically varied (i) contextual knowledge like subscale definitions, distribution summary, and interview questions, and (ii) modeling strategies including zero-shot vs few shot, amount of reasoning effort, model sizes, structured subscales vs direct scalar prediction, output rescaling and nine ensemble methods. Our findings indicate that (a) LLMs are most accurate when provided with detailed construct definitions and context of the narrative; (b) increased reasoning effort leads to better estimation accuracy; (c) performance of open-weight models (Llama, Deepseek), plateau beyond 70B parameters while closed-weight (o3-mini, gpt-5) models improve with newer generations; and (d) best performance is achieved when ensembling a supervised model with the zero-shot LLMs. Taken together, the results suggest choice of contextual knowledge and modeling strategies is important for deploying LLMs to accurately assess mental health.
[58] Multi-Token Prediction via Self-Distillation
John Kirchenbauer, Abhimanyu Hans, Brian Bartoldson, Micah Goldblum, Ashwinee Panda, Tom Goldstein
Main category: cs.CL
TL;DR: Online distillation method converts pretrained autoregressive language models into faster multi-token prediction models without auxiliary components or complex inference pipelines.
Details
Motivation: Existing acceleration techniques like speculative decoding require training auxiliary speculator models and complex inference pipelines, creating deployment challenges.Method: Simple online distillation objective that transforms a pretrained autoregressive language model from single-token prediction to multi-token prediction while maintaining the same implementation.
Result: Models achieve >3× faster decoding on GSM8K with <5% accuracy drop compared to single token decoding.
Conclusion: The approach enables significant inference acceleration without requiring auxiliary models or specialized inference code, maintaining deployment simplicity.
Abstract: Existing techniques for accelerating language model inference, such as speculative decoding, require training auxiliary speculator models and building and deploying complex inference pipelines. We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model using a simple online distillation objective. The final model retains the exact same implementation as the pretrained initial checkpoint and is deployable without the addition of any auxiliary verifier or other specialized inference code. On GSM8K, our method produces models that can decode more than $3\times$ faster on average at $<5%$ drop in accuracy relative to single token decoding performance.
[59] Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory
Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, Wenya Wang
Main category: cs.CL
TL;DR: BudgetMem is a runtime agent memory framework with explicit budget control that structures memory as modules with three budget tiers (Low/Mid/High) and uses a lightweight RL-trained router to balance performance and cost.
Details
Motivation: Existing LLM agent memory systems rely on offline, query-agnostic construction that can be inefficient and discard critical information, while runtime alternatives incur substantial overhead with limited control over performance-cost trade-offs.Method: BudgetMem structures memory processing as modules with three budget tiers, using a lightweight neural router trained with reinforcement learning to perform budget-tier routing across modules. Three tiering strategies are explored: implementation complexity, inference behavior, and model size.
Result: BudgetMem surpasses strong baselines in high-budget settings on LoCoMo, LongMemEval, and HotpotQA, and delivers better accuracy-cost frontiers under tighter budgets. Analysis reveals strengths/weaknesses of different tiering strategies under varying budget regimes.
Conclusion: BudgetMem provides an effective framework for explicit, query-aware performance-cost control in LLM agent memory systems, with modular budget-tiering strategies that offer flexible trade-offs between accuracy and computational cost.
Abstract: Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.
[60] DFlash: Block Diffusion for Flash Speculative Decoding
Jian Chen, Yesheng Liang, Zhijian Liu
Main category: cs.CL
TL;DR: DFlash is a speculative decoding framework that uses a lightweight block diffusion model for parallel drafting to accelerate autoregressive LLM inference, achieving 6x speedup with higher acceptance rates than existing methods.
Details
Motivation: Autoregressive LLMs suffer from sequential decoding causing high inference latency and poor GPU utilization. While speculative decoding helps, existing methods still rely on autoregressive drafting which remains sequential. Diffusion LLMs offer parallel generation but typically underperform compared to autoregressive models.Method: DFlash uses a lightweight block diffusion model for parallel drafting in a single forward pass. The draft model is conditioned on context features extracted from the target LLM, enabling efficient drafting with high-quality outputs and higher acceptance rates.
Result: DFlash achieves over 6x lossless acceleration across various models and tasks, delivering up to 2.5x higher speedup than state-of-the-art speculative decoding method EAGLE-3.
Conclusion: DFlash demonstrates that diffusion-based parallel drafting can significantly accelerate autoregressive LLM inference while maintaining quality, offering a promising alternative to autoregressive drafting methods.
Abstract: Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.
[61] LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models
Nam V. Nguyen, Thong T. Doan, Luong Tran, Van Nguyen, Quang Pham
Main category: cs.CL
TL;DR: LibMoE is a unified framework for efficient and reproducible Mixture of Experts research that enables comprehensive analysis of routing dynamics, initialization effects, and training regime differences.
Details
Motivation: Systematic research on Mixture of Experts architectures is constrained by prohibitive computational costs, limiting large-scale studies accessible to most researchers. There's a need for a standardized, efficient framework to lower barriers to MoE research.Method: Developed LibMoE, a unified framework supporting both pretraining and sparse-upcycling regimes with transparent analytical tools for probing routing and expert dynamics. Conducted comprehensive analysis along three dimensions: routing dynamics, lightweight initialization effects, and training regime differences.
Result: The framework enables reproducible, efficient MoE research and provides insights into expert selection patterns, routing stability, initialization effects on load balancing, and differences between sparse upcycling vs full pretraining routing patterns.
Conclusion: LibMoE broadens access to MoE research by lowering computational barriers and establishing reliable benchmarks, enabling more systematic studies of expert architectures that are fundamental to modern large language models.
Abstract: Mixture of experts (MoE) architectures have become a cornerstone for scaling up and are a key component in most large language models such as GPT-OSS, DeepSeek-V3, Llama-4, and Gemini-2.5. However, systematic research on MoE remains severely constrained by the prohibitive computational costs of training and evaluation, restricting large-scale studies accessible to most researchers. We introduce LibMoE, a unified framework for reproducible, efficient, and extensible MoE research that supports both pretraining and sparse-upcycling regimes. Beyond unified implementations, the framework provides transparent analytical tools for probing routing and expert dynamics. Leveraging this foundation, we conduct a comprehensive analysis along three dimensions: (i) routing dynamics, covering expert selection patterns, routing stability and optimality, and how routing entropy reveals task specialization and expert diversity; (ii) the effect of lightweight initialization on load balancing, demonstrating how subtle changes in router initialization shape early expert utilization; and (iii) training regime differences, revealing how sparse upcycling and full pretraining exhibit distinct routing patterns and stability profiles. By lowering the barrier to entry and standardizing evaluation, along with our comprehensive analysis, LibMoE broadens access to MoE research and establishes a reliable benchmark to guide future innovations. GitHub: \href{https://github.com/Fsoft-AIC/LibMoE}{https://github.com/Fsoft-AIC/LibMoE}.
[62] Invisible Walls in Cities: Designing LLM Agent to Predict Urban Segregation Experience with Social Media Content
Bingbing Fan, Lin Chen, Songwei Li, Jian Yuan, Fengli Xu, Pan Hui, Yong Li
Main category: cs.CL
TL;DR: LLM agent for mining social media reviews to predict urban segregation, using reflective coding to create generalizable codebooks and combining reasoning/embedding for multi-channel feature integration.
Details
Motivation: Understanding urban segregation is crucial for addressing societal inequalities, but analyzing vast, ambiguous social media review data is challenging. Current methods struggle with the volume and complexity of user-generated content that contains nuanced perceptions of places.Method: Proposes a reflective LLM coder to digest social media content into insights consistent with real-world feedback, creating a codebook capturing segregation dimensions (cultural resonance, accessibility, community engagement). Uses RE’EM framework combining reasoning and embedding capabilities to integrate multi-channel features for segregation prediction.
Result: 22.79% improvement in R² and 9.33% reduction in MSE on real-world data. Codebook generalizes across three different cities. User study confirms codebook-guided summaries provide cognitive gains for human perception of social inclusiveness.
Conclusion: Demonstrates LLM’s potential for understanding implicit social barriers through automated review mining, marking important progress in using Web technology to promote social inclusiveness.
Abstract: Understanding experienced segregation in urban daily life is crucial for addressing societal inequalities and fostering inclusivity. The abundance of user-generated reviews on social media encapsulates nuanced perceptions and feelings associated with different places, offering rich insights into segregation. However, leveraging this data poses significant challenges due to its vast volume, ambiguity, and confluence of diverse perspectives. To tackle these challenges, we propose a novel Large Language Model (LLM) agent to automate online review mining for segregation prediction. Specifically, we propose a reflective LLM coder to digest social media content into insights consistent with real-world feedback, and eventually produce a codebook capturing key dimensions that signal segregation experience, such as cultural resonance and appeal, accessibility and convenience, and community engagement and local involvement. Guided by the codebook, LLMs can generate both informative review summaries and ratings for segregation prediction. Moreover, we design a REasoning-and-EMbedding (RE’EM) framework, which combines the reasoning and embedding capabilities of language models to integrate multi-channel features for segregation prediction. Experiments on real-world data demonstrate that our agent substantially improves prediction accuracy, with a 22.79% elevation in R$^{2}$ and a 9.33% reduction in MSE. The derived codebook is generalizable across three different cities, consistently improving prediction accuracy. Moreover, our user study confirms that the codebook-guided summaries provide cognitive gains for human participants in perceiving places of interest (POIs)’ social inclusiveness. Our study marks an important step toward understanding implicit social barriers and inequalities, demonstrating the great potential of promoting social inclusiveness with Web technology.
[63] Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning
Zhaowei Liu, Xin Guo, Zhi Yang, Fangqi Lou, Lingfeng Zeng, Mengping Li, Qi Qi, Zhiqiang Liu, Yiyang Han, Dongpo Cheng, Ronghao Chen, Huacan Wang, Xingdong Feng, Huixia Judy Wang, Chengchun Shi, Liwen Zhang
Main category: cs.CL
TL;DR: Fin-R1 is a 7B parameter reasoning LLM specialized for financial applications, trained on curated financial CoT data with SFT+RL to improve accuracy and interpretability in financial reasoning tasks.
Details
Motivation: General-purpose LLMs face challenges in finance due to fragmented data sources, opaque reasoning processes, and poor transferability to business applications, creating a need for specialized financial reasoning models.Method: Two-stage pipeline: 1) Construct Fin-R1-Data (60,091 high-quality CoT samples distilled from financial benchmarks), 2) Train Fin-R1 using supervised fine-tuning followed by reinforcement learning.
Result: Fin-R1 achieves competitive performance on financial benchmarks despite its small size (7B parameters) and demonstrates practical utility in compliance checking and robo-advisory applications.
Conclusion: Fin-R1 successfully addresses financial LLM challenges through specialized data curation and training, offering an efficient solution for financial reasoning with interpretable outputs.
Abstract: In recent years, general-purpose large language models (LLMs) such as GPT, Gemini, Claude, and DeepSeek have advanced at an unprecedented pace. Despite these achievements, their application to finance remains challenging, due to fragmented data sources, intransparent reasoning processes, and weak transferability to business applications. In response, we introduce Fin-R1, a reasoning LLM designed for financial scenarios. With a compact size of 7 billion parameters, Fin-R1 reduces deployment costs while addressing the aforementioned challenges. Its development follows a two-stage pipeline. First, we construct Fin-R1-Data, a high-quality financial dataset consisting of 60,091 chain-of-thought (CoT) samples, distilled and filtered from multiple authoritative benchmarks to ensure consistency and reliability. Second, we train Fin-R1 using Fin-R1-Data through supervised fine-tuning (SFT), followed by reinforcement learning (RL). This stage substantially improves the model’s ability to solve complex financial reasoning tasks, yielding outputs that are both accurate and interpretable. Despite its relatively small parameter scale, Fin-R1 achieves competitive empirical performance across established financial benchmarks and demonstrates practical utility in compliance checking and robo-advisory. Our code is publicly available at https://github.com/SUFE-AIFLM-Lab/Fin-R1, and has already attracted over 700 stars.
[64] HBO: Hierarchical Balancing Optimization for Fine-Tuning Large Language Models
Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch
Main category: cs.CL
TL;DR: HBO is a hierarchical balancing optimization method for fine-tuning LLMs that addresses data imbalance both across datasets (globally) and within individual datasets (locally) using bilevel optimization with global and local actors guided by reward functions.
Details
Motivation: Existing methods for fine-tuning LLMs on diverse datasets only address data imbalance across datasets (globally) but overlook imbalance and heterogeneity within individual datasets (locally), limiting their effectiveness in handling diverse training data.Method: HBO uses a bilevel optimization strategy with two types of actors: a Global Actor that balances data sampling across different subsets of the training mixture, and several Local Actors that optimize data usage within each subset based on difficulty levels. These actors are guided by reward functions derived from the LLM’s training state that measure learning progress and relative performance improvement.
Result: HBO consistently outperforms existing baselines across three LLM backbones and nine diverse tasks in multilingual and multitask setups, achieving significant accuracy gains. Analysis shows both global and local actors effectively adjust data usage during fine-tuning.
Conclusion: HBO provides a comprehensive solution to data imbalance and heterogeneity in LLM fine-tuning, enabling more effective training across diverse datasets by addressing both global and local data distribution issues.
Abstract: Fine-tuning large language models (LLMs) on a mixture of diverse datasets poses challenges due to data imbalance and heterogeneity. Existing methods often address these issues across datasets (globally) but overlook the imbalance and heterogeneity within individual datasets (locally), which limits their effectiveness. We introduce Hierarchical Balancing Optimization (HBO), a novel method that enables LLMs to autonomously adjust data allocation during fine-tuning both across datasets (globally) and within each individual dataset (locally). HBO employs a bilevel optimization strategy with two types of actors: a Global Actor, which balances data sampling across different subsets of the training mixture, and several Local Actors, which optimizes data usage within each subset based on difficulty levels. These actors are guided by reward functions derived from the LLM’s training state, which measure learning progress and relative performance improvement. We evaluate HBO on three LLM backbones across nine diverse tasks in multilingual and multitask setups. Results show that HBO consistently outperforms existing baselines, achieving significant accuracy gains. Our in-depth analysis further demonstrates that both the global actor and local actors of HBO effectively adjust data usage during fine-tuning. HBO provides a comprehensive solution to the challenges of data imbalance and heterogeneity in LLM fine-tuning, enabling more effective training across diverse datasets.
[65] Position: The Real Barrier to LLM Agent Usability is Agentic ROI
Weiwen Liu, Jiarui Qin, Xu Huang, Xingshan Zeng, Yunjia Xi, Jianghao Lin, Chuhan Wu, Yasheng Wang, Lifeng Shang, Ruiming Tang, Defu Lian, Yong Yu, Weinan Zhang
Main category: cs.CL
TL;DR: This position paper introduces Agentic ROI as a framework for evaluating LLM agent usability, arguing that the key question is not just what tasks can be automated but whether they provide sufficient return on investment, and proposes a zigzag development strategy of scaling up then down for mass-market adoption.
Details
Motivation: The paper identifies a gap between LLM agents' technical capabilities and their meaningful usability in everyday applications. While LLM agents show promise in high-ROI domains like coding and research, they lack widespread adoption in mass-market scenarios due to insufficient consideration of practical utility and return on investment.Method: The paper proposes the Agentic ROI framework as a holistic evaluation metric that shifts focus from raw performance to utility-driven assessment. It suggests a zigzag developmental trajectory: first scaling up to maximize information gain and time savings, then scaling down to reduce costs for broader accessibility.
Result: The paper presents a strategic roadmap for making LLM agents truly usable, accessible, and scalable in real-world applications by balancing performance gains with cost considerations through the proposed zigzag development approach.
Conclusion: The central challenge for LLM agent adoption is not technical capability but delivering sufficient Agentic ROI. A phased approach of scaling up then down, guided by utility-driven evaluation, is necessary to bridge the usability gap and enable mass-market deployment of LLM agents.
Abstract: Large Language Model (LLM) agents represent a promising shift in human-AI interaction, moving beyond passive prompt-response systems to autonomous agents capable of reasoning, planning, and goal-directed action. While LLM agents are technically capable of performing a broad range of tasks, not all of these capabilities translate into meaningful usability. This position paper argues that the central question for LLM agent usability is no longer whether a task can be automated, but whether it delivers sufficient Agentic Return on Investment (Agentic ROI). Agentic ROI reframes evaluation from raw performance to a holistic, utility-driven perspective, guiding when, where, and for whom LLM agents should be deployed. Despite widespread application in high-ROI tasks like coding and scientific research, we identify a critical usability gap in mass-market, everyday applications. To address this, we propose a zigzag developmental trajectory: first scaling up to improve information gain and time savings, then scaling down to reduce cost. We present a strategic roadmap across these phases to make LLM agents truly usable, accessible, and scalable in real-world applications.
[66] SelfReflect: Can LLMs Communicate Their Internal Answer Distribution?
Michael Kirchhof, Luca Füger, Adam Goliński, Eeshan Gunesh Dhekane, Arno Blaas, Seong Joon Oh, Sinead Williamson
Main category: cs.CL
TL;DR: LLMs struggle to transparently communicate their internal uncertainty distributions, but can generate faithful uncertainty summaries when provided with multiple sampled outputs as context.
Details
Motivation: Current LLMs communicate uncertainty poorly through simple percentages or hedging words, lacking the ability to transparently reveal their internal belief distributions over possible answers.Method: Developed SelfReflect metric to measure faithfulness between LLM uncertainty summaries and their actual internal distributions; tested LLMs’ ability to reveal uncertainty through reasoning, chains-of-thought, and explicit finetuning.
Result: Modern LLMs are fundamentally incapable of revealing their uncertainties through standard methods, but can generate faithful uncertainty summaries when given multiple sampled outputs as context.
Conclusion: LLMs need external sampling assistance to communicate uncertainties faithfully; SelfReflect metric enables development of universal LLM uncertainty communication methods.
Abstract: The common approach to communicate a large language model’s (LLM) uncertainty is to add a percentage number or a hedging word to its response. But is this all we can do? Instead of generating a single answer and then hedging it, an LLM that is fully transparent to the user needs to be able to reflect on its internal belief distribution and output a summary of all options it deems possible, and how likely they are. To test whether LLMs possess this capability, we develop the SelfReflect metric, an information-theoretic distance between a given summary and a distribution over answers. In interventional and human studies, we find that SelfReflect indicates even slight deviations, yielding a fine measure of faithfulness between a summary string and an LLM’s actual internal distribution over answers. With SelfReflect, we make a resounding negative observation: modern LLMs are, across the board, incapable of revealing what they are uncertain about, neither through reasoning, nor chains-of-thoughts, nor explicit finetuning. However, we do find that LLMs are able to generate faithful summaries of their uncertainties if we help them by sampling multiple outputs and feeding them back into the context. This simple approach shines a light at the universal way of communicating LLM uncertainties whose future development the SelfReflect score enables. To support the development of this universal form of LLM uncertainties, we publish the code that implements our metric for arbitrary LLMs under https://github.com/apple/ml-selfreflect .
[67] POLAR: A Benchmark for Multilingual, Multicultural, and Multi-Event Online Polarization
Usman Naseem, Robert Geislinger, Juan Ren, Sarah Kohail, Rudy Garrido Veliz, P Sam Sahil, Yiran Zhang, Marco Antonio Stranisci, Idris Abdulmumin, Özge Alacam, Cengiz Acartürk, Aisha Jabr, Saba Anwar, Abinew Ali Ayele, Simona Frenda, Alessandra Teresa Cignarella, Elena Tutubalina, Oleg Rogov, Aung Kyaw Htet, Xintong Wang, Surendrabikram Thapa, Kritesh Rauniyar, Tanmoy Chakraborty, Arfeen Zeeshan, Dheeraj Kodati, Satya Keerthi, Sahar Moradizeyveh, Firoj Alam, Arid Hasan, Syed Ishtiaque Ahmed, Ye Kyaw Thu, Shantipriya Parida, Ihsan Ayyub Qazi, Lilian Wanzare, Nelson Odhiambo Onyango, Clemencia Siro, Jane Wanjiru Kimani, Ibrahim Said Ahmad, Adem Chanie Ali, Martin Semmann, Chris Biemann, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam
Main category: cs.CL
TL;DR: POLAR dataset for multilingual, multicultural polarization analysis with 110K instances across 22 languages, annotated along three axes, with experiments showing models struggle with nuanced polarization types and manifestations.
Details
Motivation: Address the limitations of monolingual, culturally narrow, and event-specific computational social science research on online polarization by creating a comprehensive multilingual dataset.Method: Created POLAR dataset with 110K instances in 22 languages from diverse online platforms and real-world events, annotated along detection, type, and manifestation axes. Conducted experiments with six pretrained small language models and various LLMs in few-shot/zero-shot settings.
Result: Models perform well in binary polarization detection but achieve substantially lower performance when predicting polarization types and manifestations, highlighting the complex contextual nature of polarization.
Conclusion: Polarization is highly contextual requiring robust, adaptable NLP approaches; dataset release supports global digital polarization mitigation research.
Abstract: Online polarization poses a growing challenge for democratic discourse, yet most computational social science research remains monolingual, culturally narrow, or event-specific. We introduce POLAR, a multilingual, multicultural, and multi-event dataset with over 110K instances in 22 languages drawn from diverse online platforms and real-world events. Polarization is annotated along three axes, namely detection, type, and manifestation, using a variety of annotation platforms adapted to each cultural context. We conduct two main experiments: (1) fine-tuning six pretrained small language models; and (2) evaluating a range of open and closed large language models in few-shot and zero-shot settings. The results show that, while most models perform well in binary polarization detection, they achieve substantially lower performance when predicting polarization types and manifestations. These findings highlight the complex, highly contextual nature of polarization and demonstrate the need for robust, adaptable approaches in NLP and computational social science. All resources will be released to support further research and effective mitigation of digital polarization globally.
[68] STACK: Adversarial Attacks on LLM Safeguard Pipelines
Ian R. McKenzie, Oskar J. Hollinsworth, Tom Tseng, Xander Davies, Stephen Casper, Aaron D. Tucker, Robert Kirk, Adam Gleave
Main category: cs.CL
TL;DR: The paper evaluates security vulnerabilities in frontier AI defense pipelines through red-teaming, showing that a novel few-shot-prompted classifier outperforms existing safeguards but can still be bypassed by staged attacks.
Details
Motivation: Frontier AI developers rely on safeguard pipelines to prevent catastrophic misuse, but the security of these pipelines is unclear with limited prior evaluation. The paper aims to address this gap by developing and red-teaming an open-source defense pipeline.Method: Developed an open-source defense pipeline with a novel few-shot-prompted input and output classifier. Introduced STaged AttaCK (STACK) procedure for black-box attacks. Evaluated performance across three attacks and two datasets, including transfer attacks.
Result: The few-shot-prompted classifier reduced attack success rate to 0% on ClearHarm dataset, outperforming ShieldGemma. However, STACK achieved 71% ASR in black-box attacks and 33% ASR in transfer settings, showing vulnerabilities in defense pipelines.
Conclusion: Defense pipelines for frontier AI systems have significant security vulnerabilities that can be exploited through staged attacks. The paper suggests specific mitigations developers could implement to thwart such attacks.
Abstract: Frontier AI developers are relying on layers of safeguards to protect against catastrophic misuse of AI systems. Anthropic and OpenAI guard their latest Opus 4 model and GPT-5 models using such defense pipelines, and other frontier developers including Google DeepMind pledge to soon deploy similar defenses. However, the security of such pipelines is unclear, with limited prior work evaluating or attacking these pipelines. We address this gap by developing and red-teaming an open-source defense pipeline. First, we find that a novel few-shot-prompted input and output classifier outperforms state-of-the-art open-weight safeguard model ShieldGemma across three attacks and two datasets, reducing the attack success rate (ASR) to 0% on the catastrophic misuse dataset ClearHarm. Second, we introduce a STaged AttaCK (STACK) procedure that achieves 71% ASR on ClearHarm in a black-box attack against the few-shot-prompted classifier pipeline. Finally, we also evaluate STACK in a transfer setting, achieving 33% ASR, providing initial evidence that it is feasible to design attacks with no access to the target pipeline. We conclude by suggesting specific mitigations that developers could use to thwart staged attacks.
[69] CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering
Hang Lv, Sheng Liang, Hao Wang, Hongchao Gu, Yaxiong Wu, Wei Guo, Defu Lian, Yong Liu, Enhong Chen
Main category: cs.CL
TL;DR: CoSteer enables real-time personalization of large cloud models using local small models without fine-tuning, preserving privacy and efficiency.
Details
Motivation: Existing personalization methods struggle with real-time adaptation under resource and privacy constraints of personal devices, requiring a solution that balances quality, efficiency, and privacy.Method: A collaborative framework using decoding-time adaptation that leverages logit differences between context-aware local small models and context-agnostic ones to steer cloud-based large models, with personalization handled locally and only final tokens sent to the cloud.
Result: CoSteer generates high-quality personalized content across various tasks while maintaining computational efficiency and robustness across different models and environments.
Conclusion: CoSteer provides an effective solution for real-time personalization that preserves user privacy and system efficiency, making it practically applicable in real-world scenarios.
Abstract: Personalization has become crucial for adapting models to the diverse and evolving needs of users across cultural, temporal, and contextual dimensions. While existing methods often rely on centralized fine-tuning or static preference alignment within a single model, they struggle to achieve both real-time and high-quality personalization under the resource and privacy constraints of personal devices. To address this challenge, we propose CoSteer, a collaborative framework that enables tuning-free, real-time personalization via decoding-time adaptation. By leveraging logit differences between context-aware and context-agnostic local small models, CoSteer steers cloud-based large models, ensuring effective personalization while preserving the large model’s capabilities. Personalization is handled locally, with only final tokens sent to the cloud, maintaining both user context and system efficiency. Through extensive experiments across a wide range of tasks, we demonstrate that CoSteer generates high-quality personalized content, ensuring both effectiveness and computational efficiency. Our results highlight its robustness across models and environments, confirming its practical applicability in real-world scenarios.
[70] Language Models and Logic Programs for Trustworthy Tax Reasoning
William Jurayj, Nils Holzenberger, Benjamin Van Durme
Main category: cs.CL
TL;DR: Neuro-symbolic system combining LLMs with symbolic solvers for tax filing, achieving high accuracy on statutory reasoning tasks with economic feasibility analysis.
Details
Motivation: Tax filing requires complex reasoning with high accuracy requirements where errors incur costly penalties, making pure LLMs unsuitable due to reliability concerns.Method: Integrates LLMs with symbolic solvers, using semantic parsing to translate plain-text tax rules into formal logic programs, combined with intelligent exemplar retrieval for formal case representations.
Result: Demonstrates effectiveness on SARA dataset, shows dramatic performance improvements, and estimates deployment costs below real-world averages for tax filing assistance.
Conclusion: Neuro-symbolic architectures show promise for reliable tax assistance by combining LLMs’ language understanding with symbolic solvers’ precision, potentially increasing access to accurate tax help.
Abstract: According to the United States Internal Revenue Service, ``the average American spends $$270$ and 13 hours filing their taxes’’. Even beyond the U.S., tax filing requires complex reasoning, combining application of overlapping rules with numerical calculations. Because errors can incur costly penalties, any automated system must deliver high accuracy and auditability, making modern large language models (LLMs) poorly suited for this task. We propose an approach that integrates LLMs with a symbolic solver to calculate tax obligations. We evaluate variants of this system on the challenging StAtutory Reasoning Assessment (SARA) dataset, and include a novel method for estimating the cost of deploying such a system based on real-world penalties for tax errors. We further show how combining up-front translation of plain-text rules into formal logic programs, combined with intelligently retrieved exemplars for formal case representations, can dramatically improve performance on this task and reduce costs to well below real-world averages. Our results demonstrate the effectiveness of applying semantic parsing methods to statutory reasoning, and show promising economic feasibility of neuro-symbolic architectures for increasing access to reliable tax assistance.
[71] Real-Time Detection of Hallucinated Entities in Long-Form Generation
Oscar Obeso, Andy Arditi, Javier Ferrando, Joshua Freeman, Cameron Holmes, Neel Nanda
Main category: cs.CL
TL;DR: A scalable method for real-time detection of entity-level hallucinations in long-form LLM generations using web-search-annotated datasets and linear probes.
Details
Motivation: Current hallucination detection methods are impractical for real-world use - limited to short factual queries or requiring costly external verification, while hallucinations in long-form generations can cause serious harm in high-stakes applications.Method: Developed annotation methodology using web search to label model responses with grounded labels indicating which tokens correspond to fabricated entities. Trained effective hallucination classifiers using simple linear probes on this dataset, focusing on entity-level hallucinations (names, dates, citations) rather than claim-level.
Result: Classifiers consistently outperform baselines on long-form responses across four model families, achieving AUC 0.90 vs 0.71 for Llama-3.3-70B compared to semantic entropy. Also effective in short-form QA and generalize to detect incorrect answers in mathematical reasoning tasks despite being trained only on entity hallucinations.
Conclusion: Presents a promising scalable approach for real-world hallucination detection that can operate in real-time on long-form generations, with datasets publicly released for reuse across different models.
Abstract: Large language models are now routinely used in high-stakes applications where hallucinations can cause serious harm, such as medical consultations or legal advice. Existing hallucination detection methods, however, are impractical for real-world use, as they are either limited to short factual queries or require costly external verification. We present a cheap, scalable method for real-time identification of hallucinated tokens in long-form generations, and scale it effectively to 70B parameter models. Our approach targets entity-level hallucinations-e.g., fabricated names, dates, citations-rather than claim-level, thereby naturally mapping to token-level labels and enabling streaming detection. We develop an annotation methodology that leverages web search to annotate model responses with grounded labels indicating which tokens correspond to fabricated entities. This dataset enables us to train effective hallucination classifiers with simple and efficient methods such as linear probes. Evaluating across four model families, our classifiers consistently outperform baselines on long-form responses, including more expensive methods such as semantic entropy (e.g., AUC 0.90 vs 0.71 for Llama-3.3-70B), and are also an improvement in short-form question-answering settings. Despite being trained only to detect hallucinated entities, our probes effectively detect incorrect answers in mathematical reasoning tasks, indicating generalization beyond entities. While our annotation methodology is expensive, we find that annotated responses from one model can be used to train effective classifiers on other models; accordingly, we publicly release our datasets to facilitate reuse. Overall, our work suggests a promising new approach for scalable, real-world hallucination detection.
[72] Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization
Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch
Main category: cs.CL
TL;DR: SummQ: An adversarial multi-agent framework for long document summarization using collaborative intelligence between summarization and quizzing agents to improve quality through iterative refinement.
Details
Motivation: Current LLMs struggle with long document summarization due to information loss, factual inconsistencies, and coherence issues when processing excessively long documents.Method: Adversarial multi-agent framework with specialized agents: summary generators and reviewers for creating/evaluating summaries, and quiz generators and reviewers creating comprehension questions. An examinee agent validates if summaries contain information needed to answer quiz questions, enabling iterative refinement through multifaceted feedback.
Result: Significantly outperforms existing state-of-the-art methods on three long document summarization benchmarks across ROUGE, BERTScore, LLM-as-a-Judge, and human evaluations.
Conclusion: Establishes a new approach for long document summarization using adversarial agentic collaboration to improve summarization quality through multi-agent dynamics and quizzing mechanisms.
Abstract: Long document summarization remains a significant challenge for current large language models (LLMs), as existing approaches commonly struggle with information loss, factual inconsistencies, and coherence issues when processing excessively long documents. We propose SummQ, a novel adversarial multi-agent framework that addresses these limitations through collaborative intelligence between specialized agents operating in two complementary domains: summarization and quizzing. Our approach employs summary generators and reviewers that work collaboratively to create and evaluate comprehensive summaries, while quiz generators and reviewers create comprehension questions that serve as continuous quality checks for the summarization process. This adversarial dynamic, enhanced by an examinee agent that validates whether the generated summary contains the information needed to answer the quiz questions, enables iterative refinement through multifaceted feedback mechanisms. We evaluate SummQ on three widely used long document summarization benchmarks. Experimental results demonstrate that our framework significantly outperforms existing state-of-the-art methods across ROUGE and BERTScore metrics, as well as in LLM-as-a-Judge and human evaluations. Our comprehensive analyses reveal the effectiveness of the multi-agent collaboration dynamics, the influence of different agent configurations, and the impact of the quizzing mechanism. This work establishes a new approach for long document summarization that uses adversarial agentic collaboration to improve summarization quality.
[73] Why Tree-Style Branching Matters for Thought Advantage Estimation in GRPO
Hongcheng Wang, Yinuo Huang, Sukai Wang, Guanghui Ren, Hao Dong
Main category: cs.CL
TL;DR: GRPO for Chain-of-Thought reasoning suffers from high variance in thought-level advantage estimation; branching with multiple answers per thought reduces variance asymptotically to zero, while increasing thoughts alone leaves positive variance floor.
Details
Motivation: GRPO trains Chain-of-Thought reasoning with verifiable rewards, but estimating thought-level advantages without value functions suffers from high variance. Tree-style branching is used in practice but lacks theoretical justification about why it works and whether it's necessary.Method: Study thought-level advantage estimation in GRPO from variance perspective under minimal tree-style setting where multiple answers are sampled per thought. Use multivariate delta method to analyze how different sampling dimensions affect variance - number of thoughts (K) vs number of answers per thought (M).
Result: Increasing thoughts (K) leaves strictly positive variance floor, while increasing answers per thought (M) induces monotonic decrease in variance asymptotically to zero. This implies accurate thought-level advantage estimation impossible through scaling thought sampling alone, making branching potentially necessary rather than heuristic.
Conclusion: Branching is a potentially necessary mechanism for accurate thought-level advantage estimation in GRPO. Experiments show effectiveness and necessity of answer-level branching across math, vision domains, different model architectures and sizes, improving optimization stability, training efficiency, and final performance.
Abstract: Group Relative Policy Optimization (GRPO) trains Chain-of-Thought reasoning with verifiable rewards, but estimating thought-level advantages without value functions often suffers from high variance. Although tree-style branching is used in practice to reduce the variance, it lacks a theoretical explanation of why it works and whether it is important or even potentially necessary. We study thought-level advantage estimation in GRPO from a variance perspective under a minimal tree-style setting where multiple answers are sampled for each thought. Using the multivariate delta method, we reveal an asymmetry in how different sampling dimensions affect variance. Increasing the number of sampled thoughts ($K$) leaves a strictly positive variance floor, whereas increasing the number of answers per thought ($M$) induces a monotonic decrease in variance, asymptotically decreasing it to zero. This implies that accurate thought-level advantage estimation is impossible through scaling thought sampling alone, making branching a potentially necessary mechanism rather than a heuristic. Experiments further provide empirical evidence for both the effectiveness and necessity of answer-level branching, demonstrating improved optimization stability, training efficiency, and final performance not only in math but also across a broad range of vision domains and under different model architectures and sizes.
[74] Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression
Peijun Zhu, Ning Yang, Baoliang Tian, Jiayu Wei, Weihao Zhang, Haijun Zhang, Pin Lv
Main category: cs.CL
TL;DR: A unified framework for MoE LLMs using dynamic expert clustering and structured compression to address load imbalance, parameter redundancy, and communication overhead.
Details
Motivation: Mixture-of-Experts LLMs face a trilemma of load imbalance (uneven expert utilization), parameter redundancy (many experts with similar functions), and communication overhead (all-to-all routing). Existing approaches address these issues separately rather than cohesively.Method: 1) Dynamic expert clustering: Periodically regroup experts using fused metric of parameter and activation similarity. 2) Structured compression: Decompose expert weights into shared base matrix + low-rank residual adapters. 3) Two-stage hierarchical routing: Tokens first assigned to cluster, then to specific experts. 4) Heterogeneous precision: Shared bases in FP16, residual factors in INT4 with dynamic offloading of inactive clusters.
Result: Matches quality of standard MoE models on GLUE and WikiText-103 while reducing total parameters by ~80%, improving throughput by 10-20%, and lowering expert load variance by over 3x.
Conclusion: Structural reorganization through dynamic clustering and compression is a principled path toward scalable, efficient, and memory-effective MoE LLMs, demonstrating that architectural reconfiguration during training can yield substantial efficiency gains.
Abstract: Mixture-of-Experts (MoE) Large Language Models (LLMs) face a trilemma of load imbalance, parameter redundancy, and communication overhead. We introduce a unified framework based on dynamic expert clustering and structured compression to address these issues cohesively. Our method employs an online clustering procedure that periodically regroups experts using a fused metric of parameter and activation similarity, which stabilizes expert utilization. To our knowledge, this is one of the first frameworks to leverage the semantic embedding capability of the router to dynamically reconfigure the model’s architecture during training for substantial efficiency gains. Within each cluster, we decompose expert weights into a shared base matrix and extremely low-rank residual adapters, achieving up to fivefold parameter reduction per group while preserving specialization. This structure enables a two-stage hierarchical routing strategy: tokens are first assigned to a cluster, then to specific experts within it, drastically reducing the routing search space and the volume of all-to-all communication. Furthermore, a heterogeneous precision scheme, which stores shared bases in FP16 and residual factors in INT4, coupled with dynamic offloading of inactive clusters, reduces peak memory consumption to levels comparable to dense models. Evaluated on GLUE and WikiText-103, our framework matches the quality of standard MoE models while reducing total parameters by approximately 80%, improving throughput by 10% to 20%, and lowering expert load variance by a factor of over three. Our work demonstrates that structural reorganization is a principled path toward scalable, efficient, and memory-effective MoE LLMs. Code is available at https://github.com/szdtzpj/Breaking_the_moe_trilemma
[75] LH-Deception: Simulating and Understanding LLM Deceptive Behaviors in Long-Horizon Interactions
Yang Xu, Xuanming Zhang, Samuel Yeh, Jwala Dhamala, Ousmane Dia, Rahul Gupta, Sharon Li
Main category: cs.CL
TL;DR: LH-Deception: A multi-agent simulation framework for systematically quantifying deception in LLMs across extended task sequences, revealing model-dependent deception patterns that increase with pressure and erode trust.
Details
Motivation: Deception is a growing concern in LLMs, but current evaluations are limited to single-turn prompts and fail to capture long-horizon deceptive strategies that unfold in real-world interactions.Method: Developed LH-Deception, a multi-agent simulation framework with performer agents completing tasks, supervisor agents evaluating progress and maintaining trust states, and an independent deception auditor analyzing full trajectories across 11 frontier LLMs.
Result: Deception is model-dependent, increases with event pressure, consistently erodes supervisor trust, and reveals emergent long-horizon phenomena like “chains of deception” invisible to single-turn evaluations.
Conclusion: The framework provides a foundation for evaluating LLMs in real-world, trust-sensitive contexts by capturing complex deceptive behaviors that emerge in extended interactions.
Abstract: Deception is a pervasive feature of human communication and an emerging concern in large language models (LLMs). While recent studies document instances of LLM deception, most evaluations remain confined to single-turn prompts and fail to capture the long-horizon interactions in which deceptive strategies typically unfold. We introduce a new simulation framework, LH-Deception, for a systematic, empirical quantification of deception in LLMs under extended sequences of interdependent tasks and dynamic contextual pressures. LH-Deception is designed as a multi-agent system: a performer agent tasked with completing tasks and a supervisor agent that evaluates progress, provides feedback, and maintains evolving states of trust. An independent deception auditor then reviews full trajectories to identify when and how deception occurs. We conduct extensive experiments across 11 frontier models, spanning both closed-source and open-source systems, and find that deception is model-dependent, increases with event pressure, and consistently erodes supervisor trust. Qualitative analyses further reveal emergent, long-horizon phenomena, such as ``chains of deception", which are invisible to static, single-turn evaluations. Our findings provide a foundation for evaluating future LLMs in real-world, trust-sensitive contexts.
[76] Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models
Ragib Amin Nihal, Rui Wen, Kazuhiro Nakadai, Jun Sakuma
Main category: cs.CL
TL;DR: PE-CoA is a framework using five conversation patterns to construct multi-turn jailbreak attacks on LLMs, revealing pattern-specific vulnerabilities and showing that safety defenses don’t generalize across patterns.
Details
Motivation: LLMs remain vulnerable to multi-turn jailbreaking attacks that exploit conversational context to bypass safety constraints gradually. Existing methods rely on heuristic exploration with limited insight into underlying model weaknesses, and the relationship between conversation patterns and vulnerabilities across harm categories is poorly understood.Method: Proposes Pattern Enhanced Chain of Attack (PE-CoA), a framework of five conversation patterns to construct multi-turn jailbreaks through natural dialogue. Evaluates on twelve LLMs spanning ten harm categories.
Result: Achieves state-of-the-art performance, uncovering pattern-specific vulnerabilities and LLM behavioral characteristics: models exhibit distinct weakness profiles, defense to one pattern does not generalize to others, and model families share similar failure modes.
Conclusion: These findings highlight limitations of current safety training and indicate the need for pattern-aware defenses against multi-turn jailbreaking attacks.
Abstract: Large language models (LLMs) remain vulnerable to multi-turn jailbreaking attacks that exploit conversational context to bypass safety constraints gradually. These attacks target different harm categories through distinct conversational approaches. Existing multi-turn methods often rely on heuristic or ad hoc exploration strategies, providing limited insight into underlying model weaknesses. The relationship between conversation patterns and model vulnerabilities across harm categories remains poorly understood. We propose Pattern Enhanced Chain of Attack (PE-CoA), a framework of five conversation patterns to construct multi-turn jailbreaks through natural dialogue. Evaluating PE-CoA on twelve LLMs spanning ten harm categories, we achieve state-of-the-art performance, uncovering pattern-specific vulnerabilities and LLM behavioral characteristics: models exhibit distinct weakness profiles, defense to one pattern does not generalize to others, and model families share similar failure modes. These findings highlight limitations of safety training and indicate the need for pattern-aware defenses. Code available on: https://github.com/Ragib-Amin-Nihal/PE-CoA
[77] Understanding and Improving Length Generalization in Hierarchical Sparse Attention Models
Jiaqi Leng, Xiang Hu, Junxiong Wang, Jianguo Li, Wei Wu, Yucheng Lu
Main category: cs.CL
TL;DR: Systematic analysis identifies three key design principles for effective long-context language models: expressive chunk encoders with CLS tokens, bypassing residual paths for stable global information integration, and enforced selection sparsity during pre-training.
Details
Motivation: Current approaches to long-context processing face limitations: standard Transformers have quadratic complexity, sliding window/state space models have fixed-size memory limitations, and chunk-based sparse attention's success principles aren't fully understood. The paper aims to systematically identify the core architectural principles enabling effective extreme length generalization.Method: The authors use a unified framework and comprehensive ablation studies to dissect chunk-based sparse attention models. They analyze architectural components through systematic experiments, providing theoretical motivation for intra-chunk information processing and landmark generation.
Result: Identified three critical design principles: (1) expressive non-linear chunk encoders with dedicated CLS tokens for retrieval representations, (2) bypassing residual paths for stable global information integration, and (3) enforced selection sparsity during pre-training to bridge train-test distribution gaps. Achieved new SOTA for training-free length extrapolation, generalizing from 4K to 32 million tokens on RULER and BABILong benchmarks.
Conclusion: The systematic analysis provides clear, empirically-grounded design principles for developing highly-capable long-context language models, establishing a foundation for future work in extreme length generalization.
Abstract: Effectively processing long contexts is a critical challenge for language models. While standard Transformers are limited by quadratic complexity and poor length extrapolation, alternative architectures like sliding window attention and state space models sacrifice the ability to effectively utilize the full context due to their fixed-size memory. Chunk-based sparse attention has emerged as a promising paradigm for extreme length generalization, yet the key architectural principles underpinning its success are not yet fully understood. In this work, we present a systematic dissection of these models to identify the core components driving their performance. Through a unified framework and comprehensive ablation studies, we demonstrate that a combination of three design principles is critical: (1) an expressive, non-linear Chunk Encoder with a dedicated CLS token to produce representations for retrieval; (2) a Bypassing Residual Path to stably integrate retrieved global information without it being overridden by the local residual stream; and (3) enforced selection sparsity during pre-training to bridge the train-test distribution gap. We provide a theoretical motivation for intra-chunk information processing and landmark generation. By combining these principles, we establish a new state-of-the-art for training-free length extrapolation, successfully generalizing models trained on a 4K context to 32 million tokens on RULER and BABILong. Our findings provide a clear and empirically-grounded set of design principles for developing future, highly-capable long-context language models.
[78] DEBATE: A Large-Scale Benchmark for Evaluating Opinion Dynamics in Role-Playing LLM Agents
Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, You Li, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, Timothy T. Rogers
Main category: cs.CL
TL;DR: DEBATE benchmark evaluates authenticity of opinion dynamics in multi-agent LLM role-playing simulations using large-scale human conversation data with public messages and private beliefs.
Details
Motivation: Current multi-agent LLM simulations show unnatural group behavior like premature convergence and lack empirical benchmarks for assessing alignment with real human group interactions in opinion dynamics.Method: Created DEBATE benchmark with 36,383 messages from 2,832 participants across 708 groups and 107 topics, with both public messages and private Likert-scale beliefs. Evaluated 7 LLMs as “digital twin” role-playing agents across next-message prediction and full conversation rollout settings using stance-alignment and opinion-convergence metrics.
Result: Zero-shot RPLA groups show strong opinion convergence relative to human groups. Supervised fine-tuning and Direct Preference Optimization improve stance alignment and bring group-level convergence closer to human behavior, though discrepancies in opinion change and belief updating remain.
Conclusion: DEBATE enables rigorous benchmarking of simulated opinion dynamics and supports future research on aligning multi-agent role-playing LLM agents with realistic human interactions.
Abstract: Accurately modeling opinion change through social interactions is crucial for understanding and mitigating polarization, misinformation, and societal conflict. Recent work simulates opinion dynamics with role-playing LLM agents (RPLAs), but multi-agent simulations often display unnatural group behavior (e.g., premature convergence) and lack empirical benchmarks for assessing alignment with real human group interactions. We introduce DEBATE, a large-scale benchmark for evaluating the authenticity of opinion dynamics in multi-agent RPLA simulations. DEBATE contains 36,383 messages from 2,832 U.S.-based participants across 708 groups and 107 topics, with both public messages and private Likert-scale beliefs, enabling evaluation at the utterance and group levels (and supporting future individual-level analyses). We instantiate “digital twin” RPLAs with seven LLMs and evaluate across two settings: next-message prediction and full conversation rollout, using stance-alignment and opinion-convergence metrics. In zero-shot settings, RPLA groups exhibit strong opinion convergence relative to human groups. Post-training via supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) improves stance alignment and brings group-level convergence closer to human behavior, though discrepancies in opinion change and belief updating remain. DEBATE enables rigorous benchmarking of simulated opinion dynamics and supports future research on aligning multi-agent RPLAs with realistic human interactions.
[79] Remembering Unequally: Global and Disciplinary Bias in LLM Reconstruction of Scholarly Coauthor Lists
Ghazal Kalhor, Afra Mashhadi
Main category: cs.CL
TL;DR: LLMs show systematic bias in reconstructing scholarly coauthor lists, favoring highly cited researchers and creating representational inequalities across disciplines and regions.
Details
Motivation: As LLMs reshape scholarly search interfaces, concerns arise about fairness and representational bias in their memorized training data, particularly regarding their ability to accurately reconstruct scholarly coauthor lists - an underexamined issue with implications for research discovery.Method: Evaluated three prominent LLMs (DeepSeek R1, Llama 4 Scout, Mixtral 8x7B) by comparing their generated coauthor lists against bibliographic reference data, analyzing patterns across academic disciplines and world regions.
Result: Revealed systematic advantage for highly cited researchers, indicating LLM memorization disproportionately favors already visible scholars, though patterns were not uniform - certain disciplines (Clinical Medicine) and regions (parts of Africa) showed more balanced reconstruction outcomes.
Conclusion: Highlights risks and limitations of relying on LLM-generated relational knowledge in scholarly discovery, emphasizing need for careful auditing of memorization-driven biases in LLM-based systems.
Abstract: Ongoing breakthroughs in large language models (LLMs) are reshaping scholarly search and discovery interfaces. While these systems offer new possibilities for navigating scientific knowledge, they also raise concerns about fairness and representational bias rooted in the models’ memorized training data. As LLMs are increasingly used to answer queries about researchers and research communities, their ability to accurately reconstruct scholarly coauthor lists becomes an important but underexamined issue. In this study, we investigate how memorization in LLMs affects the reconstruction of coauthor lists and whether this process reflects existing inequalities across academic disciplines and world regions. We evaluate three prominent models, DeepSeek R1, Llama 4 Scout, and Mixtral 8x7B, by comparing their generated coauthor lists against bibliographic reference data. Our analysis reveals a systematic advantage for highly cited researchers, indicating that LLM memorization disproportionately favors already visible scholars. However, this pattern is not uniform: certain disciplines, such as Clinical Medicine, and some regions, including parts of Africa, exhibit more balanced reconstruction outcomes. These findings highlight both the risks and limitations of relying on LLM-generated relational knowledge in scholarly discovery contexts and emphasize the need for careful auditing of memorization-driven biases in LLM-based systems.
[80] Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL
Qifeng Cai, Hao Liang, Chang Xu, Tao Xie, Wentao Zhang, Bin Cui
Main category: cs.CL
TL;DR: Text2SQL-Flow is a SQL-aware data augmentation framework that generates large-scale, diverse Text-to-SQL pairs, creating the SQLFlow dataset (89,544 examples) and improving LLM performance through fine-tuning and novel retrieval methods.
Details
Motivation: Current Text-to-SQL systems are limited by scarce, simplistic, and low-diversity datasets. The data-centric paradigm in AI requires high-quality, diverse training data to advance Text-to-SQL performance.Method: Proposes Text2SQL-Flow framework with six augmentation dimensions, SQL execution verification, natural language question generation, chain-of-thought reasoning traces, data classification, and modular Database Manager. Builds SQLFlow dataset and introduces masked alignment retrieval method for closed-source LLMs.
Result: Created SQLFlow dataset of 89,544 annotated examples. For open-source LLMs, fine-tuning on SQLFlow improves performance across benchmarks. For closed-source LLMs, masked alignment retrieval outperforms existing methods, demonstrating structure-aware example matching.
Conclusion: Establishes a scalable, data-centric foundation for advancing Text-to-SQL systems and highlights the critical role of high-quality structured data in modern AI.
Abstract: The data-centric paradigm has become pivotal in AI, especially for Text-to-SQL, where performance is limited by scarce, simplistic, and low-diversity datasets. To address this, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from minimal seed data. It operates across six augmentation dimensions and integrates an end-to-end pipeline featuring SQL execution verification, natural language question generation, chain-of-thought reasoning traces, and data classification. A modular Database Manager ensures cross-database compatibility and scalability. Using this framework, we build SQLFlow, a high-quality dataset of 89,544 annotated examples. We evaluate SQLFlow in two settings: (1) For open-source LLMs, fine-tuning on SQLFlow consistently improves performance across benchmarks under the same data budget. (2) For closed-source LLMs, we introduce a masked alignment retrieval method that treats SQLFlow as both knowledge base and training data for the retriever. This enables structure-aware example matching by modeling fine-grained alignments between questions and SQL queries. Experiments show our retrieval strategy outperforms existing methods, underscoring the value of SQLFlow’s high-fidelity data and our novel technique. Our work establishes a scalable, data-centric foundation for advancing Text-to-SQL systems and highlights the critical role of high-quality structured data in modern AI.
[81] Your Latent Reasoning is Secretly Policy Improvement Operator
Arip Asadulaev, Rayan Banerjee, Fakhri Karray, Martin Takac
Main category: cs.CL
TL;DR: Latent recursion in small models improves reasoning but suffers from “dead compute” where not all recursive steps effectively contribute depth. The paper formalizes latent reasoning as classifier-free guidance and policy improvement, proposes RL/diffusion-inspired training schemes, and achieves 18x reduction in forward passes while maintaining performance.
Details
Motivation: Small models with latent recursion show promising reasoning results but underperform compared to one-pass models with equivalent feed-forward depth, indicating "dead compute" where recursive steps don't effectively contribute to depth. The paper aims to understand when latent reasoning improves performance versus when it results in wasted computation.Method: Formalizes latent reasoning as classifier-free guidance and policy improvement algorithms. Proposes training schemes inspired by reinforcement learning and diffusion methods. Uses Tiny Recursive Model as testbed to implement modifications that avoid dead compute steps.
Result: Achieves 18x reduction in total forward passes while maintaining performance. Shows that policy improvement perspective on recursive steps can explain model behavior and provide insights for further improvements.
Conclusion: Latent reasoning can be effectively understood through policy improvement and classifier-free guidance frameworks. RL/diffusion-inspired training schemes can eliminate dead compute and significantly improve computational efficiency in recursive models while preserving reasoning capabilities.
Abstract: Recently, small models with latent recursion have obtained promising results on complex reasoning tasks. These results are typically explained by the theory that such recursion increases a networks depth, allowing it to compactly emulate the capacity of larger models. However, the performance of recursively added layers remains behind the capabilities of one pass models with the same feed forward depth. This means that in the looped version, not every recursive step effectively contributes to depth. This raises the question: when and why does latent reasoning improve performance, and when does it result in dead compute? In our work, we analyze the algorithms that latent reasoning provides answer to this question. We show that latent reasoning can be formalized as a classifier free guidance and policy improvement algorithm. Building on these insights, we propose to use a training schemes from reinforcement learning and diffusion methods for latent reasoning models. Using the Tiny Recursive Model as our testbed, we show that with our modifications we can avoid dead compute steps and reduce the total number of forward passes by 18x while maintaining performance. Broadly speaking, we show how a policy improvement perspective on recursive steps can explain model behavior and provide insights for further improvements.
[82] Diversity or Precision? A Deep Dive into Next Token Prediction
Haoyuan Wu, Hai Wang, Jiajia Wu, Jinxiang Ou, Keyao Wang, Weile Chen, Zihao Zheng, Bei Yu
Main category: cs.CL
TL;DR: This paper proposes a novel pre-training method that adapts RL principles to supervised learning by framing next-token prediction as a stochastic decision process, using reward shaping to balance diversity and precision in token distributions to create better exploration spaces for subsequent RL training.
Details
Motivation: The paper addresses how the effectiveness of RL training for improving LLM reasoning depends critically on the exploration space defined by the pre-trained model's token-output distribution. Current cross-entropy loss is limited, and the authors want to systematically study how pre-trained distributions shape exploration potential for subsequent RL.Method: The authors propose a generalized pre-training objective that adapts on-policy RL principles to supervised learning. They frame next-token prediction as a stochastic decision process and introduce a reward-shaping strategy with: 1) positive reward scaling factor to control probability concentration on ground-truth tokens, and 2) rank-aware mechanism that treats high-ranking and low-ranking negative tokens asymmetrically.
Result: Contrary to intuition that higher distribution entropy facilitates effective exploration, the authors find that imposing a precision-oriented prior yields a superior exploration space for RL, ultimately enhancing end-to-end reasoning performance.
Conclusion: The paper demonstrates that carefully shaping pre-trained token distributions using RL-inspired methods can create more favorable exploration spaces for subsequent RL training, leading to improved reasoning abilities in LLMs.
Abstract: Recent advancements have shown that reinforcement learning (RL) can substantially improve the reasoning abilities of large language models (LLMs). The effectiveness of such RL training, however, depends critically on the exploration space defined by the pre-trained model’s token-output distribution. In this paper, we revisit the standard cross-entropy loss, interpreting it as a specific instance of policy gradient optimization applied within a single-step episode. To systematically study how the pre-trained distribution shapes the exploration potential for subsequent RL, we propose a generalized pre-training objective that adapts on-policy RL principles to supervised learning. By framing next-token prediction as a stochastic decision process, we introduce a reward-shaping strategy that explicitly balances diversity and precision. Our method employs a positive reward scaling factor to control probability concentration on ground-truth tokens and a rank-aware mechanism that treats high-ranking and low-ranking negative tokens asymmetrically. This allows us to reshape the pre-trained token-output distribution and investigate how to provide a more favorable exploration space for RL, ultimately enhancing end-to-end reasoning performance. Contrary to the intuition that higher distribution entropy facilitates effective exploration, we find that imposing a precision-oriented prior yields a superior exploration space for RL.
[83] Improving Low-Resource Machine Translation via Round-Trip Reinforcement Learning
Ahmed Attia, Alham Fikri Aji
Main category: cs.CL
TL;DR: Self-supervised reinforcement learning fine-tuning for low-resource machine translation using round-trip bootstrapping with NLLB models, achieving improved translations for several languages.
Details
Motivation: Low-resource machine translation needs improvement as parallel data becomes available, but many potential methods remain unexplored. The paper aims to investigate self-supervised reinforcement learning approaches for enhancing translation quality in low-resource settings.Method: Uses round-trip bootstrapping with NLLB models: translate English to target low-resource language, then back to English. Combines chrF++ and BLEU as reward function on reconstructed English sentences. Evaluates 600M and 1.3B parameter NLLB models on NLLB-MD dataset.
Result: Consistent improvements observed for Central Aymara, Friulian, Wolof and Russian. Qualitative inspection shows increased fluency and semantic fidelity. Method benefits from scale, enabling models to leverage pretrained knowledge and continue self-improving.
Conclusion: Self-supervised reinforcement learning with round-trip bootstrapping effectively improves low-resource machine translation, with potential for further gains through scaling and leveraging pretrained knowledge.
Abstract: Low-resource machine translation (MT) has gained increasing attention as parallel data from low-resource language communities is collected, but many potential methods for improving low-resource MT remain unexplored. We investigate a self-supervised reinforcement-learning-based fine-tuning for translation in low-resource settings using round-trip bootstrapping with the No Language Left Behind (NLLB) family of models. Our approach translates English into a target low-resource language and then back into English, using a combination of chrF++ and BLEU as the reward function on the reconstructed English sentences. Using the NLLB-MD dataset, we evaluate both the 600M and 1.3B parameter NLLB models and observe consistent improvements for the following languages: Central Aymara, Friulian, Wolof and Russian. Qualitative inspection of translation outputs indicates increased fluency and semantic fidelity. We argue that our method can further benefit from scale, enabling models to increasingly leverage their pretrained knowledge and continue self-improving. The code is available on github: https://github.com/Copticoder/thesis-nllb-bootstrap-grpo.
[84] When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering
Mahdi Astaraki, Mohammad Arshi Saloot, Ali Shiraee Kasmaee, Hamidreza Mahyar, Soheila Samiee
Main category: cs.CL
TL;DR: Iterative RAG with synchronized retrieval-reasoning loops outperforms static Gold Context RAG by up to 25.6 percentage points on scientific multi-hop QA, especially benefiting non-reasoning fine-tuned models through staged retrieval that reduces late-hop failures and corrects hypothesis drift.
Details
Motivation: Current RAG systems lack understanding of when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. There's a need for controlled diagnostic studies to understand the mechanisms behind iterative RAG's advantages.Method: Benchmarked 11 state-of-the-art LLMs under three regimes: No Context (parametric memory), Gold Context (oracle evidence supplied at once), and Iterative RAG (training-free controller alternating retrieval, hypothesis refinement, and evidence-aware stopping). Used ChemKGMultiHopQA chemistry dataset to isolate questions requiring genuine retrieval, with diagnostics covering retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration.
Result: Iterative RAG consistently outperformed Gold Context across models, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduced late-hop failures, mitigated context overload, and enabled dynamic correction of early hypothesis drift. However, failure modes included incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval.
Conclusion: Staged retrieval is often more influential than the mere presence of ideal evidence. The study provides practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and establishes a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.
Abstract: Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.
[85] Improving Diffusion Language Model Decoding through Joint Search in Generation Order and Token Space
Yangyi Shen, Tianjian Feng, Jiaqi Han, Wen Wang, Tianlang Chen, Chunhua Shen, Jure Leskovec, Stefano Ermon
Main category: cs.CL
TL;DR: Order-Token Search: A novel decoding method for Diffusion Language Models that jointly searches over generation order and token values to explore diverse decoding trajectories, outperforming baselines on reasoning and coding tasks.
Details
Motivation: Current decoding methods for Diffusion Language Models commit to a single generation trajectory, limiting exploration of the rich trajectory space that DLMs offer due to their order-agnostic generation capabilities.Method: Introduces Order-Token Search, which explores trajectory space through joint search over generation order and token values. The core innovation is a likelihood estimator that scores denoising actions, enabling stable pruning and efficient exploration of diverse trajectories.
Result: Outperforms baselines on mathematical reasoning and coding benchmarks: GSM8K (3.1%), MATH500 (3.8%), Countdown (7.9%), and HumanEval (6.8%) absolute improvements over backbone. Matches or surpasses diffu-GRPO post-trained d1-LLaDA.
Conclusion: Joint search over generation order and token values is a key component for advancing decoding in Diffusion Language Models, enabling better exploration of the trajectory space and improved performance on reasoning tasks.
Abstract: Diffusion Language Models (DLMs) offer order-agnostic generation that can explore many possible decoding trajectories. However, current decoding methods commit to a single trajectory, limiting exploration in trajectory space. We introduce Order-Token Search to explore this space through jointly searching over generation order and token values. Its core is a likelihood estimator that scores denoising actions, enabling stable pruning and efficient exploration of diverse trajectories. Across mathematical reasoning and coding benchmarks, Order-Token Search consistently outperforms baselines on GSM8K, MATH500, Countdown, and HumanEval (3.1%, 3.8%, 7.9%, and 6.8% absolute over backbone), matching or surpassing diffu-GRPO post-trained d1-LLaDA. Our work establishes joint search as a key component for advancing decoding in DLMs.
[86] Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models
Aadi Palnitkar, Mingyang Mao, Nicholas Waytowich, Vinicius G. Goecks, Xiaomin Lin
Main category: cs.CL
TL;DR: MilSCORE is a military planning benchmark with multi-modal, long-context questions requiring reasoning over maps, orders, and intelligence reports to evaluate LLMs on geospatial planning tasks.
Details
Motivation: There's a need for realistic long-context benchmarks that require selective reading and integration of heterogeneous, multi-modal information sources, especially for geospatial planning problems like military operations that demand reasoning over maps, orders, and intelligence reports.Method: Created MilSCORE (Military Scenario Contextual Reasoning), a scenario-level dataset of expert-authored, multi-hop questions grounded in complex simulated military planning scenarios. Includes diverse question types across seven categories targeting factual recall and multi-step reasoning about constraints, strategy, and spatial analysis.
Result: Baseline results for contemporary vision-language models show substantial headroom, indicating current systems struggle with realistic, scenario-level long-context planning. MilSCORE serves as a challenging testbed for future work.
Conclusion: MilSCORE addresses the gap in realistic long-context benchmarks for multi-modal reasoning and positions itself as a valuable testbed for evaluating LLMs on complex geospatial planning tasks requiring integration of heterogeneous information sources.
Abstract: As large language models (LLMs) are applied to increasingly longer and more complex tasks, there is a growing need for realistic long-context benchmarks that require selective reading and integration of heterogeneous, multi-modal information sources. This need is especially acute for geospatial planning problems, such as those found in planning for large-scale military operations, which demand fast and accurate reasoning over maps, orders, intelligence reports, and other distributed data. To address this gap, we present MilSCORE (Military Scenario Contextual Reasoning), to our knowledge the first scenario-level dataset of expert-authored, multi-hop questions grounded in a complex, simulated military planning scenario used for training. MilSCORE is designed to evaluate high-stakes decision-making and planning, probing LLMs’ ability to combine tactical and spatial reasoning across multiple sources and to reason over long-horizon, geospatially rich context. The benchmark includes a diverse set of question types across seven categories targeting both factual recall and multi-step reasoning about constraints, strategy, and spatial analysis. We provide an evaluation protocol and report baseline results for a range of contemporary vision-language models. Our findings highlight substantial headroom on MilSCORE, indicating that current systems struggle with realistic, scenario-level long-context planning, and positioning MilSCORE as a challenging testbed for future work.
[87] DecompressionLM: Deterministic, Diagnostic, and Zero-Shot Concept Graph Extraction from Language Models
Zhaochen Hong, Jiaxuan You
Main category: cs.CL
TL;DR: DecompressionLM is a zero-shot concept graph extraction framework that discovers what language models encode without pre-defined queries or shared cross-sequence state, using Van der Corput sequences for parallel generation.
Details
Motivation: Existing knowledge probing methods rely on pre-defined queries, limiting extraction to known concepts. The paper aims to overcome limitations of common decoding-based probing approaches including cross-sequence coupling, competitive decoding effects, and scalability constraints.Method: Introduces DecompressionLM, a stateless framework using Van der Corput low-discrepancy sequences with arithmetic decoding for deterministic, embarrassingly parallel generation without shared state across sequences.
Result: Across model families and quantization variants, activation-aware quantization (AWQ-4bit) expands concept coverage by 30-170%, while uniform quantization (GPTQ-Int4) induces 71-86% coverage collapse. Corpus-based verification reveals a 19.6-point hallucination gap between top- and bottom-ranked MMLU-Pro Law models.
Conclusion: DecompressionLM establishes concept coverage as a complementary evaluation dimension for assessing knowledge breadth and factual grounding in compressed models intended for deployment.
Abstract: Existing knowledge probing methods rely on pre-defined queries, limiting extraction to known concepts. We introduce DecompressionLM, a stateless framework for zero-shot concept graph extraction that discovers what language models encode without pre-specified queries or shared cross-sequence state. Our method targets three limitations of common decoding-based probing approaches: (i) cross-sequence coupling that concentrates probability mass on high-frequency prefixes, (ii) competitive decoding effects that suppress long-tail concepts, and (iii) scalability constraints arising from sequential exploration. Using Van der Corput low-discrepancy sequences with arithmetic decoding, DecompressionLM enables deterministic, embarrassingly parallel generation without shared state across sequences. Across two model families and five quantization variants, we find that activation-aware quantization (AWQ-4bit) expands concept coverage by 30-170%, while uniform quantization (GPTQ-Int4) induces 71-86% coverage collapse - divergent behaviors not reliably reflected by explanation-level perplexity. Corpus-based verification further reveals a 19.6-point hallucination gap between top- and bottom-ranked MMLU-Pro Law models. DecompressionLM establishes concept coverage as a complementary evaluation dimension for assessing knowledge breadth and factual grounding in compressed models intended for deployment.
[88] From Latent Signals to Reflection Behavior: Tracing Meta-Cognitive Activation Trajectory in R1-Style LLMs
Yanrui Du, Yibo Gao, Sendong Zhao, Jiayun Li, Haochun Wang, Qika Lin, Kai He, Bing Qin, Mengling Feng
Main category: cs.CL
TL;DR: Analysis of internal mechanisms in R1-style LLMs reveals a structured progression from latent control to semantic pivoting to overt reflection behavior, suggesting human-like meta-cognitive processes.
Details
Motivation: While R1-style LLMs show capacity for self-reflection, the internal mechanisms driving this behavior remain unclear. The paper aims to bridge this gap by analyzing the onset and progression of reflection behavior.Method: The study anchors on the onset of reflection behavior and traces its layer-wise activation trajectory using the logit lens to read out token-level semantics. It identifies three key stages: latent-control layers, semantic-pivot layers, and behavior-overt layers. Targeted interventions are used to uncover causal chains across these stages.
Result: The analysis reveals a structured progression: (1) latent-control layers encode thinking budget semantics, (2) semantic-pivot layers surface discourse-level cues, and (3) behavior-overt layers show rising likelihood of reflection tokens. Interventions show a causal chain where prompt semantics modulate activations, inducing competition between discourse cues, which regulates reflection token sampling.
Conclusion: The findings suggest R1-style LLMs exhibit human-like meta-cognitive processes progressing from latent monitoring to discourse-level regulation to overt self-reflection, providing insights into the internal mechanisms of reflection in language models.
Abstract: R1-style LLMs have attracted growing attention for their capacity for self-reflection, yet the internal mechanisms underlying such behavior remain unclear. To bridge this gap, we anchor on the onset of reflection behavior and trace its layer-wise activation trajectory. Using the logit lens to read out token-level semantics, we uncover a structured progression: (i) Latent-control layers, where an approximate linear direction encodes the semantics of thinking budget; (ii) Semantic-pivot layers, where discourse-level cues, including turning-point and summarization cues, surface and dominate the probability mass; and (iii) Behavior-overt layers, where the likelihood of reflection-behavior tokens begins to rise until they become highly likely to be sampled. Moreover, our targeted interventions uncover a causal chain across these stages: prompt-level semantics modulate the projection of activations along latent-control directions, thereby inducing competition between turning-point and summarization cues in semantic-pivot layers, which in turn regulates the sampling likelihood of reflection-behavior tokens in behavior-overt layers. Collectively, our findings suggest a human-like meta-cognitive process-progressing from latent monitoring, to discourse-level regulation, and to finally overt self-reflection. Our analysis code can be found at https://github.com/DYR1/S3-CoT.
[89] Short Chains, Deep Thoughts: Balancing Reasoning Efficiency and Intra-Segment Capability via Split-Merge Optimization
Runquan Gui, Jie Wang, Zhihai Wang, Chi Ma, Jianye Hao, Feng Wu
Main category: cs.CL
TL;DR: CoSMo is a framework that optimizes reasoning chains in Large Reasoning Models by eliminating structural redundancy through split-merge operations and structure-aligned reinforcement learning, improving accuracy while reducing computational overhead.
Details
Motivation: Large Reasoning Models generate verbose reasoning chains that cause significant latency and computational overhead. Current approaches indiscriminately restrict token volume rather than addressing structural redundancy in reasoning processes.Method: CoSMo uses a consistency-guided split-merge algorithm that dynamically refines reasoning chains by merging redundant segments and splitting logical gaps. It employs structure-aligned reinforcement learning with a novel segment-level budget to supervise efficient reasoning structures during training.
Result: CoSMo achieves superior performance across multiple benchmarks and backbones, improving accuracy by 3.3 points while reducing segment usage by 28.7% on average compared to reasoning efficiency baselines.
Conclusion: CoSMo effectively addresses computational inefficiency in Large Reasoning Models by focusing on structural redundancy elimination rather than token restriction, demonstrating significant improvements in both accuracy and efficiency.
Abstract: While Large Reasoning Models (LRMs) have demonstrated impressive capabilities in solving complex tasks through the generation of long reasoning chains, this reliance on verbose generation results in significant latency and computational overhead. To address these challenges, we propose \textbf{CoSMo} (\textbf{Co}nsistency-Guided \textbf{S}plit-\textbf{M}erge \textbf{O}ptimization), a framework designed to eliminate structural redundancy rather than indiscriminately restricting token volume. Specifically, CoSMo utilizes a split-merge algorithm that dynamically refines reasoning chains by merging redundant segments and splitting logical gaps to ensure coherence. We then employ structure-aligned reinforcement learning with a novel segment-level budget to supervise the model in maintaining efficient reasoning structures throughout training. Extensive experiments across multiple benchmarks and backbones demonstrate that CoSMo achieves superior performance, improving accuracy by \textbf{3.3} points while reducing segment usage by \textbf{28.7%} on average compared to reasoning efficiency baselines.
[90] FASA: Frequency-aware Sparse Attention
Yifei Wang, Yueqi Wang, Zhenrui Yue, Huimin Zeng, Yong Wang, Ismini Lourentzou, Zhengzhong Tu, Xiangxiang Chu, Julian McAuley
Main category: cs.CL
TL;DR: FASA is a novel framework that addresses KV cache memory bottleneck in LLMs by using query-aware token eviction based on RoPE frequency-chunk sparsity, achieving near-oracle performance with significant speedups.
Details
Motivation: LLMs face memory bottlenecks with long inputs due to KV cache footprint. Existing token pruning methods are inadequate - static methods risk information loss, while dynamic strategies use heuristics that don't capture query-dependent token importance.Method: FASA leverages functional sparsity in RoPE at frequency-chunk level, identifying a small subset of “dominant” FCs that correlate with full attention. It uses these as a free proxy to identify critical tokens, then performs focused attention only on this pruned subset.
Result: FASA outperforms all token-eviction baselines across long-context tasks (sequence modeling to complex CoT reasoning), achieving near-oracle accuracy. On LongBench-V1, it reaches ~100% of full-KV performance with only 256 tokens, and achieves 2.56× speedup using 18.9% of cache on AIME24.
Conclusion: FASA provides an effective solution to KV cache memory bottleneck by exploiting RoPE frequency-chunk sparsity for query-aware token eviction, demonstrating robust performance with significant efficiency gains.
Abstract: The deployment of Large Language Models (LLMs) faces a critical bottleneck when handling lengthy inputs: the prohibitive memory footprint of the Key Value (KV) cache. To address this bottleneck, the token pruning paradigm leverages attention sparsity to selectively retain a small, critical subset of tokens. However, existing approaches fall short, with static methods risking irreversible information loss and dynamic strategies employing heuristics that insufficiently capture the query-dependent nature of token importance. We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance. FASA stems from a novel insight into RoPE: the discovery of functional sparsity at the frequency-chunk (FC) level. Our key finding is that a small, identifiable subset of “dominant” FCs consistently exhibits high contextual agreement with the full attention head. This provides a robust and computationally free proxy for identifying salient tokens. Building on this insight, FASA first identifies a critical set of tokens using dominant FCs, and then performs focused attention computation solely on this pruned subset. Across a spectrum of long-context tasks, from sequence modeling to complex CoT reasoning, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy, demonstrating remarkable robustness even under constraint budgets. Notably, on LongBench-V1, FASA reaches nearly 100% of full-KV performance when only keeping 256 tokens, and achieves 2.56$\times$ speedup using just 18.9% of the cache on AIME24.
[91] CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation
Zhao Tong, Chunlin Gong, Yiping Zhang, Qiang Liu, Xingcheng Xu, Shu Wu, Haichao Shi, Xiao-Yu Zhang
Main category: cs.CL
TL;DR: LLMs can generate unsafe reasoning in Chain-of-Thought even when refusing harmful requests, requiring new safety analysis methods.
Details
Motivation: Current LLM safety evaluation assumes refusal responses indicate safe reasoning throughout the entire generation process, but this assumption may be flawed when models engage in Chain-of-Thought reasoning for harmful tasks like fake news generation.Method: Introduced a unified safety-analysis framework that deconstructs CoT generation across model layers and evaluates individual attention heads using Jacobian-based spectral metrics. Three interpretable measures (stability, geometry, energy) quantify how attention heads respond to or embed deceptive reasoning patterns.
Result: Experiments on multiple reasoning-oriented LLMs show generation risk rises significantly when thinking mode is activated, with critical routing decisions concentrated in few contiguous mid-depth layers. Specific attention heads responsible for divergence between safe outputs and unsafe reasoning were identified.
Conclusion: The assumption that refusal implies safety is challenged, revealing latent reasoning risks in LLMs. The framework provides new understanding for mitigating these risks by identifying problematic attention heads and reasoning patterns.
Abstract: From generating headlines to fabricating news, the Large Language Models (LLMs) are typically assessed by their final outputs, under the safety assumption that a refusal response signifies safe reasoning throughout the entire process. Challenging this assumption, our study reveals that during fake news generation, even when a model rejects a harmful request, its Chain-of-Thought (CoT) reasoning may still internally contain and propagate unsafe narratives. To analyze this phenomenon, we introduce a unified safety-analysis framework that systematically deconstructs CoT generation across model layers and evaluates the role of individual attention heads through Jacobian-based spectral metrics. Within this framework, we introduce three interpretable measures: stability, geometry, and energy to quantify how specific attention heads respond or embed deceptive reasoning patterns. Extensive experiments on multiple reasoning-oriented LLMs show that the generation risk rise significantly when the thinking mode is activated, where the critical routing decisions concentrated in only a few contiguous mid-depth layers. By precisely identifying the attention heads responsible for this divergence, our work challenges the assumption that refusal implies safety and provides a new understanding perspective for mitigating latent reasoning risks.
cs.CV
[92] SIDeR: Semantic Identity Decoupling for Unrestricted Face Privacy
Zhuosen Bao, Xia Du, Zheng Lin, Jizhe Zhou, Zihan Fang, Jiening Wu, Yuxin Zhang, Zhe Chen, Chi-man Pun, Wei Ni, Jun Luo
Main category: cs.CV
TL;DR: SIDeR is a framework that decouples facial identity from visual appearance using semantic decomposition and diffusion models to generate privacy-preserving adversarial faces that maintain machine-level identity consistency while being visually anonymous.
Details
Motivation: With facial recognition integrated into online services, there's a need to protect privacy by decoupling identity information from visual representations during image storage and transmission while maintaining authorized access capabilities.Method: Decomposes facial images into machine-recognizable identity vectors and semantic appearance components, uses semantic-guided recomposition in diffusion model latent space, incorporates momentum-driven unrestricted perturbation optimization and semantic-visual balancing to generate diverse adversarial samples.
Result: Achieves 99% attack success rate in black-box scenarios and outperforms baselines by 41.28% in PSNR-based restoration quality on CelebA-HQ and FFHQ datasets.
Conclusion: SIDeR effectively protects face privacy through semantic decoupling while maintaining authorized restoration capability, offering a practical solution for privacy-preserving facial recognition systems.
Abstract: With the deep integration of facial recognition into online banking, identity verification, and other networked services, achieving effective decoupling of identity information from visual representations during image storage and transmission has become a critical challenge for privacy protection. To address this issue, we propose SIDeR, a Semantic decoupling-driven framework for unrestricted face privacy protection. SIDeR decomposes a facial image into a machine-recognizable identity feature vector and a visually perceptible semantic appearance component. By leveraging semantic-guided recomposition in the latent space of a diffusion model, it generates visually anonymous adversarial faces while maintaining machine-level identity consistency. The framework incorporates momentum-driven unrestricted perturbation optimization and a semantic-visual balancing factor to synthesize multiple visually diverse, highly natural adversarial samples. Furthermore, for authorized access, the protected image can be restored to its original form when the correct password is provided. Extensive experiments on the CelebA-HQ and FFHQ datasets demonstrate that SIDeR achieves a 99% attack success rate in black-box scenarios and outperforms baseline methods by 41.28% in PSNR-based restoration quality.
[93] Food Portion Estimation: From Pixels to Calories
Gautham Vinod, Fengqing Zhu
Main category: cs.CV
TL;DR: Survey paper exploring strategies for 3D food portion estimation from 2D images for dietary assessment
Details
Motivation: Image-based dietary assessment is crucial for health monitoring but suffers from estimating 3D food size from 2D images, which is critical for accurate portion estimation in chronic disease prevention and obesity care.Method: Survey and analysis of existing strategies including auxiliary inputs (depth maps, multi-view), model-based approaches (template matching), and deep learning methods (monocular images or combinations with auxiliary inputs).
Result: Comprehensive review of different techniques for accurate food portion estimation from images, highlighting approaches that bridge the gap between 2D inputs and 3D size estimation.
Conclusion: Various strategies exist for overcoming the 2D-to-3D estimation challenge in food portion assessment, with deep learning methods showing promise in bridging this gap for more accurate dietary monitoring.
Abstract: Reliance on images for dietary assessment is an important strategy to accurately and conveniently monitor an individual’s health, making it a vital mechanism in the prevention and care of chronic diseases and obesity. However, image-based dietary assessment suffers from estimating the three dimensional size of food from 2D image inputs. Many strategies have been devised to overcome this critical limitation such as the use of auxiliary inputs like depth maps, multi-view inputs, or model-based approaches such as template matching. Deep learning also helps bridge the gap by either using monocular images or combinations of the image and the auxillary inputs to precisely predict the output portion from the image input. In this paper, we explore the different strategies employed for accurate portion estimation.
[94] UniTrack: Differentiable Graph Representation Learning for Multi-Object Tracking
Bishoy Galoaa, Xiangyu Bai, Utsav Nandi, Sai Siddhartha Vivek Dhir Rangoju, Somaieh Amraee, Sarah Ostadabbas
Main category: cs.CV
TL;DR: UniTrack is a plug-and-play graph-theoretic loss function that enhances multi-object tracking performance by optimizing tracking-specific objectives through differentiable learning, integrating detection, identity preservation, and spatiotemporal consistency into a single trainable loss.
Details
Motivation: Existing graph-based MOT methods require redesigning tracking architectures. There's a need for a universal training objective that can improve tracking performance across different models without architectural modifications, addressing the challenge of optimizing multiple tracking objectives simultaneously.Method: UniTrack uses differentiable graph representation learning to create a unified loss function that integrates detection accuracy, identity preservation, and spatiotemporal consistency. It provides a plug-and-play objective that can be seamlessly integrated with existing MOT systems without requiring architectural changes.
Result: UniTrack demonstrates consistent improvements across diverse tracking models (Trackformer, MOTR, FairMOT, ByteTrack, GTR, MOTE) and multiple benchmarks, with up to 53% reduction in identity switches, 12% IDF1 improvements, and GTR achieving 9.7% MOTA improvement on SportsMOT.
Conclusion: UniTrack provides an effective universal training objective for multi-object tracking that significantly enhances performance across different architectures and datasets through differentiable graph-theoretic optimization, offering a practical plug-and-play solution for MOT improvement.
Abstract: We present UniTrack, a plug-and-play graph-theoretic loss function designed to significantly enhance multi-object tracking (MOT) performance by directly optimizing tracking-specific objectives through unified differentiable learning. Unlike prior graph-based MOT methods that redesign tracking architectures, UniTrack provides a universal training objective that integrates detection accuracy, identity preservation, and spatiotemporal consistency into a single end-to-end trainable loss function, enabling seamless integration with existing MOT systems without architectural modifications. Through differentiable graph representation learning, UniTrack enables networks to learn holistic representations of motion continuity and identity relationships across frames. We validate UniTrack across diverse tracking models and multiple challenging benchmarks, demonstrating consistent improvements across all tested architectures and datasets including Trackformer, MOTR, FairMOT, ByteTrack, GTR, and MOTE. Extensive evaluations show up to 53% reduction in identity switches and 12% IDF1 improvements across challenging benchmarks, with GTR achieving peak performance gains of 9.7% MOTA on SportsMOT.
[95] VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models
Yiye Chen, Yanan Jian, Xiaoyi Dong, Shuxin Cao, Jing Wu, Patricio Vela, Benjamin E. Lundell, Dongdong Chen
Main category: cs.CV
TL;DR: A training framework to improve visual conditioning in Vision-Language-Action models by aligning action prediction with visual input through preference optimization and latent-space distillation.
Details
Motivation: Vision-Language-Action models often suffer from vision-action misalignment where action predictions show weak dependence on current visual states, leading to unreliable outputs. The authors observed that successful rollouts consistently exhibit stronger visual dependence than failed ones.Method: Proposes a training framework that first aligns action prediction with visual input via preference optimization on a track-following surrogate task, then transfers enhanced alignment to instruction-following tasks through latent-space distillation during supervised finetuning.
Result: The method improves both visual conditioning and task performance for discrete OpenVLA, and yields consistent gains when extended to the continuous OpenVLA-OFT setting, without architectural modifications or additional data collection.
Conclusion: Explicitly strengthening visual conditioning in VLA models through preference optimization and latent-space distillation effectively addresses vision-action misalignment and improves model reliability and performance.
Abstract: Vision-Language-Action (VLA) models have demonstrated strong performance across a wide range of robotic manipulation tasks. Despite the success, extending large pretrained Vision-Language Models (VLMs) to the action space can induce vision-action misalignment, where action predictions exhibit weak dependence on the current visual state, leading to unreliable action outputs. In this work, we study VLA models through the lens of visual conditioning and empirically show that successful rollouts consistently exhibit stronger visual dependence than failed ones. Motivated by this observation, we propose a training framework that explicitly strengthens visual conditioning in VLA models. Our approach first aligns action prediction with visual input via preference optimization on a track-following surrogate task, and then transfers the enhanced alignment to instruction-following task through latent-space distillation during supervised finetuning. Without introducing architectural modifications or additional data collection, our method improves both visual conditioning and task performance for discrete OpenVLA, and further yields consistent gains when extended to the continuous OpenVLA-OFT setting. Project website: https://vista-vla.github.io/ .
[96] Visual concept ranking uncovers medical shortcuts used by large multimodal models
Joseph D. Janizek, Sonnet Xu, Junayd Lateef, Roxana Daneshjou
Main category: cs.CV
TL;DR: A method called Visual Concept Ranking (VCR) is introduced to identify important visual concepts in large multimodal models and audit their performance on medical tasks, particularly skin lesion classification, revealing demographic performance gaps.
Details
Motivation: The paper addresses the need for auditing methods to ensure reliability of machine learning models in safety-critical healthcare domains, specifically focusing on uncovering shortcomings in large multimodal models when applied to medical tasks.Method: The authors introduce Visual Concept Ranking (VCR), a method for identifying important visual concepts within large multimodal models. They apply this to investigate model behaviors on medical tasks, primarily focusing on malignant skin lesion classification from dermatology images, with supplemental experiments on chest radiographs and natural images.
Result: The research shows that LMMs display unexpected performance gaps between different demographic subgroups when prompted with demonstrating examples. VCR generates hypotheses about visual feature dependencies, which are then validated through manual interventions.
Conclusion: The VCR method provides a valuable tool for auditing large multimodal models in medical applications, revealing important visual concept dependencies and demographic biases that need to be addressed for reliable healthcare deployment.
Abstract: Ensuring the reliability of machine learning models in safety-critical domains such as healthcare requires auditing methods that can uncover model shortcomings. We introduce a method for identifying important visual concepts within large multimodal models (LMMs) and use it to investigate the behaviors these models exhibit when prompted with medical tasks. We primarily focus on the task of classifying malignant skin lesions from clinical dermatology images, with supplemental experiments including both chest radiographs and natural images. After showing how LMMs display unexpected gaps in performance between different demographic subgroups when prompted with demonstrating examples, we apply our method, Visual Concept Ranking (VCR), to these models and prompts. VCR generates hypotheses related to different visual feature dependencies, which we are then able to validate with manual interventions.
[97] CLEAR-HPV: Interpretable Concept Discovery for HPV-Associated Morphology in Whole-Slide Histology
Weiyi Qin, Yingci Liu-Swetz, Shiwei Tan, Hao Wang
Main category: cs.CV
TL;DR: CLEAR-HPV is an interpretable framework for HPV status prediction from whole-slide histopathology that discovers morphological concepts without requiring concept labels during training.
Details
Motivation: Current attention-based multiple instance learning (MIL) methods for HPV-related histopathology provide limited morphological interpretability despite achieving strong slide-level predictions. There's a need for more interpretable models that can automatically discover and represent morphological concepts.Method: CLEAR-HPV restructures the MIL latent space using attention to enable concept discovery without concept labels. It operates in an attention-weighted latent space to automatically discover morphological concepts (keratinizing, basaloid, stromal), generates spatial concept maps, and represents each slide using compact concept-fraction vectors.
Result: The framework reduces high-dimensional feature spaces (e.g., 1536 dimensions) to only 10 interpretable concepts while preserving predictive information. It generalizes consistently across TCGA-HNSCC, TCGA-CESC, and CPTAC-HNSCC datasets.
Conclusion: CLEAR-HPV provides compact, concept-level interpretability through a general, backbone-agnostic framework for attention-based MIL models of whole-slide histopathology, addressing the interpretability limitations of current methods.
Abstract: Human papillomavirus (HPV) status is a critical determinant of prognosis and treatment response in head and neck and cervical cancers. Although attention-based multiple instance learning (MIL) achieves strong slide-level prediction for HPV-related whole-slide histopathology, it provides limited morphologic interpretability. To address this limitation, we introduce Concept-Level Explainable Attention-guided Representation for HPV (CLEAR-HPV), a framework that restructures the MIL latent space using attention to enable concept discovery without requiring concept labels during training. Operating in an attention-weighted latent space, CLEAR-HPV automatically discovers keratinizing, basaloid, and stromal morphologic concepts, generates spatial concept maps, and represents each slide using a compact concept-fraction vector. CLEAR-HPV’s concept-fraction vectors preserve the predictive information of the original MIL embeddings while reducing the high-dimensional feature space (e.g., 1536 dimensions) to only 10 interpretable concepts. CLEAR-HPV generalizes consistently across TCGA-HNSCC, TCGA-CESC, and CPTAC-HNSCC, providing compact, concept-level interpretability through a general, backbone-agnostic framework for attention-based MIL models of whole-slide histopathology.
[98] ARGaze: Autoregressive Transformers for Online Egocentric Gaze Estimation
Jia Li, Wenjie Zhao, Shijian Deng, Bolin Lai, Yuheng Wu, RUijia Chen, Jon E. Froehlich, Yuhang Zhao, Yapeng Tian
Main category: cs.CV
TL;DR: ARGaze: Autoregressive gaze estimation for online first-person video using transformer decoder with gaze context window
Details
Motivation: Online egocentric gaze estimation lacks explicit head/eye signals and requires inferring visual attention from indirect cues; gaze exhibits strong temporal continuity during goal-directed activities, making recent gaze history a powerful prior for predicting future gaze.Method: ARGaze reformulates gaze estimation as sequential prediction using a transformer decoder that conditions on current visual features and a fixed-length Gaze Context Window of recent gaze target estimates, enabling causal, bounded-resource streaming inference.
Result: Achieves state-of-the-art performance across multiple egocentric benchmarks under online evaluation; ablations validate that autoregressive modeling with bounded gaze history is critical for robust prediction.
Conclusion: Autoregressive modeling with gaze context window effectively leverages temporal continuity for online egocentric gaze estimation, outperforming previous methods while maintaining causality and bounded resource usage.
Abstract: Online egocentric gaze estimation predicts where a camera wearer is looking from first-person video using only past and current frames, a task essential for augmented reality and assistive technologies. Unlike third-person gaze estimation, this setting lacks explicit head or eye signals, requiring models to infer current visual attention from sparse, indirect cues such as hand-object interactions and salient scene content. We observe that gaze exhibits strong temporal continuity during goal-directed activities: knowing where a person looked recently provides a powerful prior for predicting where they look next. Inspired by vision-conditioned autoregressive decoding in vision-language models, we propose ARGaze, which reformulates gaze estimation as sequential prediction: at each timestep, a transformer decoder predicts current gaze by conditioning on (i) current visual features and (ii) a fixed-length Gaze Context Window of recent gaze target estimates. This design enforces causality and enables bounded-resource streaming inference. We achieve state-of-the-art performance across multiple egocentric benchmarks under online evaluation, with extensive ablations validating that autoregressive modeling with bounded gaze history is critical for robust prediction. We will release our source code and pre-trained models.
[99] AirGlove: Exploring Egocentric 3D Hand Tracking and Appearance Generalization for Sensing Gloves
Wenhui Cui, Ziyi Kou, Chuan Qin, Ergys Ristani, Li Guan
Main category: cs.CV
TL;DR: AirGlove improves vision-based hand tracking for gloved hands by generalizing learned glove representations to new glove designs with limited data.
Details
Motivation: Existing vision-based hand tracking models perform well on bare hands but suffer significant degradation on gloved hands due to appearance differences. Sensor-based glove tracking has accuracy issues affected by calibration quality.Method: Proposes AirGlove which leverages existing gloves to generalize learned glove representations to new glove designs with limited data. Evaluates vision-based models on gloved hands under zero-shot and fine-tuning setups.
Result: AirGlove effectively generalizes hand pose models to new glove designs and achieves significant performance boost over compared schemes across multiple sensing gloves.
Conclusion: Vision-based approaches can be adapted for gloved hand tracking by addressing appearance gaps, with AirGlove providing an effective solution for generalizing to new glove designs.
Abstract: Sensing gloves have become important tools for teleoperation and robotic policy learning as they are able to provide rich signals like speed, acceleration and tactile feedback. A common approach to track gloved hands is to directly use the sensor signals (e.g., angular velocity, gravity orientation) to estimate 3D hand poses. However, sensor-based tracking can be restrictive in practice as the accuracy is often impacted by sensor signal and calibration quality. Recent advances in vision-based approaches have achieved strong performance on human hands via large-scale pre-training, but their performance on gloved hands with distinct visual appearances remains underexplored. In this work, we present the first systematic evaluation of vision-based hand tracking models on gloved hands under both zero-shot and fine-tuning setups. Our analysis shows that existing bare-hand models suffer from substantial performance degradation on sensing gloves due to large appearance gap between bare-hand and glove designs. We therefore propose AirGlove, which leverages existing gloves to generalize the learned glove representations towards new gloves with limited data. Experiments with multiple sensing gloves show that AirGlove effectively generalizes the hand pose models to new glove designs and achieves a significant performance boost over the compared schemes.
[100] SHaSaM: Submodular Hard Sample Mining for Fair Facial Attribute Recognition
Anay Majee, Rishabh Iyer
Main category: cs.CV
TL;DR: SHaSaM is a novel combinatorial approach for fairness-driven representation learning that uses submodular optimization to mine hard samples and minimize sensitive attribute influence while maintaining performance.
Details
Motivation: Deep neural networks often inherit social and demographic biases from training data, leading to unfair predictions based on sensitive attributes like race, age, and gender. Existing methods struggle with data imbalance between attribute groups and inadvertently emphasize sensitive attributes, worsening both fairness and performance.Method: Two-stage approach: 1) SHaSaM-MINE uses submodular subset selection to mine hard positive and negative samples, mitigating data imbalance; 2) SHaSaM-LEARN employs combinatorial loss functions based on Submodular Conditional Mutual Information to maximize decision boundaries between target classes while minimizing influence of sensitive attributes.
Result: Experiments on CelebA and UTKFace datasets show SHaSaM achieves state-of-the-art results with up to 2.7 points improvement in model fairness (Equalized Odds) and 3.5% gain in Accuracy, within fewer epochs compared to existing methods.
Conclusion: SHaSaM provides a unified formulation that restricts models from learning features tied to sensitive attributes, significantly enhancing fairness without sacrificing performance, addressing both data imbalance and sensitive attribute influence problems.
Abstract: Deep neural networks often inherit social and demographic biases from annotated data during model training, leading to unfair predictions, especially in the presence of sensitive attributes like race, age, gender etc. Existing methods fall prey to the inherent data imbalance between attribute groups and inadvertently emphasize on sensitive attributes, worsening unfairness and performance. To surmount these challenges, we propose SHaSaM (Submodular Hard Sample Mining), a novel combinatorial approach that models fairness-driven representation learning as a submodular hard-sample mining problem. Our two-stage approach comprises of SHaSaM-MINE, which introduces a submodular subset selection strategy to mine hard positives and negatives - effectively mitigating data imbalance, and SHaSaM-LEARN, which introduces a family of combinatorial loss functions based on Submodular Conditional Mutual Information to maximize the decision boundary between target classes while minimizing the influence of sensitive attributes. This unified formulation restricts the model from learning features tied to sensitive attributes, significantly enhancing fairness without sacrificing performance. Experiments on CelebA and UTKFace demonstrate that SHaSaM achieves state-of-the-art results, with up to 2.7 points improvement in model fairness (Equalized Odds) and a 3.5% gain in Accuracy, within fewer epochs as compared to existing methods.
[101] LOBSTgER-enhance: an underwater image enhancement pipeline
Andreas Mentzelopoulos, Keith Ellenbogen
Main category: cs.CV
TL;DR: Diffusion-based image-to-image pipeline that reverses underwater degradations using synthetic corruption and training on awareness photography dataset
Details
Motivation: Underwater photography suffers from reduced contrast, spatial blur, and wavelength-dependent color distortions that obscure marine life vibrancy, requiring heavy post-processingMethod: Develops an image-to-image pipeline that learns to reverse underwater degradations by introducing a synthetic corruption pipeline and learning to reverse its effects with diffusion-based generation
Result: Achieves high perceptual consistency and strong generalization in synthesizing 512x768 images using ~11M parameters trained from scratch on ~2.5k images
Conclusion: Proposed diffusion-based approach effectively addresses underwater image degradation challenges with relatively small model size and dataset
Abstract: Underwater photography presents significant inherent challenges including reduced contrast, spatial blur, and wavelength-dependent color distortions. These effects can obscure the vibrancy of marine life and awareness photographers in particular are often challenged with heavy post-processing pipelines to correct for these distortions. We develop an image-to-image pipeline that learns to reverse underwater degradations by introducing a synthetic corruption pipeline and learning to reverse its effects with diffusion-based generation. Training and evaluation are performed on a small high-quality dataset of awareness photography images by Keith Ellenbogen. The proposed methodology achieves high perceptual consistency and strong generalization in synthesizing 512x768 images using a model of ~11M parameters after training from scratch on ~2.5k images.
[102] ShapePuri: Shape Guided and Appearance Generalized Adversarial Purification
Zhe Li, Bernhard Kainz
Main category: cs.CV
TL;DR: ShapePuri: A novel adversarial defense framework that uses shape guidance and appearance debiasing to improve robustness without computational overhead
Details
Motivation: Deep neural networks are vulnerable to adversarial attacks, and existing defense methods like diffusion-based purification have high computational costs and information loss. There's a need for efficient, scalable defenses that maintain performance.Method: ShapePuri integrates two components: 1) Shape Encoding Module (SEM) that provides dense geometric guidance using Signed Distance Functions (SDF), and 2) Global Appearance Debiasing (GAD) module that mitigates appearance bias through stochastic transformations. The framework aligns model representations with stable structural invariants.
Result: Achieves 84.06% clean accuracy and 81.64% robust accuracy under AutoAttack protocol, becoming the first defense framework to surpass the 80% threshold on this benchmark. Preserves prediction stability without requiring auxiliary modules or additional computational cost.
Conclusion: ShapePuri provides a scalable and efficient adversarial defense that enhances robustness by leveraging shape guidance and appearance debiasing, offering a practical solution for real-world deployment without computational overhead.
Abstract: Deep neural networks demonstrate impressive performance in visual recognition, but they remain vulnerable to adversarial attacks that is imperceptible to the human. Although existing defense strategies such as adversarial training and purification have achieved progress, diffusion-based purification often involves high computational costs and information loss. To address these challenges, we introduce Shape Guided Purification (ShapePuri), a novel defense framework enhances robustness by aligning model representations with stable structural invariants. ShapePuri integrates two components: a Shape Encoding Module (SEM) that provides dense geometric guidance through Signed Distance Functions (SDF), and a Global Appearance Debiasing (GAD) module that mitigates appearance bias via stochastic transformations. In our experiments, ShapePuri achieves $84.06%$ clean accuracy and $81.64%$ robust accuracy under the AutoAttack protocol, representing the first defense framework to surpass the $80%$ threshold on this benchmark. Our approach provides a scalable and efficient adversarial defense that preserves prediction stability during inference without requiring auxiliary modules or additional computational cost.
[103] PoseGaussian: Pose-Driven Novel View Synthesis for Robust 3D Human Reconstruction
Ju Shen, Chen Chen, Tam V. Nguyen, Vijayan K. Asari
Main category: cs.CV
TL;DR: PoseGaussian: A pose-guided Gaussian Splatting framework for high-fidelity human novel view synthesis using body pose as both structural prior and temporal cue.
Details
Motivation: To address challenges in dynamic human scenes (articulated motion, self-occlusion) by better leveraging human pose information for novel view synthesis, going beyond simple conditioning or warping approaches.Method: Uses pose as structural prior fused with color encoder for depth refinement, and as temporal cue processed by dedicated pose encoder for temporal consistency. Integrated into differentiable end-to-end trainable pipeline based on Gaussian Splatting.
Result: Achieves state-of-the-art performance on ZJU-MoCap, THuman2.0, and in-house datasets (PSNR 30.86, SSIM 0.979, LPIPS 0.028) with real-time rendering at 100 FPS.
Conclusion: PoseGaussian effectively integrates pose information into both geometric and temporal stages, improving robustness and generalization for human novel view synthesis while maintaining real-time efficiency.
Abstract: We propose PoseGaussian, a pose-guided Gaussian Splatting framework for high-fidelity human novel view synthesis. Human body pose serves a dual purpose in our design: as a structural prior, it is fused with a color encoder to refine depth estimation; as a temporal cue, it is processed by a dedicated pose encoder to enhance temporal consistency across frames. These components are integrated into a fully differentiable, end-to-end trainable pipeline. Unlike prior works that use pose only as a condition or for warping, PoseGaussian embeds pose signals into both geometric and temporal stages to improve robustness and generalization. It is specifically designed to address challenges inherent in dynamic human scenes, such as articulated motion and severe self-occlusion. Notably, our framework achieves real-time rendering at 100 FPS, maintaining the efficiency of standard Gaussian Splatting pipelines. We validate our approach on ZJU-MoCap, THuman2.0, and in-house datasets, demonstrating state-of-the-art performance in perceptual quality and structural accuracy (PSNR 30.86, SSIM 0.979, LPIPS 0.028).
[104] GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling
Shivanshu Shekhar, Uttaran Bhattacharya, Raghavendra Addanki, Mehrab Tanjim, Somdeb Sarkhel, Tong Zhang
Main category: cs.CV
TL;DR: Video generative models repurposed as reward models for human preference alignment, using energy-based formulation and synthetic negative videos to capture temporal dynamics.
Details
Motivation: Current video generative model alignment relies on Vision-Language Models (VLMs) for reward modeling, but VLMs struggle to capture subtle temporal dynamics. There's a need for temporally-aware reward models that can better evaluate video quality.Method: Proposes Generative-Transformer-based Self-Supervised Video Judge (GTSVJ) that transforms video generation models into reward models using energy-based model formulation. Uses contrastive training with synthetic negative videos created through latent-space perturbations: temporal slicing, feature swapping, and frame shuffling to force learning of meaningful spatiotemporal features.
Result: Achieves state-of-the-art performance on GenAI-Bench and MonteBench benchmarks using only 30K human annotations (6× to 65× fewer than existing VLM-based approaches). Demonstrates superior temporal awareness and video quality discrimination.
Conclusion: Video generative models can be effectively repurposed as temporally-aware reward models through energy-based formulation and synthetic negative training, offering a more efficient and accurate approach to video quality evaluation than VLM-based methods.
Abstract: Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: $6\times$ to $65\times$ fewer than existing VLM-based approaches.
[105] Dual-Representation Image Compression at Ultra-Low Bitrates via Explicit Semantics and Implicit Textures
Chuqin Zhou, Xiaoyue Ling, Yunuo Chen, Jincheng Dai, Guo Lu, Wenjun Zhang
Main category: cs.CV
TL;DR: A unified framework for ultra-low bitrate image compression that integrates explicit semantic representations and implicit detail encoding using diffusion models and reverse-channel coding.
Details
Motivation: Existing neural codecs perform poorly at ultra-low bitrates, and generative compression methods face a tradeoff between semantic faithfulness (explicit methods) and perceptual realism (implicit methods).Method: Proposes a training-free framework that conditions a diffusion model on explicit high-level semantics while using reverse-channel coding to implicitly convey fine-grained details. Includes a plug-in encoder for flexible distortion-perception tradeoff control.
Result: Achieves state-of-the-art rate-perception performance, outperforming existing methods by 29.92%, 19.33%, and 20.89% in DISTS BD-Rate on Kodak, DIV2K, and CLIC2020 datasets respectively.
Conclusion: The unified framework successfully bridges the gap between explicit and implicit representations for ultra-low bitrate image compression, achieving superior performance without requiring training.
Abstract: While recent neural codecs achieve strong performance at low bitrates when optimized for perceptual quality, their effectiveness deteriorates significantly under ultra-low bitrate conditions. To mitigate this, generative compression methods leveraging semantic priors from pretrained models have emerged as a promising paradigm. However, existing approaches are fundamentally constrained by a tradeoff between semantic faithfulness and perceptual realism. Methods based on explicit representations preserve content structure but often lack fine-grained textures, whereas implicit methods can synthesize visually plausible details at the cost of semantic drift. In this work, we propose a unified framework that bridges this gap by coherently integrating explicit and implicit representations in a training-free manner. Specifically, We condition a diffusion model on explicit high-level semantics while employing reverse-channel coding to implicitly convey fine-grained details. Moreover, we introduce a plug-in encoder that enables flexible control of the distortion-perception tradeoff by modulating the implicit information. Extensive experiments demonstrate that the proposed framework achieves state-of-the-art rate-perception performance, outperforming existing methods and surpassing DiffC by 29.92%, 19.33%, and 20.89% in DISTS BD-Rate on the Kodak, DIV2K, and CLIC2020 datasets, respectively.
[106] E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching
Jiahao Nie, Wenbin An, Gongjie Zhang, Yicheng Xu, Yap-Peng Tan, Alex C. Kot, Shijian Lu
Main category: cs.CV
TL;DR: E.M.Ground: A novel Video LLM for Temporal Video Grounding that uses event tokens and smoothing to better localize video segments by capturing semantic continuity rather than just matching start/end frames.
Details
Motivation: Existing Video LLMs for temporal video grounding rely heavily on exact timestamp matching of start/end frames, which fails to capture semantic continuity and event integrity, leading to ambiguities in localization.Method: Introduces three innovations: 1) special
Result: Extensive experiments on benchmark datasets show E.M.Ground consistently outperforms state-of-the-art Video LLMs by significant margins.
Conclusion: E.M.Ground addresses limitations of current Video LLMs for temporal video grounding by focusing on holistic event perception and semantic continuity, achieving superior performance.
Abstract: Despite recent advances in Video Large Language Models (Vid-LLMs), Temporal Video Grounding (TVG), which aims to precisely localize time segments corresponding to query events, remains a significant challenge. Existing methods often match start and end frames by comparing frame features with two separate tokens, relying heavily on exact timestamps. However, this approach fails to capture the event’s semantic continuity and integrity, leading to ambiguities. To address this, we propose E.M.Ground, a novel Vid-LLM for TVG that focuses on holistic and coherent event perception. E.M.Ground introduces three key innovations: (i) a special
[107] Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation
Jiahao Nie, Guanqiao Fu, Wenbin An, Yap-Peng Tan, Alex C. Kot, Shijian Lu
Main category: cs.CV
TL;DR: MPA (Multi-view Progressive Adaptation) improves cross-domain few-shot segmentation by progressively adapting models to target domains through hybrid data augmentation and dual-chain multi-view prediction strategies.
Details
Motivation: Existing cross-domain few-shot segmentation methods struggle with limited target domain data diversity and substantial domain gaps, which hinder effective adaptation from source to target domains.Method: Proposes two key components: 1) Hybrid Progressive Augmentation that generates increasingly diverse and complex views through cumulative strong augmentations, and 2) Dual-chain Multi-view Prediction that leverages these views through sequential and parallel learning paths with extensive supervision.
Result: MPA outperforms state-of-the-art methods by a large margin (+7.0%) in cross-domain few-shot segmentation benchmarks.
Conclusion: The progressive adaptation approach from both data and strategy perspectives effectively addresses domain gaps and data scarcity in few-shot segmentation, enabling robust adaptation to target domains.
Abstract: Cross-Domain Few-Shot Segmentation aims to segment categories in data-scarce domains conditioned on a few exemplars. Typical methods first establish few-shot capability in a large-scale source domain and then adapt it to target domains. However, due to the limited quantity and diversity of target samples, existing methods still exhibit constrained performance. Moreover, the source-trained model’s initially weak few-shot capability in target domains, coupled with substantial domain gaps, severely hinders the effective utilization of target samples and further impedes adaptation. To this end, we propose Multi-view Progressive Adaptation, which progressively adapts few-shot capability to target domains from both data and strategy perspectives. (i) From the data perspective, we introduce Hybrid Progressive Augmentation, which progressively generates more diverse and complex views through cumulative strong augmentations, thereby creating increasingly challenging learning scenarios. (ii) From the strategy perspective, we design Dual-chain Multi-view Prediction, which fully leverages these progressively complex views through sequential and parallel learning paths under extensive supervision. By jointly enforcing prediction consistency across diverse and complex views, MPA achieves both robust and accurate adaptation to target domains. Extensive experiments demonstrate that MPA effectively adapts few-shot capability to target domains, outperforming state-of-the-art methods by a large margin (+7.0%).
[108] Boosting SAM for Cross-Domain Few-Shot Segmentation via Conditional Point Sparsification
Jiahao Nie, Yun Xing, Wenbin An, Qingsong Zhao, Jiawei Shao, Yap-Peng Tan, Alex C. Kot, Shijian Lu, Xuelong Li
Main category: cs.CV
TL;DR: Training-free approach using SAM for cross-domain few-shot segmentation via conditional point sparsification to handle domain shifts
Details
Motivation: SAM-based few-shot segmentation methods struggle with cross-domain scenarios (medical/satellite images) due to domain shifts disrupting point-image interactions learned by SAM. Dense point matching performs poorly under these conditions.Method: Proposes Conditional Point Sparsification (CPS) - a training-free approach that adaptively sparsifies dense matched points based on reference exemplars with ground-truth masks. Uses reference images to guide SAM interactions for cross-domain images.
Result: Extensive experiments show CPS outperforms existing training-free SAM-based methods across diverse CD-FSS datasets.
Conclusion: Point density is crucial for cross-domain few-shot segmentation, and adaptive sparsification guided by reference exemplars enables more accurate segmentation results across domains.
Abstract: Motivated by the success of the Segment Anything Model (SAM) in promptable segmentation, recent studies leverage SAM to develop training-free solutions for few-shot segmentation, which aims to predict object masks in the target image based on a few reference exemplars. These SAM-based methods typically rely on point matching between reference and target images and use the matched dense points as prompts for mask prediction. However, we observe that dense points perform poorly in Cross-Domain Few-Shot Segmentation (CD-FSS), where target images are from medical or satellite domains. We attribute this issue to large domain shifts that disrupt the point-image interactions learned by SAM, and find that point density plays a crucial role under such conditions. To address this challenge, we propose Conditional Point Sparsification (CPS), a training-free approach that adaptively guides SAM interactions for cross-domain images based on reference exemplars. Leveraging ground-truth masks, the reference images provide reliable guidance for adaptively sparsifying dense matched points, enabling more accurate segmentation results. Extensive experiments demonstrate that CPS outperforms existing training-free SAM-based methods across diverse CD-FSS datasets.
[109] Image inpainting for corrupted images by using the semi-super resolution GAN
Mehrshad Momen-Tayefeh, Mehrdad Momen-Tayefeh, Amir Ali Ghafourian Ghahramani
Main category: cs.CV
TL;DR: Proposes a GAN-based image inpainting approach with a novel Semi-SRGAN variant to handle varying levels of pixel corruption, tested on three datasets.
Details
Motivation: Address the challenge of restoring images with varying degrees of corruption, particularly focusing on how much corruption deep learning models can effectively handle.Method: Developed a Generative Adversarial Network (GAN) for learning missing pixels and created a distinct variant called Semi-SRGAN (SSRGAN). Trained with varying levels of pixel corruption to optimize accuracy.
Result: Tested on three diverse datasets to assess robustness and accuracy, achieving optimal performance through training with different corruption levels.
Conclusion: The proposed GAN-based approach with SSRGAN effectively handles image inpainting for corrupted images with varying degrees of pixel corruption.
Abstract: Image inpainting is a valuable technique for enhancing images that have been corrupted. The primary challenge in this research revolves around the extent of corruption in the input image that the deep learning model must restore. To address this challenge, we introduce a Generative Adversarial Network (GAN) for learning and replicating the missing pixels. Additionally, we have developed a distinct variant of the Super-Resolution GAN (SRGAN), which we refer to as the Semi-SRGAN (SSRGAN). Furthermore, we leveraged three diverse datasets to assess the robustness and accuracy of our proposed model. Our training process involves varying levels of pixel corruption to attain optimal accuracy and generate high-quality images.
[110] PatchFlow: Leveraging a Flow-Based Model with Patch Features
Boxiang Zhang, Baijian Yang, Xiaoming Wang, Corey Vian
Main category: cs.CV
TL;DR: A computer vision approach combining local patch features with normalizing flow and adapter modules for anomaly detection in die casting surface defects, achieving state-of-the-art performance on industrial datasets.
Details
Motivation: Surface defects in die casting impede quality control; computer vision techniques can automate defect detection but need adaptation to bridge the gap between generic pretrained features and industrial product images.Method: Combines local neighbor-aware patch features with normalizing flow model, introduces adapter module to bridge pretrained feature extractor and industrial images, enabling anomaly detection without requiring anomalous samples for training.
Result: Achieves 99.28% AUROC on MVTec AD (20% error reduction), 96.48% on VisA (28.2% error reduction), and 95.77% accuracy on proprietary die casting dataset without anomalous training samples.
Conclusion: Demonstrates potential of computer vision and deep learning for advancing inspection capabilities in die casting industry through effective anomaly detection without requiring defective samples.
Abstract: Die casting plays a crucial role across various industries due to its ability to craft intricate shapes with high precision and smooth surfaces. However, surface defects remain a major issue that impedes die casting quality control. Recently, computer vision techniques have been explored to automate and improve defect detection. In this work, we combine local neighbor-aware patch features with a normalizing flow model and bridge the gap between the generic pretrained feature extractor and industrial product images by introducing an adapter module to increase the efficiency and accuracy of automated anomaly detection. Compared to state-of-the-art methods, our approach reduces the error rate by 20% on the MVTec AD dataset, achieving an image-level AUROC of 99.28%. Our approach has also enhanced performance on the VisA dataset , achieving an image-level AUROC of 96.48%. Compared to the state-of-the-art models, this represents a 28.2% reduction in error. Additionally, experiments on a proprietary die casting dataset yield an accuracy of 95.77% for anomaly detection, without requiring any anomalous samples for training. Our method illustrates the potential of leveraging computer vision and deep learning techniques to advance inspection capabilities for the die casting industry
[111] Active Label Cleaning for Reliable Detection of Electron Dense Deposits in Transmission Electron Microscopy Images
Jieyun Tan, Shuo Liu, Guibin Zhang, Ziqi Li, Jian Geng, Lei Zhang, Lei Cao
Main category: cs.CV
TL;DR: Active label cleaning method for denoising crowdsourced medical image datasets using active learning to select valuable noisy samples for expert re-annotation
Details
Motivation: Automated detection of electron dense deposits in glomerular disease is hindered by scarce high-quality labeled data; crowdsourcing reduces annotation cost but introduces label noise that needs to be addressedMethod: Proposes active label cleaning using active learning to select most valuable noisy samples for expert re-annotation; includes Label Selection Module that leverages discrepancies between crowdsourced labels and model predictions for sample selection and instance-level noise grading
Result: Achieves 67.18% AP50 on private dataset (18.83% improvement over training on noisy labels), reaching 95.79% of performance with full expert annotation while reducing annotation cost by 73.30%
Conclusion: Provides practical, cost-effective solution for developing reliable medical AI with limited expert resources through efficient denoising of crowdsourced datasets
Abstract: Automated detection of electron dense deposits (EDD) in glomerular disease is hindered by the scarcity of high-quality labeled data. While crowdsourcing reduces annotation cost, it introduces label noise. We propose an active label cleaning method to efficiently denoise crowdsourced datasets. Our approach uses active learning to select the most valuable noisy samples for expert re-annotation, building high-accuracy cleaning models. A Label Selection Module leverages discrepancies between crowdsourced labels and model predictions for both sample selection and instance-level noise grading. Experiments show our method achieves 67.18% AP\textsubscript{50} on a private dataset, an 18.83% improvement over training on noisy labels. This performance reaches 95.79% of that with full expert annotation while reducing annotation cost by 73.30%. The method provides a practical, cost-effective solution for developing reliable medical AI with limited expert resources.
[112] RFM-Pose:Reinforcement-Guided Flow Matching for Fast Category-Level 6D Pose Estimation
Diya He, Qingchen Liu, Cong Zhang, Jiahu Qin
Main category: cs.CV
TL;DR: RFM-Pose accelerates category-level 6D object pose estimation using flow-matching generative models and reinforcement learning for efficient hypothesis sampling and refinement.
Details
Motivation: Current score-based generative models for object pose estimation suffer from high sampling costs and inefficiency in handling rotational symmetry ambiguity, limiting their practical application in real-time scenarios like virtual reality and embodied AI.Method: Proposes RFM-Pose framework that: 1) Uses flow-matching generative models instead of diffusion for more efficient pose generation along optimal transport paths, 2) Casts sampling as a Markov decision process and applies proximal policy optimization to fine-tune sampling policy, 3) Interprets flow field as learnable policy and maps estimator to value network for joint optimization of pose generation and hypothesis scoring.
Result: Achieves favorable performance on REAL275 benchmark while significantly reducing computational cost. Also demonstrates adaptability to object pose tracking with competitive results.
Conclusion: RFM-Pose provides an efficient framework for category-level 6D object pose estimation that balances accuracy and computational efficiency through flow-matching and reinforcement learning techniques.
Abstract: Object pose estimation is a fundamental problem in computer vision and plays a critical role in virtual reality and embodied intelligence, where agents must understand and interact with objects in 3D space. Recently, score based generative models have to some extent solved the rotational symmetry ambiguity problem in category level pose estimation, but their efficiency remains limited by the high sampling cost of score-based diffusion. In this work, we propose a new framework, RFM-Pose, that accelerates category-level 6D object pose generation while actively evaluating sampled hypotheses. To improve sampling efficiency, we adopt a flow-matching generative model and generate pose candidates along an optimal transport path from a simple prior to the pose distribution. To further refine these candidates, we cast the flow-matching sampling process as a Markov decision process and apply proximal policy optimization to fine-tune the sampling policy. In particular, we interpret the flow field as a learnable policy and map an estimator to a value network, enabling joint optimization of pose generation and hypothesis scoring within a reinforcement learning framework. Experiments on the REAL275 benchmark demonstrate that RFM-Pose achieves favorable performance while significantly reducing computational cost. Moreover, similar to prior work, our approach can be readily adapted to object pose tracking and attains competitive results in this setting.
[113] ReGLA: Efficient Receptive-Field Modeling with Gated Linear Attention Network
Junzhou Li, Manqi Zhao, Yilin Gao, Zhiheng Yu, Yin Li, Dongsheng Jiang, Li Xiao
Main category: cs.CV
TL;DR: ReGLA introduces lightweight hybrid networks combining efficient convolutions with ReLU-based gated linear attention for balancing accuracy and latency on high-resolution images, achieving state-of-the-art performance.
Details
Motivation: Transformer-based architectures often suffer from excessive latency on high-resolution images, creating a need for lightweight models that can balance accuracy and computational efficiency for real-world applications.Method: Three key innovations: 1) Efficient Large Receptive Field (ELRF) module for convolutional efficiency with large receptive field, 2) ReLU Gated Modulated Attention (RGMA) for linear complexity with enhanced local features, and 3) multi-teacher distillation strategy for downstream tasks.
Result: ReGLA-M achieves 80.85% Top-1 accuracy on ImageNet-1K at 224px with only 4.98 ms latency at 512px. Outperforms similarly scaled iFormer models by 3.1% AP on COCO object detection and 3.6% mIoU on ADE20K semantic segmentation.
Conclusion: ReGLA establishes a state-of-the-art solution for high-resolution visual applications by effectively balancing accuracy and latency through its hybrid architecture combining efficient convolutions with attention mechanisms.
Abstract: Balancing accuracy and latency on high-resolution images is a critical challenge for lightweight models, particularly for Transformer-based architectures that often suffer from excessive latency. To address this issue, we introduce \textbf{ReGLA}, a series of lightweight hybrid networks, which integrates efficient convolutions for local feature extraction with ReLU-based gated linear attention for global modeling. The design incorporates three key innovations: the Efficient Large Receptive Field (ELRF) module for enhancing convolutional efficiency while preserving a large receptive field; the ReLU Gated Modulated Attention (RGMA) module for maintaining linear complexity while enhancing local feature representation; and a multi-teacher distillation strategy to boost performance on downstream tasks. Extensive experiments validate the superiority of ReGLA; particularly the ReGLA-M achieves \textbf{80.85%} Top-1 accuracy on ImageNet-1K at $224px$, with only \textbf{4.98 ms} latency at $512px$. Furthermore, ReGLA outperforms similarly scaled iFormer models in downstream tasks, achieving gains of \textbf{3.1%} AP on COCO object detection and \textbf{3.6%} mIoU on ADE20K semantic segmentation, establishing it as a state-of-the-art solution for high-resolution visual applications.
[114] Unlocking Prototype Potential: An Efficient Tuning Framework for Few-Shot Class-Incremental Learning
Shengqin Jiang, Xiaoran Feng, Yuankai Qi, Haokui Zhang, Renlong Hang, Qingshan Liu, Lina Yao, Quan Z. Sheng, Ming-Hsuan Yang
Main category: cs.CV
TL;DR: A prototype fine-tuning framework for few-shot class-incremental learning that freezes feature extractors and optimizes decision regions by evolving static prototypes into dynamic, learnable components with dual-calibration offsets.
Details
Motivation: Traditional FSCIL methods use frozen pre-trained feature extractors with static class prototypes, suffering from representation bias. Recent prompt-based tuning methods have limited capacity to assimilate novel information under extreme data scarcity. The paper argues the main challenge is not feature acquisition but optimizing decision regions within a static, high-quality feature space.Method: Proposes freezing the feature extractor while fine-tuning prototypes. Introduces an efficient prototype fine-tuning framework that evolves static centroids into dynamic, learnable components using dual-calibration method with class-specific and task-aware offsets that work synergistically to improve discriminative capacity for incremental classes.
Result: Extensive results demonstrate superior performance across multiple benchmarks while requiring minimal learnable parameters.
Conclusion: Shifting perspective from adapting feature extractors to fine-tuning prototypes within a static feature space is an effective approach for few-shot class-incremental learning, achieving better performance with fewer parameters.
Abstract: Few-shot class-incremental learning (FSCIL) seeks to continuously learn new classes from very limited samples while preserving previously acquired knowledge. Traditional methods often utilize a frozen pre-trained feature extractor to generate static class prototypes, which suffer from the inherent representation bias of the backbone. While recent prompt-based tuning methods attempt to adapt the backbone via minimal parameter updates, given the constraint of extreme data scarcity, the model’s capacity to assimilate novel information and substantively enhance its global discriminative power is inherently limited. In this paper, we propose a novel shift in perspective: freezing the feature extractor while fine-tuning the prototypes. We argue that the primary challenge in FSCIL is not feature acquisition, but rather the optimization of decision regions within a static, high-quality feature space. To this end, we introduce an efficient prototype fine-tuning framework that evolves static centroids into dynamic, learnable components. The framework employs a dual-calibration method consisting of class-specific and task-aware offsets. These components function synergistically to improve the discriminative capacity of prototypes for ongoing incremental classes. Extensive results demonstrate that our method attains superior performance across multiple benchmarks while requiring minimal learnable parameters.
[115] Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs
Qi Li, Yanzhe Zhao, Yongxin Zhou, Yameng Wang, Yandong Yang, Yuanjia Zhou, Jue Wang, Zuojian Wang, Jinxiang Liu
Main category: cs.CV
TL;DR: Magic-MM-Embedding: Efficient multimodal embedding models using visual token compression and progressive training for universal multimodal retrieval
Details
Motivation: MLLMs show promise for universal multimodal retrieval but face computational bottlenecks from processing many visual tokens, hindering practical deploymentMethod: Two synergistic pillars: (1) Efficient MLLM architecture with visual token compression to reduce latency/memory, (2) Multi-stage progressive training (continue pretraining → contrastive pretraining with hard negatives → task-aware fine-tuning using MLLM-as-a-Judge)
Result: Outperforms existing methods by large margin while being more inference-efficient
Conclusion: Proposed approach achieves both high efficiency and state-of-the-art performance in universal multimodal embedding
Abstract: Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the substantial computational cost incurred from processing a large number of tokens from visual inputs. In this paper, we propose Magic-MM-Embedding, a series of novel models that achieve both high efficiency and state-of-the-art performance in universal multimodal embedding. Our approach is built on two synergistic pillars: (1) a highly efficient MLLM architecture incorporating visual token compression to drastically reduce inference latency and memory footprint, and (2) a multi-stage progressive training strategy designed to not only recover but significantly boost performance. This coarse-to-fine training paradigm begins with extensive continue pretraining to restore multimodal understanding and generation capabilities, progresses to large-scale contrastive pretraining and hard negative mining to enhance discriminative power, and culminates in a task-aware fine-tuning stage guided by an MLLM-as-a-Judge for precise data curation. Comprehensive experiments show that our model outperforms existing methods by a large margin while being more inference-efficient.
[116] Fast-SAM3D: 3Dfy Anything in Images but Faster
Weilun Feng, Mingqiang Wu, Zhiliang Chen, Chuanguang Yang, Haotong Qin, Yuqi Li, Xiaokun Liu, Guoxin Fan, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu
Main category: cs.CV
TL;DR: Fast-SAM3D accelerates SAM3D’s 3D reconstruction by addressing pipeline heterogeneity through dynamic computation alignment, achieving 2.67× speedup with minimal quality loss.
Details
Motivation: SAM3D enables scalable open-world 3D reconstruction but suffers from prohibitive inference latency. Generic acceleration strategies fail due to the pipeline's inherent multi-level heterogeneity, including kinematic differences between shape/layout, texture refinement sparsity, and spectral variance across geometries.Method: Fast-SAM3D introduces three heterogeneity-aware mechanisms: 1) Modality-Aware Step Caching to decouple structural evolution from layout updates, 2) Joint Spatiotemporal Token Carving to focus refinement on high-entropy regions, and 3) Spectral-Aware Token Aggregation to adapt decoding resolution based on spectral variance.
Result: Extensive experiments show Fast-SAM3D delivers up to 2.67× end-to-end speedup with negligible fidelity loss, establishing a new Pareto frontier for efficient single-view 3D generation.
Conclusion: Fast-SAM3D successfully addresses SAM3D’s inference bottlenecks through heterogeneity-aware acceleration, enabling efficient deployment while maintaining reconstruction quality, with code released for community use.
Abstract: SAM3D enables scalable, open-world 3D reconstruction from complex scenes, yet its deployment is hindered by prohibitive inference latency. In this work, we conduct the \textbf{first systematic investigation} into its inference dynamics, revealing that generic acceleration strategies are brittle in this context. We demonstrate that these failures stem from neglecting the pipeline’s inherent multi-level \textbf{heterogeneity}: the kinematic distinctiveness between shape and layout, the intrinsic sparsity of texture refinement, and the spectral variance across geometries. To address this, we present \textbf{Fast-SAM3D}, a training-free framework that dynamically aligns computation with instantaneous generation complexity. Our approach integrates three heterogeneity-aware mechanisms: (1) \textit{Modality-Aware Step Caching} to decouple structural evolution from sensitive layout updates; (2) \textit{Joint Spatiotemporal Token Carving} to concentrate refinement on high-entropy regions; and (3) \textit{Spectral-Aware Token Aggregation} to adapt decoding resolution. Extensive experiments demonstrate that Fast-SAM3D delivers up to \textbf{2.67$\times$} end-to-end speedup with negligible fidelity loss, establishing a new Pareto frontier for efficient single-view 3D generation. Our code is released in https://github.com/wlfeng0509/Fast-SAM3D.
[117] FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion
Zhuokun Chen, Jianfei Cai, Bohan Zhuang
Main category: cs.CV
TL;DR: FlashBlock improves block diffusion efficiency by reusing stable attention outputs from tokens outside the current block, reducing computation and KV cache access while maintaining generation quality.
Details
Motivation: Block diffusion improves inference efficiency for long-form content generation, but still suffers from substantial overhead from repeatedly computing attention over growing KV caches. The authors identified cross-step redundancy in attention within blocks as an underexplored optimization opportunity.Method: FlashBlock proposes a cached block-external attention mechanism that reuses stable attention outputs from tokens outside the current block. It identifies that attention outputs from external tokens remain largely stable across diffusion steps, while block-internal attention varies significantly. The method reduces attention computation and KV cache access without modifying the diffusion process and can be combined with sparse attention techniques.
Result: Experiments on diffusion language models and video generation show up to 1.44× higher token throughput and up to 1.6× reduction in attention time, with negligible impact on generation quality. The method also improves model accuracy when combined with aggressive sparsification.
Conclusion: FlashBlock effectively addresses efficiency bottlenecks in block diffusion for long-form content generation by exploiting cross-step attention redundancy, offering significant speedups without compromising quality. The approach is orthogonal to existing optimization techniques and can be combined with them for further improvements.
Abstract: Generating long-form content, such as minute-long videos and extended texts, is increasingly important for modern generative models. Block diffusion improves inference efficiency via KV caching and block-wise causal inference and has been widely adopted in diffusion language models and video generation. However, in long-context settings, block diffusion still incurs substantial overhead from repeatedly computing attention over a growing KV cache. We identify an underexplored property of block diffusion: cross-step redundancy of attention within a block. Our analysis shows that attention outputs from tokens outside the current block remain largely stable across diffusion steps, while block-internal attention varies significantly. Based on this observation, we propose FlashBlock, a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process. Moreover, FlashBlock is orthogonal to sparse attention and can be combined as a complementary residual reuse strategy, substantially improving model accuracy under aggressive sparsification. Experiments on diffusion language models and video generation demonstrate up to 1.44$\times$ higher token throughput and up to 1.6$\times$ reduction in attention time, with negligible impact on generation quality. Project page: https://caesarhhh.github.io/FlashBlock/.
[118] Wid3R: Wide Field-of-View 3D Reconstruction via Camera Model Conditioning
Dongki Jung, Jaehoon Choi, Adil Qureshi, Somi Jeong, Dinesh Manocha, Suyong Yeon
Main category: cs.CV
TL;DR: Wid3R is a feed-forward neural network for visual geometry reconstruction that supports wide field-of-view camera models, enabling distortion-aware 3D reconstruction from fisheye, panoramic, and 360° imagery without requiring rectification or pinhole assumptions.
Details
Motivation: Prior 3D reconstruction methods are limited to perspective images and pinhole cameras, requiring careful calibration and undistortion. This restricts their applicability in real-world scenarios using fisheye or panoramic cameras with wide field-of-view.Method: Uses a ray representation with spherical harmonics and a novel camera model token within the network architecture to enable distortion-aware 3D reconstruction. Supports feed-forward reconstruction directly from 360° imagery.
Result: Demonstrates strong zero-shot robustness and consistently outperforms prior methods, achieving improvements of up to +77.33 on Stanford2D3D benchmark. First multi-view foundation model supporting feed-forward 3D reconstruction from 360° imagery.
Conclusion: Wid3R provides a generalizable multi-view 3D estimation method that can model wide field-of-view camera types, overcoming limitations of previous approaches that were restricted to perspective images and pinhole camera assumptions.
Abstract: We present Wid3R, a feed-forward neural network for visual geometry reconstruction that supports wide field-of-view camera models. Prior methods typically assume that input images are rectified or captured with pinhole cameras, since both their architectures and training datasets are tailored to perspective images only. These assumptions limit their applicability in real-world scenarios that use fisheye or panoramic cameras and often require careful calibration and undistortion. In contrast, Wid3R is a generalizable multi-view 3D estimation method that can model wide field-of-view camera types. Our approach leverages a ray representation with spherical harmonics and a novel camera model token within the network, enabling distortion-aware 3D reconstruction. Furthermore, Wid3R is the first multi-view foundation model to support feed-forward 3D reconstruction directly from 360 imagery. It demonstrates strong zero-shot robustness and consistently outperforms prior methods, achieving improvements of up to +77.33 on Stanford2D3D.
[119] MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors
Jingdong Zhang, Xiaohang Zhan, Lingzhi Zhang, Yizhou Wang, Zhengming Yu, Jionghao Wang, Wenping Wang, Xin Li
Main category: cs.CV
TL;DR: MTPano is a multi-task panoramic foundation model that addresses panoramic scene understanding challenges using label-free training with perspective priors and a dual-branch architecture for rotation-invariant/variant tasks.
Details
Motivation: Panoramic scene understanding is challenging due to scarcity of high-resolution multi-task annotations, geometric distortions in equirectangular projections, and coordinate system discrepancies when adapting perspective models to panoramic domains.Method: 1) Label-free training using perspective patches from panoramic images with off-the-shelf foundation models for pseudo-labels; 2) Panoramic Dual BridgeNet with geometry-aware modulation layers separating rotation-invariant (depth, segmentation) and rotation-variant (surface normals) tasks; 3) ERP token mixers with dual-branch BridgeNet and gradient truncation; 4) Auxiliary tasks (image gradient, point map) for cross-task learning.
Result: MTPano achieves state-of-the-art performance on multiple benchmarks and delivers competitive results against task-specific panoramic specialist foundation models.
Conclusion: The proposed MTPano framework effectively addresses panoramic scene understanding challenges through label-free training with perspective priors and a carefully designed architecture that handles geometric distortions and task interference.
Abstract: Comprehensive panoramic scene understanding is critical for immersive applications, yet it remains challenging due to the scarcity of high-resolution, multi-task annotations. While perspective foundation models have achieved success through data scaling, directly adapting them to the panoramic domain often fails due to severe geometric distortions and coordinate system discrepancies. Furthermore, the underlying relations between diverse dense prediction tasks in spherical spaces are underexplored. To address these challenges, we propose MTPano, a robust multi-task panoramic foundation model established by a label-free training pipeline. First, to circumvent data scarcity, we leverage powerful perspective dense priors. We project panoramic images into perspective patches to generate accurate, domain-gap-free pseudo-labels using off-the-shelf foundation models, which are then re-projected to serve as patch-wise supervision. Second, to tackle the interference between task types, we categorize tasks into rotation-invariant (e.g., depth, segmentation) and rotation-variant (e.g., surface normals) groups. We introduce the Panoramic Dual BridgeNet, which disentangles these feature streams via geometry-aware modulation layers that inject absolute position and ray direction priors. To handle the distortion from equirectangular projections (ERP), we incorporate ERP token mixers followed by a dual-branch BridgeNet for interactions with gradient truncation, facilitating beneficial cross-task information sharing while blocking conflicting gradients from incompatible task attributes. Additionally, we introduce auxiliary tasks (image gradient, point map, etc.) to fertilize the cross-task learning process. Extensive experiments demonstrate that MTPano achieves state-of-the-art performance on multiple benchmarks and delivers competitive results against task-specific panoramic specialist foundation models.
[120] Consistency-Preserving Concept Erasure via Unsafe-Safe Pairing and Directional Fisher-weighted Adaptation
Yongwoo Kim, Sungmin Cha, Hyunsoo Kim, Jaewon Lee, Donghyun Kim
Main category: cs.CV
TL;DR: PAIR framework for concept erasure in text-to-image diffusion models that uses unsafe-safe pairs to preserve structural and semantic consistency while removing harmful content.
Details
Motivation: Existing concept erasure approaches focus only on removing unsafe concepts without providing guidance toward safe alternatives, leading to failure in preserving structural and semantic consistency between original and erased generations.Method: PAIRed Erasing (PAIR) framework with two key components: (1) Paired Semantic Realignment using unsafe-safe pairs to map target concepts to semantically aligned safe anchors, and (2) Fisher-weighted Initialization for DoRA that initializes parameter-efficient low-rank adaptation matrices using unsafe-safe pairs to encourage safe alternatives while suppressing unsafe concepts.
Result: Extensive experiments show the approach significantly outperforms state-of-the-art baselines, achieving effective concept erasure while preserving structural integrity, semantic coherence, and generation quality.
Conclusion: PAIR framework successfully reframes concept erasure from simple removal to consistency-preserving semantic realignment, enabling fine-grained erasure that removes only targeted concepts while maintaining overall semantic consistency.
Abstract: With the increasing versatility of text-to-image diffusion models, the ability to selectively erase undesirable concepts (e.g., harmful content) has become indispensable. However, existing concept erasure approaches primarily focus on removing unsafe concepts without providing guidance toward corresponding safe alternatives, which often leads to failure in preserving the structural and semantic consistency between the original and erased generations. In this paper, we propose a novel framework, PAIRed Erasing (PAIR), which reframes concept erasure from simple removal to consistency-preserving semantic realignment using unsafe-safe pairs. We first generate safe counterparts from unsafe inputs while preserving structural and semantic fidelity, forming paired unsafe-safe multimodal data. Leveraging these pairs, we introduce two key components: (1) Paired Semantic Realignment, a guided objective that uses unsafe-safe pairs to explicitly map target concepts to semantically aligned safe anchors; and (2) Fisher-weighted Initialization for DoRA, which initializes parameter-efficient low-rank adaptation matrices using unsafe-safe pairs, encouraging the generation of safe alternatives while selectively suppressing unsafe concepts. Together, these components enable fine-grained erasure that removes only the targeted concepts while maintaining overall semantic consistency. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving effective concept erasure while preserving structural integrity, semantic coherence, and generation quality.
[121] Learning with Adaptive Prototype Manifolds for Out-of-Distribution Detection
Ningkang Peng, JiuTao Zhou, Yuhao Zhang, Xiaoqian Peng, Qianfeng Yu, Linjing Qian, Tingyu Lu, Yi Chen, Yanhui Gu
Main category: cs.CV
TL;DR: APEX introduces adaptive prototype learning with MDL-based complexity determination and posterior-aware scoring to address fundamental flaws in existing OOD detection methods.
Details
Motivation: Existing prototype-based OOD detection methods suffer from two fundamental flaws: 1) Static Homogeneity Assumption (fixed representational resources for all classes), and 2) Learning-Inference Disconnect (discarding prototype quality knowledge at inference). These limitations constrain model capacity and performance.Method: APEX uses a Two-Stage Repair process: 1) Adaptive Prototype Manifold (APM) that applies Minimum Description Length principle to automatically determine optimal prototype complexity K_c* per class, resolving prototype collision; 2) Posterior-Aware OOD Scoring (PAOS) that quantifies prototype quality (cohesion and separation) to bridge learning-inference disconnect.
Result: Comprehensive experiments on benchmarks like CIFAR-100 show APEX achieves new state-of-the-art performance in OOD detection.
Conclusion: APEX successfully addresses fundamental limitations in prototype-based OOD detection through adaptive prototype complexity determination and quality-aware scoring, demonstrating superior performance.
Abstract: Out-of-distribution (OOD) detection is a critical task for the safe deployment of machine learning models in the real world. Existing prototype-based representation learning methods have demonstrated exceptional performance. Specifically, we identify two fundamental flaws that universally constrain these methods: the Static Homogeneity Assumption (fixed representational resources for all classes) and the Learning-Inference Disconnect (discarding rich prototype quality knowledge at inference). These flaws fundamentally limit the model’s capacity and performance. To address these issues, we propose APEX (Adaptive Prototype for eXtensive OOD Detection), a novel OOD detection framework designed via a Two-Stage Repair process to optimize the learned feature manifold. APEX introduces two key innovations to address these respective flaws: (1) an Adaptive Prototype Manifold (APM), which leverages the Minimum Description Length (MDL) principle to automatically determine the optimal prototype complexity $K_c^*$ for each class, thereby fundamentally resolving prototype collision; and (2) a Posterior-Aware OOD Scoring (PAOS) mechanism, which quantifies prototype quality (cohesion and separation) to bridge the learning-inference disconnect. Comprehensive experiments on benchmarks such as CIFAR-100 validate the superiority of our method, where APEX achieves new state-of-the-art performance.
[122] Multimodal Latent Reasoning via Hierarchical Visual Cues Injection
Yiming Zhang, Qiangyu Yan, Borui Jiang, Kai Han
Main category: cs.CV
TL;DR: HIVE introduces a multimodal latent reasoning framework that enables “slow thinking” through hierarchical visual cues injection, allowing models to perform grounded reasoning in aligned latent space without verbose textual rationales.
Details
Motivation: Current multimodal LLMs use "fast thinking" paradigms with end-to-end generation or language-centric chains of thought, which are inefficient, verbose, and prone to hallucinations. There's a need for more robust reasoning that evolves within latent space with seamless multimodal integration.Method: Proposes HIVE framework that recursively extends transformer blocks to create internal loops for iterative reasoning refinement. It injectively grounds the process with hierarchical visual cues from global scene context to fine-grained regional details directly into latent representations, enabling multi-step inference in aligned latent space.
Result: Extensive evaluations show test-time scaling is effective when incorporating vision knowledge, and integrating hierarchical information significantly enhances model understanding of complex scenes.
Conclusion: HIVE enables “slow thinking” in multimodal models through latent reasoning with hierarchical visual grounding, improving efficiency and reducing hallucinations compared to language-centric reasoning approaches.
Abstract: The advancement of multimodal large language models (MLLMs) has enabled impressive perception capabilities. However, their reasoning process often remains a “fast thinking” paradigm, reliant on end-to-end generation or explicit, language-centric chains of thought (CoT), which can be inefficient, verbose, and prone to hallucination. This work posits that robust reasoning should evolve within a latent space, integrating multimodal signals seamlessly. We propose multimodal latent reasoning via HIerarchical Visual cuEs injection (\emph{HIVE}), a novel framework that instills deliberate, “slow thinking” without depending on superficial textual rationales. Our method recursively extends transformer blocks, creating an internal loop for iterative reasoning refinement. Crucially, it injectively grounds this process with hierarchical visual cues from global scene context to fine-grained regional details directly into the model’s latent representations. This enables the model to perform grounded, multi-step inference entirely in the aligned latent space. Extensive evaluations demonstrate that test-time scaling is effective when incorporating vision knowledge, and that integrating hierarchical information significantly enhances the model’s understanding of complex scenes.
[123] Breaking Semantic Hegemony: Decoupling Principal and Residual Subspaces for Generalized OOD Detection
Ningkang Peng, Xiaoqian Peng, Yuhao Zhang, Qianfeng Yu, Feng Xing, Peirong Ma, Xichen Yang, Yi Chen, Tingyu Lu, Yanhui Gu
Main category: cs.CV
TL;DR: D-KNN addresses the “Simplicity Paradox” in OOD detection where SOTA models detect subtle semantic OOD samples but fail on structurally distinct simple samples or sensor noise, using orthogonal decomposition to separate semantic and structural features.
Details
Motivation: The paper identifies a counter-intuitive "Simplicity Paradox" in current OOD detection models: they excel at detecting semantically subtle OOD samples but suffer from "Geometric Blindness" when facing structurally distinct yet semantically simple samples or high-frequency sensor noise. This is attributed to "Semantic Hegemony" in deep feature spaces where semantic information dominates structural signals.Method: Proposes D-KNN, a training-free, plug-and-play geometric decoupling framework. It uses orthogonal decomposition to explicitly separate semantic components from structural residuals in feature space. A dual-space calibration mechanism reactivates sensitivity to weak residual signals by analyzing both principal and residual subspaces.
Result: D-KNN establishes new SOTA performance on CIFAR and ImageNet benchmarks. It reduces FPR95 from 31.3% to 2.3% for the Simplicity Paradox case, and boosts AUROC from 79.7% to 94.9% for sensor noise detection. The method effectively breaks Semantic Hegemony and improves geometric sensitivity.
Conclusion: The paper successfully addresses the Simplicity Paradox in OOD detection by revealing and mitigating Semantic Hegemony through geometric decoupling. D-KNN provides a training-free solution that enhances model sensitivity to structural distribution shifts while maintaining semantic discrimination capabilities.
Abstract: While feature-based post-hoc methods have made significant strides in Out-of-Distribution (OOD) detection, we uncover a counter-intuitive Simplicity Paradox in existing state-of-the-art (SOTA) models: these models exhibit keen sensitivity in distinguishing semantically subtle OOD samples but suffer from severe Geometric Blindness when confronting structurally distinct yet semantically simple samples or high-frequency sensor noise. We attribute this phenomenon to Semantic Hegemony within the deep feature space and reveal its mathematical essence through the lens of Neural Collapse. Theoretical analysis demonstrates that the spectral concentration bias, induced by the high variance of the principal subspace, numerically masks the structural distribution shift signals that should be significant in the residual subspace. To address this issue, we propose D-KNN, a training-free, plug-and-play geometric decoupling framework. This method utilizes orthogonal decomposition to explicitly separate semantic components from structural residuals and introduces a dual-space calibration mechanism to reactivate the model’s sensitivity to weak residual signals. Extensive experiments demonstrate that D-KNN effectively breaks Semantic Hegemony, establishing new SOTA performance on both CIFAR and ImageNet benchmarks. Notably, in resolving the Simplicity Paradox, it reduces the FPR95 from 31.3% to 2.3%; when addressing sensor failures such as Gaussian noise, it boosts the detection performance (AUROC) from a baseline of 79.7% to 94.9%.
[124] Imagine a City: CityGenAgent for Procedural 3D City Generation
Zishan Liu, Zecong Tang, RuoCheng Wu, Xinzhe Zheng, Jingyu Hu, Ka-Hei Hui, Haoran Xie, Bo Dai, Zhengzhe Liu
Main category: cs.CV
TL;DR: CityGenAgent: A natural language-driven framework for hierarchical procedural generation of high-quality 3D cities using interpretable block and building programs with two-stage learning (SFT + RL).
Details
Motivation: Automated generation of interactive 3D cities is critical for applications like autonomous driving, VR, and embodied intelligence, but existing methods struggle with high-fidelity asset creation, controllability, and manipulation.Method: Decomposes city generation into Block Program and Building Program components. Uses two-stage learning: 1) Supervised Fine-Tuning to generate valid programs adhering to schema constraints, 2) Reinforcement Learning with Spatial Alignment Reward and Visual Consistency Reward to enhance spatial reasoning and bridge text-visual gaps.
Result: Comprehensive evaluations demonstrate superior semantic alignment, visual quality, and controllability compared to existing methods, establishing a robust foundation for scalable 3D city generation.
Conclusion: CityGenAgent provides a natural language-driven framework for high-quality 3D city generation with improved controllability and manipulation capabilities, supporting natural language editing.
Abstract: The automated generation of interactive 3D cities is a critical challenge with broad applications in autonomous driving, virtual reality, and embodied intelligence. While recent advances in generative models and procedural techniques have improved the realism of city generation, existing methods often struggle with high-fidelity asset creation, controllability, and manipulation. In this work, we introduce CityGenAgent, a natural language-driven framework for hierarchical procedural generation of high-quality 3D cities. Our approach decomposes city generation into two interpretable components, Block Program and Building Program. To ensure structural correctness and semantic alignment, we adopt a two-stage learning strategy: (1) Supervised Fine-Tuning (SFT). We train BlockGen and BuildingGen to generate valid programs that adhere to schema constraints, including non-self-intersecting polygons and complete fields; (2) Reinforcement Learning (RL). We design Spatial Alignment Reward to enhance spatial reasoning ability and Visual Consistency Reward to bridge the gap between textual descriptions and the visual modality. Benefiting from the programs and the models’ generalization, CityGenAgent supports natural language editing and manipulation. Comprehensive evaluations demonstrate superior semantic alignment, visual quality, and controllability compared to existing methods, establishing a robust foundation for scalable 3D city generation.
[125] SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback
Xiaoxuan He, Siming Fu, Wanli Li, Zhiyuan Li, Dacheng Yin, Kang Rong, Fengyun Rao, Bo Zhang
Main category: cs.CV
TL;DR: SAIL enables diffusion models to self-improve with minimal human feedback through iterative self-annotation and refinement, reducing preference data needs by 94%.
Details
Motivation: Aligning diffusion models with human preferences is challenging due to the need for large-scale preference datasets or reward models, which are expensive to obtain. The paper explores whether minimal human feedback can unlock latent self-improvement capabilities within diffusion models themselves.Method: SAIL (Self-Amplified Iterative Learning) is a closed-loop framework where diffusion models act as their own teachers. Starting from a small seed set of human-annotated preference pairs, the model progressively generates diverse samples, self-annotates preferences based on evolving understanding, and refines itself using this self-augmented dataset. A ranked preference mixup strategy balances exploration with adherence to initial human priors to prevent catastrophic forgetting.
Result: SAIL consistently outperforms state-of-the-art methods across multiple benchmarks while using only 6% of the preference data required by existing approaches, demonstrating that diffusion models possess remarkable self-improvement capabilities.
Conclusion: Diffusion models have inherent self-improvement capabilities that, when properly harnessed through frameworks like SAIL, can effectively replace both large-scale human annotation and external reward models for alignment with human preferences.
Abstract: Aligning diffusion models with human preferences remains challenging, particularly when reward models are unavailable or impractical to obtain, and collecting large-scale preference datasets is prohibitively expensive. \textit{This raises a fundamental question: can we achieve effective alignment using only minimal human feedback, without auxiliary reward models, by unlocking the latent capabilities within diffusion models themselves?} In this paper, we propose \textbf{SAIL} (\textbf{S}elf-\textbf{A}mplified \textbf{I}terative \textbf{L}earning), a novel framework that enables diffusion models to act as their own teachers through iterative self-improvement. Starting from a minimal seed set of human-annotated preference pairs, SAIL operates in a closed-loop manner where the model progressively generates diverse samples, self-annotates preferences based on its evolving understanding, and refines itself using this self-augmented dataset. To ensure robust learning and prevent catastrophic forgetting, we introduce a ranked preference mixup strategy that carefully balances exploration with adherence to initial human priors. Extensive experiments demonstrate that SAIL consistently outperforms state-of-the-art methods across multiple benchmarks while using merely 6% of the preference data required by existing approaches, revealing that diffusion models possess remarkable self-improvement capabilities that, when properly harnessed, can effectively replace both large-scale human annotation and external reward models.
[126] VRIQ: Benchmarking and Analyzing Visual-Reasoning IQ of VLMs
Tina Khezresmaeilzadeh, Jike Zhong, Konstantinos Psounis
Main category: cs.CV
TL;DR: VRIQ benchmark reveals current Vision Language Models struggle with visual reasoning, achieving only 28% accuracy on abstract puzzles and 45% on natural images, with perception failures accounting for 56% of errors.
Details
Motivation: To assess whether Vision Language Models (VLMs) can reliably perform nonverbal reasoning, as recent progress raises questions about their true visual reasoning capabilities beyond language understanding.Method: Introduces VRIQ benchmark with abstract puzzle-style and natural-image reasoning tasks, evaluates models, uses tool-augmented reasoning, and introduces diagnostic probes targeting perception and reasoning failures.
Result: Models perform near random (28%) on abstract puzzles and weakly (45%) on natural tasks. Diagnostic analysis shows 56% failures from perception alone, 43% from both perception and reasoning, and only 1% from reasoning alone.
Conclusion: Current VLMs remain unreliable abstract reasoners primarily due to perception limitations, not reasoning deficits, providing a principled basis for improving visual reasoning in multimodal systems.
Abstract: Recent progress in Vision Language Models (VLMs) has raised the question of whether they can reliably perform nonverbal reasoning. To this end, we introduce VRIQ (Visual Reasoning IQ), a novel benchmark designed to assess and analyze the visual reasoning ability of VLMs. We evaluate models on two sets of tasks: abstract puzzle-style and natural-image reasoning tasks. We find that on abstract puzzles, performance remains near random with an average accuracy of around 28%, while natural tasks yield better but still weak results with 45% accuracy. We also find that tool-augmented reasoning demonstrates only modest improvements. To uncover the source of this weakness, we introduce diagnostic probes targeting perception and reasoning. Our analysis demonstrates that around 56% of failures arise from perception alone, 43% from both perception and reasoning, and only a mere 1% from reasoning alone. This motivates us to design fine-grained diagnostic probe questions targeting specific perception categories (e.g., shape, count, position, 3D/depth), revealing that certain categories cause more failures than others. Our benchmark and analysis establish that current VLMs, even with visual reasoning tools, remain unreliable abstract reasoners, mostly due to perception limitations, and offer a principled basis for improving visual reasoning in multimodal systems.
[127] Dolphin-v2: Universal Document Parsing via Scalable Anchor Prompting
Hao Feng, Wei Shi, Ke Zhang, Xiang Fei, Lei Liao, Dingkang Yang, Yongkun Du, Xuecheng Wu, Jingqun Tang, Yang Liu, Hong Chen, Can Huang
Main category: cs.CV
TL;DR: Dolphin-v2 is an enhanced two-stage document image parsing model that improves handling of both digital-born and photographed documents through joint document type classification, finer-grained element detection, and hybrid parsing strategies.
Details
Motivation: Current document parsing field is fragmented with specialized models requiring complex selection, and existing approaches fail to handle distorted/photographed documents effectively due to reliance on axis-aligned bounding boxes.Method: Two-stage approach: 1) Joint document type classification (digital-born vs photographed) with layout analysis; for digital-born docs, finer-grained element detection with reading order prediction. 2) Hybrid parsing: photographed docs parsed holistically as complete pages, digital-born docs undergo element-wise parallel parsing guided by layout anchors.
Result: Substantial improvements: +14.78 points overall on OmniDocBench, 91% error reduction on photographed documents, while maintaining efficient inference through parallel processing. Also introduces 21 element categories with semantic attribute extraction and code block recognition with indentation preservation.
Conclusion: Dolphin-v2 significantly advances document parsing capabilities by addressing key limitations of existing approaches, particularly for photographed documents, while providing finer-grained analysis and maintaining efficiency.
Abstract: Document parsing has garnered widespread attention as vision-language models (VLMs) advance OCR capabilities. However, the field remains fragmented across dozens of specialized models with varying strengths, forcing users to navigate complex model selection and limiting system scalability. Moreover, existing two-stage approaches depend on axis-aligned bounding boxes for layout detection, failing to handle distorted or photographed documents effectively. To this end, we present Dolphin-v2, a two-stage document image parsing model that substantially improves upon the original Dolphin. In the first stage, Dolphin-v2 jointly performs document type classification (digital-born versus photographed) alongside layout analysis. For digital-born documents, it conducts finer-grained element detection with reading order prediction. In the second stage, we employ a hybrid parsing strategy: photographed documents are parsed holistically as complete pages to handle geometric distortions, while digital-born documents undergo element-wise parallel parsing guided by the detected layout anchors, enabling efficient content extraction. Compared with the original Dolphin, Dolphin-v2 introduces several crucial enhancements: (1) robust parsing of photographed documents via holistic page-level understanding, (2) finer-grained element detection (21 categories) with semantic attribute extraction such as author information and document metadata, and (3) code block recognition with indentation preservation, which existing systems typically lack. Comprehensive evaluations are conducted on DocPTBench, OmniDocBench, and our self-constructed RealDoc-160 benchmark. The results demonstrate substantial improvements: +14.78 points overall on the challenging OmniDocBench and 91% error reduction on photographed documents, while maintaining efficient inference through parallel processing.
[128] Parallel Swin Transformer-Enhanced 3D MRI-to-CT Synthesis for MRI-Only Radiotherapy Planning
Zolnamar Dorjsembe, Hung-Yi Chen, Furen Xiao, Hsing-Kuo Pao
Main category: cs.CV
TL;DR: A 3D transformer-based architecture for generating synthetic CT from MRI to enable MRI-only radiotherapy planning, improving anatomical fidelity and dosimetric accuracy.
Details
Motivation: Current radiotherapy workflows require both MRI and CT scans due to MRI's lack of electron density information needed for dose calculation, leading to registration errors and procedural complexity. Synthetic CT generation from MRI alone could streamline workflows but faces challenges from nonlinear MRI-CT relationships and anatomical variability.Method: Proposes Parallel Swin Transformer-Enhanced Med2Transformer, a 3D architecture combining convolutional encoding with dual Swin Transformer branches to capture both local anatomical details and long-range contextual dependencies. Uses multi-scale shifted window attention with hierarchical feature aggregation for improved anatomical fidelity.
Result: Experiments on public and clinical datasets show higher image similarity and improved geometric accuracy compared to baseline methods. Dosimetric evaluation demonstrates clinically acceptable performance with mean target dose error of 1.69%.
Conclusion: The proposed transformer-based architecture effectively generates synthetic CT from MRI, enabling MRI-only radiotherapy planning with improved anatomical fidelity and clinically acceptable dosimetric accuracy.
Abstract: MRI provides superior soft tissue contrast without ionizing radiation; however, the absence of electron density information limits its direct use for dose calculation. As a result, current radiotherapy workflows rely on combined MRI and CT acquisitions, increasing registration uncertainty and procedural complexity. Synthetic CT generation enables MRI only planning but remains challenging due to nonlinear MRI-CT relationships and anatomical variability. We propose Parallel Swin Transformer-Enhanced Med2Transformer, a 3D architecture that integrates convolutional encoding with dual Swin Transformer branches to model both local anatomical detail and long-range contextual dependencies. Multi-scale shifted window attention with hierarchical feature aggregation improves anatomical fidelity. Experiments on public and clinical datasets demonstrate higher image similarity and improved geometric accuracy compared with baseline methods. Dosimetric evaluation shows clinically acceptable performance, with a mean target dose error of 1.69%. Code is available at: https://github.com/mobaidoctor/med2transformer.
[129] Dataset Distillation via Relative Distribution Matching and Cognitive Heritage
Qianxin Xia, Jiawei Du, Yuhan Zhang, Jielei Wang, Guoming Lu
Main category: cs.CV
TL;DR: Dataset distillation method using statistical flow matching for efficient supervised learning with pre-trained models, achieving state-of-the-art performance with 10x lower memory and 4x faster runtime.
Details
Motivation: Previous dataset distillation methods for classification with pre-trained models suffer from high computational and memory overhead due to batch-level gradient matching requiring thousands of real images and multiple augmentation rounds per step.Method: Proposes statistical flow matching that aligns constant statistical flows from target class centers to non-target class centers in original data, loading raw statistics only once and performing single augmentation pass. Also introduces classifier inheritance strategy reusing original classifier with lightweight linear projector.
Result: Achieves comparable or better performance than state-of-the-art methods with 10x lower GPU memory usage and 4x shorter runtime. Classifier inheritance provides substantial performance gains with minimal storage.
Conclusion: Statistical flow matching offers stable and efficient supervised learning framework for dataset distillation with pre-trained models, significantly reducing computational overhead while maintaining or improving performance.
Abstract: Dataset distillation seeks to synthesize a highly compact dataset that achieves performance comparable to the original dataset on downstream tasks. For the classification task that use pre-trained self-supervised models as backbones, previous linear gradient matching optimizes synthetic images by encouraging them to mimic the gradient updates induced by real images on the linear classifier. However, this batch-level formulation requires loading thousands of real images and applying multiple rounds of differentiable augmentations to synthetic images at each distillation step, leading to substantial computational and memory overhead. In this paper, we introduce statistical flow matching , a stable and efficient supervised learning framework that optimizes synthetic images by aligning constant statistical flows from target class centers to non-target class centers in the original data. Our approach loads raw statistics only once and performs a single augmentation pass on the synthetic data, achieving performance comparable to or better than the state-of-the-art methods with 10x lower GPU memory usage and 4x shorter runtime. Furthermore, we propose a classifier inheritance strategy that reuses the classifier trained on the original dataset for inference, requiring only an extremely lightweight linear projector and marginal storage while achieving substantial performance gains.
[130] Explainable Pathomics Feature Visualization via Correlation-aware Conditional Feature Editing
Yuechen Yang, Junlin Guo, Ruining Deng, Junchao Zhu, Zhengyi Lu, Chongyu Qu, Yanfan Zhu, Xingyi Guo, Yu Wang, Shilin Zhao, Haichun Yang, Yuankai Huo
Main category: cs.CV
TL;DR: MAD framework enables biologically plausible cell nuclei editing by regularizing feature trajectories within a learned manifold to handle correlated pathomics features, using VAE for disentangled latent space and conditional diffusion for synthesis.
Details
Motivation: Pathomics provides quantitative features for explainable biomarkers in digital pathology, but many features are difficult to interpret across clinical contexts. Existing conditional diffusion models assume feature independence, which is violated by correlated pathomics features, leading to unrealistic artifacts when editing features.Method: Proposes Manifold-Aware Diffusion (MAD) framework with variational auto-encoder (VAE) to learn disentangled latent space, regularizing feature trajectories within the learned biological manifold. Optimized features then guide a conditional diffusion model to synthesize high-fidelity cell nuclei images while maintaining biological plausibility.
Result: Experiments show MAD successfully navigates the manifold of pathomics features during editing, outperforming baseline methods in conditional feature editing while preserving structural coherence and avoiding unrealistic artifacts.
Conclusion: MAD enables controllable and biologically plausible cell nuclei editing by addressing feature correlations through manifold-aware regularization, advancing explainability in pathomics and digital pathology applications.
Abstract: Pathomics is a recent approach that offers rich quantitative features beyond what black-box deep learning can provide, supporting more reproducible and explainable biomarkers in digital pathology. However, many derived features (e.g., “second-order moment”) remain difficult to interpret, especially across different clinical contexts, which limits their practical adoption. Conditional diffusion models show promise for explainability through feature editing, but they typically assume feature independence**–**an assumption violated by intrinsically correlated pathomics features. Consequently, editing one feature while fixing others can push the model off the biological manifold and produce unrealistic artifacts. To address this, we propose a Manifold-Aware Diffusion (MAD) framework for controllable and biologically plausible cell nuclei editing. Unlike existing approaches, our method regularizes feature trajectories within a disentangled latent space learned by a variational auto-encoder (VAE). This ensures that manipulating a target feature automatically adjusts correlated attributes to remain within the learned distribution of real cells. These optimized features then guide a conditional diffusion model to synthesize high-fidelity images. Experiments demonstrate that our approach is able to navigate the manifold of pathomics features when editing those features. The proposed method outperforms baseline methods in conditional feature editing while preserving structural coherence.
[131] TSBOW: Traffic Surveillance Benchmark for Occluded Vehicles Under Various Weather Conditions
Ngoc Doan-Minh Huynh, Duong Nguyen-Ngoc Tran, Long Hoang Pham, Tai Huu-Phuong Tran, Hyung-Joon Jeon, Huy-Hung Nguyen, Duong Khac Vu, Hyung-Min Jeon, Son Hong Phan, Quoc Pham-Nam Ho, Chi Dai Tran, Trinh Le Ba Khanh, Jae Wook Jeon
Main category: cs.CV
TL;DR: TSBOW is a comprehensive traffic surveillance dataset for occluded vehicle detection under diverse weather conditions, addressing limitations of existing datasets that lack extreme weather scenarios.
Details
Motivation: Global warming increases extreme weather events that degrade CCTV quality and disrupt traffic flow, raising accident rates. Existing datasets are limited to light weather conditions and fail to capture extreme scenarios needed for robust traffic monitoring systems.Method: Created TSBOW dataset with over 32 hours of real-world traffic data from urban areas, including 48,000 manually annotated and 3.2 million semi-labeled frames with bounding boxes for eight traffic participant classes. Established object detection benchmark to evaluate performance under occlusions and adverse weather.
Result: Dataset includes diverse weather conditions, road types, scales, and viewpoints. Benchmark highlights challenges of occlusions and adverse weather for object detection. Dataset is publicly available for research and application development.
Conclusion: TSBOW serves as critical resource for advancing Intelligent Transportation Systems, demonstrating potential of CCTV-based traffic monitoring and enabling new research in occluded vehicle detection under extreme weather conditions.
Abstract: Global warming has intensified the frequency and severity of extreme weather events, which degrade CCTV signal and video quality while disrupting traffic flow, thereby increasing traffic accident rates. Existing datasets, often limited to light haze, rain, and snow, fail to capture extreme weather conditions. To address this gap, this study introduces the Traffic Surveillance Benchmark for Occluded vehicles under various Weather conditions (TSBOW), a comprehensive dataset designed to enhance occluded vehicle detection across diverse annual weather scenarios. Comprising over 32 hours of real-world traffic data from densely populated urban areas, TSBOW includes more than 48,000 manually annotated and 3.2 million semi-labeled frames; bounding boxes spanning eight traffic participant classes from large vehicles to micromobility devices and pedestrians. We establish an object detection benchmark for TSBOW, highlighting challenges posed by occlusions and adverse weather. With its varied road types, scales, and viewpoints, TSBOW serves as a critical resource for advancing Intelligent Transportation Systems. Our findings underscore the potential of CCTV-based traffic monitoring, pave the way for new research and applications. The TSBOW dataset is publicly available at: https://github.com/SKKUAutoLab/TSBOW.
[132] VMF-GOS: Geometry-guided virtual Outlier Synthesis for Long-Tailed OOD Detection
Ningkang Peng, Qianfeng Yu, Yuhao Zhang, Yafei Liu, Xiaoqian Peng, Peirong Ma, Yi Chen, Peiheng Li, Yanhui Gu
Main category: cs.CV
TL;DR: A data-free OOD detection method using geometry-guided virtual outlier synthesis on hypersphere feature space, eliminating need for external datasets while achieving SOTA performance on long-tailed distributions.
Details
Motivation: Current OOD detection methods for long-tailed distributions rely on large external datasets (like 80M Tiny Images) which are impractical due to cost and privacy concerns. Need a data-free approach that maintains performance without external data.Method: Proposes Geometry-guided virtual Outlier Synthesis (GOS) using von Mises-Fisher distribution on hypersphere to model statistical properties. Locates low-likelihood annulus in feature space and performs directional sampling of virtual outliers. Uses Dual-Granularity Semantic Loss (DGS) with contrastive learning to maximize distinction between ID features and synthesized boundary outliers.
Result: Extensive experiments on CIFAR-LT benchmarks show the method outperforms state-of-the-art approaches that use external real images, demonstrating superior OOD detection performance without external data.
Conclusion: The proposed data-free framework successfully eliminates reliance on external datasets while maintaining superior OOD detection performance on long-tailed distributions through geometry-guided virtual outlier synthesis and dual-granularity semantic regularization.
Abstract: Out-of-Distribution (OOD) detection under long-tailed distributions is a highly challenging task because the scarcity of samples in tail classes leads to blurred decision boundaries in the feature space. Current state-of-the-art (sota) methods typically employ Outlier Exposure (OE) strategies, relying on large-scale real external datasets (such as 80 Million Tiny Images) to regularize the feature space. However, this dependence on external data often becomes infeasible in practical deployment due to high data acquisition costs and privacy sensitivity. To this end, we propose a novel data-free framework aimed at completely eliminating reliance on external datasets while maintaining superior detection performance. We introduce a Geometry-guided virtual Outlier Synthesis (GOS) strategy that models statistical properties using the von Mises-Fisher (vMF) distribution on a hypersphere. Specifically, we locate a low-likelihood annulus in the feature space and perform directional sampling of virtual outliers in this region. Simultaneously, we introduce a new Dual-Granularity Semantic Loss (DGS) that utilizes contrastive learning to maximize the distinction between in-distribution (ID) features and these synthesized boundary outliers. Extensive experiments on benchmarks such as CIFAR-LT demonstrate that our method outperforms sota approaches that utilize external real images.
[133] Disco: Densely-overlapping Cell Instance Segmentation via Adjacency-aware Collaborative Coloring
Rui Sun, Yiwen Yang, Kaiyu Guo, Chen Jiang, Dongli Xu, Zhaonan Liu, Tan Pan, Limei Han, Xue Jiang, Wu Wei, Yuan Cheng
Main category: cs.CV
TL;DR: A new framework for cell instance segmentation using graph coloring with explicit marking and implicit disambiguation to handle dense overlapping cells in complex tissues.
Details
Motivation: Existing cell instance segmentation methods struggle with complex, dense cellular regions. Graph coloring approaches show promise but haven't been validated in real-world scenarios with dense overlaps and complex topologies.Method: Disco framework combines data-driven topological labeling with constrained deep learning: 1) “Explicit Marking” recursively decomposes cell graphs to isolate conflict sets, 2) “Implicit Disambiguation” enforces feature dissimilarity between instances in conflict regions.
Result: Analysis reveals most real-world cell graphs are non-bipartite with high prevalence of odd-length cycles (triangles), making simple 2-coloring insufficient. The GBC-FS 2025 dataset is released for complex nuclear arrangements.
Conclusion: The proposed Disco framework effectively handles complex cell adjacency conflicts through collaborative coloring, addressing limitations of existing methods in dense overlapping scenarios.
Abstract: Accurate cell instance segmentation is foundational for digital pathology analysis. Existing methods based on contour detection and distance mapping still face significant challenges in processing complex and dense cellular regions. Graph coloring-based methods provide a new paradigm for this task, yet the effectiveness of this paradigm in real-world scenarios with dense overlaps and complex topologies has not been verified. Addressing this issue, we release a large-scale dataset GBC-FS 2025, which contains highly complex and dense sub-cellular nuclear arrangements. We conduct the first systematic analysis of the chromatic properties of cell adjacency graphs across four diverse datasets and reveal an important discovery: most real-world cell graphs are non-bipartite, with a high prevalence of odd-length cycles (predominantly triangles). This makes simple 2-coloring theory insufficient for handling complex tissues, while higher-chromaticity models would cause representational redundancy and optimization difficulties. Building on this observation of complex real-world contexts, we propose Disco (Densely-overlapping Cell Instance Segmentation via Adjacency-aware COllaborative Coloring), an adjacency-aware framework based on the “divide and conquer” principle. It uniquely combines a data-driven topological labeling strategy with a constrained deep learning system to resolve complex adjacency conflicts. First, “Explicit Marking” strategy transforms the topological challenge into a learnable classification task by recursively decomposing the cell graph and isolating a “conflict set.” Second, “Implicit Disambiguation” mechanism resolves ambiguities in conflict regions by enforcing feature dissimilarity between different instances, enabling the model to learn separable feature representations.
[134] NeVStereo: A NeRF-Driven NVS-Stereo Architecture for High-Fidelity 3D Tasks
Pengcheng Chen, Yue Hu, Wenhao Li, Nicole M Gunderson, Andrew Feng, Zhenglong Sun, Peter Beerel, Eric J Seibel
Main category: cs.CV
TL;DR: NeVStereo: A unified framework combining NeRF-based novel view synthesis with stereo depth estimation for joint camera pose estimation, multi-view depth, rendering, and surface reconstruction from casual RGB-only inputs.
Details
Motivation: Current dense 3D reconstruction systems either focus on end-to-end matching/geometry prediction without explicit novel view synthesis, or neural rendering approaches that assume fixed camera poses and are sensitive to pose errors. There's a need for a single framework that can provide accurate poses, reliable depth, high-quality rendering, and accurate 3D surfaces from casually captured views.Method: NeVStereo combines NeRF-based novel view synthesis for stereo-friendly renderings, confidence-guided multi-view depth estimation, NeRF-coupled bundle adjustment for pose refinement, and iterative refinement that updates both depth and radiance field to improve geometric consistency. This design mitigates common NeRF issues like surface stacking, artifacts, and pose-depth coupling.
Result: Across indoor, outdoor, tabletop, and aerial benchmarks, NeVStereo achieves consistently strong zero-shot performance with up to 36% lower depth error, 10.4% improved pose accuracy, 4.5% higher NVS fidelity, and state-of-the-art mesh quality (F1 91.93%, Chamfer 4.35 mm) compared to existing methods.
Conclusion: NeVStereo successfully integrates neural rendering with stereo depth estimation to create a unified framework that delivers accurate camera poses, multi-view depth, high-quality novel view synthesis, and surface reconstruction from multi-view RGB-only inputs, addressing limitations of existing approaches.
Abstract: In modern dense 3D reconstruction, feed-forward systems (e.g., VGGT, pi3) focus on end-to-end matching and geometry prediction but do not explicitly output the novel view synthesis (NVS). Neural rendering-based approaches offer high-fidelity NVS and detailed geometry from posed images, yet they typically assume fixed camera poses and can be sensitive to pose errors. As a result, it remains non-trivial to obtain a single framework that can offer accurate poses, reliable depth, high-quality rendering, and accurate 3D surfaces from casually captured views. We present NeVStereo, a NeRF-driven NVS-stereo architecture that aims to jointly deliver camera poses, multi-view depth, novel view synthesis, and surface reconstruction from multi-view RGB-only inputs. NeVStereo combines NeRF-based NVS for stereo-friendly renderings, confidence-guided multi-view depth estimation, NeRF-coupled bundle adjustment for pose refinement, and an iterative refinement stage that updates both depth and the radiance field to improve geometric consistency. This design mitigated the common NeRF-based issues such as surface stacking, artifacts, and pose-depth coupling. Across indoor, outdoor, tabletop, and aerial benchmarks, our experiments indicate that NeVStereo achieves consistently strong zero-shot performance, with up to 36% lower depth error, 10.4% improved pose accuracy, 4.5% higher NVS fidelity, and state-of-the-art mesh quality (F1 91.93%, Chamfer 4.35 mm) compared to existing prestigious methods.
[135] Multi-AD: Cross-Domain Unsupervised Anomaly Detection for Medical and Industrial Applications
Wahyu Rahmaniar, Kenji Suzuki
Main category: cs.CV
TL;DR: Multi-AD: A CNN-based unsupervised anomaly detection model using squeeze-and-excitation blocks, knowledge distillation, and teacher-student architecture for cross-domain applications in medical and industrial imaging.
Details
Motivation: Addresses the lack of annotated data in cross-domain applications like anomaly detection for medical diagnosis and industrial defect detection, requiring robust unsupervised methods that generalize across domains.Method: Uses CNN with squeeze-and-excitation blocks for channel-wise attention, knowledge distillation to transfer features from teacher to student model, discriminator network to enhance anomaly discrimination, and multi-scale feature integration for detecting varying anomaly sizes.
Result: Achieved best average AUROC scores: image-level (81.4% medical, 99.6% industrial) and pixel-level (97.0% medical, 98.4% industrial), outperforming state-of-the-art models on brain MRI, liver CT, retina OCT, and MVTec AD datasets.
Conclusion: Multi-AD demonstrates strong cross-domain generalization for unsupervised anomaly detection, making it effective for real-world medical and industrial applications where annotated data is scarce.
Abstract: Traditional deep learning models often lack annotated data, especially in cross-domain applications such as anomaly detection, which is critical for early disease diagnosis in medicine and defect detection in industry. To address this challenge, we propose Multi-AD, a convolutional neural network (CNN) model for robust unsupervised anomaly detection across medical and industrial images. Our approach employs the squeeze-and-excitation (SE) block to enhance feature extraction via channel-wise attention, enabling the model to focus on the most relevant features and detect subtle anomalies. Knowledge distillation (KD) transfers informative features from the teacher to the student model, enabling effective learning of the differences between normal and anomalous data. Then, the discriminator network further enhances the model’s capacity to distinguish between normal and anomalous data. At the inference stage, by integrating multi-scale features, the student model can detect anomalies of varying sizes. The teacher-student (T-S) architecture ensures consistent representation of high-dimensional features while adapting them to enhance anomaly detection. Multi-AD was evaluated on several medical datasets, including brain MRI, liver CT, and retina OCT, as well as industrial datasets, such as MVTec AD, demonstrating strong generalization across multiple domains. Experimental results demonstrated that our approach consistently outperformed state-of-the-art models, achieving the best average AUROC for both image-level (81.4% for medical and 99.6% for industrial) and pixel-level (97.0% for medical and 98.4% for industrial) tasks, making it effective for real-world applications.
[136] LD-SLRO: Latent Diffusion Structured Light for 3-D Reconstruction of Highly Reflective Objects
Sanghoon Jeon, Gihyun Jung, Suhyeon Ka, Jae-Sang Hyun
Main category: cs.CV
TL;DR: A latent diffusion-based method (LD-SLRO) improves 3D reconstruction of reflective objects by restoring distorted fringe patterns using conditional diffusion models.
Details
Motivation: 3D reconstruction of glossy surfaces using fringe projection profilometry is challenging due to specular reflection and indirect illumination causing severe distortion or loss of fringe patterns.Method: Phase-shifted fringe images from reflective surfaces are encoded to extract latent representations of reflectance characteristics. These features condition a latent diffusion model that probabilistically suppresses reflection artifacts and recovers lost fringe information. Key components include specular reflection encoder, time-variant channel affine layer, and attention modules.
Result: The method improves both fringe quality and 3D reconstruction accuracy, reducing average root-mean-squared error from 1.8176 mm to 0.9619 mm compared to state-of-the-art methods.
Conclusion: LD-SLRO effectively addresses reflection-induced artifacts in fringe projection profilometry for reflective objects and provides flexibility in configuring input/output fringe sets.
Abstract: Fringe projection profilometry-based 3-D reconstruction of objects with high reflectivity and low surface roughness remains a significant challenge. When measuring such glossy surfaces, specular reflection and indirect illumination often lead to severe distortion or loss of the projected fringe patterns. To address these issues, we propose a latent diffusion-based structured light for reflective objects (LD-SLRO). Phase-shifted fringe images captured from highly reflective surfaces are first encoded to extract latent representations that capture surface reflectance characteristics. These latent features are then used as conditional inputs to a latent diffusion model, which probabilistically suppresses reflection-induced artifacts and recover lost fringe information, yielding high-quality fringe images. The proposed components, including the specular reflection encoder, time-variant channel affine layer, and attention modules, further improve fringe restoration quality. In addition, LD-SLRO provides high flexibility in configuring the input and output fringe sets. Experimental results demonstrate that the proposed method improves both fringe quality and 3-D reconstruction accuracy over state-of-the-art methods, reducing the average root-mean-squared error from 1.8176 mm to 0.9619 mm.
[137] Stable Velocity: A Variance Perspective on Flow Matching
Donglin Yang, Yongxing Zhang, Xin Yu, Liang Hou, Xin Tao, Pengfei Wan, Xiaojuan Qi, Renjie Liao
Main category: cs.CV
TL;DR: Stable Velocity framework reduces variance in flow matching training and accelerates sampling by leveraging low-variance regimes near data distribution
Details
Motivation: Flow matching's reliance on single-sample conditional velocities creates high-variance training targets that destabilize optimization and slow convergence, especially near the prior distributionMethod: Proposes Stable Velocity framework with: 1) Stable Velocity Matching (StableVM) for unbiased variance reduction in training, 2) Variance-Aware Representation Alignment (VA-REPA) for adaptive auxiliary supervision, and 3) Stable Velocity Sampling (StableVS) for finetuning-free acceleration using closed-form simplifications in low-variance regimes
Result: Extensive experiments on ImageNet 256×256 and large pretrained models (SD3.5, Flux, Qwen-Image, Wan2.2) show consistent training efficiency improvements and more than 2× faster sampling without quality degradation
Conclusion: The Stable Velocity framework successfully addresses variance issues in flow matching, improving both training stability and sampling efficiency across various diffusion models and scales
Abstract: While flow matching is elegant, its reliance on single-sample conditional velocities leads to high-variance training targets that destabilize optimization and slow convergence. By explicitly characterizing this variance, we identify 1) a high-variance regime near the prior, where optimization is challenging, and 2) a low-variance regime near the data distribution, where conditional and marginal velocities nearly coincide. Leveraging this insight, we propose Stable Velocity, a unified framework that improves both training and sampling. For training, we introduce Stable Velocity Matching (StableVM), an unbiased variance-reduction objective, along with Variance-Aware Representation Alignment (VA-REPA), which adaptively strengthen auxiliary supervision in the low-variance regime. For inference, we show that dynamics in the low-variance regime admit closed-form simplifications, enabling Stable Velocity Sampling (StableVS), a finetuning-free acceleration. Extensive experiments on ImageNet $256\times256$ and large pretrained text-to-image and text-to-video models, including SD3.5, Flux, Qwen-Image, and Wan2.2, demonstrate consistent improvements in training efficiency and more than $2\times$ faster sampling within the low-variance regime without degrading sample quality. Our code is available at https://github.com/linYDTHU/StableVelocity.
[138] Synthetic Defect Geometries of Cast Metal Objects Modeled via 2d Voronoi Tessellations
Natascha Jeziorski, Petra Gospodnetić, Claudia Redenbach
Main category: cs.CV
TL;DR: Parametric 3D mesh modeling of manufacturing defects for synthetic data generation in automated defect detection systems
Details
Motivation: Automated defect detection requires large amounts of training data, but real defective samples are scarce. Synthetic data generation using digital twins provides controllable, scalable training data with perfect annotations.Method: Develop parametric methods to model various defect types as 3D mesh objects that can be added to object geometry. Use physically-based Monte Carlo simulations of inspection methods to generate synthetic data resembling real inspection data.
Result: Created variable and arbitrarily large synthetic datasets with pixel-perfect annotations, including rare defects in sufficient quantities. Demonstrated with visual surface inspection but applicable to other NDT methods.
Conclusion: Parametric defect modeling enables scalable synthetic data generation for training automated defect detection systems, overcoming data scarcity issues while maintaining controllability and annotation quality.
Abstract: In industry, defect detection is crucial for quality control. Non-destructive testing (NDT) methods are preferred as they do not influence the functionality of the object while inspecting. Automated data evaluation for automated defect detection is a growing field of research. In particular, machine learning approaches show promising results. To provide training data in sufficient amount and quality, synthetic data can be used. Rule-based approaches enable synthetic data generation in a controllable environment. Therefore, a digital twin of the inspected object including synthetic defects is needed. We present parametric methods to model 3d mesh objects of various defect types that can then be added to the object geometry to obtain synthetic defective objects. The models are motivated by common defects in metal casting but can be transferred to other machining procedures that produce similar defect shapes. Synthetic data resembling the real inspection data can then be created by using a physically based Monte Carlo simulation of the respective testing method. Using our defect models, a variable and arbitrarily large synthetic data set can be generated with the possibility to include rarely occurring defects in sufficient quantity. Pixel-perfect annotation can be created in parallel. As an example, we will use visual surface inspection, but the procedure can be applied in combination with simulations for any other NDT method.
[139] DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
Chang Zou, Changlin Li, Yang Li, Patrol Li, Jianbing Wu, Xiao He, Songtao Liu, Zhao Zhong, Kailin Huang, Linfeng Zhang
Main category: cs.CV
TL;DR: A novel distillation-compatible learnable feature caching mechanism for accelerating video diffusion models, achieving 11.8× speedup while preserving quality.
Details
Motivation: Existing acceleration methods for video diffusion models have limitations: feature caching causes semantic/detail loss with compression, while step-distillation suffers quality degradation in video generation with few steps. Combining these methods worsens quality due to sparser sampling steps.Method: 1) Introduces learnable neural predictor instead of training-free heuristics for feature caching, better capturing high-dimensional feature evolution. 2) Proposes conservative Restricted MeanFlow approach for stable, lossless distillation on large-scale video models.
Result: Achieves 11.8× acceleration while preserving generation quality, outperforming existing methods that suffer quality degradation with compression.
Conclusion: The proposed distillation-compatible learnable feature caching mechanism successfully pushes acceleration boundaries for video diffusion models while maintaining quality, addressing limitations of existing training-free and distillation-based approaches.
Abstract: While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to $11.8\times$ while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. The code is in the supplementary materials and will be publicly available.
[140] Attention Retention for Continual Learning with Vision Transformers
Yue Lu, Xiangyu Zhou, Shizhou Zhang, Yinghui Xing, Guoqiang Liang, Wencong Zhang
Main category: cs.CV
TL;DR: A novel attention-retaining framework for continual learning in Vision Transformers that mitigates catastrophic forgetting by constraining attention drift through gradient masking.
Details
Motivation: Catastrophic forgetting remains a critical challenge in continual learning, where models lose previously learned knowledge when acquiring new tasks. The paper identifies attention drift in Vision Transformers as a primary source of this forgetting, where attention to previously learned visual concepts shifts significantly after learning new tasks.Method: Proposes an attention-retaining framework with a two-step process: 1) Extract attention maps of previous tasks using layer-wise rollout mechanism and generate instance-adaptive binary masks, 2) When learning new tasks, apply these masks to zero out gradients associated with previous attention regions, preventing disruption of learned visual concepts. Enhanced with gradient scaling for compatibility with modern optimizers.
Result: The method achieves state-of-the-art performance and exhibits robust generalizability across diverse continual learning scenarios. Experiments and visualizations demonstrate effectiveness in mitigating catastrophic forgetting and preserving visual concepts.
Conclusion: The attention-retaining framework successfully addresses catastrophic forgetting in Vision Transformers by constraining attention drift through gradient masking, inspired by neuroscientific insights into selective attention in the human visual system.
Abstract: Continual learning (CL) empowers AI systems to progressively acquire knowledge from non-stationary data streams. However, catastrophic forgetting remains a critical challenge. In this work, we identify attention drift in Vision Transformers as a primary source of catastrophic forgetting, where the attention to previously learned visual concepts shifts significantly after learning new tasks. Inspired by neuroscientific insights into the selective attention in the human visual system, we propose a novel attention-retaining framework to mitigate forgetting in CL. Our method constrains attention drift by explicitly modifying gradients during backpropagation through a two-step process: 1) extracting attention maps of the previous task using a layer-wise rollout mechanism and generating instance-adaptive binary masks, and 2) when learning a new task, applying these masks to zero out gradients associated with previous attention regions, thereby preventing disruption of learned visual concepts. For compatibility with modern optimizers, the gradient masking process is further enhanced by scaling parameter updates proportionally to maintain their relative magnitudes. Experiments and visualizations demonstrate the effectiveness of our method in mitigating catastrophic forgetting and preserving visual concepts. It achieves state-of-the-art performance and exhibits robust generalizability across diverse CL scenarios.
[141] MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation
Dekang Qi, Shuang Zeng, Xinyuan Chang, Feng Xiong, Shichao Xie, Xiaolong Wu, Mu Xu
Main category: cs.CV
TL;DR: A Memory-Execute-Review framework for Visual Language Navigation that combines hierarchical memory, routine decision-making, and behavior correction to achieve both high success rates and strong generalization.
Details
Motivation: Existing VLN methods struggle to achieve both high success rates and good generalization - supervised fine-tuning approaches get better success rates but training-free approaches generalize better. There's a need for a framework that can achieve both simultaneously.Method: Proposes a Memory-Execute-Review framework with three components: 1) hierarchical memory module for information support, 2) execute module for routine decision-making and actions, and 3) review module for handling abnormal situations and behavior correction.
Result: Achieved absolute improvements of 7% and 5% average success rate compared to all baselines under training-free and zero-shot settings across 4 datasets. On HM3D_v0.1 and HM3D_OVON, improved by 8% and 6% under zero-shot. Outperformed both training-free and supervised fine-tuning methods on MP3D and HM3D_OVON datasets.
Conclusion: The Memory-Execute-Review framework successfully addresses the trade-off between success rate and generalization in VLN, achieving comprehensive leadership in both metrics across multiple challenging datasets.
Abstract: Visual Language Navigation (VLN) is one of the fundamental capabilities for embodied intelligence and a critical challenge that urgently needs to be addressed. However, existing methods are still unsatisfactory in terms of both success rate (SR) and generalization: Supervised Fine-Tuning (SFT) approaches typically achieve higher SR, while Training-Free (TF) approaches often generalize better, but it is difficult to obtain both simultaneously. To this end, we propose a Memory-Execute-Review framework. It consists of three parts: a hierarchical memory module for providing information support, an execute module for routine decision-making and actions, and a review module for handling abnormal situations and correcting behavior. We validated the effectiveness of this framework on the Object Goal Navigation task. Across 4 datasets, our average SR achieved absolute improvements of 7% and 5% compared to all baseline methods under TF and Zero-Shot (ZS) settings, respectively. On the most commonly used HM3D_v0.1 and the more challenging open vocabulary dataset HM3D_OVON, the SR improved by 8% and 6%, under ZS settings. Furthermore, on the MP3D and HM3D_OVON datasets, our method not only outperformed all TF methods but also surpassed all SFT methods, achieving comprehensive leadership in both SR (5% and 2%) and generalization.
[142] SOMA-1M: A Large-Scale SAR-Optical Multi-resolution Alignment Dataset for Multi-Task Remote Sensing
Peihao Wu, Yongxiang Yao, Yi Wan, Wenfei Zhang, Ruipeng Zhao, Jiayuan Li, Yongjun Zhang
Main category: cs.CV
TL;DR: SOMA-1M is a large-scale, pixel-level aligned SAR-optical dataset with 1.3M image pairs across 12 land cover categories and multiple spatial resolutions (0.5m-10m), enabling training and evaluation of multimodal remote sensing foundation models.
Details
Motivation: Existing benchmark datasets for SAR and optical imagery have limitations including single spatial resolution, insufficient scale, and poor alignment accuracy, which hinder the development of robust multimodal foundation models for remote sensing applications.Method: Created SOMA-1M dataset by integrating imagery from Sentinel-1, PIESAT-1, Capella Space, and Google Earth, using a coarse-to-fine image matching framework to ensure pixel-level alignment across 12 land cover categories at multiple resolutions.
Result: Dataset contains over 1.3 million precisely aligned image pairs with global multi-scale coverage. Established evaluation benchmarks for four vision tasks (matching, fusion, cloud removal, translation) showing SOTA performance in multimodal image matching and significant improvements across all tasks when trained on SOMA-1M.
Conclusion: SOMA-1M serves as a foundational resource for robust multimodal algorithms and remote sensing foundation models, addressing critical data limitations in the field and enabling better cross-modal collaborative processing.
Abstract: Synthetic Aperture Radar (SAR) and optical imagery provide complementary strengths that constitute the critical foundation for transcending single-modality constraints and facilitating cross-modal collaborative processing and intelligent interpretation. However, existing benchmark datasets often suffer from limitations such as single spatial resolution, insufficient data scale, and low alignment accuracy, making them inadequate for supporting the training and generalization of multi-scale foundation models. To address these challenges, we introduce SOMA-1M (SAR-Optical Multi-resolution Alignment), a pixel-level precisely aligned dataset containing over 1.3 million pairs of georeferenced images with a specification of 512 x 512 pixels. This dataset integrates imagery from Sentinel-1, PIESAT-1, Capella Space, and Google Earth, achieving global multi-scale coverage from 0.5 m to 10 m. It encompasses 12 typical land cover categories, effectively ensuring scene diversity and complexity. To address multimodal projection deformation and massive data registration, we designed a rigorous coarse-to-fine image matching framework ensuring pixel-level alignment. Based on this dataset, we established comprehensive evaluation benchmarks for four hierarchical vision tasks, including image matching, image fusion, SAR-assisted cloud removal, and cross-modal translation, involving over 30 mainstream algorithms. Experimental results demonstrate that supervised training on SOMA-1M significantly enhances performance across all tasks. Notably, multimodal remote sensing image (MRSI) matching performance achieves current state-of-the-art (SOTA) levels. SOMA-1M serves as a foundational resource for robust multimodal algorithms and remote sensing foundation models. The dataset will be released publicly at: https://github.com/PeihaoWu/SOMA-1M.
[143] Feature points evaluation on omnidirectional vision with a photorealistic fisheye sequence – A report on experiments done in 2014
Julien Moreau, S. Ambellouis, Yassine Ruichek
Main category: cs.CV
TL;DR: A 2014 technical report presenting PFSeq dataset for fisheye image analysis, comparing feature detectors/descriptors for fisheye visual odometry and stereo vision in urban scenes.
Details
Motivation: Address the chicken-and-egg problem in fisheye camera calibration: need good features for calibration but need accurate projection model for optimal feature detection. The goal is to find best feature detectors/descriptors for fisheye images in self-calibration context for urban scene applications.Method: Created PFSeq dataset of photorealistic fisheye sequences. Conducted comprehensive experiments comparing various feature detection and description algorithms on fisheye images, focusing on cameras mounted on car roofs aiming at zenith for visual odometry and stereovision.
Result: Provides detailed experimental results comparing feature detectors/descriptors for fisheye images, though results are from 2014 and haven’t been updated with state-of-the-art evolution. The PFSeq dataset is made publicly available.
Conclusion: This 2014 technical report contributes a fisheye dataset and comparative analysis of feature detection methods for fisheye camera calibration, though it’s limited by being outdated and not peer-reviewed.
Abstract: What is this report: This is a scientific report, contributing with a detailed bibliography, a dataset which we will call now PFSeq for ‘‘Photorealistic Fisheye Sequence’’ and make available at https://doi.org/10. 57745/DYIVVU, and comprehensive experiments. This work should be considered as a draft, and has been done during my PhD thesis ‘‘Construction of 3D models from fisheye video data-Application to the localisation in urban area’’ in 2014 [Mor16]. These results have never been published. The aim was to find the best features detector and descriptor for fisheye images, in the context of selfcalibration, with cameras mounted on the top of a car and aiming at the zenith (to proceed then fisheye visual odometry and stereovision in urban scenes). We face a chicken and egg problem, because we can not take advantage of an accurate projection model for an optimal features detection and description, and we rightly need good features to perform the calibration (i.e. to compute the accurate projection model of the camera). What is not this report: It does not contribute with new features algorithm. It does not compare standard features algorithms to algorithms designed for omnidirectional images (unfortunately). It has not been peer-reviewed. Discussions have been translated and enhanced but the experiments have not been run again and the report has not been updated accordingly to the evolution of the state-of-the-art (read this as a 2014 report).
[144] VGGT-Motion: Motion-Aware Calibration-Free Monocular SLAM for Long-Range Consistency
Zhuang Xiong, Chen Zhang, Qingshan Xu, Wenbing Tao
Main category: cs.CV
TL;DR: VGGT-Motion is a calibration-free monocular SLAM system that addresses scale drift in long sequences through motion-aware submap construction and efficient global optimization.
Details
Motivation: Existing calibration-free monocular SLAM systems suffer from severe scale drift on long sequences due to motion-agnostic partitioning that breaks contextual coherence and causes zero-motion drift, while conventional geometric alignment is computationally expensive.Method: 1) Motion-aware submap construction using optical flow to guide adaptive partitioning, prune static redundancy, and encapsulate turns for stable local geometry. 2) Anchor-driven direct Sim(3) registration using context-balanced anchors for search-free, pixel-wise dense alignment and efficient loop closure. 3) Lightweight submap-level pose graph optimization with linear complexity for global consistency.
Result: VGGT-Motion markedly improves trajectory accuracy and efficiency, achieving state-of-the-art performance in zero-shot, long-range calibration-free monocular SLAM on kilometer-scale trajectories.
Conclusion: The proposed system effectively addresses scale drift in long sequences through motion-aware partitioning and efficient global optimization, enabling robust calibration-free monocular SLAM over kilometer-scale trajectories.
Abstract: Despite recent progress in calibration-free monocular SLAM via 3D vision foundation models, scale drift remains severe on long sequences. Motion-agnostic partitioning breaks contextual coherence and causes zero-motion drift, while conventional geometric alignment is computationally expensive. To address these issues, we propose VGGT-Motion, a calibration-free SLAM system for efficient and robust global consistency over kilometer-scale trajectories. Specifically, we first propose a motion-aware submap construction mechanism that uses optical flow to guide adaptive partitioning, prune static redundancy, and encapsulate turns for stable local geometry. We then design an anchor-driven direct Sim(3) registration strategy. By exploiting context-balanced anchors, it achieves search-free, pixel-wise dense alignment and efficient loop closure without costly feature matching. Finally, a lightweight submap-level pose graph optimization enforces global consistency with linear complexity, enabling scalable long-range operation. Experiments show that VGGT-Motion markedly improves trajectory accuracy and efficiency, achieving state-of-the-art performance in zero-shot, long-range calibration-free monocular SLAM.
[145] Mapper-GIN: Lightweight Structural Graph Abstraction for Corrupted 3D Point Cloud Classification
Jeongbin You, Donggun Kim, Sejun Park, Seungsang Oh
Main category: cs.CV
TL;DR: Mapper-GIN: A lightweight 3D point cloud classification method using topological decomposition via Mapper algorithm and Graph Isomorphism Network for robust recognition.
Details
Motivation: To explore whether structural abstraction alone can improve robustness in 3D point cloud classification, avoiding the need for scaling up backbones or specialized data augmentation.Method: Partitions point clouds into overlapping regions using Mapper algorithm (PCA lens, cubical cover, density-based clustering), constructs region graph from overlaps, and performs graph classification with Graph Isomorphism Network (GIN).
Result: Achieves competitive and stable accuracy on ModelNet40-C corruption benchmark under Noise and Transformation corruptions with only 0.5M parameters, outperforming heavier architectures.
Conclusion: Region-graph structure offers an efficient and interpretable source of robustness for 3D visual recognition through simple structural abstraction.
Abstract: Robust 3D point cloud classification is often pursued by scaling up backbones or relying on specialized data augmentation. We instead ask whether structural abstraction alone can improve robustness, and study a simple topology-inspired decomposition based on the Mapper algorithm. We propose Mapper-GIN, a lightweight pipeline that partitions a point cloud into overlapping regions using Mapper (PCA lens, cubical cover, and followed by density-based clustering), constructs a region graph from their overlaps, and performs graph classification with a Graph Isomorphism Network. On the corruption benchmark ModelNet40-C, Mapper-GIN achieves competitive and stable accuracy under Noise and Transformation corruptions with only 0.5M parameters. In contrast to prior approaches that require heavier architectures or additional mechanisms to gain robustness, Mapper-GIN attains strong corruption robustness through simple region-level graph abstraction and GIN message passing. Overall, our results suggest that region-graph structure offers an efficient and interpretable source of robustness for 3D visual recognition.
[146] Generalization of Self-Supervised Vision Transformers for Protein Localization Across Microscopy Domains
Ben Isselmann, Dilara Göksu, Andreas Weinmann
Main category: cs.CV
TL;DR: DINO-pretrained Vision Transformers show strong cross-domain transferability for protein localization in microscopy images, with domain-relevant self-supervised pretraining (HPA dataset) outperforming ImageNet and even in-domain OpenCell pretraining.
Details
Motivation: Microscopy datasets are often too small for robust deep learning, and while SSL can help with large unlabeled datasets, it's unclear how well SSL representations transfer across different microscopy domains with varying staining protocols and channel configurations.Method: Investigated cross-domain transferability of DINO-pretrained Vision Transformers for protein localization on OpenCell dataset. Generated image embeddings using three DINO backbones pretrained on: 1) ImageNet-1k, 2) Human Protein Atlas (HPA), and 3) OpenCell. Evaluated by training supervised classification head on OpenCell labels.
Result: All pretrained models transferred well. HPA-pretrained model (microscopy-specific) achieved best performance with mean macro F1-score = 0.8221 ± 0.0062, slightly outperforming DINO model trained directly on OpenCell (0.8057 ± 0.0090). ImageNet-pretrained model also performed well but slightly lower.
Conclusion: Large-scale pretraining is valuable for microscopy analysis. Domain-relevant SSL representations (like HPA-pretrained) can generalize effectively to related but distinct microscopy datasets, enabling strong downstream performance even with limited task-specific labeled data.
Abstract: Task-specific microscopy datasets are often too small to train deep learning models that learn robust feature representations. Self-supervised learning (SSL) can mitigate this by pretraining on large unlabeled datasets, but it remains unclear how well such representations transfer across microscopy domains with different staining protocols and channel configurations. We investigate the cross-domain transferability of DINO-pretrained Vision Transformers for protein localization on the OpenCell dataset. We generate image embeddings using three DINO backbones pretrained on ImageNet-1k, the Human Protein Atlas (HPA), and OpenCell, and evaluate them by training a supervised classification head on OpenCell labels. All pretrained models transfer well, with the microscopy-specific HPA-pretrained model achieving the best performance (mean macro $F_1$-score = 0.8221 \pm 0.0062), slightly outperforming a DINO model trained directly on OpenCell (0.8057 \pm 0.0090). These results highlight the value of large-scale pretraining and indicate that domain-relevant SSL representations can generalize effectively to related but distinct microscopy datasets, enabling strong downstream performance even when task-specific labeled data are limited.
[147] SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation
Youngwoo Shin, Jiwan Hur, Junmo Kim
Main category: cs.CV
TL;DR: SSG is a training-free inference-time guidance method for visual autoregressive models that ensures proper coarse-to-fine hierarchy by emphasizing high-frequency semantic residuals through frequency-domain processing.
Details
Motivation: Visual autoregressive models can drift from their intended coarse-to-fine hierarchy during inference due to limited capacity and accumulated errors, leading to train-inference discrepancy and degraded image quality.Method: Proposes Scaled Spatial Guidance (SSG) with Discrete Spatial Enhancement (DSE) - a frequency-domain procedure to isolate semantic residuals (high-frequency content not explained by coarser scales) and guide generation toward proper hierarchy without retraining.
Result: SSG consistently improves fidelity and diversity across VAR models while preserving low latency, demonstrating untapped efficiency in coarse-to-fine image generation.
Conclusion: SSG effectively mitigates train-inference discrepancy in VAR models through principled frequency-domain guidance, enabling better hierarchical generation without additional training.
Abstract: Visual autoregressive (VAR) models generate images through next-scale prediction, naturally achieving coarse-to-fine, fast, high-fidelity synthesis mirroring human perception. In practice, this hierarchy can drift at inference time, as limited capacity and accumulated error cause the model to deviate from its coarse-to-fine nature. We revisit this limitation from an information-theoretic perspective and deduce that ensuring each scale contributes high-frequency content not explained by earlier scales mitigates the train-inference discrepancy. With this insight, we propose Scaled Spatial Guidance (SSG), training-free, inference-time guidance that steers generation toward the intended hierarchy while maintaining global coherence. SSG emphasizes target high-frequency signals, defined as the semantic residual, isolated from a coarser prior. To obtain this prior, we leverage a principled frequency-domain procedure, Discrete Spatial Enhancement (DSE), which is devised to sharpen and better isolate the semantic residual through frequency-aware construction. SSG applies broadly across VAR models leveraging discrete visual tokens, regardless of tokenization design or conditioning modality. Experiments demonstrate SSG yields consistent gains in fidelity and diversity while preserving low latency, revealing untapped efficiency in coarse-to-fine image generation. Code is available at https://github.com/Youngwoo-git/SSG.
[148] A Comparative Study of 3D Person Detection: Sensor Modalities and Robustness in Diverse Indoor and Outdoor Environments
Malaz Tamim, Andrea Matic-Flierl, Karsten Roscher
Main category: cs.CV
TL;DR: Systematic evaluation of 3D person detection comparing camera-only, LiDAR-only, and camera-LiDAR fusion approaches on JRDB dataset, showing fusion outperforms single-modality models but remains vulnerable to sensor misalignments and corruptions.
Details
Motivation: Accurate 3D person detection is critical for safety applications like robotics and surveillance, but most existing research focuses on autonomous driving. The paper aims to systematically evaluate detection performance and robustness in diverse indoor/outdoor scenes using different sensor modalities.Method: Evaluated three representative models: BEVDepth (camera-only), PointPillars (LiDAR-only), and DAL (camera-LiDAR fusion) on JRDB dataset. Analyzed performance under varying occlusion and distance levels, and investigated robustness against sensor corruptions and misalignments.
Result: Fusion-based approach (DAL) consistently outperformed single-modality models, especially in challenging scenarios. However, DAL remained sensitive to sensor misalignment and certain LiDAR-based corruptions. Camera-based BEVDepth showed lowest performance and was most affected by occlusion, distance, and noise.
Conclusion: Sensor fusion provides enhanced 3D person detection but has vulnerabilities to sensor misalignments and corruptions. Ongoing research is needed to address these system weaknesses while leveraging the benefits of multimodal approaches.
Abstract: Accurate 3D person detection is critical for safety in applications such as robotics, industrial monitoring, and surveillance. This work presents a systematic evaluation of 3D person detection using camera-only, LiDAR-only, and camera-LiDAR fusion. While most existing research focuses on autonomous driving, we explore detection performance and robustness in diverse indoor and outdoor scenes using the JRDB dataset. We compare three representative models - BEVDepth (camera), PointPillars (LiDAR), and DAL (camera-LiDAR fusion) - and analyze their behavior under varying occlusion and distance levels. Our results show that the fusion-based approach consistently outperforms single-modality models, particularly in challenging scenarios. We further investigate robustness against sensor corruptions and misalignments, revealing that while DAL offers improved resilience, it remains sensitive to sensor misalignment and certain LiDAR-based corruptions. In contrast, the camera-based BEVDepth model showed the lowest performance and was most affected by occlusion, distance, and noise. Our findings highlight the importance of utilizing sensor fusion for enhanced 3D person detection, while also underscoring the need for ongoing research to address the vulnerabilities inherent in these systems.
[149] FastVMT: Eliminating Redundancy in Video Motion Transfer
Yue Ma, Zhikai Wang, Tianhao Ren, Mingzhe Zheng, Hongyu Liu, Jiayi Guo, Mark Fong, Yuxuan Xue, Zixiang Zhao, Konrad Schindler, Qifeng Chen, Linfeng Zhang
Main category: cs.CV
TL;DR: FastVMT: A method that accelerates video motion transfer by addressing two types of computational redundancy in Diffusion Transformers - motion redundancy through local attention masking and gradient redundancy through gradient reuse optimization.
Details
Motivation: Current video motion transfer methods using Diffusion Transformers (DiT) suffer from computational inefficiency. While some methods attempt acceleration, they fail to address structural sources of redundancy. The authors identify two specific inefficiencies: motion redundancy (unnecessary computation for distant regions due to small frame-to-frame motion) and gradient redundancy (unnecessary gradient computations due to slow gradient changes along diffusion trajectory).Method: Two main techniques: 1) To address motion redundancy, mask attention layers to local neighborhoods so interaction weights aren’t computed for unnecessarily distant image regions. 2) To exploit gradient redundancy, design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations.
Result: FastVMT achieves an average 3.43x speedup compared to previous methods without degrading visual fidelity or temporal consistency of generated videos.
Conclusion: The paper demonstrates that addressing structural computational redundancies in DiT architectures can significantly accelerate video motion transfer while maintaining output quality, offering a practical solution for efficient video generation.
Abstract: Video motion transfer aims to synthesize videos by generating visual content according to a text prompt while transferring the motion pattern observed in a reference video. Recent methods predominantly use the Diffusion Transformer (DiT) architecture. To achieve satisfactory runtime, several methods attempt to accelerate the computations in the DiT, but fail to address structural sources of inefficiency. In this work, we identify and remove two types of computational redundancy in earlier work: motion redundancy arises because the generic DiT architecture does not reflect the fact that frame-to-frame motion is small and smooth; gradient redundancy occurs if one ignores that gradients change slowly along the diffusion trajectory. To mitigate motion redundancy, we mask the corresponding attention layers to a local neighborhood such that interaction weights are not computed unnecessarily distant image regions. To exploit gradient redundancy, we design an optimization scheme that reuses gradients from previous diffusion steps and skips unwarranted gradient computations. On average, FastVMT achieves a 3.43x speedup without degrading the visual fidelity or the temporal consistency of the generated videos.
[150] IndustryShapes: An RGB-D Benchmark dataset for 6D object pose estimation of industrial assembly components and tools
Panagiotis Sapoutzoglou, Orestis Vaggelis, Athina Zacharia, Evangelos Sartinas, Maria Pateraki
Main category: cs.CV
TL;DR: IndustryShapes is a new RGB-D benchmark dataset of industrial tools and components for 6D pose estimation, designed to bridge the gap between lab research and real-world manufacturing applications.
Details
Motivation: Existing datasets focus on household/consumer products or use synthetic/clean tabletop environments, lacking realistic industrial settings. There's a need for datasets that better represent real-world manufacturing scenarios to advance industrial robotics applications.Method: Created a comprehensive RGB-D dataset with 5 new industrial object types captured in realistic assembly settings. Includes both classic set (4.6k images, 6k annotated poses) and extended set with additional modalities including RGB-D static onboarding sequences.
Result: The dataset provides diverse complexity from simple to challenging scenes, with single/multiple objects including multiple instances of same objects. Evaluation on state-of-the-art methods shows room for improvement in industrial 6D pose estimation.
Conclusion: IndustryShapes fills an important gap by providing the first dataset with realistic industrial settings and RGB-D static onboarding sequences, enabling better benchmarking of pose estimation methods for real-world manufacturing applications.
Abstract: We introduce IndustryShapes, a new RGB-D benchmark dataset of industrial tools and components, designed for both instance-level and novel object 6D pose estimation approaches. The dataset provides a realistic and application-relevant testbed for benchmarking these methods in the context of industrial robotics bridging the gap between lab-based research and deployment in real-world manufacturing scenarios. Unlike many previous datasets that focus on household or consumer products or use synthetic, clean tabletop datasets, or objects captured solely in controlled lab environments, IndustryShapes introduces five new object types with challenging properties, also captured in realistic industrial assembly settings. The dataset has diverse complexity, from simple to more challenging scenes, with single and multiple objects, including scenes with multiple instances of the same object and it is organized in two parts: the classic set and the extended set. The classic set includes a total of 4,6k images and 6k annotated poses. The extended set introduces additional data modalities to support the evaluation of model-free and sequence-based approaches. To the best of our knowledge, IndustryShapes is the first dataset to offer RGB-D static onboarding sequences. We further evaluate the dataset on a representative set of state-of-the art methods for instance-based and novel object 6D pose estimation, including also object detection, segmentation, showing that there is room for improvement in this domain. The dataset page can be found in https://pose-lab.github.io/IndustryShapes.
[151] PIRATR: Parametric Object Inference for Robotic Applications with Transformers in 3D Point Clouds
Michael Schwingshackl, Fabio F. Oberweger, Mario Niedermeyer, Huemer Johannes, Markus Murschitz
Main category: cs.CV
TL;DR: PIRATR is an end-to-end 3D object detection framework for robotics that jointly estimates 6-DoF poses and class-specific parametric attributes from point clouds, enabling task-relevant property estimation for robotic manipulation.
Details
Motivation: The paper aims to bridge the gap between low-level geometric reasoning and actionable world models for robotics, enabling robots to not only detect objects but also understand task-relevant properties like gripper openings for manipulation.Method: Extends PI3DETR with modular, class-specific heads for joint estimation of multi-class 6-DoF poses and parametric attributes from occlusion-affected point clouds. Uses synthetic training and generalizes to real LiDAR data without fine-tuning.
Result: Achieves detection mAP of 0.919 on real outdoor LiDAR scans without additional fine-tuning, validated on automated forklift platform with three diverse object categories (crane grippers, loading platforms, pallets).
Conclusion: PIRATR establishes a new paradigm of pose-aware, parameterized perception that bridges geometric reasoning with actionable world models, enabling scalable simulation-trained perception for dynamic robotic environments.
Abstract: We present PIRATR, an end-to-end 3D object detection framework for robotic use cases in point clouds. Extending PI3DETR, our method streamlines parametric 3D object detection by jointly estimating multi-class 6-DoF poses and class-specific parametric attributes directly from occlusion-affected point cloud data. This formulation enables not only geometric localization but also the estimation of task-relevant properties for parametric objects, such as a gripper’s opening, where the 3D model is adjusted according to simple, predefined rules. The architecture employs modular, class-specific heads, making it straightforward to extend to novel object types without re-designing the pipeline. We validate PIRATR on an automated forklift platform, focusing on three structurally and functionally diverse categories: crane grippers, loading platforms, and pallets. Trained entirely in a synthetic environment, PIRATR generalizes effectively to real outdoor LiDAR scans, achieving a detection mAP of 0.919 without additional fine-tuning. PIRATR establishes a new paradigm of pose-aware, parameterized perception. This bridges the gap between low-level geometric reasoning and actionable world models, paving the way for scalable, simulation-trained perception systems that can be deployed in dynamic robotic environments. Code available at https://github.com/swingaxe/piratr.
[152] ShapeGaussian: High-Fidelity 4D Human Reconstruction in Monocular Videos via Vision Priors
Zhenxiao Liang, Ning Zhang, Youbao Tang, Ruei-Sung Lin, Qixing Huang, Peng Chang, Jing Xiao
Main category: cs.CV
TL;DR: ShapeGaussian is a template-free 4D human reconstruction method from monocular videos that integrates vision priors to achieve high-fidelity results without relying on SMPL templates or multi-view cues.
Details
Motivation: Existing methods have limitations: template-free approaches (like 4DGS) struggle with high-deformation human motion without multi-view cues, while template-based methods (like HUGS using SMPL) are susceptible to pose estimation errors leading to unrealistic artifacts. There's a need for a method that combines high-fidelity reconstruction with robustness to pose estimation errors.Method: Two-step pipeline: 1) Learn coarse deformable geometry using pretrained models for data-driven priors, 2) Refine geometry with neural deformation model for fine-grained dynamic details. Leverages 2D vision priors to mitigate pose estimation artifacts and uses multiple reference frames to address 2D keypoint invisibility issues in template-free manner.
Result: Extensive experiments show ShapeGaussian surpasses template-based methods in reconstruction accuracy, achieving superior visual quality and robustness across diverse human motions in casual monocular videos.
Conclusion: ShapeGaussian effectively integrates template-free vision priors to achieve both high-fidelity and robust 4D human scene reconstructions from monocular videos, addressing limitations of both template-based and template-free approaches.
Abstract: We introduce ShapeGaussian, a high-fidelity, template-free method for 4D human reconstruction from casual monocular videos. Generic reconstruction methods lacking robust vision priors, such as 4DGS, struggle to capture high-deformation human motion without multi-view cues. While template-based approaches, primarily relying on SMPL, such as HUGS, can produce photorealistic results, they are highly susceptible to errors in human pose estimation, often leading to unrealistic artifacts. In contrast, ShapeGaussian effectively integrates template-free vision priors to achieve both high-fidelity and robust scene reconstructions. Our method follows a two-step pipeline: first, we learn a coarse, deformable geometry using pretrained models that estimate data-driven priors, providing a foundation for reconstruction. Then, we refine this geometry using a neural deformation model to capture fine-grained dynamic details. By leveraging 2D vision priors, we mitigate artifacts from erroneous pose estimation in template-based methods and employ multiple reference frames to resolve the invisibility issue of 2D keypoints in a template-free manner. Extensive experiments demonstrate that ShapeGaussian surpasses template-based methods in reconstruction accuracy, achieving superior visual quality and robustness across diverse human motions in casual monocular videos.
[153] Visual Implicit Geometry Transformer for Autonomous Driving
Arsenii Shirokov, Mikhail Kuznetsov, Danila Stepochkin, Egor Evdokimov, Daniil Glazkov, Nikolay Patakin, Anton Konushin, Dmitry Senushkin
Main category: cs.CV
TL;DR: ViGT is a visual implicit geometry transformer for autonomous driving that estimates continuous 3D occupancy fields from surround-view cameras using calibration-free architecture and self-supervised training with image-LiDAR pairs.
Details
Motivation: To create a foundational geometric model for autonomous driving that prioritizes scalability, architectural simplicity, and generalization across diverse sensor configurations without requiring manual annotations.Method: Uses a calibration-free transformer architecture to estimate continuous 3D occupancy fields in BEV from multiple camera views, trained self-supervised using synchronized image-LiDAR pairs from five large-scale autonomous driving datasets.
Result: Achieves state-of-the-art performance on pointmap estimation with best average rank across baselines, and comparable performance with supervised methods on Occ3D-nuScenes benchmark.
Conclusion: ViGT demonstrates effective continuous 3D occupancy field estimation for autonomous driving with strong generalization across datasets and sensor setups through self-supervised learning.
Abstract: We introduce the Visual Implicit Geometry Transformer (ViGT), an autonomous driving geometric model that estimates continuous 3D occupancy fields from surround-view camera rigs. ViGT represents a step towards foundational geometric models for autonomous driving, prioritizing scalability, architectural simplicity, and generalization across diverse sensor configurations. Our approach achieves this through a calibration-free architecture, enabling a single model to adapt to different sensor setups. Unlike general-purpose geometric foundational models that focus on pixel-aligned predictions, ViGT estimates a continuous 3D occupancy field in a birds-eye-view (BEV) addressing domain-specific requirements. ViGT naturally infers geometry from multiple camera views into a single metric coordinate frame, providing a common representation for multiple geometric tasks. Unlike most existing occupancy models, we adopt a self-supervised training procedure that leverages synchronized image-LiDAR pairs, eliminating the need for costly manual annotations. We validate the scalability and generalizability of our approach by training our model on a mixture of five large-scale autonomous driving datasets (NuScenes, Waymo, NuPlan, ONCE, and Argoverse) and achieving state-of-the-art performance on the pointmap estimation task, with the best average rank across all evaluated baselines. We further evaluate ViGT on the Occ3D-nuScenes benchmark, where ViGT achieves comparable performance with supervised methods. The source code is publicly available at \href{https://github.com/whesense/ViGT}{https://github.com/whesense/ViGT}.
[154] A Hybrid CNN and ML Framework for Multi-modal Classification of Movement Disorders Using MRI and Brain Structural Features
Mengyu Li, Ingibjörg Kristjánsdóttir, Thilo van Eimeren, Kathrin Giehl, Lotta M. Ellingsen, the ASAP Neuroimaging Initiative
Main category: cs.CV
TL;DR: Hybrid CNN-ML framework using multi-modal MRI data (T1-weighted images, segmentation masks, volumetric measurements) to classify Atypical Parkinsonian Disorders subtypes vs. Parkinson’s Disease with high AUC scores.
Details
Motivation: Early differential diagnosis of Atypical Parkinsonian Disorders (APD) like PSP and MSA is challenging due to overlapping symptoms with Parkinson's Disease, creating need for reliable imaging biomarkers.Method: Combines convolutional neural networks with machine learning using multi-modal input: T1-weighted MRI, segmentation masks of 12 deep brain structures, and corresponding volumetric measurements for classification tasks.
Result: Achieved promising classification performance with AUC scores: 0.95 for PSP vs. PD, 0.86 for MSA vs. PD, and 0.92 for PSP vs. MSA.
Conclusion: Fusing CNN-based image features with volume-based ML inputs improves classification accuracy for APD subtypes, potentially enabling more reliable early-stage diagnosis and targeted interventions.
Abstract: Atypical Parkinsonian Disorders (APD), also known as Parkinson-plus syndrome, are a group of neurodegenerative diseases that include progressive supranuclear palsy (PSP) and multiple system atrophy (MSA). In the early stages, overlapping clinical features often lead to misdiagnosis as Parkinson’s disease (PD). Identifying reliable imaging biomarkers for early differential diagnosis remains a critical challenge. In this study, we propose a hybrid framework combining convolutional neural networks (CNNs) with machine learning (ML) techniques to classify APD subtypes versus PD and distinguish between the subtypes themselves: PSP vs. PD, MSA vs. PD, and PSP vs. MSA. The model leverages multi-modal input data, including T1-weighted magnetic resonance imaging (MRI), segmentation masks of 12 deep brain structures associated with APD, and their corresponding volumetric measurements. By integrating these complementary modalities, including image data, structural segmentation masks, and quantitative volume features, the hybrid approach achieved promising classification performance with area under the curve (AUC) scores of 0.95 for PSP vs. PD, 0.86 for MSA vs. PD, and 0.92 for PSP vs. MSA. These results highlight the potential of combining spatial and structural information for robust subtype differentiation. In conclusion, this study demonstrates that fusing CNN-based image features with volume-based ML inputs improves classification accuracy for APD subtypes. The proposed approach may contribute to more reliable early-stage diagnosis, facilitating timely and targeted interventions in clinical practice.
[155] LocateEdit-Bench: A Benchmark for Instruction-Based Editing Localization
Shiyu Wu, Shuyan Li, Jing Li, Jing Liu, Yequan Wang
Main category: cs.CV
TL;DR: LocateEdit-Bench: A large-scale dataset of 231K edited images for benchmarking localization methods against instruction-driven image editing, addressing the gap in existing forgery localization approaches.
Details
Motivation: Recent advancements in image editing enable highly controllable and semantically-aware alterations, posing challenges for manipulation localization. Existing AI-generated forgery localization methods focus on inpainting-based manipulations and are ineffective against the latest instruction-based editing paradigms.Method: Proposes LocateEdit-Bench dataset with 231K edited images using four cutting-edge editing models covering three common edit types. Develops two multi-metric evaluation protocols to assess existing localization methods.
Result: Establishes a foundation to keep pace with evolving image editing landscape and facilitates development of effective methods for future forgery localization. Dataset will be open-sourced upon acceptance.
Conclusion: The work bridges a critical gap in manipulation localization by providing a benchmark specifically designed for instruction-driven image editing, enabling better evaluation and development of localization methods against modern editing techniques.
Abstract: Recent advancements in image editing have enabled highly controllable and semantically-aware alteration of visual content, posing unprecedented challenges to manipulation localization. However, existing AI-generated forgery localization methods primarily focus on inpainting-based manipulations, making them ineffective against the latest instruction-based editing paradigms. To bridge this critical gap, we propose LocateEdit-Bench, a large-scale dataset comprising $231$K edited images, designed specifically to benchmark localization methods against instruction-driven image editing. Our dataset incorporates four cutting-edge editing models and covers three common edit types. We conduct a detailed analysis of the dataset and develop two multi-metric evaluation protocols to assess existing localization methods. Our work establishes a foundation to keep pace with the evolving landscape of image editing, thereby facilitating the development of effective methods for future forgery localization. Dataset will be open-sourced upon acceptance.
[156] LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation
Junyang Chen, Xiangbo Lv, Zhiqiang Kou, Xingdong Sheng, Ning Xu, Yiguo Qiao
Main category: cs.CV
TL;DR: LoGoSeg is an efficient single-stage framework for open-vocabulary semantic segmentation that integrates object existence priors, region-aware alignment, and dual-stream fusion to improve spatial alignment and reduce hallucinations without needing external mask proposals or extra datasets.
Details
Motivation: Existing open-vocabulary semantic segmentation methods relying on vision-language models like CLIP suffer from imprecise spatial alignment due to image-level pretraining, leading to mismatched segmentations in ambiguous/cluttered scenes. They also lack strong object priors and region-level constraints, causing object hallucination or missed detections.Method: LoGoSeg introduces three key innovations: 1) Object existence prior that dynamically weights relevant categories through global image-text similarity to reduce hallucinations; 2) Region-aware alignment module for precise region-level visual-textual correspondences; 3) Dual-stream fusion mechanism combining local structural information with global semantic context. The framework is single-stage and eliminates need for external mask proposals, additional backbones, or extra datasets.
Result: Extensive experiments on six benchmarks (A-847, PC-459, A-150, PC-59, PAS-20, and PAS-20b) demonstrate competitive performance and strong generalization in open-vocabulary settings.
Conclusion: LoGoSeg effectively addresses spatial alignment issues in open-vocabulary semantic segmentation by integrating object priors and region-level constraints, achieving efficient and accurate segmentation without requiring additional resources.
Abstract: Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories using arbitrary textual descriptions. While existing methods leverage vision-language models (VLMs) like CLIP, their reliance on image-level pretraining often results in imprecise spatial alignment, leading to mismatched segmentations in ambiguous or cluttered scenes. However, most existing approaches lack strong object priors and region-level constraints, which can lead to object hallucination or missed detections, further degrading performance. To address these challenges, we propose LoGoSeg, an efficient single-stage framework that integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context. Unlike prior works, LoGoSeg eliminates the need for external mask proposals, additional backbones, or extra datasets, ensuring efficiency. Extensive experiments on six benchmarks (A-847, PC-459, A-150, PC-59, PAS-20, and PAS-20b) demonstrate its competitive performance and strong generalization in open-vocabulary settings.
[157] Geometric Observability Index: An Operator-Theoretic Framework for Per-Feature Sensitivity, Weak Observability, and Dynamic Effects in SE(3) Pose Estimation
Joe-Mei Feng, Sheng-Wei Yu
Main category: cs.CV
TL;DR: A unified operator-theoretic framework for analyzing per-feature sensitivity in camera pose estimation on SE(3), introducing Geometric Observability Index (GOI) to quantify individual measurement contributions through curvature operators and Lie algebraic structure.
Details
Motivation: Classical sensitivity tools fail to explain how individual image features influence pose estimates or why dynamic/inconsistent observations disproportionately distort modern SLAM and structure-from-motion systems. There's a need for a geometrically consistent description of measurement influence that unifies existing theories.Method: Extends influence function theory to matrix Lie groups and derives an intrinsic perturbation operator for left-trivialized M-estimators on SE(3). Develops Geometric Observability Index (GOI) that quantifies single measurement contributions through curvature operators and Lie algebraic structure, with spectral decomposition along principal directions of observable curvature.
Result: GOI reveals correspondence between weak observability and amplified sensitivity, coincides with Fisher information geometry on SE(3) in population regime (single-measurement analogue of Cramer-Rao bound), explains classical degeneracies, and provides lightweight diagnostic signals for identifying dynamic features and weak observability configurations.
Conclusion: GOI provides a geometrically consistent framework that unifies conditioning analysis, Fisher information geometry, influence function theory, and dynamic scene detectability through spectral geometry of curvature operators, offering practical diagnostic tools for existing SLAM architectures without modification.
Abstract: We present a unified operator-theoretic framework for analyzing per-feature sensitivity in camera pose estimation on the Lie group SE(3). Classical sensitivity tools - conditioning analyses, Euclidean perturbation arguments, and Fisher information bounds - do not explain how individual image features influence the pose estimate, nor why dynamic or inconsistent observations can disproportionately distort modern SLAM and structure-from-motion systems. To address this gap, we extend influence function theory to matrix Lie groups and derive an intrinsic perturbation operator for left-trivialized M-estimators on SE(3). The resulting Geometric Observability Index (GOI) quantifies the contribution of a single measurement through the curvature operator and the Lie algebraic structure of the observable subspace. GOI admits a spectral decomposition along the principal directions of the observable curvature, revealing a direct correspondence between weak observability and amplified sensitivity. In the population regime, GOI coincides with the Fisher information geometry on SE(3), yielding a single-measurement analogue of the Cramer-Rao bound. The same spectral mechanism explains classical degeneracies such as pure rotation and vanishing parallax, as well as dynamic feature amplification along weak curvature directions. Overall, GOI provides a geometrically consistent description of measurement influence that unifies conditioning analysis, Fisher information geometry, influence function theory, and dynamic scene detectability through the spectral geometry of the curvature operator. Because these quantities arise directly within Gauss-Newton pipelines, the curvature spectrum and GOI also yield lightweight, training-free diagnostic signals for identifying dynamic features and detecting weak observability configurations without modifying existing SLAM architectures.
[158] A Mixed Reality System for Robust Manikin Localization in Childbirth Training
Haojie Cheng, Chang Liu, Abhiram Kanneganti, Mahesh Arjandas Choolani, Arundhati Tushar Gosavi, Eng Tat Khoo
Main category: cs.CV
TL;DR: A mixed reality system for childbirth training that combines virtual guidance with tactile manikin interaction, enabling independent practice with haptic feedback and showing superior performance compared to VR training.
Details
Motivation: Medical students face limited practical experience in vaginal births due to shortened clinical rotations, patient reluctance, and unpredictable labor. There's a need to reduce clinicians' instructional burden while enhancing trainees' learning efficiency through technology-enabled independent practice.Method: Developed a mixed reality system using commercial HMDs with passthrough capability extended by an external RGB-D camera for real-time visual integration of physical training objects. Implemented a coarse-to-fine localization pipeline: first aligns maternal manikin with fiducial markers to define delivery region, then registers pre-scanned neonatal head within this area for spatially accurate overlay of virtual guiding hands.
Result: System achieved accurate and stable manikin localization on standalone headset. In a study with 83 fourth-year medical students, MR training showed significantly higher scores in delivery, post-delivery, and overall task performance compared to VR training, and was consistently preferred by trainees.
Conclusion: The MR system effectively combines virtual guidance with tactile interaction, enabling independent childbirth training with authentic haptic feedback while reducing need for continuous expert supervision, demonstrating practical value for medical education.
Abstract: Opportunities for medical students to gain practical experience in vaginal births are increasingly constrained by shortened clinical rotations, patient reluctance, and the unpredictable nature of labour. To alleviate clinicians’ instructional burden and enhance trainees’ learning efficiency, we introduce a mixed reality (MR) system for childbirth training that combines virtual guidance with tactile manikin interaction, thereby preserving authentic haptic feedback while enabling independent practice without continuous on-site expert supervision. The system extends the passthrough capability of commercial head-mounted displays (HMDs) by spatially calibrating an external RGB-D camera, allowing real-time visual integration of physical training objects. Building on this capability, we implement a coarse-to-fine localization pipeline that first aligns the maternal manikin with fiducial markers to define a delivery region and then registers the pre-scanned neonatal head within this area. This process enables spatially accurate overlay of virtual guiding hands near the manikin, allowing trainees to follow expert trajectories reinforced by haptic interaction. Experimental evaluations demonstrate that the system achieves accurate and stable manikin localization on a standalone headset, ensuring practical deployment without external computing resources. A large-scale user study involving 83 fourth-year medical students was subsequently conducted to compare MR-based and virtual reality (VR)-based childbirth training. Four senior obstetricians independently assessed performance using standardized criteria. Results showed that MR training achieved significantly higher scores in delivery, post-delivery, and overall task performance, and was consistently preferred by trainees over VR training.
[159] EgoPoseVR: Spatiotemporal Multi-Modal Reasoning for Egocentric Full-Body Pose in Virtual Reality
Haojie Cheng, Shaun Jing Heng Ong, Shaoyu Cai, Aiden Tat Yang Koh, Fuxi Ouyang, Eng Tat Khoo
Main category: cs.CV
TL;DR: EgoPoseVR: An end-to-end framework for accurate egocentric full-body pose estimation in VR using headset motion cues fused with RGB-D observations through cross-attention and kinematic optimization.
Details
Motivation: Immersive VR applications need accurate, temporally coherent full-body pose tracking. Current head-mounted camera approaches face challenges in VR HMDs including temporal instability, inaccurate lower-body estimation, and lack of real-time performance.Method: Dual-modality fusion pipeline integrating headset motion cues with egocentric RGB-D observations. Uses spatiotemporal encoder for frame- and joint-level representations, fused via cross-attention. Includes kinematic optimization module imposing constraints from HMD signals. Trained on large-scale synthetic dataset of 1.8M temporally aligned HMD and RGB-D frames.
Result: Outperforms state-of-the-art egocentric pose estimation models. User study shows significantly higher subjective ratings in accuracy, stability, embodiment, and intention for future use compared to baseline methods.
Conclusion: EgoPoseVR enables robust full-body pose tracking, offering practical solution for accurate VR embodiment without requiring additional body-worn sensors or room-scale tracking systems.
Abstract: Immersive virtual reality (VR) applications demand accurate, temporally coherent full-body pose tracking. Recent head-mounted camera-based approaches show promise in egocentric pose estimation, but encounter challenges when applied to VR head-mounted displays (HMDs), including temporal instability, inaccurate lower-body estimation, and the lack of real-time performance. To address these limitations, we present EgoPoseVR, an end-to-end framework for accurate egocentric full-body pose estimation in VR that integrates headset motion cues with egocentric RGB-D observations through a dual-modality fusion pipeline. A spatiotemporal encoder extracts frame- and joint-level representations, which are fused via cross-attention to fully exploit complementary motion cues across modalities. A kinematic optimization module then imposes constraints from HMD signals, enhancing the accuracy and stability of pose estimation. To facilitate training and evaluation, we introduce a large-scale synthetic dataset of over 1.8 million temporally aligned HMD and RGB-D frames across diverse VR scenarios. Experimental results show that EgoPoseVR outperforms state-of-the-art egocentric pose estimation models. A user study in real-world scenes further shows that EgoPoseVR achieved significantly higher subjective ratings in accuracy, stability, embodiment, and intention for future use compared to baseline methods. These results show that EgoPoseVR enables robust full-body pose tracking, offering a practical solution for accurate VR embodiment without requiring additional body-worn sensors or room-scale tracking systems.
[160] CAViT – Channel-Aware Vision Transformer for Dynamic Feature Fusion
Aon Safdar, Mohamed Saadeldin
Main category: cs.CV
TL;DR: CAViT introduces a dual-attention Vision Transformer that replaces static MLPs with dynamic channel-wise self-attention, enabling content-aware feature interaction and improved performance with fewer parameters.
Details
Motivation: Standard Vision Transformers use static MLPs for channel mixing that lack adaptability to input content, limiting their ability to dynamically recalibrate feature representations based on global image context.Method: CAViT uses a dual-attention architecture where each Transformer block performs spatial self-attention followed by channel-wise self-attention, replacing the static MLP with a dynamic, attention-based mechanism for feature interaction.
Result: CAViT outperforms standard ViT baseline by up to +3.6% accuracy across five benchmark datasets (natural and medical domains), reduces parameters and FLOPs by over 30%, and produces sharper, semantically meaningful attention maps.
Conclusion: The attention-driven token mixing strategy enhances representational expressiveness without increasing depth or complexity, providing a more unified and content-aware approach to vision transformer architecture design.
Abstract: Vision Transformers (ViTs) have demonstrated strong performance across a range of computer vision tasks by modeling long-range spatial interactions via self-attention. However, channel-wise mixing in ViTs remains static, relying on fixed multilayer perceptrons (MLPs) that lack adaptability to input content. We introduce ‘CAViT’, a dual-attention architecture that replaces the static MLP with a dynamic, attention-based mechanism for feature interaction. Each Transformer block in CAViT performs spatial self-attention followed by channel-wise self-attention, allowing the model to dynamically recalibrate feature representations based on global image context. This unified and content-aware token mixing strategy enhances representational expressiveness without increasing depth or complexity. We validate CAViT across five benchmark datasets spanning both natural and medical domains, where it outperforms the standard ViT baseline by up to +3.6% in accuracy, while reducing parameter count and FLOPs by over 30%. Qualitative attention maps reveal sharper and semantically meaningful activation patterns, validating the effectiveness of our attention-driven token mixing.
[161] Multi-instance robust fitting for non-classical geometric models
Zongliang Zhang, Shuxiang Li, Xingwang Huang, Zongyue Wang
Main category: cs.CV
TL;DR: A method for robust multi-instance fitting of non-classical models from noisy data using model-to-data error estimator and meta-heuristic optimization
Details
Motivation: Existing robust fitting methods focus on classical models (lines, circles, planes) and single instances, but lack methods for multi-instance fitting of non-classical models (spiral curves, procedural character models, free-form surfaces) from noisy dataMethod: Formulates multi-instance fitting as optimization with novel estimator based on model-to-data error (handles outliers without predefined threshold) and meta-heuristic optimizer for global optimum (since estimator is non-differentiable)
Result: Demonstrated effectiveness through experiments on various non-classical models; code available on GitHub
Conclusion: Proposed method successfully addresses multi-instance fitting of non-classical models from noisy data using robust estimator and optimization approach
Abstract: Most existing robust fitting methods are designed for classical models, such as lines, circles, and planes. In contrast, fewer methods have been developed to robustly handle non-classical models, such as spiral curves, procedural character models, and free-form surfaces. Furthermore, existing methods primarily focus on reconstructing a single instance of a non-classical model. This paper aims to reconstruct multiple instances of non-classical models from noisy data. We formulate this multi-instance fitting task as an optimization problem, which comprises an estimator and an optimizer. Specifically, we propose a novel estimator based on the model-to-data error, capable of handling outliers without a predefined error threshold. Since the proposed estimator is non-differentiable with respect to the model parameters, we employ a meta-heuristic algorithm as the optimizer to seek the global optimum. The effectiveness of our method are demonstrated through experimental results on various non-classical models. The code is available at https://github.com/zhangzongliang/fitting.
[162] Unified Sensor Simulation for Autonomous Driving
Nikolay Patakin, Arsenii Shirokov, Anton Konushin, Dmitry Senushkin
Main category: cs.CV
TL;DR: XSIM is a sensor simulation framework for autonomous driving that extends 3D Gaussian splatting with rolling-shutter modeling and specialized solutions for spherical cameras like LiDARs, achieving state-of-the-art performance on driving datasets.
Details
Motivation: Existing 3D Gaussian splatting methods struggle with sensor simulation for autonomous driving, particularly with spherical cameras (LiDARs) due to cyclic projection and time discontinuities at azimuth boundaries, leading to incorrect particle projection.Method: Extends 3D Gaussian splatting with generalized rolling-shutter modeling, introduces phase modeling to handle temporal/shape discontinuities at azimuth borders, and uses extended 3D Gaussian representation with dual opacity parameters to resolve geometry-color mismatches.
Result: Achieves state-of-the-art performance on Waymo Open Dataset, Argoverse 2, and PandaSet, with enhanced scene representations showing improved geometric consistency and photorealistic appearance.
Conclusion: XSIM provides a unified sensor simulation framework for autonomous driving that effectively handles complex sensor distortions in dynamic environments, particularly addressing challenges with spherical cameras like LiDARs.
Abstract: In this work, we introduce \textbf{XSIM}, a sensor simulation framework for autonomous driving. XSIM extends 3DGUT splatting with a generalized rolling-shutter modeling tailored for autonomous driving applications. Our framework provides a unified and flexible formulation for appearance and geometric sensor modeling, enabling rendering of complex sensor distortions in dynamic environments. We identify spherical cameras, such as LiDARs, as a critical edge case for existing 3DGUT splatting due to cyclic projection and time discontinuities at azimuth boundaries leading to incorrect particle projection. To address this issue, we propose a phase modeling mechanism that explicitly accounts temporal and shape discontinuities of Gaussians projected by the Unscented Transform at azimuth borders. In addition, we introduce an extended 3D Gaussian representation that incorporates two distinct opacity parameters to resolve mismatches between geometry and color distributions. As a result, our framework provides enhanced scene representations with improved geometric consistency and photorealistic appearance. We evaluate our framework extensively on multiple autonomous driving datasets, including Waymo Open Dataset, Argoverse 2, and PandaSet. Our framework consistently outperforms strong recent baselines and achieves state-of-the-art performance across all datasets. The source code is publicly available at \href{https://github.com/whesense/XSIM}{https://github.com/whesense/XSIM}.
[163] ROMAN: Reward-Orchestrated Multi-Head Attention Network for Autonomous Driving System Testing
Jianlei Chi, Yuzhen Wu, Jiaxuan Hou, Xiaodong Zhang, Ming Fan, Suhui Sun, Weijun Dai, Bo Li, Jianguo Sun, Jun Sun
Main category: cs.CV
TL;DR: ROMAN: A novel scenario generation approach for ADS testing that combines multi-head attention with traffic law weighting to generate high-risk violation scenarios for autonomous vehicle safety evaluation.
Details
Motivation: Current ADS testing approaches face challenges in generating complex, high-risk law-breaking scenarios and fail to account for complex multi-vehicle interactions. There's a need for more thorough testing of autonomous vehicles against traffic law violations.Method: ROMAN combines a multi-head attention network to model interactions among vehicles, traffic signals, and other factors, with a traffic law weighting mechanism that uses an LLM-based risk weighting module to evaluate violations based on severity and occurrence dimensions.
Result: ROMAN surpassed state-of-the-art tools ABLE and LawBreaker, achieving 7.91% higher average violation count than ABLE and 55.96% higher than LawBreaker, while maintaining greater scenario diversity. Only ROMAN successfully generated violation scenarios for every clause of input traffic laws.
Conclusion: ROMAN enables more thorough and targeted ADS evaluation by generating high-risk violation scenarios, addressing current limitations in autonomous vehicle testing and improving safety assessment capabilities.
Abstract: Automated Driving System (ADS) acts as the brain of autonomous vehicles, responsible for their safety and efficiency. Safe deployment requires thorough testing in diverse real-world scenarios and compliance with traffic laws like speed limits, signal obedience, and right-of-way rules. Violations like running red lights or speeding pose severe safety risks. However, current testing approaches face significant challenges: limited ability to generate complex and high-risk law-breaking scenarios, and failing to account for complex interactions involving multiple vehicles and critical situations. To address these challenges, we propose ROMAN, a novel scenario generation approach for ADS testing that combines a multi-head attention network with a traffic law weighting mechanism. ROMAN is designed to generate high-risk violation scenarios to enable more thorough and targeted ADS evaluation. The multi-head attention mechanism models interactions among vehicles, traffic signals, and other factors. The traffic law weighting mechanism implements a workflow that leverages an LLM-based risk weighting module to evaluate violations based on the two dimensions of severity and occurrence. We have evaluated ROMAN by testing the Baidu Apollo ADS within the CARLA simulation platform and conducting extensive experiments to measure its performance. Experimental results demonstrate that ROMAN surpassed state-of-the-art tools ABLE and LawBreaker by achieving 7.91% higher average violation count than ABLE and 55.96% higher than LawBreaker, while also maintaining greater scenario diversity. In addition, only ROMAN successfully generated violation scenarios for every clause of the input traffic laws, enabling it to identify more high-risk violations than existing approaches.
[164] UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos
Jinlin Wu, Felix Holm, Chuxi Chen, An Wang, Yaxin Hu, Xiaofan Ye, Zelin Zang, Miao Xu, Lihua Zhou, Huai Liao, Danny T. M. Chan, Ming Feng, Wai S. Poon, Hongliang Ren, Dong Yi, Nassir Navab, Gaofeng Meng, Jiebo Luo, Hongbin Liu, Zhen Lei
Main category: cs.CV
TL;DR: UniSurg is a video-native foundation model for surgical video analysis that shifts from pixel-level reconstruction to latent motion prediction, achieving state-of-the-art performance across multiple surgical understanding tasks.
Details
Motivation: Current surgical video analysis models waste capacity on low-level visual details (smoke, reflections, fluid motion) rather than focusing on semantic structures essential for surgical understanding. There's a need for models that prioritize motion and semantic information over pixel-level reconstruction.Method: Built on Video Joint Embedding Predictive Architecture (V-JEPA), with three key innovations: 1) motion-guided latent prediction to prioritize semantically meaningful regions, 2) spatiotemporal affinity self-distillation for relational consistency, and 3) feature diversity regularization to prevent representation collapse. Pretrained on UniSurg-15M, the largest surgical video dataset (3,658 hours from 50 sources across 13 anatomical regions).
Result: Significantly outperforms state-of-the-art methods: +14.6% F1 on EgoSurgery, +10.3% on PitVis for workflow recognition; 39.54% mAP-IVT on CholecT50 for action triplet recognition; and strong performance on skill assessment, polyp segmentation, and depth estimation.
Conclusion: UniSurg establishes a new standard for universal, motion-oriented surgical video understanding by shifting the learning paradigm from pixel-level reconstruction to latent motion prediction, demonstrating superior performance across diverse surgical tasks.
Abstract: While foundation models have advanced surgical video analysis, current approaches rely predominantly on pixel-level reconstruction objectives that waste model capacity on low-level visual details - such as smoke, specular reflections, and fluid motion - rather than semantic structures essential for surgical understanding. We present UniSurg, a video-native foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction. Built on the Video Joint Embedding Predictive Architecture (V-JEPA), UniSurg introduces three key technical innovations tailored to surgical videos: 1) motion-guided latent prediction to prioritize semantically meaningful regions, 2) spatiotemporal affinity self-distillation to enforce relational consistency, and 3) feature diversity regularization to prevent representation collapse in texture-sparse surgical scenes. To enable large-scale pretraining, we curate UniSurg-15M, the largest surgical video dataset to date, comprising 3,658 hours of video from 50 sources across 13 anatomical regions. Extensive experiments across 17 benchmarks demonstrate that UniSurg significantly outperforms state-of-the-art methods on surgical workflow recognition (+14.6% F1 on EgoSurgery, +10.3% on PitVis), action triplet recognition (39.54% mAP-IVT on CholecT50), skill assessment, polyp segmentation, and depth estimation. These results establish UniSurg as a new standard for universal, motion-oriented surgical video understanding.
[165] Enhancing Personality Recognition by Comparing the Predictive Power of Traits, Facets, and Nuances
Amir Ansari, Jana Subirana, Bruna Silva, Sergio Escalera, David Gallardo-Pujol, Cristina Palmero
Main category: cs.CV
TL;DR: Personality recognition from audiovisual data improves when using granular personality nuances rather than broad trait scores, with transformer models achieving up to 74% error reduction.
Details
Motivation: Current personality recognition models rely on broad trait scores as ground truth, which limits generalization because similar trait scores can manifest through diverse, context-dependent behaviors. There's a need to explore more granular hierarchical levels of personality assessment.Method: Used the UDIVA v0.5 dataset and trained a transformer-based model with cross-modal (audiovisual) and cross-subject (dyad-aware) attention mechanisms to predict personality at different hierarchical levels (traits, facets, and nuances).
Result: Nuance-level models consistently outperformed facet and trait-level models, reducing mean squared error by up to 74% across different interaction scenarios.
Conclusion: Granular personality assessment at the nuance level significantly improves personality recognition from audiovisual interaction data compared to using broad trait scores.
Abstract: Personality is a complex, hierarchical construct typically assessed through item-level questionnaires aggregated into broad trait scores. Personality recognition models aim to infer personality traits from different sources of behavioral data. However, reliance on broad trait scores as ground truth, combined with limited training data, poses challenges for generalization, as similar trait scores can manifest through diverse, context dependent behaviors. In this work, we explore the predictive impact of the more granular hierarchical levels of the Big-Five Personality Model, facets and nuances, to enhance personality recognition from audiovisual interaction data. Using the UDIVA v0.5 dataset, we trained a transformer-based model including cross-modal (audiovisual) and cross-subject (dyad-aware) attention mechanisms. Results show that nuance-level models consistently outperform facet and trait-level models, reducing mean squared error by up to 74% across interaction scenarios.
[166] ShapeUP: Scalable Image-Conditioned 3D Editing
Inbar Gat, Dana Cohen-Bar, Guy Levy, Elad Richardson, Daniel Cohen-Or
Main category: cs.CV
TL;DR: ShapeUP is a scalable 3D editing framework that uses image-conditioned latent-to-latent translation within a 3D Diffusion Transformer to enable precise 3D manipulation while maintaining structural consistency.
Details
Motivation: Existing 3D editing methods face trade-offs between visual controllability, geometric consistency, and scalability - optimization-based methods are slow, multi-view 2D propagation suffers from visual drift, and training-free latent manipulation is limited by frozen priors.Method: ShapeUP formulates editing as supervised latent-to-latent translation using a 3D Diffusion Transformer (DiT). It’s trained on triplets of source 3D shape, edited 2D image, and corresponding edited 3D shape, learning direct mapping within native 3D representation.
Result: ShapeUP consistently outperforms current trained and training-free baselines in both identity preservation and edit fidelity, offering fine-grained visual control over local/global edits with implicit mask-free localization while maintaining structural consistency.
Conclusion: ShapeUP presents a robust and scalable paradigm for native 3D content creation that leverages pretrained 3D foundation models while adapting them to editing through supervised training, overcoming limitations of existing approaches.
Abstract: Recent advancements in 3D foundation models have enabled the generation of high-fidelity assets, yet precise 3D manipulation remains a significant challenge. Existing 3D editing frameworks often face a difficult trade-off between visual controllability, geometric consistency, and scalability. Specifically, optimization-based methods are prohibitively slow, multi-view 2D propagation techniques suffer from visual drift, and training-free latent manipulation methods are inherently bound by frozen priors and cannot directly benefit from scaling. In this work, we present ShapeUP, a scalable, image-conditioned 3D editing framework that formulates editing as a supervised latent-to-latent translation within a native 3D representation. This formulation allows ShapeUP to build on a pretrained 3D foundation model, leveraging its strong generative prior while adapting it to editing through supervised training. In practice, ShapeUP is trained on triplets consisting of a source 3D shape, an edited 2D image, and the corresponding edited 3D shape, and learns a direct mapping using a 3D Diffusion Transformer (DiT). This image-as-prompt approach enables fine-grained visual control over both local and global edits and achieves implicit, mask-free localization, while maintaining strict structural consistency with the original asset. Our extensive evaluations demonstrate that ShapeUP consistently outperforms current trained and training-free baselines in both identity preservation and edit fidelity, offering a robust and scalable paradigm for native 3D content creation.
[167] Poster: Camera Tampering Detection for Outdoor IoT Systems
Shadi Attarha, Kanaga Shanmugi, Anna Förster
Main category: cs.CV
TL;DR: Two approaches (rule-based and deep learning) for detecting camera tampering in still images, with deep learning offering higher accuracy but rule-based being better for resource-constrained scenarios.
Details
Motivation: Smart cameras in outdoor surveillance are vulnerable to tampering (vandalism/environmental damage), but detecting tampering in still images is challenging without temporal video frames.Method: Proposed two methods: 1) Rule-based approach using heuristic rules, and 2) Deep-learning-based method using neural networks for tampering detection in still images.
Result: Deep learning model provides higher accuracy, while rule-based method is more suitable for resource-limited scenarios where prolonged calibration is impractical.
Conclusion: Both methods have trade-offs between accuracy and computational requirements; publicly available datasets (normal, blurred, rotated images) are provided to support further research.
Abstract: Recently, the use of smart cameras in outdoor settings has grown to improve surveillance and security. Nonetheless, these systems are susceptible to tampering, whether from deliberate vandalism or harsh environmental conditions, which can undermine their monitoring effectiveness. In this context, detecting camera tampering is more challenging when a camera is capturing still images rather than video as there is no sequence of continuous frames over time. In this study, we propose two approaches for detecting tampered images: a rule-based method and a deep-learning-based method. The aim is to evaluate how each method performs in terms of accuracy, computational demands, and the data required for training when applied to real-world scenarios. Our results show that the deep-learning model provides higher accuracy, while the rule-based method is more appropriate for scenarios where resources are limited and a prolonged calibration phase is impractical. We also offer publicly available datasets with normal, blurred, and rotated images to support the development and evaluation of camera tampering detection methods, addressing the need for such resources.
[168] Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization
Yunchuan Ma, Laiyun Qing, Guorong Li, Yuqing Liu, Yuankai Qi, Qingming Huang
Main category: cs.CV
TL;DR: A multi-task learning framework for point-supervised temporal action localization that introduces three self-supervised temporal understanding tasks to improve model’s temporal reasoning capabilities.
Details
Motivation: Existing point-supervised temporal action localization approaches only use snippet-level classification without explicit modeling of temporal relationships among frames, which is crucial for understanding how actions are defined and localizing full action frames.Method: Proposes a multi-task learning framework with three self-supervised temporal understanding tasks: (1) Action Completion, (2) Action Order Understanding, and (3) Action Regularity Understanding to help models understand temporal consistency of actions across videos.
Result: Extensive experimental results on four benchmark datasets demonstrate the effectiveness of the proposed method compared to several state-of-the-art approaches.
Conclusion: This is the first attempt to explicitly explore temporal consistency for point supervision action localization, showing that understanding temporal relationships among frames significantly improves action localization performance.
Abstract: Point-supervised Temporal Action Localization (PTAL) adopts a lightly frame-annotated paradigm (\textit{i.e.}, labeling only a single frame per action instance) to train a model to effectively locate action instances within untrimmed videos. Most existing approaches design the task head of models with only a point-supervised snippet-level classification, without explicit modeling of understanding temporal relationships among frames of an action. However, understanding the temporal relationships of frames is crucial because it can help a model understand how an action is defined and therefore benefits localizing the full frames of an action. To this end, in this paper, we design a multi-task learning framework that fully utilizes point supervision to boost the model’s temporal understanding capability for action localization. Specifically, we design three self-supervised temporal understanding tasks: (i) Action Completion, (ii) Action Order Understanding, and (iii) Action Regularity Understanding. These tasks help a model understand the temporal consistency of actions across videos. To the best of our knowledge, this is the first attempt to explicitly explore temporal consistency for point supervision action localization. Extensive experimental results on four benchmark datasets demonstrate the effectiveness of the proposed method compared to several state-of-the-art approaches.
[169] Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification
Lexiang Hu, Youze Xue, Dian Li, Gang Liu, Zhouchen Lin
Main category: cs.CV
TL;DR: AGFF-Embed is a novel multimodal embedding method that adaptively fuses global and fine-grained semantic information from MLLMs, achieving SOTA performance on multimodal benchmarks.
Details
Motivation: Current multimodal embeddings (CLIP-based and MLLM-based) only capture global semantics, but complex scenarios require both global and fine-grained understanding. Existing methods lack proper fusion mechanisms for these hybrid perceptual patterns.Method: Proposes AGFF-Embed which prompts MLLMs to generate multiple embeddings focusing on different semantic dimensions, then adaptively aggregates them. Uses Explicit Gradient Amplification (EGA) for in-batch hard negatives enhancement without dataset editing.
Result: Achieves state-of-the-art performance on MMEB and MMVP-VLM benchmarks for both general and fine-grained multimodal understanding compared to other embedding models.
Conclusion: AGFF-Embed effectively addresses the limitation of current multimodal embeddings by adaptively fusing global and fine-grained information, demonstrating superior performance on complex multimodal tasks.
Abstract: Multimodal embeddings serve as a bridge for aligning vision and language, with the two primary implementations – CLIP-based and MLLM-based embedding models – both limited to capturing only global semantic information. Although numerous studies have focused on fine-grained understanding, we observe that complex scenarios currently targeted by MLLM embeddings often involve a hybrid perceptual pattern of both global and fine-grained elements, thus necessitating a compatible fusion mechanism. In this paper, we propose Adaptive Global and Fine-grained perceptual Fusion for MLLM Embeddings (AGFF-Embed), a method that prompts the MLLM to generate multiple embeddings focusing on different dimensions of semantic information, which are then adaptively and smoothly aggregated. Furthermore, we adapt AGFF-Embed with the Explicit Gradient Amplification (EGA) technique to achieve in-batch hard negatives enhancement without requiring fine-grained editing of the dataset. Evaluation on the MMEB and MMVP-VLM benchmarks shows that AGFF-Embed comprehensively achieves state-of-the-art performance in both general and fine-grained understanding compared to other multimodal embedding models.
[170] Depth as Prior Knowledge for Object Detection
Moussa Kassem Sbeyti, Nadja Klein
Main category: cs.CV
TL;DR: DepthPrior improves small object detection using depth information as prior knowledge without modifying detector architectures, achieving significant performance gains across multiple benchmarks.
Details
Motivation: Small and distant objects are challenging to detect due to scale variation, low resolution, and background clutter. While depth information can help, existing approaches require complex architectural modifications. The paper aims to provide a framework that uses depth as prior knowledge without changing detector architectures.Method: DepthPrior uses depth as prior knowledge rather than fused features. It consists of: 1) Depth-Based Loss Weighting (DLW) and Depth-Based Loss Stratification (DLS) during training to focus on distant/small objects, and 2) Depth-Aware Confidence Thresholding (DCT) during inference to adjust detection confidence based on depth. Only requires initial depth estimation overhead.
Result: Experiments across four benchmarks (KITTI, MS COCO, VisDrone, SUN RGB-D) and two detectors (YOLOv11, EfficientDet) show DepthPrior achieves up to +9% mAP_S and +7% mAR_S for small objects, with inference recovery rates as high as 95:1 (true vs. false detections). Works without additional sensors or architectural changes.
Conclusion: DepthPrior effectively improves small object detection using depth as prior knowledge, providing significant performance benefits without requiring architectural modifications or additional sensors. The framework is generalizable across different detectors and datasets.
Abstract: Detecting small and distant objects remains challenging for object detectors due to scale variation, low resolution, and background clutter. Safety-critical applications require reliable detection of these objects for safe planning. Depth information can improve detection, but existing approaches require complex, model-specific architectural modifications. We provide a theoretical analysis followed by an empirical investigation of the depth-detection relationship. Together, they explain how depth causes systematic performance degradation and why depth-informed supervision mitigates it. We introduce DepthPrior, a framework that uses depth as prior knowledge rather than as a fused feature, providing comparable benefits without modifying detector architectures. DepthPrior consists of Depth-Based Loss Weighting (DLW) and Depth-Based Loss Stratification (DLS) during training, and Depth-Aware Confidence Thresholding (DCT) during inference. The only overhead is the initial cost of depth estimation. Experiments across four benchmarks (KITTI, MS COCO, VisDrone, SUN RGB-D) and two detectors (YOLOv11, EfficientDet) demonstrate the effectiveness of DepthPrior, achieving up to +9% mAP$_S$ and +7% mAR$_S$ for small objects, with inference recovery rates as high as 95:1 (true vs. false detections). DepthPrior offers these benefits without additional sensors, architectural changes, or performance costs. Code is available at https://github.com/mos-ks/DepthPrior.
[171] Neuro-Inspired Visual Pattern Recognition via Biological Reservoir Computing
Luca Ciampi, Ludovico Iannello, Fabrizio Tonelli, Gabriele Lagani, Angelo Di Garbo, Federico Cremisi, Giuseppe Amato
Main category: cs.CV
TL;DR: Biological reservoir computing using living cortical neurons as physical reservoir for visual pattern recognition tasks, achieving accurate classification despite biological variability.
Details
Motivation: To develop a neuro-inspired computing system that leverages actual biological neural circuits rather than artificial models, integrating living neural substrates into neuromorphic computing frameworks and incorporating biological principles into machine learning.Method: Uses in vitro cultured cortical neurons as physical reservoir, with high-density multi-electrode array for stimulation and readout across hundreds of channels. Input patterns delivered through selected electrodes while others capture neural responses. Linear readout layer (single-layer perceptron) trained to classify reservoir states for visual pattern recognition tasks.
Result: System consistently generates high-dimensional representations supporting accurate classification across tasks of increasing difficulty (pointwise stimuli, oriented bars, clock-digit-like shapes, MNIST handwritten digits), despite biological variability from noise, spontaneous activity, and inter-session differences.
Conclusion: In vitro cortical networks can function as effective reservoirs for static visual pattern recognition, opening new avenues for integrating living neural substrates into neuromorphic computing and informing biologically grounded computational models.
Abstract: In this paper, we present a neuro-inspired approach to reservoir computing (RC) in which a network of in vitro cultured cortical neurons serves as the physical reservoir. Rather than relying on artificial recurrent models to approximate neural dynamics, our biological reservoir computing (BRC) system leverages the spontaneous and stimulus-evoked activity of living neural circuits as its computational substrate. A high-density multi-electrode array (HD-MEA) provides simultaneous stimulation and readout across hundreds of channels: input patterns are delivered through selected electrodes, while the remaining ones capture the resulting high-dimensional neural responses, yielding a biologically grounded feature representation. A linear readout layer (single-layer perceptron) is then trained to classify these reservoir states, enabling the living neural network to perform static visual pattern-recognition tasks within a computer-vision framework. We evaluate the system across a sequence of tasks of increasing difficulty, ranging from pointwise stimuli to oriented bars, clock-digit-like shapes, and handwritten digits from the MNIST dataset. Despite the inherent variability of biological neural responses-arising from noise, spontaneous activity, and inter-session differences-the system consistently generates high-dimensional representations that support accurate classification. These results demonstrate that in vitro cortical networks can function as effective reservoirs for static visual pattern recognition, opening new avenues for integrating living neural substrates into neuromorphic computing frameworks. More broadly, this work contributes to the effort to incorporate biological principles into machine learning and supports the goals of neuro-inspired vision by illustrating how living neural systems can inform the design of efficient and biologically grounded computational models.
[172] FMPose3D: monocular 3D pose estimation via flow matching
Ti Wang, Xiaohang Yu, Mackenzie Weygandt Mathis
Main category: cs.CV
TL;DR: FMPose3D: A flow matching framework for efficient 3D pose estimation that generates multiple plausible pose hypotheses from 2D inputs using ODE-based sampling with few integration steps.
Details
Motivation: Monocular 3D pose estimation is ill-posed due to depth ambiguity and occlusions, requiring probabilistic methods. Diffusion models work well but are computationally expensive due to many denoising steps. Flow matching offers a more efficient alternative.Method: Uses Flow Matching to learn a velocity field defined by an ODE, enabling efficient generation of 3D pose samples with few integration steps. Formulates pose estimation as conditional distribution transport from Gaussian prior to plausible 3D poses conditioned on 2D inputs. Includes Reprojection-based Posterior Expectation Aggregation (RPEA) to select best hypothesis.
Result: Surpasses existing methods on Human3.6M and MPI-INF-3DHP benchmarks, achieves SOTA on Animal3D and CtrlAni3D datasets, demonstrating strong performance across both human and animal 3D pose domains.
Conclusion: FMPose3D provides an efficient flow matching framework for probabilistic 3D pose estimation that generates diverse hypotheses with few integration steps, outperforming diffusion-based approaches while being computationally efficient.
Abstract: Monocular 3D pose estimation is fundamentally ill-posed due to depth ambiguity and occlusions, thereby motivating probabilistic methods that generate multiple plausible 3D pose hypotheses. In particular, diffusion-based models have recently demonstrated strong performance, but their iterative denoising process typically requires many timesteps for each prediction, making inference computationally expensive. In contrast, we leverage Flow Matching (FM) to learn a velocity field defined by an Ordinary Differential Equation (ODE), enabling efficient generation of 3D pose samples with only a few integration steps. We propose a novel generative pose estimation framework, FMPose3D, that formulates 3D pose estimation as a conditional distribution transport problem. It continuously transports samples from a standard Gaussian prior to the distribution of plausible 3D poses conditioned only on 2D inputs. Although ODE trajectories are deterministic, FMPose3D naturally generates various pose hypotheses by sampling different noise seeds. To obtain a single accurate prediction from those hypotheses, we further introduce a Reprojection-based Posterior Expectation Aggregation (RPEA) module, which approximates the Bayesian posterior expectation over 3D hypotheses. FMPose3D surpasses existing methods on the widely used human pose estimation benchmarks Human3.6M and MPI-INF-3DHP, and further achieves state-of-the-art performance on the 3D animal pose datasets Animal3D and CtrlAni3D, demonstrating strong performance across both 3D pose domains. The code is available at https://github.com/AdaptiveMotorControlLab/FMPose3D.
[173] ReText: Text Boosts Generalization in Image-Based Person Re-identification
Timur Mamedov, Karina Kvanchiani, Anton Konushin, Vadim Konushin
Main category: cs.CV
TL;DR: ReText: A multimodal person re-identification method that combines multi-camera Re-ID data with single-camera data enriched by textual descriptions, using joint optimization of Re-ID, image-text matching, and text-guided image reconstruction tasks.
Details
Motivation: Generalizable person Re-ID across unseen domains is challenging due to domain gaps. While single-camera data is easy to collect, it lacks cross-view variation complexity. The paper aims to enrich single-camera data with textual descriptions to improve generalization.Method: ReText trains on a mixture of multi-camera Re-ID data and single-camera data with textual descriptions. It jointly optimizes three tasks: Re-ID on multi-camera data, image-text matching, and text-guided image reconstruction on single-camera data.
Result: ReText achieves strong generalization and significantly outperforms state-of-the-art methods on cross-domain Re-ID benchmarks, demonstrating the effectiveness of multimodal joint learning.
Conclusion: This is the first work to explore multimodal joint learning on mixed multi-camera and single-camera data in person Re-ID, showing that textual descriptions can effectively enrich semantic cues for better generalization.
Abstract: Generalizable image-based person re-identification (Re-ID) aims to recognize individuals across cameras in unseen domains without retraining. While multiple existing approaches address the domain gap through complex architectures, recent findings indicate that better generalization can be achieved by stylistically diverse single-camera data. Although this data is easy to collect, it lacks complexity due to minimal cross-view variation. We propose ReText, a novel method trained on a mixture of multi-camera Re-ID data and single-camera data, where the latter is complemented by textual descriptions to enrich semantic cues. During training, ReText jointly optimizes three tasks: (1) Re-ID on multi-camera data, (2) image-text matching, and (3) image reconstruction guided by text on single-camera data. Experiments demonstrate that ReText achieves strong generalization and significantly outperforms state-of-the-art methods on cross-domain Re-ID benchmarks. To the best of our knowledge, this is the first work to explore multimodal joint learning on a mixture of multi-camera and single-camera data in image-based person Re-ID.
[174] Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation
Hengyi Wang, Ruiqiang Zhang, Chang Liu, Guanjie Wang, Zehua Ma, Han Fang, Weiming Zhang
Main category: cs.CV
TL;DR: Allocentric Perceiver is a training-free method that uses geometric experts to extract 3D states from images and transforms them into target-centric reference frames, enabling VLMs to better handle allocentric spatial reasoning tasks.
Details
Motivation: VLMs struggle with allocentric spatial queries requiring perspective shifts, where answers depend on reasoning in target-centric frames rather than observed camera views. There's a growing need for spatially grounded capabilities in tasks like Vision-Language Navigation/Action.Method: Uses off-the-shelf geometric experts to recover metric 3D states from images, instantiates query-conditioned allocentric reference frames aligned with instruction intent, transforms reconstructed geometry into target frames, and prompts backbone VLMs with structured, geometry-grounded representations.
Result: Achieves consistent ~10% gains on allocentric tasks while maintaining strong egocentric performance, surpassing both spatial-perception-finetuned models and state-of-the-art open-source/proprietary models across multiple backbone families.
Conclusion: Allocentric Perceiver effectively offloads mental rotation from implicit reasoning to explicit computation, enabling VLMs to handle allocentric spatial reasoning without additional training.
Abstract: With the rising need for spatially grounded tasks such as Vision-Language Navigation/Action, allocentric perception capabilities in Vision-Language Models (VLMs) are receiving growing focus. However, VLMs remain brittle on allocentric spatial queries that require explicit perspective shifts, where the answer depends on reasoning in a target-centric frame rather than the observed camera view. Thus, we introduce Allocentric Perceiver, a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts, and then instantiates a query-conditioned allocentric reference frame aligned with the instruction’s semantic intent. By deterministically transforming reconstructed geometry into the target frame and prompting the backbone VLM with structured, geometry-grounded representations, Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation. We evaluate Allocentric Perciver across multiple backbone families on spatial reasoning benchmarks, observing consistent and substantial gains ($\sim$10%) on allocentric tasks while maintaining strong egocentric performance, and surpassing both spatial-perception-finetuned models and state-of-the-art open-source and proprietary models.
[175] Focus-Scan-Refine: From Human Visual Perception to Efficient Visual Token Pruning
Enwei Tong, Yuanchao Bai, Yao Zhu, Junjun Jiang, Xianming Liu
Main category: cs.CV
TL;DR: FSR is a plug-and-play pruning framework for vision-language models that mimics human visual question answering: focus on key evidence, scan globally if needed, then refine context without increasing token budget.
Details
Motivation: Vision-language models generate massive visual tokens that increase inference latency and memory footprint. Existing training-free token pruning methods struggle to balance local evidence and global context under aggressive compression.Method: FSR uses a three-step human-inspired approach: 1) Focus on key evidence by combining visual importance with instruction relevance, 2) Scan for complementary context conditioned on the focused set, selecting tokens most different from focused evidence, 3) Refine scanned context by aggregating nearby informative tokens via similarity-based assignment and score-weighted merging.
Result: Extensive experiments across multiple VLM backbones and vision-language benchmarks show FSR consistently improves accuracy-efficiency trade-off over state-of-the-art pruning methods.
Conclusion: FSR provides an effective plug-and-play pruning framework that mimics human visual processing to reduce visual tokens while maintaining performance, addressing efficiency challenges in vision-language models.
Abstract: Vision-language models (VLMs) often generate massive visual tokens that greatly increase inference latency and memory footprint; while training-free token pruning offers a practical remedy, existing methods still struggle to balance local evidence and global context under aggressive compression. We propose Focus-Scan-Refine (FSR), a human-inspired, plug-and-play pruning framework that mimics how humans answer visual questions: focus on key evidence, then scan globally if needed, and refine the scanned context by aggregating relevant details. FSR first focuses on key evidence by combining visual importance with instruction relevance, avoiding the bias toward visually salient but query-irrelevant regions. It then scans for complementary context conditioned on the focused set, selecting tokens that are most different from the focused evidence. Finally, FSR refines the scanned context by aggregating nearby informative tokens into the scan anchors via similarity-based assignment and score-weighted merging, without increasing the token budget. Extensive experiments across multiple VLM backbones and vision-language benchmarks show that FSR consistently improves the accuracy-efficiency trade-off over existing state-of-the-art pruning methods. The source codes can be found at https://github.com/ILOT-code/FSR
[176] NVS-HO: A Benchmark for Novel View Synthesis of Handheld Objects
Musawar Ali, Manuel Carranza-García, Nicola Fioraio, Samuele Salti, Luigi Di Stefano
Main category: cs.CV
TL;DR: NVS-HO is the first benchmark for novel view synthesis of handheld objects using only RGB inputs, featuring two complementary sequences (handheld manipulation and board-mounted) to evaluate NVS models under real-world unconstrained conditions.
Details
Motivation: Current novel view synthesis methods lack robust benchmarks for real-world handheld object scenarios where objects are manipulated in unconstrained conditions with only RGB inputs, creating a need for challenging real-world evaluation.Method: Create benchmark with two RGB sequences per object: (1) handheld manipulation with static camera, (2) board-mounted with ChArUco board for accurate camera poses. Use SfM pipeline and VGGT as pose estimators, train NVS models based on NeRF and Gaussian Splatting for baseline evaluation.
Result: Experiments reveal significant performance gaps in current methods under unconstrained handheld conditions, demonstrating the benchmark’s ability to highlight limitations of existing approaches in real-world scenarios.
Conclusion: NVS-HO provides a challenging real-world benchmark that exposes weaknesses in current RGB-based novel view synthesis methods for handheld objects, driving need for more robust approaches in this domain.
Abstract: We propose NVS-HO, the first benchmark designed for novel view synthesis of handheld objects in real-world environments using only RGB inputs. Each object is recorded in two complementary RGB sequences: (1) a handheld sequence, where the object is manipulated in front of a static camera, and (2) a board sequence, where the object is fixed on a ChArUco board to provide accurate camera poses via marker detection. The goal of NVS-HO is to learn a NVS model that captures the full appearance of an object from (1), whereas (2) provides the ground-truth images used for evaluation. To establish baselines, we consider both a classical SfM pipeline and a state-of-the-art pre-trained feed-forward neural network (VGGT) as pose estimators, and train NVS models based on NeRF and Gaussian Splatting. Our experiments reveal significant performance gaps in current methods under unconstrained handheld conditions, highlighting the need for more robust approaches. NVS-HO thus offers a challenging real-world benchmark to drive progress in RGB-based novel view synthesis of handheld objects.
[177] Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation
Hai Zhang, Siqi Liang, Li Chen, Yuxian Li, Yukuan Xu, Yichao Zhong, Fu Zhang, Hongyang Li
Main category: cs.CV
TL;DR: SparseVideoNav introduces video generation models for Beyond-the-View Navigation, enabling autonomous navigation with sparse high-level instructions rather than detailed step-by-step guidance, achieving 27x speed-up and superior performance in real-world zero-shot scenarios.
Details
Motivation: Current vision-language navigation relies on detailed language instructions, which contradicts real-world navigation needs where agents should navigate autonomously with simple high-level intents. The challenge is Beyond-the-View Navigation (BVN) - locating distant unseen targets without dense guidance.Method: Proposes SparseVideoNav that introduces video generation models for BVN tasks. Video generation models inherently benefit from long-horizon supervision for language alignment. The method generates sparse future trajectories spanning 20-second horizons for sub-second inference, achieving 27x speed-up over unoptimized video generation.
Result: Achieves 2.5x success rate compared to state-of-the-art LLM baselines on BVN tasks. Successfully demonstrates capability in challenging night scenes. Realizes 27x speed-up for trajectory inference.
Conclusion: Video generation models are uniquely suitable for BVN tasks due to their long-horizon supervision benefits. SparseVideoNav enables practical real-world deployment by overcoming latency issues through sparse trajectory generation, marking significant advancement in autonomous navigation with minimal language guidance.
Abstract: Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely by simple and high-level intents. Realizing this ambition introduces a formidable challenge: Beyond-the-View Navigation (BVN), where agents must locate distant, unseen targets without dense and step-by-step guidance. Existing large language model (LLM)-based methods, though adept at following dense instructions, often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision. Simply extending the supervision horizon, however, destabilizes LLM training. In this work, we identify that video generation models inherently benefit from long-horizon supervision to align with language instructions, rendering them uniquely suitable for BVN tasks. Capitalizing on this insight, we propose introducing the video generation model into this field for the first time. Yet, the prohibitive latency for generating videos spanning tens of seconds makes real-world deployment impractical. To bridge this gap, we propose SparseVideoNav, achieving sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart. Extensive real-world zero-shot experiments demonstrate that SparseVideoNav achieves 2.5x the success rate of state-of-the-art LLM baselines on BVN tasks and marks the first realization of such capability in challenging night scenes.
[178] Weaver: End-to-End Agentic System Training for Video Interleaved Reasoning
Yudi Shi, Shangzhe Di, Qirui Chen, Qinian Wang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie
Main category: cs.CV
TL;DR: Weaver: A multimodal reasoning agentic system that dynamically invokes tools during video reasoning, using reinforcement learning to explore tool usage strategies, improving performance on complex video reasoning benchmarks.
Details
Motivation: Current text-centric Chain-of-Thought approaches for video reasoning suffer from representational mismatch and limited perceptual acuity, failing to fully leverage multimodal information in videos.Method: End-to-end trainable multimodal reasoning agentic system that empowers policy models to dynamically invoke diverse tools throughout reasoning process, acquiring visual cues progressively. Integrates reinforcement learning for exploring tool usage strategies with trajectory-free data.
Result: Extensive experiments show Weaver enhances performance on several complex video reasoning benchmarks, particularly those involving long videos.
Conclusion: Weaver addresses limitations of text-centric approaches by enabling dynamic tool invocation and multimodal reasoning trajectories, advancing video reasoning capabilities.
Abstract: Video reasoning constitutes a comprehensive assessment of a model’s capabilities, as it demands robust perceptual and interpretive skills, thereby serving as a means to explore the boundaries of model performance. While recent research has leveraged text-centric Chain-of-Thought reasoning to augment these capabilities, such approaches frequently suffer from representational mismatch and restricted by limited perceptual acuity. To address these limitations, we propose Weaver, a novel, end-to-end trainable multimodal reasoning agentic system. Weaver empowers its policy model to dynamically invoke diverse tools throughout the reasoning process, enabling progressive acquisition of crucial visual cues and construction of authentic multimodal reasoning trajectories. Furthermore, we integrate a reinforcement learning algorithm to allow the system to freely explore strategies for employing and combining these tools with trajectory-free data. Extensive experiments demonstrate that our system, Weaver, enhances performance on several complex video reasoning benchmarks, particularly those involving long videos.
[179] UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents
Han Xiao, Guozhi Wang, Hao Wang, Shilong Liu, Yuxiang Chai, Yue Pan, Yufeng Zhou, Xiaoxin Chen, Yafei Wen, Hongsheng Li
Main category: cs.CV
TL;DR: UI-Mem enhances GUI online RL with hierarchical experience memory for better credit assignment and cross-task transfer
Details
Motivation: Online RL for GUI agents faces inefficient credit assignment in long-horizon tasks and lacks experience transfer across tasks, limiting effectivenessMethod: Proposes UI-Mem framework with Hierarchical Experience Memory storing structured knowledge (workflows, subtask skills, failure patterns) as parameterized templates. Uses Stratified Group Sampling to inject varying guidance levels and Self-Evolving Loop to abstract novel strategies
Result: Significantly outperforms traditional RL baselines and static reuse strategies on online GUI benchmarks, with strong generalization to unseen applications
Conclusion: UI-Mem effectively addresses credit assignment and transfer learning challenges in GUI online RL through hierarchical memory and guided exploration
Abstract: Online Reinforcement Learning (RL) offers a promising paradigm for enhancing GUI agents through direct environment interaction. However, its effectiveness is severely hindered by inefficient credit assignment in long-horizon tasks and repetitive errors across tasks due to the lack of experience transfer. To address these challenges, we propose UI-Mem, a novel framework that enhances GUI online RL with a Hierarchical Experience Memory. Unlike traditional replay buffers, our memory accumulates structured knowledge, including high-level workflows, subtask skills, and failure patterns. These experiences are stored as parameterized templates that enable cross-task and cross-application transfer. To effectively integrate memory guidance into online RL, we introduce Stratified Group Sampling, which injects varying levels of guidance across trajectories within each rollout group to maintain outcome diversity, driving the unguided policy toward internalizing guided behaviors. Furthermore, a Self-Evolving Loop continuously abstracts novel strategies and errors to keep the memory aligned with the agent’s evolving policy. Experiments on online GUI benchmarks demonstrate that UI-Mem significantly outperforms traditional RL baselines and static reuse strategies, with strong generalization to unseen applications. Project page: https://ui-mem.github.io
[180] Self-Supervised Learning with a Multi-Task Latent Space Objective
Pierre-François De Plaen, Abhishek Jha, Luc Van Gool, Tinne Tuytelaars, Marc Proesmans
Main category: cs.CV
TL;DR: Multi-crop strategy causes instability in predictor-based SSL methods; solution uses separate predictors per view type and adds cutout views, creating stable multi-task framework that improves ResNet and ViT performance on ImageNet.
Details
Motivation: Multi-crop strategy enhances SSL frameworks but causes instability in predictor-based architectures like BYOL, SimSiam, and MoCo v3. The authors aim to solve this instability problem and improve SSL performance.Method: Assign separate predictors to each view type instead of sharing one predictor across all views. Extend this by treating each spatial transformation as distinct alignment task and add cutout views (masked images). Create multi-task formulation combining global, local, and masked views.
Result: Stabilizes multi-crop training, yields significant performance gains. Approach is stable, generally applicable across backbones, and consistently improves ResNet and ViT models on ImageNet.
Conclusion: Separate predictors per view type stabilize multi-crop SSL training. Multi-task formulation with global, local, and masked views creates effective framework that works across different architectures.
Abstract: Self-supervised learning (SSL) methods based on Siamese networks learn visual representations by aligning different views of the same image. The multi-crop strategy, which incorporates small local crops to global ones, enhances many SSL frameworks but causes instability in predictor-based architectures such as BYOL, SimSiam, and MoCo v3. We trace this failure to the shared predictor used across all views and demonstrate that assigning a separate predictor to each view type stabilizes multi-crop training, resulting in significant performance gains. Extending this idea, we treat each spatial transformation as a distinct alignment task and add cutout views, where part of the image is masked before encoding. This yields a simple multi-task formulation of asymmetric Siamese SSL that combines global, local, and masked views into a single framework. The approach is stable, generally applicable across backbones, and consistently improves the performance of ResNet and ViT models on ImageNet.
[181] Pathwise Test-Time Correction for Autoregressive Long Video Generation
Xunzhi Xiang, Zixuan Duan, Guiyu Zhang, Haiyu Zhang, Zhe Gao, Junta Wu, Shaofeng Zhang, Tengfei Wang, Qi Fan, Chunchao Guo
Main category: cs.CV
TL;DR: TTC (Test-Time Correction) addresses error accumulation in distilled autoregressive diffusion models for long video generation by using initial frames as anchors to calibrate intermediate states during sampling.
Details
Motivation: Distilled autoregressive diffusion models enable real-time short video synthesis but suffer from severe error accumulation during long-sequence generation. Existing Test-Time Optimization (TTO) methods work for images/short clips but fail for extended sequences due to unstable reward landscapes and hypersensitive distilled parameters.Method: Introduces Test-Time Correction (TTC), a training-free method that uses the initial frame as a stable reference anchor to calibrate intermediate stochastic states along the sampling trajectory. This approach integrates with various distilled models without additional training.
Result: Extensive experiments show TTC seamlessly integrates with various distilled models, extending generation lengths with negligible overhead while matching quality of resource-intensive training-based methods on 30-second benchmarks.
Conclusion: TTC provides an effective training-free solution to mitigate error accumulation in long video generation with distilled autoregressive diffusion models, overcoming limitations of existing TTO methods.
Abstract: Distilled autoregressive diffusion models facilitate real-time short video synthesis but suffer from severe error accumulation during long-sequence generation. While existing Test-Time Optimization (TTO) methods prove effective for images or short clips, we identify that they fail to mitigate drift in extended sequences due to unstable reward landscapes and the hypersensitivity of distilled parameters. To overcome these limitations, we introduce Test-Time Correction (TTC), a training-free alternative. Specifically, TTC utilizes the initial frame as a stable reference anchor to calibrate intermediate stochastic states along the sampling trajectory. Extensive experiments demonstrate that our method seamlessly integrates with various distilled models, extending generation lengths with negligible overhead while matching the quality of resource-intensive training-based methods on 30-second benchmarks.
[182] Contour Refinement using Discrete Diffusion in Low Data Regime
Fei Yu Guan, Ian Keefe, Sophie Wilkinson, Daniel D. B. Perrakis, Steven Waslander
Main category: cs.CV
TL;DR: Lightweight discrete diffusion contour refinement pipeline for boundary detection of irregular/translucent objects in low-data regimes, outperforming SOTA on medical imaging datasets with 3.5X faster inference.
Details
Motivation: Boundary detection of irregular and translucent objects is important for medical imaging, environmental monitoring, and manufacturing, but faces challenges with scarce labeled data and low computational resources. While recent segmentation studies focus on mask alignment, boundary detection remains understudied in low-data regimes.Method: Proposes a lightweight discrete diffusion contour refinement pipeline using a CNN with self-attention layers. Conditions on segmentation masks and iteratively denoises sparse contour representations. Introduces novel adaptations: simplified diffusion process, customized model architecture, and minimal post-processing for datasets with <500 training images.
Result: Outperforms several SOTA baselines on medical imaging dataset KVASIR, competitive on HAM10K and custom wildfire dataset (Smoke), while improving inference framerate by 3.5X.
Conclusion: The proposed lightweight discrete diffusion contour refinement pipeline effectively addresses boundary detection for irregular/translucent objects in low-data regimes, achieving strong performance with improved computational efficiency.
Abstract: Boundary detection of irregular and translucent objects is an important problem with applications in medical imaging, environmental monitoring and manufacturing, where many of these applications are plagued with scarce labeled data and low in situ computational resources. While recent image segmentation studies focus on segmentation mask alignment with ground-truth, the task of boundary detection remains understudied, especially in the low data regime. In this work, we present a lightweight discrete diffusion contour refinement pipeline for robust boundary detection in the low data regime. We use a Convolutional Neural Network(CNN) architecture with self-attention layers as the core of our pipeline, and condition on a segmentation mask, iteratively denoising a sparse contour representation. We introduce multiple novel adaptations for improved low-data efficacy and inference efficiency, including using a simplified diffusion process, a customized model architecture, and minimal post processing to produce a dense, isolated contour given a dataset of size <500 training images. Our method outperforms several SOTA baselines on the medical imaging dataset KVASIR, is competitive on HAM10K and our custom wildfire dataset, Smoke, while improving inference framerate by 3.5X.
[183] EoCD: Encoder only Remote Sensing Change Detection
Mubashir Noman, Mustansar Fiaz, Hiyam Debary, Abdul Hannan, Shah Nawaz, Fahad Shahbaz Khan, Salman Khan
Main category: cs.CV
TL;DR: EoCD is an encoder-only change detection method that uses early fusion of temporal data and replaces complex decoders with parameter-free multiscale feature fusion, reducing model complexity while maintaining performance.
Details
Motivation: Existing change detection methods rely on Siamese encoders with complex decoders, increasing computational cost and model complexity. Early fusion methods exist but have inferior performance and still use sophisticated decoders.Method: Proposes encoder-only change detection (EoCD) that performs early fusion of temporal data and replaces decoder with parameter-free multiscale feature fusion module, significantly reducing model complexity.
Result: EoCD achieves optimal balance between change detection performance and prediction speed across various encoder architectures, showing performance depends predominantly on encoder rather than decoder.
Conclusion: EoCD demonstrates that decoder is an additional component rather than essential, providing effective change detection with reduced complexity across four challenging datasets.
Abstract: Being a cornerstone of temporal analysis, change detection has been playing a pivotal role in modern earth observation. Existing change detection methods rely on the Siamese encoder to individually extract temporal features followed by temporal fusion. Subsequently, these methods design sophisticated decoders to improve the change detection performance without taking into consideration the complexity of the model. These aforementioned issues intensify the overall computational cost as well as the network’s complexity which is undesirable. Alternatively, few methods utilize the early fusion scheme to combine the temporal images. These methods prevent the extra overhead of Siamese encoder, however, they also rely on sophisticated decoders for better performance. In addition, these methods demonstrate inferior performance as compared to late fusion based methods. To bridge these gaps, we introduce encoder only change detection (EoCD) that is a simple and effective method for the change detection task. The proposed method performs the early fusion of the temporal data and replaces the decoder with a parameter-free multiscale feature fusion module thereby significantly reducing the overall complexity of the model. EoCD demonstrate the optimal balance between the change detection performance and the prediction speed across a variety of encoder architectures. Additionally, EoCD demonstrate that the performance of the model is predominantly dependent on the encoder network, making the decoder an additional component. Extensive experimentation on four challenging change detection datasets reveals the effectiveness of the proposed method.
[184] Neural Implicit 3D Cardiac Shape Reconstruction from Sparse CT Angiography Slices Mimicking 2D Transthoracic Echocardiography Views
Gino E. Jansen, Carolina Brás, R. Nils Planken, Mark J. Schuuring, Berto J. Bouma, Ivana Išgum
Main category: cs.CV
TL;DR: 3D cardiac shape reconstruction from sparse 2D planes using neural implicit functions, applied to echocardiography views.
Details
Motivation: Accurate 3D cardiac representations enable quantitative analysis of anatomy and function, but 2D transthoracic echocardiography (TTE) provides only sparse views. Current clinical methods like Simpson's biplane rule have limitations in accuracy.Method: Uses neural implicit functions to reconstruct 3D cardiac shapes from sparse CT angiography (CTA) planes that mimic standard apical 2D TTE views. A multi-layer perceptron learns shape priors from 3D segmentations during training, then at test time jointly optimizes latent codes and rigid transforms to map observed planes into 3D space.
Result: Achieves average Dice coefficient of 0.86 ± 0.04 across all structures on held-out CTA segmentations. Significantly outperforms clinical standard: left ventricle volume errors of 4.88 ± 4.26 mL vs 8.14 ± 6.04 mL, and left atrium errors of 6.40 ± 7.37 mL vs 37.76 ± 22.96 mL.
Conclusion: The approach offers a viable route to more accurate 3D chamber quantification in 2D transthoracic echocardiography, potentially improving clinical cardiac assessment.
Abstract: Accurate 3D representations of cardiac structures allow quantitative analysis of anatomy and function. In this work, we propose a method for reconstructing complete 3D cardiac shapes from segmentations of sparse planes in CT angiography (CTA) for application in 2D transthoracic echocardiography (TTE). Our method uses a neural implicit function to reconstruct the 3D shape of the cardiac chambers and left-ventricle myocardium from sparse CTA planes. To investigate the feasibility of achieving 3D reconstruction from 2D TTE, we select planes that mimic the standard apical 2D TTE views. During training, a multi-layer perceptron learns shape priors from 3D segmentations of the target structures in CTA. At test time, the network reconstructs 3D cardiac shapes from segmentations of TTE-mimicking CTA planes by jointly optimizing the latent code and the rigid transforms that map the observed planes into 3D space. For each heart, we simulate four realistic apical views, and we compare reconstructed multi-class volumes with the reference CTA volumes. On a held-out set of CTA segmentations, our approach achieves an average Dice coefficient of 0.86 $\pm$ 0.04 across all structures. Our method also achieves markedly lower volume errors than the clinical standard, Simpson’s biplane rule: 4.88 $\pm$ 4.26 mL vs. 8.14 $\pm$ 6.04 mL, respectively, for the left ventricle; and 6.40 $\pm$ 7.37 mL vs. 37.76 $\pm$ 22.96 mL, respectively, for the left atrium. This suggests that our approach offers a viable route to more accurate 3D chamber quantification in 2D transthoracic echocardiography.
[185] CLIP-Map: Structured Matrix Mapping for Parameter-Efficient CLIP Compression
Kangjie Zhang, Wenxuan Huang, Xin Zhou, Boxiang Zhou, Dejia Song, Yuan Xie, Baochang Zhang, Lizhuang Ma, Nemo Chen, Xu Tang, Yao Hu, Shaohui Lin
Main category: cs.CV
TL;DR: CLIP-Map: A novel mapping-based compression framework for CLIP that uses learnable matrices with Kronecker factorization to preserve original weight information, outperforming select-based compression methods especially at high compression ratios.
Details
Motivation: CLIP has high memory and computation costs that limit its use in resource-constrained scenarios. Existing compression methods use select-based weight inheritance which compromises feature representation ability, especially under extreme compression.Method: Proposes CLIP-Map, a mapping-based compression framework that uses learnable matrices to map and combine pretrained weights via Full-Mapping with Kronecker Factorization. Introduces Diagonal Inheritance Initialization to reduce distribution shifting for efficient mapping learning.
Result: Extensive experiments show CLIP-Map outperforms select-based frameworks across various compression ratios, with particularly significant gains under high compression settings.
Conclusion: CLIP-Map provides an effective mapping-based approach for CLIP compression that better preserves original weight information compared to selection-based methods, enabling more efficient deployment in resource-limited scenarios.
Abstract: Contrastive Language-Image Pre-training (CLIP) has achieved widely applications in various computer vision tasks, e.g., text-to-image generation, Image-Text retrieval and Image captioning. However, CLIP suffers from high memory and computation cost, which prohibits its usage to the resource-limited application scenarios. Existing CLIP compression methods typically reduce the size of pre-trained CLIP weights by selecting their subset as weight inheritance for further retraining via mask optimization or important weight measurement. However, these select-based weight inheritance often compromises the feature presentation ability, especially on the extreme compression. In this paper, we propose a novel mapping-based CLIP compression framework, CLIP-Map. It leverages learnable matrices to map and combine pretrained weights by Full-Mapping with Kronecker Factorization, aiming to preserve as much information from the original weights as possible. To mitigate the optimization challenges introduced by the learnable mapping, we propose Diagonal Inheritance Initialization to reduce the distribution shifting problem for efficient and effective mapping learning. Extensive experimental results demonstrate that the proposed CLIP-Map outperforms select-based frameworks across various compression ratios, with particularly significant gains observed under high compression settings.
[186] Multi-Scale Global-Instance Prompt Tuning for Continual Test-time Adaptation in Medical Image Segmentation
Lingrui Li, Yanfeng Zhou, Nan Pu, Xin Chen, Zhun Zhong
Main category: cs.CV
TL;DR: MGIPT introduces multi-scale global-instance prompt tuning for continual test-time adaptation in medical image segmentation, addressing domain shift challenges without catastrophic forgetting.
Details
Motivation: Distribution shift in medical images from different clinical centers hinders deployment of pre-trained segmentation models. Existing CTTA methods suffer from error accumulation and catastrophic forgetting, while prompt-based approaches lack multi-scale diversity and instance-specific knowledge.Method: Proposes Multi-scale Global-Instance Prompt Tuning (MGIPT) with two components: Adaptive-scale Instance Prompt (AIP) for lightweight instance-specific prompts with adaptive scale selection, and Multi-scale Global-level Prompt (MGP) for domain-level knowledge across scales. Uses weighted ensemble approach for dual-level adaptation.
Result: Extensive experiments on medical image segmentation benchmarks show MGIPT outperforms state-of-the-art methods, achieving robust adaptation across continually changing target domains.
Conclusion: MGIPT effectively addresses limitations of existing CTTA methods by combining multi-scale global and instance-level prompt tuning, enabling robust adaptation without catastrophic forgetting or privacy leakage concerns.
Abstract: Distribution shift is a common challenge in medical images obtained from different clinical centers, significantly hindering the deployment of pre-trained semantic segmentation models in real-world applications across multiple domains. Continual Test-Time Adaptation(CTTA) has emerged as a promising approach to address cross-domain shifts during continually evolving target domains. Most existing CTTA methods rely on incrementally updating model parameters, which inevitably suffer from error accumulation and catastrophic forgetting, especially in long-term adaptation. Recent prompt-tuning-based works have shown potential to mitigate the two issues above by updating only visual prompts. While these approaches have demonstrated promising performance, several limitations remain:1)lacking multi-scale prompt diversity, 2)inadequate incorporation of instance-specific knowledge, and 3)risk of privacy leakage. To overcome these limitations, we propose Multi-scale Global-Instance Prompt Tuning(MGIPT), to enhance scale diversity of prompts and capture both global- and instance-level knowledge for robust CTTA. Specifically, MGIPT consists of an Adaptive-scale Instance Prompt(AIP) and a Multi-scale Global-level Prompt(MGP). AIP dynamically learns lightweight and instance-specific prompts to mitigate error accumulation with adaptive optimal-scale selection mechanism. MGP captures domain-level knowledge across different scales to ensure robust adaptation with anti-forgetting capabilities. These complementary components are combined through a weighted ensemble approach, enabling effective dual-level adaptation that integrates both global and local information. Extensive experiments on medical image segmentation benchmarks demonstrate that our MGIPT outperforms state-of-the-art methods, achieving robust adaptation across continually changing target domains.
[187] Better Source, Better Flow: Learning Condition-Dependent Source Distribution for Flow Matching
Junwan Kim, Jiho Park, Seonghu Jeon, Seungryong Kim
Main category: cs.CV
TL;DR: Learning condition-dependent source distributions for flow matching in text-to-image generation improves performance and convergence speed.
Details
Motivation: Current flow matching approaches for text-to-image generation typically use standard Gaussian source distributions inherited from diffusion models, without optimizing the source distribution itself. The authors argue that principled design of source distributions can significantly improve conditional generation performance.Method: Propose learning condition-dependent source distributions under flow matching objective. Address failure modes like distributional collapse and instability through variance regularization and directional alignment between source and target. Analyze impact of target representation space on flow matching with structured sources.
Result: Extensive experiments across multiple text-to-image benchmarks show consistent improvements, including up to 3x faster convergence in FID metrics.
Conclusion: Principled design of source distributions for conditional flow matching is both feasible and beneficial for modern text-to-image systems, offering practical advantages over standard Gaussian assumptions.
Abstract: Flow matching has recently emerged as a promising alternative to diffusion-based generative models, particularly for text-to-image generation. Despite its flexibility in allowing arbitrary source distributions, most existing approaches rely on a standard Gaussian distribution, a choice inherited from diffusion models, and rarely consider the source distribution itself as an optimization target in such settings. In this work, we show that principled design of the source distribution is not only feasible but also beneficial at the scale of modern text-to-image systems. Specifically, we propose learning a condition-dependent source distribution under flow matching objective that better exploit rich conditioning signals. We identify key failure modes that arise when directly incorporating conditioning into the source, including distributional collapse and instability, and show that appropriate variance regularization and directional alignment between source and target are critical for stable and effective learning. We further analyze how the choice of target representation space impacts flow matching with structured sources, revealing regimes in which such designs are most effective. Extensive experiments across multiple text-to-image benchmarks demonstrate consistent and robust improvements, including up to a 3x faster convergence in FID, highlighting the practical benefits of a principled source distribution design for conditional flow matching.
[188] LSA: Localized Semantic Alignment for Enhancing Temporal Consistency in Traffic Video Generation
Mirlan Karimov, Teodora Spasojevic, Markus Braun, Julian Wiederer, Vasileios Belagiannis, Marc Pollefeys
Main category: cs.CV
TL;DR: LSA framework fine-tunes video generation models to improve temporal consistency by aligning semantic features between ground-truth and generated videos around dynamic objects, eliminating need for control signals at inference.
Details
Motivation: Existing controllable video generation methods for autonomous driving require control signals at inference time, limiting their scalability and generalizability as data engines. There's a need for temporally consistent video generation without external guidance during inference.Method: Proposes Localized Semantic Alignment (LSA) framework that fine-tunes pre-trained video generation models by aligning semantic features between ground-truth and generated video clips localized around dynamic objects. Uses off-the-shelf feature extraction model to compute semantic feature consistency loss, combined with standard diffusion loss for fine-tuning.
Result: Model fine-tuned for single epoch with LSA outperforms baselines in video generation metrics. Adapted object detection metrics (mAP and mIoU) show improved temporal consistency. Extensive experiments on nuScenes and KITTI datasets demonstrate effectiveness without computational overhead or control signals at inference.
Conclusion: LSA effectively enhances temporal consistency in video generation for autonomous driving scenarios without requiring external control signals during inference, making it more scalable and generalizable as a data engine.
Abstract: Controllable video generation has emerged as a versatile tool for autonomous driving, enabling realistic synthesis of traffic scenarios. However, existing methods depend on control signals at inference time to guide the generative model towards temporally consistent generation of dynamic objects, limiting their utility as scalable and generalizable data engines. In this work, we propose Localized Semantic Alignment (LSA), a simple yet effective framework for fine-tuning pre-trained video generation models. LSA enhances temporal consistency by aligning semantic features between ground-truth and generated video clips. Specifically, we compare the output of an off-the-shelf feature extraction model between the ground-truth and generated video clips localized around dynamic objects inducing a semantic feature consistency loss. We fine-tune the base model by combining this loss with the standard diffusion loss. The model fine-tuned for a single epoch with our novel loss outperforms the baselines in common video generation evaluation metrics. To further test the temporal consistency in generated videos we adapt two additional metrics from object detection task, namely mAP and mIoU. Extensive experiments on nuScenes and KITTI datasets show the effectiveness of our approach in enhancing temporal consistency in video generation without the need for external control signals during inference and any computational overheads.
[189] RISE-Video: Can Video Generators Decode Implicit World Rules?
Mingxin Liu, Shuran Ma, Shibei Meng, Xiangyu Zhao, Zicheng Zhang, Shaofeng Zhang, Zhihang Zhong, Peixian Chen, Haoyu Cao, Xing Sun, Haodong Duan, Xue Yang
Main category: cs.CV
TL;DR: RISE-Video is a reasoning-oriented benchmark for Text-Image-to-Video synthesis that evaluates models’ ability to understand and reason about implicit world rules, going beyond visual quality to assess cognitive reasoning capabilities.
Details
Motivation: Current generative video models focus on visual fidelity but lack capacity to internalize and reason over implicit world rules. There's a need to shift evaluation from surface-level aesthetics to deep cognitive reasoning to advance world-simulating generative models.Method: Created RISE-Video benchmark with 467 human-annotated samples across 8 reasoning categories. Introduced multi-dimensional evaluation protocol with 4 metrics: Reasoning Alignment, Temporal Consistency, Physical Rationality, and Visual Quality. Developed automated pipeline using Large Multimodal Models to emulate human assessment for scalable evaluation.
Result: Extensive experiments on 11 state-of-the-art TI2V models revealed pervasive deficiencies in simulating complex scenarios under implicit constraints, showing current models struggle with reasoning capabilities despite visual quality improvements.
Conclusion: RISE-Video provides a structured testbed for probing model intelligence across diverse reasoning dimensions, offering critical insights for advancing future world-simulating generative models beyond visual fidelity to cognitive reasoning.
Abstract: While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.
[190] VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation
Jie Deng, Kaichun Yao, Libo Zhang
Main category: cs.CV
TL;DR: VisRefiner: A training framework for screenshot-to-code generation that enables models to learn from visual differences between rendered predictions and reference designs, improving both single-step generation quality and self-refinement ability.
Details
Motivation: Existing multimodal LLMs for screenshot-to-code generation directly map screenshots to code without observing the visual outcomes of their generated code, unlike human developers who iteratively render, compare, and learn from visual differences.Method: Proposes VisRefiner with two key components: 1) Difference-aligned supervision that associates visual discrepancies with corresponding code edits, and 2) Reinforcement learning stage for self-refinement where models improve code by observing rendered output, target design, and their visual differences.
Result: VisRefiner substantially improves single-step generation quality and layout fidelity, while endowing models with strong self-refinement ability through visual difference learning.
Conclusion: Learning from visual differences is effective for advancing screenshot-to-code generation, enabling models to better understand how appearance variations arise from implementation changes.
Abstract: Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from screenshots but are trained without observing the visual outcomes of their generated code. In contrast, human developers iteratively render their implementation, compare it with the design, and learn how visual differences relate to code changes. Inspired by this process, we propose VisRefiner, a training framework that enables models to learn from visual differences between rendered predictions and reference designs. We construct difference-aligned supervision that associates visual discrepancies with corresponding code edits, allowing the model to understand how appearance variations arise from implementation changes. Building on this, we introduce a reinforcement learning stage for self-refinement, where the model improves its generated code by observing both the rendered output and the target design, identifying their visual differences, and updating the code accordingly. Experiments show that VisRefiner substantially improves single-step generation quality and layout fidelity, while also endowing models with strong self-refinement ability. These results demonstrate the effectiveness of learning from visual differences for advancing screenshot-to-code generation.
[191] GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?
Ruihang Li, Leigang Qu, Jingxu Zhang, Dongnan Gui, Mengde Xu, Xiaosong Zhang, Han Hu, Wenjie Wang, Jiaqi Wang
Main category: cs.CV
TL;DR: GenArena introduces a pairwise comparison framework for evaluating visual generation models, addressing limitations of traditional pointwise scoring methods by improving stability and human alignment.
Details
Motivation: Traditional evaluation approaches for visual generation models are inadequate as they rely on absolute pointwise scoring, which suffers from stochastic inconsistency and poor alignment with human perception. The rapid advancement of visual generation models requires more reliable evaluation methods.Method: GenArena uses a pairwise comparison paradigm instead of absolute pointwise scoring. It leverages Vision-Language Models as surrogate judges to compare generated images in pairs, creating a more stable and human-aligned evaluation framework across diverse visual generation tasks.
Result: The pairwise approach enables off-the-shelf open-source models to outperform top-tier proprietary models. It boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, compared to only 0.36 for pointwise methods.
Conclusion: GenArena provides a rigorous and automated evaluation standard for visual generation models, addressing fundamental limitations of current evaluation paradigms and offering a more reliable benchmark for the community.
Abstract: The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.
[192] MambaVF: State Space Model for Efficient Video Fusion
Zixiang Zhao, Yukun Cui, Lilun Deng, Haowen Bai, Haotong Qin, Tao Feng, Konrad Schindler
Main category: cs.CV
TL;DR: MambaVF is an efficient video fusion framework using state space models that eliminates optical flow estimation, achieving SOTA performance with significantly reduced computation and parameters.
Details
Motivation: Current video fusion methods rely heavily on optical flow estimation and feature warping, which causes high computational overhead and limits scalability. There's a need for more efficient temporal modeling approaches.Method: MambaVF reformulates video fusion as a sequential state update process using state space models (SSMs). It replaces conventional flow-guided alignment with a spatio-temporal bidirectional scanning mechanism that captures long-range temporal dependencies with linear complexity.
Result: Achieves state-of-the-art performance across multiple benchmarks (multi-exposure, multi-focus, infrared-visible, and medical video fusion). Reduces up to 92.25% parameters, 88.79% computational FLOPs, and provides 2.1x speedup compared to existing methods.
Conclusion: MambaVF demonstrates that SSM-based approaches can efficiently handle video fusion tasks without explicit motion estimation, offering a scalable and computationally efficient alternative to flow-based methods.
Abstract: Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: https://mambavf.github.io
[193] Context Forcing: Consistent Autoregressive Video Generation with Long Context
Shuo Chen, Cong Wei, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, Wenhu Chen
Main category: cs.CV
TL;DR: Context Forcing: A novel framework for real-time long video generation that trains long-context student models using long-context teachers to eliminate student-teacher mismatch, enabling generation with context lengths exceeding 20 seconds.
Details
Motivation: Existing real-time long video generation methods suffer from student-teacher mismatch where short-context teachers (limited to 5-second windows) cannot guide students on global temporal dependencies, effectively capping context length and limiting long-term consistency.Method: Proposes Context Forcing framework with long-context teacher aware of full generation history, plus a Slow-Fast Memory architecture for context management that transforms linearly growing context to reduce visual redundancy, making extreme durations (e.g., 2 minutes) computationally feasible.
Result: Enables effective context lengths exceeding 20 seconds (2-10x longer than SOTA methods like LongLive and Infinite-RoPE), preserves superior consistency across long durations, and surpasses SOTA baselines on various long video evaluation metrics.
Conclusion: Context Forcing successfully resolves the student-teacher mismatch problem in long video generation, enabling robust training of models with extended context awareness and superior long-term consistency through computationally efficient context management.
Abstract: Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical \textbf{student-teacher mismatch}: the teacher’s inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student’s context length. To resolve this, we propose \textbf{Context Forcing}, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a \textbf{Slow-Fast Memory} architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds – 2 to 10 times longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.
[194] Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation
David Shavin, Sagie Benaim
Main category: cs.CV
TL;DR: Splat and Distill framework enhances 2D Vision Foundation Models with 3D awareness by using feed-forward 3D Gaussian reconstruction to lift 2D features into 3D and then projecting them to novel viewpoints for student supervision.
Details
Motivation: Current Vision Foundation Models (VFMs) excel at 2D tasks but lack 3D awareness, limiting their understanding of spatial relationships and geometry. The authors aim to bridge this gap by instilling robust 3D understanding into 2D VFMs.Method: The framework uses a teacher model to extract 2D features, then lifts these features into an explicit 3D Gaussian representation via feed-forward 3D reconstruction. These 3D features are projected (“splatted”) onto novel viewpoints to create novel 2D feature maps, which supervise a student model through distillation, creating a dynamic learning loop where both teacher and student improve.
Result: The method significantly outperforms prior works on multiple downstream tasks including monocular depth estimation, surface normal estimation, multi-view correspondence, and semantic segmentation. It achieves substantial gains in 3D awareness while also enhancing the semantic richness of 2D features.
Conclusion: Splat and Distill successfully instills 3D awareness into 2D Vision Foundation Models through a novel feed-forward 3D reconstruction and distillation approach, overcoming limitations of previous methods and improving performance across various 3D-aware tasks.
Abstract: Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then splatted" onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, distilling" geometrically grounded knowledge. By replacing slow per-scene optimization of prior work with our feed-forward lifting approach, our framework avoids feature-averaging artifacts, creating a dynamic learning process where the teacher’s consistency improves alongside that of the student. We conduct a comprehensive evaluation on a suite of downstream tasks, including monocular depth estimation, surface normal estimation, multi-view correspondence, and semantic segmentation. Our method significantly outperforms prior works, not only achieving substantial gains in 3D awareness but also enhancing the underlying semantic richness of 2D features. Project page is available at https://davidshavin4.github.io/Splat-and-Distill/
[195] V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval
Dongyang Chen, Chaoyang Wang, Dezhao SU, Xi Xiao, Zeyu Zhang, Jing Xiong, Qing Li, Yuzhang Shang, Shichao Ka
Main category: cs.CV
TL;DR: V-Retrver is an evidence-driven multimodal retrieval framework that enables MLLMs to actively gather visual evidence during reasoning using external tools, improving accuracy by 23% on average across benchmarks.
Details
Motivation: Current multimodal retrieval approaches using Chain-of-Thought reasoning are largely language-driven with static visual encodings, lacking active visual verification which leads to speculative reasoning in visually ambiguous cases.Method: Proposes V-Retrver framework that reformulates multimodal retrieval as agentic reasoning with selective visual evidence acquisition via external tools, using curriculum-based learning combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with evidence-aligned objectives.
Result: Experiments across multiple multimodal retrieval benchmarks show consistent improvements: 23.0% average improvement in retrieval accuracy, enhanced perception-driven reasoning reliability, and better generalization.
Conclusion: V-Retrver demonstrates that evidence-driven, active visual verification in multimodal retrieval significantly outperforms static visual encoding approaches, enabling more reliable and accurate multimodal reasoning.
Abstract: Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.
[196] InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions
Sirui Xu, Samuel Schulter, Morteza Ziyadi, Xialin He, Xiaohan Fei, Yu-Xiong Wang, Liangyan Gui
Main category: cs.CV
TL;DR: InterPrior: A scalable framework for humanoid whole-body interaction learning through imitation pretraining and RL finetuning that generalizes to unseen objects and contexts.
Details
Motivation: Humans plan interactions at the affordance level rather than explicit movements, with coordination emerging from physical priors. Scaling such priors is needed for humanoids to generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination.Method: InterPrior learns a unified generative controller through: 1) Large-scale imitation pretraining to distill a full-reference imitation expert into a goal-conditioned variational policy, 2) Data augmentation with physical perturbations, and 3) Reinforcement learning finetuning to improve generalization to unseen goals and initializations.
Result: The framework yields a motion prior that generalizes beyond training data, enabling interactions with unseen objects. It demonstrates effectiveness for user-interactive control and potential for real robot deployment.
Conclusion: InterPrior provides a scalable approach for learning whole-body interaction priors that can generalize across diverse contexts, bridging imitation learning and reinforcement learning for humanoid control.
Abstract: Humans rarely plan whole-body interactions with objects at the level of explicit whole-body movements. High-level intentions, such as affordance, define the goal, while coordinated balance, contact, and manipulation can emerge naturally from underlying physical and motor priors. Scaling such priors is key to enabling humanoids to compose and generalize loco-manipulation skills across diverse contexts while maintaining physically coherent whole-body coordination. To this end, we introduce InterPrior, a scalable framework that learns a unified generative controller through large-scale imitation pretraining and post-training by reinforcement learning. InterPrior first distills a full-reference imitation expert into a versatile, goal-conditioned variational policy that reconstructs motion from multimodal observations and high-level intent. While the distilled policy reconstructs training behaviors, it does not generalize reliably due to the vast configuration space of large-scale human-object interactions. To address this, we apply data augmentation with physical perturbations, and then perform reinforcement learning finetuning to improve competence on unseen goals and initializations. Together, these steps consolidate the reconstructed latent skills into a valid manifold, yielding a motion prior that generalizes beyond the training data, e.g., it can incorporate new behaviors such as interactions with unseen objects. We further demonstrate its effectiveness for user-interactive control and its potential for real robot deployment.
[197] Thinking with Geometry: Active Geometry Integration for Spatial Reasoning
Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, Hang Xu, Xiaodan Liang
Main category: cs.CV
TL;DR: GeoThinker is an active perception framework for MLLMs that enables selective retrieval of geometric evidence based on internal reasoning demands, achieving state-of-the-art spatial intelligence performance.
Details
Motivation: Current MLLMs for spatial reasoning passively fuse geometric priors from 3D encoders as global streams, leading to semantic-geometry misalignment and redundant signals. The authors aim to shift from passive fusion to active perception.Method: GeoThinker uses Spatial-Grounded Fusion at selected VLM layers where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, calibrated by Importance Gating that biases attention toward task-relevant structures.
Result: Achieves state-of-the-art spatial intelligence with peak score of 72.6 on VSI-Bench, demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios including embodied referring and autonomous driving.
Conclusion: Active integration of spatial structures is essential for next-generation spatial intelligence. The ability to selectively retrieve geometric evidence based on reasoning demands outperforms passive fusion approaches.
Abstract: Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.
[198] SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs
Jintao Tong, Shilin Yan, Hongwei Xue, Xiaojun Tang, Kunyu Shi, Guannan Zhang, Ruixuan Li, Yixiong Zou
Main category: cs.CV
TL;DR: SwimBird is a reasoning-switchable MLLM that dynamically chooses among text-only, vision-only, and interleaved vision-text reasoning modes based on input queries, improving performance on vision-dense tasks while preserving textual logic.
Details
Motivation: Current MLLMs primarily use textual Chain-of-Thought reasoning, which limits effectiveness on vision-intensive tasks. Recent approaches using fixed visual thoughts improve visual performance but degrade text-based logical reasoning. The core limitation is rigid, pre-defined reasoning patterns that cannot adaptively choose suitable thinking modalities for different queries.Method: SwimBird uses a hybrid autoregressive formulation unifying next-token prediction for textual thoughts with next-embedding prediction for visual thoughts. It dynamically switches among three reasoning modes: text-only, vision-only (continuous hidden states as visual thoughts), and interleaved vision-text reasoning. A systematic reasoning-mode curation strategy constructs SwimBird-SFT-92K, a diverse supervised fine-tuning dataset covering all three reasoning patterns.
Result: SwimBird achieves state-of-the-art results across diverse benchmarks covering textual reasoning and challenging visual understanding. It demonstrates robust gains over prior fixed-pattern multimodal reasoning methods, preserving strong textual logic while substantially improving performance on vision-dense tasks.
Conclusion: SwimBird’s adaptive reasoning-switchable architecture enables flexible, query-appropriate mode selection, overcoming limitations of fixed reasoning patterns in MLLMs and achieving superior performance on both textual and visual reasoning tasks.
Abstract: Multimodal Large Language Models (MLLMs) have made remarkable progress in multimodal perception and reasoning by bridging vision and language. However, most existing MLLMs perform reasoning primarily with textual CoT, which limits their effectiveness on vision-intensive tasks. Recent approaches inject a fixed number of continuous hidden states as “visual thoughts” into the reasoning process and improve visual performance, but often at the cost of degraded text-based logical reasoning. We argue that the core limitation lies in a rigid, pre-defined reasoning pattern that cannot adaptively choose the most suitable thinking modality for different user queries. We introduce SwimBird, a reasoning-switchable MLLM that dynamically switches among three reasoning modes conditioned on the input: (1) text-only reasoning, (2) vision-only reasoning (continuous hidden states as visual thoughts), and (3) interleaved vision-text reasoning. To enable this capability, we adopt a hybrid autoregressive formulation that unifies next-token prediction for textual thoughts with next-embedding prediction for visual thoughts, and design a systematic reasoning-mode curation strategy to construct SwimBird-SFT-92K, a diverse supervised fine-tuning dataset covering all three reasoning patterns. By enabling flexible, query-adaptive mode selection, SwimBird preserves strong textual logic while substantially improving performance on vision-dense tasks. Experiments across diverse benchmarks covering textual reasoning and challenging visual understanding demonstrate that SwimBird achieves state-of-the-art results and robust gains over prior fixed-pattern multimodal reasoning methods.
[199] Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning
Xuejun Zhang, Aditi Tiwari, Zhenhailong Wang, Heng Ji
Main category: cs.CV
TL;DR: CAMCUE is a pose-aware multi-image framework for 3D spatial reasoning that uses camera pose as geometric anchor for cross-view fusion and novel-view reasoning, enabling perspective taking from language-specified viewpoints.
Details
Motivation: Current multimodal LLMs struggle with multi-image spatial reasoning, particularly perspective taking - building coherent 3D understanding from multi-view observations and reasoning from new language-specified viewpoints.Method: CAMCUE injects per-view camera pose into visual tokens, grounds natural-language viewpoint descriptions to target camera poses, and synthesizes pose-conditioned imagined target views to support answering perspective-shift questions.
Result: CAMCUE improves overall accuracy by 9.06%, predicts target poses from language descriptions with >90% rotation accuracy within 20° and translation accuracy within 0.5 error threshold, reducing inference time from 256.6s to 1.45s per example.
Conclusion: CAMCUE enables efficient multi-view spatial reasoning by explicitly incorporating camera geometry, demonstrating significant improvements in perspective taking while enabling fast, interactive use in real-world scenarios.
Abstract: Multi-image spatial reasoning remains challenging for current multimodal large language models (MLLMs). While single-view perception is inherently 2D, reasoning over multiple views requires building a coherent scene understanding across viewpoints. In particular, we study perspective taking, where a model must build a coherent 3D understanding from multi-view observations and use it to reason from a new, language-specified viewpoint. We introduce CAMCUE, a pose-aware multi-image framework that uses camera pose as an explicit geometric anchor for cross-view fusion and novel-view reasoning. CAMCUE injects per-view pose into visual tokens, grounds natural-language viewpoint descriptions to a target camera pose, and synthesizes a pose-conditioned imagined target view to support answering. To support this setting, we curate CAMCUE-DATA with 27,668 training and 508 test instances pairing multi-view images and poses with diverse target-viewpoint descriptions and perspective-shift questions. We also include human-annotated viewpoint descriptions in the test split to evaluate generalization to human language. CAMCUE improves overall accuracy by 9.06% and predicts target poses from natural-language viewpoint descriptions with over 90% rotation accuracy within 20° and translation accuracy within a 0.5 error threshold. This direct grounding avoids expensive test-time search-and-match, reducing inference time from 256.6s to 1.45s per example and enabling fast, interactive use in real-world scenarios.
[200] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, Shaohui Lin
Main category: cs.CV
TL;DR: Vision-R1: A multimodal reasoning MLLM trained with RL using a novel cold-start dataset and progressive training strategy to enhance multimodal math reasoning capabilities.
Details
Motivation: Inspired by DeepSeek-R1-Zero's success in RL-based reasoning emergence in LLMs, the authors aim to extend RL training to MLLMs for enhanced multimodal reasoning, but face challenges due to lack of high-quality multimodal reasoning data and optimization difficulties.Method: 1) Construct Vision-R1-cold dataset (200K multimodal CoT) using existing MLLM and DeepSeek-R1 via modality bridging and filtering; 2) Use Progressive Thinking Suppression Training (PTST) strategy; 3) Employ Group Relative Policy Optimization (GRPO) with hard formatting result reward on 10K multimodal math dataset to refine reasoning processes.
Result: Average ~6% improvement across multimodal math reasoning benchmarks; Vision-R1-7B achieves 73.5% on MathVista (0.4% below OpenAI O1); Vision-R1-32B gets 76.4%; Vision-R1-72B achieves 78.2% on MathVista.
Conclusion: RL can effectively enhance multimodal reasoning in MLLMs when combined with proper cold-start data and progressive training strategies, achieving competitive performance with state-of-the-art reasoning models.
Abstract: DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a high-quality multimodal CoT dataset without human annotations by leveraging an existing MLLM and DeepSeek-R1 through modality bridging and data filtering to obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as cold-start initialization data for Vision-R1. To mitigate the optimization challenges caused by overthinking after cold start, we propose Progressive Thinking Suppression Training (PTST) strategy and employ Group Relative Policy Optimization (GRPO) with the hard formatting result reward function to gradually refine the model’s ability to learn correct and complex reasoning processes on a 10K multimodal math dataset. Comprehensive experiments show our model achieves an average improvement of $\sim$6% across various multimodal math reasoning benchmarks. Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1. Scaling up the amount of multimodal math data in the RL training, Vision-R1-32B and Vison-R1-72B achieves 76.4% and 78.2% MathVista benchmark scores, respectively. The datasets and code will be released in: https://github.com/Osilly/Vision-R1 .
[201] Imperceptible Protection against Style Imitation from Diffusion Models
Namhyuk Ahn, Wonhyuk Ahn, KiYoon Yoo, Daesik Kim, Seung-Hun Nam
Main category: cs.CV
TL;DR: A method for protecting artworks from style imitation by diffusion models while maintaining visual quality through perceptual maps, instance-aware refinement, difficulty-aware protection, and perceptual constraints.
Details
Motivation: As diffusion models improve image generation fidelity, they raise copyright concerns for artworks. Existing protection methods degrade visual quality while preventing style imitation, so there's a need for protection that maintains artwork quality.Method: 1) Perceptual map to highlight human-sensitive areas with instance-aware refinement; 2) Difficulty-aware protection predicting protection difficulty and adjusting intensity dynamically; 3) Perceptual constraints bank to improve imperceptibility.
Result: The method substantially elevates the quality of protected images without compromising protection efficacy against style imitation by diffusion models.
Conclusion: The proposed approach successfully protects artworks from style imitation while maintaining visual quality, addressing both copyright concerns and aesthetic preservation.
Abstract: Recent progress in diffusion models has profoundly enhanced the fidelity of image generation, but it has raised concerns about copyright infringements. While prior methods have introduced adversarial perturbations to prevent style imitation, most are accompanied by the degradation of artworks’ visual quality. Recognizing the importance of maintaining this, we introduce a visually improved protection method while preserving its protection capability. To this end, we devise a perceptual map to highlight areas sensitive to human eyes, guided by instance-aware refinement, which refines the protection intensity accordingly. We also introduce a difficulty-aware protection by predicting how difficult the artwork is to protect and dynamically adjusting the intensity based on this. Lastly, we integrate a perceptual constraints bank to further improve the imperceptibility. Results show that our method substantially elevates the quality of the protected image without compromising on protection efficacy.
[202] MVGS: Multi-view Regulated Gaussian Splatting for Novel View Synthesis
Xiaobiao Du, Yida Wang, Xin Yu
Main category: cs.CV
TL;DR: A novel 3D Gaussian Splatting optimization method with multi-view training, cross-intrinsic guidance, cross-ray densification, and multi-view augmented densification to improve novel-view synthesis and 3D geometry accuracy.
Details
Motivation: Current 3D Gaussian Splatting methods suffer from overfitting to certain training views due to single-view supervision per iteration, leading to poor novel-view synthesis and imprecise 3D geometries.Method: Four key contributions: 1) Multi-view training strategy with regulation to prevent overfitting, 2) Cross-intrinsic guidance for coarse-to-fine training across resolutions, 3) Cross-ray densification in ray-intersect regions from multiple views, 4) Multi-view augmented densification that adapts to dramatic view differences.
Result: Improved overall accuracy across various scenarios and different Gaussian variants, with enhanced reconstruction accuracy and better novel-view synthesis performance.
Conclusion: The proposed multi-view optimization framework significantly improves 3D Gaussian Splatting by addressing overfitting issues and enhancing geometric reconstruction through novel training strategies.
Abstract: Recent works in volume rendering, \textit{e.g.} NeRF and 3D Gaussian Splatting (3DGS), significantly advance the rendering quality and efficiency with the help of the learned implicit neural radiance field or 3D Gaussians. Rendering on top of an explicit representation, the vanilla 3DGS and its variants deliver real-time efficiency by optimizing the parametric model with single-view supervision per iteration during training which is adopted from NeRF. Consequently, certain views are overfitted, leading to unsatisfying appearance in novel-view synthesis and imprecise 3D geometries. To solve aforementioned problems, we propose a new 3DGS optimization method embodying four key novel contributions: 1) We transform the conventional single-view training paradigm into a multi-view training strategy. With our proposed multi-view regulation, 3D Gaussian attributes are further optimized without overfitting certain training views. As a general solution, we improve the overall accuracy in a variety of scenarios and different Gaussian variants. 2) Inspired by the benefit introduced by additional views, we further propose a cross-intrinsic guidance scheme, leading to a coarse-to-fine training procedure concerning different resolutions. 3) Built on top of our multi-view regulated training, we further propose a cross-ray densification strategy, densifying more Gaussian kernels in the ray-intersect regions from a selection of views. 4) By further investigating the densification strategy, we found that the effect of densification should be enhanced when certain views are distinct dramatically. As a solution, we propose a novel multi-view augmented densification strategy, where 3D Gaussians are encouraged to get densified to a sufficient number accordingly, resulting in improved reconstruction accuracy.
[203] EUGens: Efficient, Unified, and General Dense Layers
Sang Min Kim, Byeongchan Kim, Arijit Sehanobish, Somnath Basu Roy Chowdhury, Rahul Kidambi, Dongseok Shim, Avinava Dubey, Snigdha Chaturvedi, Min-hwan Oh, Krzysztof Choromanski
Main category: cs.CV
TL;DR: EUGens: Efficient, Unified, General dense layers that generalize fully-connected layers using random features and input norm dependence, reducing quadratic to linear complexity while maintaining expressive power.
Details
Motivation: Fully-connected feedforward layers create computation and parameter bottlenecks in neural networks, limiting scalability for real-time applications and resource-constrained environments.Method: Propose EUGens that leverage random features to approximate standard FFLs and incorporate direct dependence on input norms, unifying existing efficient FFL extensions and enabling linear-time inference.
Result: EUGens integrated into Transformers and MLPs achieve up to 27% faster inference speed and 30% better memory efficiency across image classification, language model pre-training, and 3D scene reconstruction tasks.
Conclusion: EUGens offer a scalable solution for deploying large-scale neural networks in real-world scenarios by significantly improving efficiency while preserving model performance.
Abstract: Efficient neural networks are essential for scaling machine learning models to real-time applications and resource-constrained environments. Fully-connected feedforward layers (FFLs) introduce computation and parameter count bottlenecks within neural network architectures. To address this challenge, in this work, we propose a new class of dense layers that generalize standard fully-connected feedforward layers, \textbf{E}fficient, \textbf{U}nified and \textbf{Gen}eral dense layers (EUGens). EUGens leverage random features to approximate standard FFLs and go beyond them by incorporating a direct dependence on the input norms in their computations. The proposed layers unify existing efficient FFL extensions and improve efficiency by reducing inference complexity from quadratic to linear time. They also lead to \textbf{the first} unbiased algorithms approximating FFLs with arbitrary polynomial activation functions. Furthermore, EuGens reduce the parameter count and computational overhead while preserving the expressive power and adaptability of FFLs. We also present a layer-wise knowledge transfer technique that bypasses backpropagation, enabling efficient adaptation of EUGens to pre-trained models. Empirically, we observe that integrating EUGens into Transformers and MLPs yields substantial improvements in inference speed (up to \textbf{27}%) and memory efficiency (up to \textbf{30}%) across a range of tasks, including image classification, language model pre-training, and 3D scene reconstruction. Overall, our results highlight the potential of EUGens for the scalable deployment of large-scale neural networks in real-world scenarios.
[204] Efficient Scene Modeling via Structure-Aware and Region-Prioritized 3D Gaussians
Guangchi Fang, Bing Wang
Main category: cs.CV
TL;DR: Mini-Splatting2 improves 3D Gaussian Splatting with geometry-aware distribution and optimization for more compact, faster, and higher-quality 3D scene reconstruction.
Details
Motivation: Current 3D Gaussian Splatting (3DGS) methods rely heavily on photometric supervision, leading to irregular spatial distribution and indiscriminate primitive adjustments that ignore geometric context. There's a need for more geometry-regulated approaches to improve efficiency and quality.Method: Introduces two key mechanisms: 1) Structure-aware distribution that enforces spatial regularity through structured reorganization and representation sparsity for compact organization, and 2) Region-prioritized optimization that improves training discrimination through geometric saliency and computational selectivity for faster convergence.
Result: Achieves up to 4× fewer Gaussians and 3× faster optimization while maintaining state-of-the-art visual quality. Demonstrates improved representation compactness, convergence acceleration, and rendering fidelity.
Conclusion: Mini-Splatting2 successfully drives 3DGS into a geometry-regulated paradigm, alleviating tensions between compactness, speed, and quality, paving the way for more structured and efficient 3D Gaussian modeling.
Abstract: Reconstructing 3D scenes with high fidelity and efficiency remains a central pursuit in computer vision and graphics. Recent advances in 3D Gaussian Splatting (3DGS) enable photorealistic rendering with Gaussian primitives, yet the modeling process remains governed predominantly by photometric supervision. This reliance often leads to irregular spatial distribution and indiscriminate primitive adjustments that largely ignore underlying geometric context. In this work, we rethink Gaussian modeling from a geometric standpoint and introduce Mini-Splatting2, an efficient scene modeling framework that couples structure-aware distribution and region-prioritized optimization, driving 3DGS into a geometry-regulated paradigm. The structure-aware distribution enforces spatial regularity through structured reorganization and representation sparsity, ensuring balanced structural coverage for compact organization. The region-prioritized optimization improves training discrimination through geometric saliency and computational selectivity, fostering appropriate structural emergence for fast convergence. These mechanisms alleviate the long-standing tension among representation compactness, convergence acceleration, and rendering fidelity. Extensive experiments demonstrate that Mini-Splatting2 achieves up to 4$\times$ fewer Gaussians and 3$\times$ faster optimization while maintaining state-of-the-art visual quality, paving the way towards structured and efficient 3D Gaussian modeling.
[205] RAD: Region-Aware Diffusion Models for Image Inpainting
Sora Kim, Sungho Suh, Minsik Lee
Main category: cs.CV
TL;DR: RAD (Region-Aware Diffusion) is a novel diffusion model approach for image inpainting that uses pixel-specific noise schedules for asynchronous local generation while maintaining global context, achieving 100x faster inference than SOTA methods.
Details
Motivation: Existing diffusion-based inpainting methods either hijack pretrained models' reverse processes or use complex conditioning frameworks, requiring nested loops or additional components, leading to slow inference times.Method: RAD reformulates vanilla diffusion models by assigning different noise schedules to each pixel, enabling asynchronous local region generation while considering global context. Uses plain reverse process without extra components and employs LoRA for efficient fine-tuning.
Result: Achieves state-of-the-art results on FFHQ, LSUN Bedroom, and ImageNet datasets with up to 100x faster inference than existing approaches, while maintaining high quality both qualitatively and quantitatively.
Conclusion: RAD provides an effective and efficient diffusion-based inpainting solution that balances local region generation with global context awareness, significantly improving inference speed without compromising quality.
Abstract: Diffusion models have achieved remarkable success in image generation, with applications broadening across various domains. Inpainting is one such application that can benefit significantly from diffusion models. Existing methods either hijack the reverse process of a pretrained diffusion model or cast the problem into a larger framework, \ie, conditioned generation. However, these approaches often require nested loops in the generation process or additional components for conditioning. In this paper, we present region-aware diffusion models (RAD) for inpainting with a simple yet effective reformulation of the vanilla diffusion models. RAD utilizes a different noise schedule for each pixel, which allows local regions to be generated asynchronously while considering the global image context. A plain reverse process requires no additional components, enabling RAD to achieve inference time up to 100 times faster than the state-of-the-art approaches. Moreover, we employ low-rank adaptation (LoRA) to fine-tune RAD based on other pretrained diffusion models, reducing computational burdens in training as well. Experiments demonstrated that RAD provides state-of-the-art results both qualitatively and quantitatively, on the FFHQ, LSUN Bedroom, and ImageNet datasets.
[206] Human Body Restoration with One-Step Diffusion Model and A New Benchmark
Jue Gong, Jingkai Wang, Zheng Chen, Xing Liu, Hong Gu, Yulun Zhang, Xiaokang Yang
Main category: cs.CV
TL;DR: Proposes PERSONA dataset for human body restoration using automated pipeline, and OSDHuman one-step diffusion model with HFIE for better guidance.
Details
Motivation: Human body restoration lacks benchmark datasets; existing datasets have quality and content limitations, hindering thorough research in this important practical application.Method: 1) HQ-ACF pipeline to automatically crop/filter high-quality human images from existing datasets; 2) PERSONA dataset construction; 3) OSDHuman one-step diffusion model with High-Fidelity Image Embedder (HFIE) for better guidance.
Result: PERSONA dataset surpasses others in quality/content richness; OSDHuman outperforms existing methods in both visual quality and quantitative metrics.
Conclusion: Proposed dataset and model advance human body restoration research; PERSONA provides valuable benchmark, OSDHuman shows state-of-the-art performance.
Abstract: Human body restoration, as a specific application of image restoration, is widely applied in practice and plays a vital role across diverse fields. However, thorough research remains difficult, particularly due to the lack of benchmark datasets. In this study, we propose a high-quality dataset automated cropping and filtering (HQ-ACF) pipeline. This pipeline leverages existing object detection datasets and other unlabeled images to automatically crop and filter high-quality human images. Using this pipeline, we constructed a person-based restoration with sophisticated objects and natural activities (\emph{PERSONA}) dataset, which includes training, validation, and test sets. The dataset significantly surpasses other human-related datasets in both quality and content richness. Finally, we propose \emph{OSDHuman}, a novel one-step diffusion model for human body restoration. Specifically, we propose a high-fidelity image embedder (HFIE) as the prompt generator to better guide the model with low-quality human image information, effectively avoiding misleading prompts. Experimental results show that OSDHuman outperforms existing methods in both visual quality and quantitative metrics. The dataset and code will at https://github.com/gobunu/OSDHuman.
[207] TextOCVP: Object-Centric Video Prediction with Language Guidance
Angel Villar-Corrales, Gjergj Plepi, Sven Behnke
Main category: cs.CV
TL;DR: TextOCVP: Object-centric video prediction model guided by textual descriptions that parses scenes into object slots and uses text-conditioned transformers for controllable future forecasting.
Details
Motivation: Existing object-centric models struggle to scale beyond simple synthetic datasets and lack integration of external guidance, limiting their applicability in robotics for scene understanding and forecasting.Method: Parses observed scenes into object representations (slots), uses text-conditioned transformer predictor to forecast future object states and video frames, jointly models object dynamics and interactions with textual guidance.
Result: Outperforms several video prediction baselines on two datasets, provides superior robustness to novel scene configurations, improved controllability and interpretability through structured object-centric representations.
Conclusion: TextOCVP enables accurate and controllable video predictions with structured latent spaces, offering precise control of forecasting process and more understandable predictions for autonomous agents.
Abstract: Understanding and forecasting future scene states is critical for autonomous agents to plan and act effectively in complex environments. Object-centric models, with structured latent spaces, have shown promise in modeling object dynamics and predicting future scene states, but often struggle to scale beyond simple synthetic datasets and to integrate external guidance, limiting their applicability in robotics. To address these limitations, we propose TextOCVP, an object-centric model for video prediction guided by textual descriptions. TextOCVP parses an observed scene into object representations, called slots, and utilizes a text-conditioned transformer predictor to forecast future object states and video frames. Our approach jointly models object dynamics and interactions while incorporating textual guidance, enabling accurate and controllable predictions. TextOCVP’s structured latent space offers a more precise control of the forecasting process, outperforming several video prediction baselines on two datasets. Additionally, we show that structured object-centric representations provide superior robustness to novel scene configurations, as well as improved controllability and interpretability, enabling more precise and understandable predictions. Videos and code are available at https://play-slot.github.io/TextOCVP.
[208] Hidden in Plain Sight – Class Competition Focuses Attribution Maps
Nils Philipp Walter, Jilles Vreeken, Jonas Fischer
Main category: cs.CV
TL;DR: The paper proposes using distributions of attributions over multiple classes instead of single-class logits to improve specificity and fine-grained attribution in neural network interpretability methods.
Details
Motivation: Current attribution methods often highlight both important and irrelevant features, appearing unspecific. The authors identify that using logits as attribution targets is a main cause of this problem, and seek to improve attribution specificity.Method: The authors revisit the common attribution pipeline and propose considering distributions of attributions over multiple classes using existing attribution methods. This approach leverages existing attribution techniques but applies them across multiple classes rather than focusing on single-class logits.
Result: The method improves the ability of 18 attribution methods across 7 architectures up to 2x on common benchmarks including the grid-pointing game and randomization-based sanity checks. The improvement is agnostic to model architecture.
Conclusion: Using attribution distributions over multiple classes yields specific and fine-grained attributions, solving the common problem of unspecific attributions in neural network interpretability methods.
Abstract: Attribution methods reveal which input features a neural network uses for a prediction, adding transparency to their decisions. A common problem is that these attributions seem unspecific, highlighting both important and irrelevant features. We revisit the common attribution pipeline and observe that using logits as attribution target is a main cause of this phenomenon. We show that the solution is in plain sight: considering distributions of attributions over multiple classes using existing attribution methods yields specific and fine-grained attributions. On common benchmarks, including the grid-pointing game and randomization-based sanity checks, this improves the ability of 18 attribution methods across 7 architectures up to 2x, agnostic to model architecture.
[209] CMD-HAR: Cross-Modal Disentanglement for Wearable Human Activity Recognition
Ying Yu, Siyao Li, Yixuan Jiang, Hang Xiao, Jingxi Long, Haotian Tang, Hanyu Liu, Chao Li
Main category: cs.CV
TL;DR: A multimodal sensor-based human activity recognition method using spatiotemporal attention and modal decomposition alignment fusion to address data mixing, activity heterogeneity, and deployment challenges.
Details
Motivation: To solve key challenges in sensor-based human activity recognition including multimodal data mixing, activity heterogeneity, and complex model deployment for real-world applications.Method: Proposes spatiotemporal attention modal decomposition alignment fusion strategy, cross-modal spatio-temporal disentangled representation for key feature capture, gradient modulation for data heterogeneity, and wearable deployment simulation system.
Result: Experiments on numerous public datasets demonstrate the effectiveness of the proposed model in addressing the identified challenges.
Conclusion: The proposed approach effectively tackles multimodal data mixing, activity heterogeneity, and deployment issues in sensor-based human activity recognition.
Abstract: Human Activity Recognition (HAR) is a fundamental technology for numerous human - centered intelligent applications. Although deep learning methods have been utilized to accelerate feature extraction, issues such as multimodal data mixing, activity heterogeneity, and complex model deployment remain largely unresolved. The aim of this paper is to address issues such as multimodal data mixing, activity heterogeneity, and complex model deployment in sensor-based human activity recognition. We propose a spatiotemporal attention modal decomposition alignment fusion strategy to tackle the problem of the mixed distribution of sensor data. Key discriminative features of activities are captured through cross-modal spatio-temporal disentangled representation, and gradient modulation is combined to alleviate data heterogeneity. In addition, a wearable deployment simulation system is constructed. We conducted experiments on a large number of public datasets, demonstrating the effectiveness of the model.
[210] LayoutCoT: Unleashing the Deep Reasoning Potential of Large Language Models for Layout Generation
Hengyu Shi, Junhao Su, Tianyang Han, Junfeng Luo, Jialin Gao
Main category: cs.CV
TL;DR: LayoutCoT: A training-free approach using LLMs with Retrieval-Augmented Generation and Chain-of-Thought reasoning for conditional layout generation, achieving SOTA without fine-tuning.
Details
Motivation: Existing layout generation methods require substantial training data or fine-tuning, while training-free LLM approaches have limited reasoning capabilities and simplistic ranking mechanisms that restrict quality.Method: Transforms layout representations into serialized format for LLMs, uses Layout-aware RAG for coarse layout generation, then employs a specialized CoT reasoning module for iterative refinement using exemplars.
Result: Achieves state-of-the-art performance on five datasets across three conditional layout generation tasks without training or fine-tuning. CoT reasoning enables standard LLMs to outperform specialized deep-reasoning models.
Conclusion: LayoutCoT demonstrates the potential of unleashing deep reasoning capabilities of LLMs for layout generation through RAG and CoT techniques, offering a versatile and practical training-free solution.
Abstract: Conditional layout generation aims to automatically generate visually appealing and semantically coherent layouts from user-defined constraints. While recent methods based on generative models have shown promising results, they typically require substantial amounts of training data or extensive fine-tuning, limiting their versatility and practical applicability. Alternatively, some training-free approaches leveraging in-context learning with Large Language Models (LLMs) have emerged, but they often suffer from limited reasoning capabilities and overly simplistic ranking mechanisms, which restrict their ability to generate consistently high-quality layouts. To this end, we propose LayoutCoT, a novel approach that leverages the reasoning capabilities of LLMs through a combination of Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) techniques. Specifically, LayoutCoT transforms layout representations into a standardized serialized format suitable for processing by LLMs. A Layout-aware RAG is used to facilitate effective retrieval and generate a coarse layout by LLMs. This preliminary layout, together with the selected exemplars, is then fed into a specially designed CoT reasoning module for iterative refinement, significantly enhancing both semantic coherence and visual quality. We conduct extensive experiments on five public datasets spanning three conditional layout generation tasks. Experimental results demonstrate that LayoutCoT achieves state-of-the-art performance without requiring training or fine-tuning. Notably, our CoT reasoning module enables standard LLMs, even those without explicit deep reasoning abilities, to outperform specialized deep-reasoning models such as deepseek-R1, highlighting the potential of our approach in unleashing the deep reasoning capabilities of LLMs for layout generation tasks.
[211] Bringing Diversity from Diffusion Models to Semantic-Guided Face Asset Generation
Yunxuan Cai, Sitao Xiang, Zongjian Li, Haiwei Chen, Yajie Zhao
Main category: cs.CV
TL;DR: A system for generating and editing high-quality 3D face assets using a GAN-based generator trained on synthetic data from diffusion models, with semantic attribute control and web-based interactive editing.
Details
Motivation: Current digital face modeling faces limitations in diversity, expressiveness, and control due to requirements for specialized capture devices, manual labor, and suitable actors. The paper aims to overcome these limitations through generative approaches.Method: 1) Created a novel data generation pipeline using pre-trained diffusion models to synthesize 44,000 high-quality 3D face models; 2) Developed normalization module to convert diffusion outputs to scanned data quality; 3) Built efficient GAN-based generator that accepts semantic attributes to produce geometry and albedo; 4) Implemented asset refinement for physically-based facial assets; 5) Created web-based interactive tool.
Result: Successfully generated 44,000 diverse 3D face models, developed a controllable GAN generator with semantic attribute input, created comprehensive system for face asset creation/editing, and built interactive web tool. Extensive experiments validated quality and control capabilities.
Conclusion: The proposed system demonstrates that semantically controllable generative networks can significantly enhance control and diversity in digital face modeling, overcoming limitations of traditional capture-based approaches through synthetic data generation and interactive editing tools.
Abstract: Digital modeling and reconstruction of human faces serve various applications. However, its availability is often hindered by the requirements of data capturing devices, manual labor, and suitable actors. This situation restricts the diversity, expressiveness, and control over the resulting models. This work aims to demonstrate that a semantically controllable generative network can provide enhanced control over the digital face modeling process. To enhance diversity beyond the limited human faces scanned in a controlled setting, we introduce a novel data generation pipeline that creates a high-quality 3D face database using a pre-trained diffusion model. Our proposed normalization module converts synthesized data from the diffusion model into high-quality scanned data. Using the 44,000 face models we obtained, we further developed an efficient GAN-based generator. This generator accepts semantic attributes as input, and generates geometry and albedo. It also allows continuous post-editing of attributes in the latent space. Our asset refinement component subsequently creates physically-based facial assets. We introduce a comprehensive system designed for creating and editing high-quality face assets. Our proposed model has undergone extensive experiment, comparison and evaluation. We also integrate everything into a web-based interactive tool. We aim to make this tool publicly available with the release of the paper.
[212] StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians
Cailin Zhuang, Yaoqi Hu, Xuanyang Zhang, Wei Cheng, Jiacheng Bao, Shengqi Liu, Yiying Yang, Xianfang Zeng, Gang Yu, Ming Li
Main category: cs.CV
TL;DR: StyleMe3D: A hierarchical framework for high-fidelity 3D Gaussian Splatting stylization using multi-level style representations, dynamic score distillation, and CLIP-based alignment to preserve geometric details while achieving coherent artistic styles.
Details
Motivation: Current 3D Gaussian Splatting stylization methods are limited in representing diverse artistic styles, often producing low-level texture replacements or semantically inconsistent outputs. There's a need for comprehensive stylization that preserves geometric fidelity while achieving high-quality artistic transfer.Method: Proposes StyleMe3D with three key components: 1) Dynamic Style Score Distillation (DSSD) using style-aware diffusion models for high-level semantic guidance, 2) Multi-modal CLIP alignment with Contrastive Style Descriptor (middle-level style similarity) and 3D Gaussian Quality Assessment (global regularization), and 3) VGG-based Simultaneously Optimized Scale module for fine-grained texture refinement.
Result: Extensive experiments show the method consistently preserves intricate geometric details and achieves coherent stylistic effects across entire scenes, significantly surpassing state-of-the-art baselines in both qualitative and quantitative evaluations.
Conclusion: StyleMe3D provides a comprehensive hierarchical framework for 3D Gaussian Splatting stylization that successfully disentangles multi-level style representations while maintaining geometric fidelity, offering superior artistic style transfer capabilities.
Abstract: Current 3D Gaussian Splatting stylization approaches are limited in their ability to represent diverse artistic styles, frequently defaulting to low-level texture replacement or yielding semantically inconsistent outputs. In this paper, we introduce StyleMe3D, a novel hierarchical framework that achieves comprehensive, high-fidelity stylization by disentangling multi-level style representations while preserving geometric fidelity. The cornerstone of StyleMe3D is Dynamic Style Score Distillation (DSSD), which harnesses latent priors from a style-aware diffusion model to provide high-level semantic guidance, ensuring robust and expressive style transfer. To further refine this distillation process, we propose a multi-modal alignment strategy using the CLIP latent space: a CLIP-based style stream evaluator (Contrastive Style Descriptor) that enforces middle-level stylistic similarity, and a CLIP-based content stream evaluator (3D Gaussian Quality Assessment) that acts as a global regularizer to mitigate typical GS quality degradation. Finally, a VGG-based Simultaneously Optimized Scale module is integrated to refine fine-grained texture details at the low-level. Extensive experiments demonstrate that our method consistently preserves intricate geometric details and achieves coherent stylistic effects across entire scenes, significantly surpassing state-of-the-art baselines in both qualitative and quantitative evaluations.
[213] Event2Vec: Processing Neuromorphic Events Directly by Representations in Vector Space
Wei Fang, Priyadarshini Panda
Main category: cs.CV
TL;DR: Event2vec: A novel representation method that enables Transformers to directly process sparse, asynchronous neuromorphic event camera data with high efficiency and accuracy.
Details
Motivation: Neuromorphic event cameras have superior temporal resolution and efficiency but their sparse, asynchronous data format is incompatible with conventional deep learning methods. Existing approaches either lose event characteristics during conversion to dense representations or use irregular models that can't leverage GPU acceleration effectively.Method: Inspired by word-to-vector models, the authors draw an analogy between words and events to create event2vec, a novel representation that allows neural networks to process events directly. This approach is fully compatible with Transformer architectures and GPU parallel processing.
Result: Event2vec achieves high accuracy on DVS Gesture, ASL-DVS, and DVS-Lip benchmarks. It is remarkably parameter-efficient, features high throughput and low latency, and maintains performance even with extremely low numbers of events or low spatial resolutions.
Conclusion: Event2vec introduces a novel paradigm for neuromorphic vision by enabling direct integration of sparse event data into high-throughput Transformer architectures, resolving the conflict between maintaining data sparsity and maximizing GPU efficiency for real-time applications.
Abstract: Neuromorphic event cameras possess superior temporal resolution, power efficiency, and dynamic range compared to traditional cameras. However, their asynchronous and sparse data format poses a significant challenge for conventional deep learning methods. Existing methods either convert the events into dense synchronous frame representations for processing by powerful CNNs or Transformers, but lose the asynchronous, sparse and high temporal resolution characteristics of events during the conversion process; or adopt irregular models such as sparse convolution, spiking neural networks, or graph neural networks to process the irregular event representations but fail to take full advantage of GPU acceleration. Inspired by word-to-vector models, we draw an analogy between words and events to introduce event2vec, a novel representation that allows neural networks to process events directly. This approach is fully compatible with the parallel processing capabilities of Transformers. We demonstrate the effectiveness of event2vec on the DVS Gesture, ASL-DVS, and DVS-Lip benchmarks, showing that event2vec is remarkably parameter-efficient, features high throughput and low latency, and achieves high accuracy even with an extremely low number of events or low spatial resolutions. Event2vec introduces a novel paradigm by demonstrating for the first time that sparse, irregular event data can be directly integrated into high-throughput Transformer architectures. This breakthrough resolves the long-standing conflict between maintaining data sparsity and maximizing GPU efficiency, offering a promising balance for real-time, low-latency neuromorphic vision tasks. The code is provided in https://github.com/Intelligent-Computing-Lab-Panda/event2vec.
[214] DPMambaIR: All-in-One Image Restoration via Degradation-Aware Prompt State Space Model
Zhanwen Liu, Sai Zhou, Yuchao Dai, Yang Wang, Yisheng An, Xiangmo Zhao
Main category: cs.CV
TL;DR: DPMambaIR is a unified image restoration framework that handles multiple degradation types using fine-grained degradation extraction and a state space model with degradation-aware prompts.
Details
Motivation: Existing all-in-one image restoration approaches lack fine-grained degradation modeling and struggle with multi-task conflicts. Current methods use degradation-specific models or coarse-grained prompts, limiting their effectiveness across diverse degradation types.Method: Proposes DPMambaIR with: 1) Fine-grained degradation extractor to capture detailed degradation information, 2) Degradation-Aware Prompt State Space Model (DP-SSM) that incorporates degradation features as dynamic prompts into state space modeling, and 3) Complementary High-Frequency Enhancement Block (HEB) to recover local details.
Result: Achieves state-of-the-art performance on mixed dataset with 7 degradation types: 27.69dB PSNR and 0.893 SSIM, demonstrating superiority over existing approaches.
Conclusion: DPMambaIR provides an effective unified solution for all-in-one image restoration by addressing fine-grained degradation modeling and multi-task balancing through degradation-aware state space modeling and high-frequency enhancement.
Abstract: All-in-One image restoration aims to address multiple image degradation problems using a single model, offering a more practical and versatile solution compared to designing dedicated models for each degradation type. Existing approaches typically rely on Degradation-specific models or coarse-grained degradation prompts to guide image restoration. However, they lack fine-grained modeling of degradation information and face limitations in balancing multi-task conflicts. To overcome these limitations, we propose DPMambaIR, a novel All-in-One image restoration framework that introduces a fine-grained degradation extractor and a Degradation-Aware Prompt State Space Model (DP-SSM). The DP-SSM leverages the fine-grained degradation features captured by the extractor as dynamic prompts, which are then incorporated into the state space modeling process. This enhances the model’s adaptability to diverse degradation types, while a complementary High-Frequency Enhancement Block (HEB) recovers local high-frequency details. Extensive experiments on a mixed dataset containing seven degradation types show that DPMambaIR achieves the best performance, with 27.69dB and 0.893 in PSNR and SSIM, respectively. These results highlight the potential and superiority of DPMambaIR as a unified solution for All-in-One image restoration.
[215] Histo-Miner: Deep learning based tissue features extraction pipeline from H&E whole slide images of cutaneous squamous cell carcinoma
Lucas Sancéré, Carina Lorenz, Doris Helbig, Oana-Diana Persa, Sonja Dengler, Alexander Kreuter, Martim Laimer, Roland Lang, Anne Fröhlich, Jennifer Landsberg, Johannes Brägelmann, Katarzyna Bozek
Main category: cs.CV
TL;DR: Histo-Miner is a deep learning pipeline for analyzing skin whole-slide images, generating datasets for nuclei/tumor segmentation, and predicting immunotherapy response in cutaneous squamous cell carcinoma patients.
Details
Motivation: There's a lack of labeled datasets and open-source pipelines specifically for skin tissue analysis in digital pathology, particularly for cutaneous squamous cell carcinoma (cSCC) which is a frequent non-melanoma skin cancer.Method: Developed a pipeline using convolutional neural networks and vision transformers for nucleus segmentation/classification and tumor region segmentation on two datasets (47,392 annotated cell nuclei and 144 tumor-segmented WSIs). Generated compact feature vectors summarizing tissue morphology and cellular interactions for downstream tasks.
Result: Achieved mPQ of 0.569 for nucleus segmentation, macro-averaged F1 of 0.832 for nucleus classification, and mIoU of 0.907 for tumor segmentation. Successfully predicted cSCC patient response to immunotherapy using pre-treatment WSIs from 45 patients, identifying key predictive features.
Conclusion: Histo-Miner provides an effective pipeline for skin tissue analysis with clinical applications, offering interpretable features for immunotherapy response prediction and insights into underlying biology.
Abstract: Recent advancements in digital pathology have enabled comprehensive analysis of Whole-Slide Images (WSI) from tissue samples, leveraging high-resolution microscopy and computational capabilities. Despite this progress, there is a lack of labeled datasets and open source pipelines specifically tailored for analysis of skin tissue. Here we propose Histo-Miner, a deep learning-based pipeline for analysis of skin WSIs and generate two datasets with labeled nuclei and tumor regions. We develop our pipeline for the analysis of patient samples of cutaneous squamous cell carcinoma (cSCC), a frequent non-melanoma skin cancer. Utilizing the two datasets, comprising 47,392 annotated cell nuclei and 144 tumor-segmented WSIs respectively, both from cSCC patients, Histo-Miner employs convolutional neural networks and vision transformers for nucleus segmentation and classification as well as tumor region segmentation. Performance of trained models positively compares to state of the art with multi-class Panoptic Quality (mPQ) of 0.569 for nucleus segmentation, macro-averaged F1 of 0.832 for nucleus classification and mean Intersection over Union (mIoU) of 0.907 for tumor region segmentation. From these predictions we generate a compact feature vector summarizing tissue morphology and cellular interactions, which can be used for various downstream tasks. Here, we use Histo-Miner to predict cSCC patient response to immunotherapy based on pre-treatment WSIs from 45 patients. Histo-Miner identifies percentages of lymphocytes, the granulocyte to lymphocyte ratio in tumor vicinity and the distances between granulocytes and plasma cells in tumors as predictive features for therapy response. This highlights the applicability of Histo-Miner to clinically relevant scenarios, providing direct interpretation of the classification and insights into the underlying biology.
[216] Improved Bag-of-Words Image Retrieval with Geometric Constraints for Ground Texture Localization
Aaron Wilhelm, Nils Napp
Main category: cs.CV
TL;DR: Improved bag-of-words image retrieval system for ground texture localization using downward-facing cameras, with enhanced accuracy for global localization and better precision/recall for loop closure detection in SLAM.
Details
Motivation: Ground texture localization using downward-facing cameras offers cost-effective, high-precision localization that works in dynamic environments without environmental modifications. Existing bag-of-words systems for this application can be improved for better performance.Method: Uses approximate k-means (AKM) vocabulary with soft assignment, and exploits orientation consistency and constant scale constraints inherent to ground texture localization. Presents both high-accuracy and high-speed versions tailored for global localization vs. loop closure detection needs.
Result: Achieves substantially higher accuracy for global localization and higher precision/recall for loop closure detection in SLAM. Ablation study validates each improvement, and the method can readily replace existing generic BoW systems in ground texture localization pipelines.
Conclusion: The improved BoW system provides significant performance gains for ground texture localization applications, with versions optimized for different SLAM needs (global localization vs. loop closure detection), making it a practical upgrade for existing systems.
Abstract: Ground texture localization using a downward-facing camera offers a low-cost, high-precision localization solution that is robust to dynamic environments and requires no environmental modification. We present a significantly improved bag-of-words (BoW) image retrieval system for ground texture localization, achieving substantially higher accuracy for global localization and higher precision and recall for loop closure detection in SLAM. Our approach leverages an approximate $k$-means (AKM) vocabulary with soft assignment, and exploits the consistent orientation and constant scale constraints inherent to ground texture localization. Identifying the different needs of global localization vs. loop closure detection for SLAM, we present both high-accuracy and high-speed versions of our algorithm. We test the effect of each of our proposed improvements through an ablation study and demonstrate our method’s effectiveness for both global localization and loop closure detection. With numerous ground texture localization systems already using BoW, our method can readily replace other generic BoW systems in their pipeline and immediately improve their results.
[217] Physics-Driven Local-Whole Elastic Deformation Modeling for Point Cloud Representation Learning
Zhongyu Chen, Rong Zhao, Xie Han, Xindong Guo, Song Wang, Zherui Qiao
Main category: cs.CV
TL;DR: A physics-driven approach combining elastic deformation modeling with data-driven implicit fields to learn point cloud representations that capture relationships between local features and global structure.
Details
Motivation: Existing point cloud methods focus only on spatial distribution and overlook relationships between local information and global structure, limiting representation accuracy. Real-world object deformation propagates from local to global, suggesting physics-driven mechanisms could improve structural modeling and generalization.Method: Dual-task encoder-decoder framework combining data-driven implicit fields with physics-driven elastic deformation. Uses physics-based loss functions to predict localized deformation and capture correspondence between local structural changes and global shape variations.
Result: The method learns fine-grained features in point clouds and models structural relationships between local regions and whole shapes, enhancing generalization and interpretability for downstream tasks.
Conclusion: Incorporating physics-driven mechanisms into data-driven point cloud methods effectively addresses limitations in structural modeling by capturing topological relationships between local parts and global objects.
Abstract: Existing point cloud representation learning methods primarily rely on data-driven strategies to extract geometric information from large amounts of scattered data. However, most methods focus solely on the spatial distribution features of point clouds while overlooking the relationship between local information and the whole structure, which limits the accuracy of point cloud representation. Local information reflect the fine-grained variations of an object, while the whole structure is determined by the interaction and combination of these local features, collectively defining the object’s shape. In real-world, objects undergo deformation under external forces, and this deformation gradually affects the whole structure through the propagation of forces from local regions, thereby altering the object’s geometric features. Therefore, appropriately introducing a physics-driven mechanism to capture the topological relationships between local parts and the whole object can effectively mitigate for the limitations of data-driven point cloud methods in structural modeling, and enhance the generalization and interpretability of point cloud representations for downstream tasks such as understanding and recognition. Inspired by this, we incorporate a physics-driven mechanism into the data-driven method to learn fine-grained features in point clouds and model the structural relationship between local regions and the whole shape. Specifically, we design a dual-task encoder-decoder framework that combines the geometric modeling capability of data-driven implicit fields with physics-driven elastic deformation. Through the integration of physics-based loss functions, the framework is guided to predict localized deformation and explicitly capture the correspondence between local structural changes and whole shape variations.
[218] Image-to-Image Translation with Diffusion Transformers and CLIP-Based Image Conditioning
Qiang Zhu, Kuan Lu, Menghao Huo, Yuxiao Li
Main category: cs.CV
TL;DR: Diffusion Transformers (DiT) adapted for image-to-image translation using CLIP embeddings for conditioning, achieving high-quality translations without text or labels
Details
Motivation: To develop a diffusion-based framework for image-to-image translation that leverages the strengths of diffusion models and transformers, providing an alternative to GAN-based approaches while maintaining semantic consistency and visual fidelityMethod: Adapts Diffusion Transformers (DiT) for image-to-image translation, conditions on CLIP image embeddings for guidance, and incorporates CLIP similarity loss for semantic consistency plus LPIPS perceptual loss for visual quality
Result: Achieves high-quality, semantically faithful translations on face2comics (real faces to comic-style) and edges2shoes (edge maps to realistic shoes) benchmark datasets
Conclusion: DiT with CLIP-based conditioning offers a promising alternative to GANs for paired image-to-image translation, combining diffusion denoising with transformer global modeling for structurally consistent results
Abstract: Image-to-image translation aims to learn a mapping between a source and a target domain, enabling tasks such as style transfer, appearance transformation, and domain adaptation. In this work, we explore a diffusion-based framework for image-to-image translation by adapting Diffusion Transformers (DiT), which combine the denoising capabilities of diffusion models with the global modeling power of transformers. To guide the translation process, we condition the model on image embeddings extracted from a pre-trained CLIP encoder, allowing for fine-grained and structurally consistent translations without relying on text or class labels. We incorporate both a CLIP similarity loss to enforce semantic consistency and an LPIPS perceptual loss to enhance visual fidelity during training. We validate our approach on two benchmark datasets: face2comics, which translates real human faces to comic-style illustrations, and edges2shoes, which translates edge maps to realistic shoe images. Experimental results demonstrate that DiT, combined with CLIP-based conditioning and perceptual similarity objectives, achieves high-quality, semantically faithful translations, offering a promising alternative to GAN-based models for paired image-to-image translation tasks.
[219] LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision
Anthony Fuller, Yousef Yassin, Junfeng Wen, Daniel G. Kyrollos, Tarek Ibrahim, James R. Green, Evan Shelhamer
Main category: cs.CV
TL;DR: LookWhere: Adaptive vision transformer method that uses a low-resolution selector to predict where to compute in high-resolution images, reducing computational cost without processing full high-resolution input.
Details
Motivation: Vision transformers are computationally expensive, especially at high resolutions where token count grows quadratically. Current methods either prune already-computed tokens (wasting computation) or require complex per-task optimization. Need for efficient adaptive computation that learns where to compute without full high-resolution processing.Method: Two-stage approach: 1) Low-resolution selector predicts important regions, 2) High-resolution extractor processes only selected regions. Jointly pretrained without task supervision via distillation from self-supervised teacher. Learns both where and what to compute simultaneously.
Result: Achieves up to 34x FLOP reduction and 6x speedup on high-resolution Traffic Signs dataset while maintaining accuracy. Also improves accuracy on standard tasks (ImageNet classification, ADE20K segmentation) with 1.36x speedup.
Conclusion: LookWhere provides an economical and accurate method for adaptive computation in vision transformers, enabling efficient processing of high-resolution images while maintaining or improving performance on various vision tasks.
Abstract: Vision transformers are ever larger, more accurate, and more expensive to compute. The expense is even more extreme at high resolution as the number of tokens grows quadratically with the image size. We turn to adaptive computation to cope with this cost by learning to predict where to compute. Our LookWhere method divides the computation between a low-resolution selector and a high-resolution extractor without ever processing the full high-resolution input. We jointly pretrain the selector and extractor without task supervision by distillation from a self-supervised teacher, in effect, learning where and what to compute simultaneously. Unlike prior token reduction methods, which pay to save by pruning already-computed tokens, and prior token selection methods, which require complex and expensive per-task optimization, LookWhere economically and accurately selects and extracts transferrable representations of images. We show that LookWhere excels at sparse recognition on high-resolution inputs (Traffic Signs), maintaining accuracy while reducing FLOPs by up to 34x and time by 6x. It also excels at standard recognition tasks that are global (ImageNet classification) or local (ADE20K segmentation), improving accuracy while reducing time by 1.36x. See https://github.com/antofuller/lookwhere for the code and weights.
[220] Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks
Ruibin Li, Tao Yang, Yangming Shi, Weiguo Feng, Shilei Wen, Bingyue Peng, Lei Zhang
Main category: cs.CV
TL;DR: A unified many-for-many framework that trains a single diffusion model for multiple visual generation and manipulation tasks using joint image-video learning and lightweight adapters.
Details
Motivation: Training separate models for each visual generation task is costly and inefficient. Existing T2V models require expensive high-quality annotations, and most models are limited to single or few tasks.Method: Uses lightweight adapters to unify different task conditions, employs joint image-video learning from scratch, and incorporates depth maps as 3D spatial conditioning to improve perception.
Result: Two model versions (8B and 2B parameters) can perform over 10 different tasks. The 8B model shows competitive video generation performance compared to open-source and commercial engines.
Conclusion: The many-for-many framework successfully creates a unified model for multiple visual generation tasks with improved video generation capabilities through joint learning and depth conditioning.
Abstract: Diffusion models have shown impressive performance in many visual generation and manipulation tasks. Many existing methods focus on training a model for a specific task, especially, text-to-video (T2V) generation, while many other works focus on finetuning the pretrained T2V model for image-to-video (I2V), video-to-video (V2V), image and video manipulation tasks, etc. However, training a strong T2V foundation model requires a large amount of high-quality annotations, which is very costly. In addition, many existing models can perform only one or several tasks. In this work, we introduce a unified framework, namely many-for-many, which leverages the available training data from many different visual generation and manipulation tasks to train a single model for those different tasks. Specifically, we design a lightweight adapter to unify the different conditions in different tasks, then employ a joint image-video learning strategy to progressively train the model from scratch. Our joint learning leads to a unified visual generation and manipulation model with improved video generation performance. In addition, we introduce depth maps as a condition to help our model better perceive the 3D space in visual generation. Two versions of our model are trained with different model sizes (8B and 2B), each of which can perform more than 10 different tasks. In particular, our 8B model demonstrates highly competitive performance in video generation tasks compared to open-source and even commercial engines. Our models and source codes are available at https://github.com/leeruibin/MfM.git.
[221] GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra
Mateusz Michalkiewicz, Anekha Sokhal, Tadeusz Michalkiewicz, Piotr Pawlikowski, Mahsa Baktashmotlagh, Varun Jampani, Guha Balakrishnan
Main category: cs.CV
TL;DR: GIQ benchmark evaluates geometric reasoning in vision/VLMs using diverse 3D shapes, revealing significant shortcomings in current models’ understanding of basic geometric properties.
Details
Motivation: Recent work questions whether modern vision and vision-language models truly understand geometric properties despite impressive benchmark performance. There's a need for comprehensive evaluation of geometric reasoning capabilities.Method: Created GIQ benchmark with synthetic/real images and 3D meshes of diverse polyhedra (Platonic, Archimedean, Johnson, Catalan solids, stellations, compounds). Conducted systematic experiments: monocular 3D reconstruction, 3D symmetry detection, mental rotation tests, zero-shot shape classification.
Result: State-of-the-art reconstruction algorithms struggle with basic Platonic solids. Foundation models capture some symmetry elements but fail at detailed geometric differentiation like mental rotation. Advanced VLMs (ChatGPT, Gemini, Claude) show low accuracy on basic shape properties.
Conclusion: Current vision and vision-language models have significant geometric reasoning gaps. GIQ provides structured platform to benchmark and improve geometry-aware representation learning.
Abstract: Modern monocular 3D reconstruction methods and vision-language models (VLMs) demonstrate impressive results on standard benchmarks, yet recent works cast doubt on their true understanding of geometric properties. We introduce GOQ, a comprehensive benchmark specifically designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models. GIQ comprises synthetic and real-world images and corresponding 3D meshes of diverse polyhedra covering varying levels of complexity and symmetry, from Platonic, Archimedean, Johnson, and Catalan solids to stellations and compound shapes. Through systematic experiments involving monocular 3D reconstruction, 3D symmetry detection, mental rotation tests, and zero-shot shape classification tasks, we reveal significant shortcomings in current models. State-of-the-art reconstruction algorithms trained on extensive 3D datasets struggle to reconstruct even basic geometric Platonic solids accurately. Next, although foundation models may be shown via linear and non-linear probing to capture specific 3D symmetry elements, they falter significantly in tasks requiring detailed geometric differentiation, such as mental rotation. Moreover, advanced vision-language assistants such as ChatGPT, Gemini and Claud exhibit remarkably low accuracy in interpreting basic shape properties such as face geometry, convexity, and compound structures of complex polyhedra. GIQ is publicly available at toomanymatts.github.io/giq-benchmark/, providing a structured platform to benchmark critical gaps in geometric intelligence and facilitate future progress in robust, geometry-aware representation learning.
[222] Feature Engineering is Not Dead: Reviving Classical Machine Learning with Entropy, HOG, and LBP Feature Fusion for Image Classification
Abhijit Sen, Giridas Maiti, Bikram K. Parida, Bhanu P. Mishra, Mahima Arya, Denys I. Bondar
Main category: cs.CV
TL;DR: Classical ML approach for image classification using Permutation Entropy extended to 2D images, combined with HOG and LBP features, achieving competitive results on benchmark datasets without deep learning.
Details
Motivation: To develop interpretable and computationally efficient image classification methods as alternatives to deep learning models with millions of parameters, focusing on feature engineering with classical machine learning approaches.Method: Extends Permutation Entropy (PE) to 2D images with multiscale, multi-orientation entropy-based feature extraction. Combines PE with Histogram of Oriented Gradients (HOG) for shape/edge structure and Local Binary Patterns (LBP) for micro-texture. Uses 780-dimensional feature set with SVM classifiers optimized via grid search.
Result: Achieves competitive classification performance on Fashion-MNIST, KMNIST, EMNIST, and CIFAR-10 datasets without deep architectures, demonstrating the effectiveness of the fused feature approach.
Conclusion: The fusion of PE with HOG and LBP provides a compact, interpretable, and effective alternative to computationally expensive deep learning models, showing potential for entropy-based descriptors in image classification.
Abstract: Feature engineering continues to play a critical role in image classification, particularly when interpretability and computational efficiency are prioritized over deep learning models with millions of parameters. In this study, we revisit classical machine learning based image classification through a novel approach centered on Permutation Entropy (PE), a robust and computationally lightweight measure traditionally used in time series analysis but rarely applied to image data. We extend PE to two-dimensional images and propose a multiscale, multi-orientation entropy-based feature extraction approach that characterizes spatial order and complexity along rows, columns, diagonals, anti-diagonals, and local patches of the image. To enhance the discriminatory power of the entropy features, we integrate two classic image descriptors: the Histogram of Oriented Gradients (HOG) to capture shape and edge structure, and Local Binary Patterns (LBP) to encode micro-texture of an image. The resulting hand-crafted feature set, comprising of 780 dimensions, is used to train Support Vector Machine (SVM) classifiers optimized through grid search. The proposed approach is evaluated on multiple benchmark datasets, including Fashion-MNIST, KMNIST, EMNIST, and CIFAR-10, where it delivers competitive classification performance without relying on deep architectures. Our results demonstrate that the fusion of PE with HOG and LBP provides a compact, interpretable, and effective alternative to computationally expensive and limited interpretable deep learning models. This shows a potential of entropy-based descriptors in image classification and contributes a lightweight and generalizable solution to interpretable machine learning in image classification and computer vision.
[223] Personalized Safety Alignment for Text-to-Image Diffusion Models
Yu Lei, Jinbin Bai, Qingyu Shi, Aosong Feng, Hongcheng Gao, Xiao Zhang, Rex Ying
Main category: cs.CV
TL;DR: PSA framework enables personalized safety alignment for text-to-image diffusion models by adapting safety constraints to individual user preferences rather than using rigid uniform standards.
Details
Motivation: Current text-to-image diffusion models use rigid safety mechanisms that don't account for diverse user preferences shaped by age, culture, or personal beliefs, limiting their practical deployment.Method: Proposed Personalized Safety Alignment (PSA) framework with Sage dataset (1,000 simulated user profiles) and parameter-efficient cross-attention adapter to dynamically modulate generation based on individual safety boundaries.
Result: PSA achieves calibrated safety-quality trade-off: relaxes constraints for permissive profiles to enhance visual fidelity while enforcing state-of-the-art suppression for restrictive profiles, outperforming static baselines and showing superior instruction adherence.
Conclusion: Personalization is a vital direction for creating adaptive, user-centered, and responsible generative AI, with PSA establishing a framework for transitioning from static filtration to user-conditioned adaptation.
Abstract: Text-to-image diffusion models have revolutionized visual content generation, yet their deployment is hindered by a fundamental limitation: safety mechanisms enforce rigid, uniform standards that fail to reflect diverse user preferences shaped by age, culture, or personal beliefs. To address this, we propose Personalized Safety Alignment (PSA), a framework that transitions generative safety from static filtration to user-conditioned adaptation. We introduce Sage, a large-scale dataset capturing diverse safety boundaries across 1,000 simulated user profiles, covering complex risks often missed by traditional datasets. By integrating these profiles via a parameter-efficient cross-attention adapter, PSA dynamically modulates generation to align with individual sensitivities. Extensive experiments demonstrate that PSA achieves a calibrated safety-quality trade-off: under permissive profiles, it relaxes over-cautious constraints to enhance visual fidelity, while under restrictive profiles, it enforces state-of-the-art suppression, significantly outperforming static baselines. Furthermore, PSA exhibits superior instruction adherence compared to prompt-engineering methods, establishing personalization as a vital direction for creating adaptive, user-centered, and responsible generative AI. Our code, data, and models are publicly available at https://github.com/M-E-AGI-Lab/PSAlign.
[224] BioLite U-Net: Edge-Deployable Semantic Segmentation for In Situ Bioprinting Monitoring
Usman Haider, Lukasz Szemet, Daniel Kelly, Vasileios Sergis, Andrew C. Daly, Karl Mason
Main category: cs.CV
TL;DR: Lightweight semantic segmentation framework for real-time bioprinting monitoring using depthwise separable convolutions achieves high accuracy with minimal computational footprint.
Details
Motivation: Real-time monitoring of bioprinting processes is challenging due to limited imaging data and resource-constrained hardware, requiring efficient semantic segmentation to differentiate nozzle, bioink, and background for quality control.Method: Proposed BioLite U-Net architecture using depthwise separable convolutions, benchmarked against MobileNetV2/V3 baselines on a manually annotated dataset of 787 RGB bioprinting images with three classes.
Result: BioLite U-Net achieves 92.85% mIoU and 96.17% Dice score, is 1300x smaller than MobileNetV2-DeepLabV3+, and runs at 335 ms per frame on Raspberry Pi 4B for near real-time performance.
Conclusion: BioLite U-Net provides superior accuracy-efficiency tradeoff for real-time bioprinting monitoring, enabling intelligent closed-loop systems with practical deployability on embedded hardware.
Abstract: Bioprinting is a rapidly advancing field that offers a transformative approach to fabricating tissue and organ models through the precise deposition of cell-laden bioinks. Ensuring the fidelity and consistency of printed structures in real-time remains a core challenge, particularly under constraints imposed by limited imaging data and resource-constrained embedded hardware. Semantic segmentation of the extrusion process, differentiating between nozzle, extruded bioink, and surrounding background, enables in situ monitoring critical to maintaining print quality and biological viability. In this work, we introduce a lightweight semantic segmentation framework tailored for real-time bioprinting applications. We present a novel, manually annotated dataset comprising 787 RGB images captured during the bioprinting process, labeled across three classes: nozzle, bioink, and background. To achieve fast and efficient inference suitable for integration with bioprinting systems, we propose a BioLite U-Net architecture that leverages depthwise separable convolutions to drastically reduce computational load without compromising accuracy. Our model is benchmarked against MobileNetV2 and MobileNetV3-based segmentation baselines using mean Intersection over Union (mIoU), Dice score, and pixel accuracy. All models were evaluated on a Raspberry Pi 4B to assess real-world feasibility. The proposed BioLite U-Net achieves an mIoU of 92.85% and a Dice score of 96.17%, while being over 1300x smaller than MobileNetV2-DeepLabV3+. On-device inference takes 335 ms per frame, demonstrating near real-time capability. Compared to MobileNet baselines, BioLite U-Net offers a superior tradeoff between segmentation accuracy, efficiency, and deployability, making it highly suitable for intelligent, closed-loop bioprinting systems.
[225] SurgLaVi: Large-Scale Hierarchical Dataset for Surgical Vision-Language Representation Learning
Alejandra Perez, Chinedu Nwoye, Ramtin Raji Kermani, Omid Mohareri, Muhammad Abdullah Jamal
Main category: cs.CV
TL;DR: SurgLaVi is the largest surgical vision-language dataset with 240k clip-caption pairs from 200+ procedures, featuring hierarchical annotations and automated pipeline for fine-grained video transcription, enabling improved surgical AI models.
Details
Motivation: Current surgical vision-language pre-training (VLP) is limited by small datasets lacking procedural diversity, semantic quality, and hierarchical structure, hindering progress in surgical AI.Method: Developed SurgLaVi dataset with automated pipeline for fine-grained surgical video transcription and segmentation, featuring dual-modality filtering for quality control and hierarchical annotations at coarse-, mid-, and fine-levels.
Result: SurgCLIP (CLIP-style video-text contrastive model) trained on SurgLaVi achieves state-of-the-art performance across phase, step, action, and tool recognition tasks, often by large margins.
Conclusion: Large-scale, semantically rich, hierarchically structured surgical VLP datasets directly translate to stronger and more generalizable representations, establishing SurgLaVi as a key resource for surgical foundation models.
Abstract: Vision-language pre-training (VLP) offers unique advantages for surgery by aligning language with surgical videos, enabling workflow understanding and transfer across tasks without relying on expert-labeled datasets. However, progress in surgical VLP remains constrained by the limited scale, procedural diversity, semantic quality, and hierarchical structure of existing datasets. In this work, we present SurgLaVi, the largest and most diverse surgical vision-language dataset to date, comprising nearly 240k clip-caption pairs from more than 200 procedures, and featuring hierarchical levels at coarse-, mid-, and fine-level. At the core of SurgLaVi lies a fully automated pipeline that systematically generates fine-grained transcriptions of surgical videos and segments them into coherent procedural units. To ensure high-quality annotations, it applies dual-modality filtering to remove irrelevant and noisy samples. Within this framework, the resulting captions are enriched with contextual detail, producing annotations that are both semantically rich and easy to interpret. To ensure accessibility, we release SurgLaVi-$\b{eta}$, an open-source derivative of 113k clip-caption pairs constructed entirely from public data, which is over four times larger than existing surgical VLP datasets. To demonstrate the value of the SurgLaVi datasets, we introduce SurgCLIP, a CLIP-style video-text contrastive framework with dual encoders, as a representative base model. SurgCLIP achieves consistent improvements across phase, step, action, and tool recognition, surpassing prior state-of-the-art methods, often by large margins. These results validate that large-scale, semantically rich, and hierarchically structured datasets directly translate into stronger and more generalizable representations, establishing SurgLaVi as a key resource for developing surgical foundation models.
[226] TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?
Zhongyuan Bao, Lejun Zhang
Main category: cs.CV
TL;DR: TennisTV benchmark evaluates MLLMs on tennis video understanding, revealing challenges with fast sports and providing insights on frame sampling and temporal grounding.
Details
Motivation: MLLMs struggle with fast, high-frequency sports like tennis where rally clips are short but information-dense, creating a need for systematic evaluation in this challenging domain.Method: Created TennisTV benchmark modeling rallies as temporal-ordered stroke sequences, using automated pipelines for filtering and question generation, covering 8 tasks from stroke to rally level with 2527 human-verified questions.
Result: Evaluation of 17 MLLMs revealed two key insights: frame-sampling density needs task-specific balancing, and improving temporal grounding is essential for stronger reasoning in sports video understanding.
Conclusion: TennisTV provides the first comprehensive benchmark for tennis video understanding, highlighting specific challenges for MLLMs in fast sports and offering guidance for future improvements in temporal reasoning.
Abstract: Multimodal large language models (MLLMs) excel at general video understanding but struggle with fast, high-frequency sports like tennis, where rally clips are short yet information-dense. To systematically evaluate MLLMs in this challenging domain, we present TennisTV, the first and most comprehensive benchmark for tennis video understanding. TennisTV models each rally as a temporal-ordered sequence of consecutive stroke events, using automated pipelines for filtering and question generation. It covers 8 tasks from the stroke level to the rally level and includes 2527 human-verified questions. Evaluating 17 representative MLLMs, we provide the first systematic assessment of tennis video understanding. Results yield two key insights: (i) frame-sampling density should be tailored and balanced across tasks, and (ii) improving temporal grounding is essential for stronger reasoning.
[227] Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach
Daiqing Wu, Dongbao Yang, Sicheng Zhao, Can Ma, Yu Zhou
Main category: cs.CV
TL;DR: The paper proposes a new evaluation framework called Emotion Statement Judgment to assess Multimodal Large Language Models’ ability to perceive emotions from images, addressing limitations in existing evaluation methods.
Details
Motivation: Current evaluation methods for MLLMs' emotion perception from images have several limitations: they overlook plausible responses, use limited emotional taxonomies, neglect contextual factors, and require labor-intensive annotations. There's inconsistency in reported performance across studies, particularly in zero-shot scenarios.Method: The authors propose an Emotion Statement Judgment task that overcomes existing constraints. They also develop an automated pipeline to efficiently construct emotion-centric statements with minimal human effort. This framework is used to systematically evaluate prevailing MLLMs.
Result: MLLMs show stronger performance in emotion interpretation and context-based emotion judgment, but have relative limitations in comprehending perception subjectivity. Even top-performing models like GPT4o demonstrate significant performance gaps compared to humans.
Conclusion: The paper contributes a fundamental evaluation framework for assessing emotional intelligence in MLLMs and identifies key areas for future improvement, particularly in understanding perception subjectivity and closing the gap with human performance.
Abstract: Recently, Multimodal Large Language Models (MLLMs) have achieved exceptional performance across diverse tasks, continually surpassing previous expectations regarding their capabilities. Nevertheless, their proficiency in perceiving emotions from images remains debated, with studies yielding divergent results in zero-shot scenarios. We argue that this inconsistency stems partly from constraints in existing evaluation methods, including the oversight of plausible responses, limited emotional taxonomies, neglect of contextual factors, and labor-intensive annotations. To facilitate customized visual emotion evaluation for MLLMs, we propose an Emotion Statement Judgment task that overcomes these constraints. Complementing this task, we devise an automated pipeline that efficiently constructs emotion-centric statements with minimal human effort. Through systematically evaluating prevailing MLLMs, our study showcases their stronger performance in emotion interpretation and context-based emotion judgment, while revealing relative limitations in comprehending perception subjectivity. When compared to humans, even top-performing MLLMs like GPT4o demonstrate remarkable performance gaps, underscoring key areas for future improvement. By developing a fundamental evaluation framework and conducting a comprehensive MLLM assessment, we hope this work contributes to advancing emotional intelligence in MLLMs. Project page: https://github.com/wdqqdw/MVEI.
[228] RefAM: Attention Magnets for Zero-Shot Referral Segmentation
Anna Kukleva, Enis Simsar, Alessio Tonioni, Muhammad Ferjad Naeem, Federico Tombari, Jan Eric Lenssen, Bernt Schiele
Main category: cs.CV
TL;DR: RefAM: A training-free framework that uses diffusion transformer attention features for zero-shot referring segmentation in images and videos, achieving state-of-the-art performance without fine-tuning or architectural modifications.
Details
Motivation: Existing referring segmentation methods require fine-tuning or multiple pre-trained models with additional training and architectural changes. Diffusion models contain rich semantic information but their attention features haven't been systematically exploited for vision-language grounding tasks.Method: Extracts attention features from diffusion transformers without modifications. Key insights: stop words act as attention magnets that can be filtered; global attention sinks (GAS) in deeper layers can be suppressed/redirected; attention redistribution via appended stop words partitions background activations. Combines these into RefAM framework.
Result: Achieves strong performance on zero-shot referring image and video segmentation benchmarks, surpassing prior methods on most datasets and establishing new state-of-the-art without fine-tuning, additional components, or complex reasoning.
Conclusion: Diffusion transformer attention features are highly effective for vision-language grounding tasks. The proposed training-free approach demonstrates that rich semantic information in diffusion models can be directly leveraged for referring segmentation without architectural changes or additional training.
Abstract: Most existing approaches to referring segmentation achieve strong performance only through fine-tuning or by composing multiple pre-trained models, often at the cost of additional training and architectural modifications. Meanwhile, large-scale generative diffusion models encode rich semantic information, making them attractive as general-purpose feature extractors. In this work, we introduce a new method that directly exploits features, attention scores, from diffusion transformers for downstream tasks, requiring neither architectural modifications nor additional training. To systematically evaluate these features, we extend benchmarks with vision-language grounding tasks spanning both images and videos. Our key insight is that stop words act as attention magnets: they accumulate surplus attention and can be filtered to reduce noise. Moreover, we identify global attention sinks (GAS) emerging in deeper layers and show that they can be safely suppressed or redirected onto auxiliary tokens, leading to sharper and more accurate grounding maps. We further propose an attention redistribution strategy, where appended stop words partition background activations into smaller clusters, yielding sharper and more localized heatmaps. Building on these findings, we develop RefAM, a simple training-free grounding framework that combines cross-attention maps, GAS handling, and redistribution. Across zero-shot referring image and video segmentation benchmarks, our approach achieves strong performance and surpasses prior methods on most datasets, establishing a new state of the art without fine-tuning, additional components and complex reasoning.
[229] PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models
Mouxiao Huang, Borui Jiang, Dehua Zheng, Hailin Hu, Kai Han, Xinghao Chen
Main category: cs.CV
TL;DR: PPE (Positional Preservation Embedding) is a parameter-free operator that preserves spatiotemporal structure during visual token compression in MLLMs by disentangling 3D positional encoding, improving performance across vision-language benchmarks.
Details
Motivation: Existing token merging methods in MLLMs reduce computational cost but often disrupt spatial layouts and temporal continuity by ignoring positional relationships between tokens, leading to performance degradation.Method: Proposes PPE - a parameter-free operator that explicitly encodes 3D positions (spatial and temporal) in the token dimension, allowing compressed tokens to encapsulate multiple original positions. Supports cascade clustering for progressive compression.
Result: PPE achieves 2-5% improvements across multiple benchmarks: MMBench (general vision understanding), TextVQA (layout understanding), and VideoMME (temporal understanding).
Conclusion: Preserving positional cues is critical for efficient and effective MLLM reasoning. PPE is a generic operator that can be integrated into existing token merging methods without adjustments.
Abstract: Multimodal large language models (MLLMs) have achieved strong performance on vision-language tasks, yet often suffer from inefficiencies due to redundant visual tokens. Existing token merging methods reduce sequence length but frequently disrupt spatial layouts and temporal continuity by disregarding positional relationships. In this work, we propose a novel encoding operator dubbed as \textbf{P}ositional \textbf{P}reservation \textbf{E}mbedding (\textbf{PPE}), which has the main hallmark of preservation of spatiotemporal structure during visual token compression. PPE explicitly introduces the disentangled encoding of 3D positions in the token dimension, enabling each compressed token to encapsulate different positions from multiple original tokens. Furthermore, we show that PPE can effectively support cascade clustering – a progressive token compression strategy that leads to better performance retention. PPE is a parameter-free and generic operator that can be seamlessly integrated into existing token merging methods without any adjustments. Applied to state-of-the-art token merging framework, PPE achieves consistent improvements of $2%\sim5%$ across multiple vision-language benchmarks, including MMBench (general vision understanding), TextVQA (layout understanding) and VideoMME (temporal understanding). These results demonstrate that preserving positional cues is critical for efficient and effective MLLM reasoning. Our code is available at https://github.com/MouxiaoHuang/PPE.
[230] All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles
Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Hazim Alzorgan, Mahlagha Fazeli, Abolfazl Razi
Main category: cs.CV
TL;DR: Survey paper analyzing object detection in Autonomous Vehicles with focus on multimodal sensor fusion, emerging Vision-Language Models, LLMs, and Generative AI approaches rather than traditional techniques.
Details
Motivation: Autonomous Vehicles require reliable object detection in complex multimodal environments, but knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. The survey aims to bridge this gap by providing forward-looking analysis of emerging AI paradigms.Method: Systematic review of AV sensors (camera, ultrasonic, LiDAR, Radar) and fusion strategies, structured categorization of AV datasets (ego-vehicle, infrastructure-based, cooperative), and analysis of cutting-edge detection methodologies including 2D/3D pipelines, hybrid sensor fusion, and transformer-driven approaches with Vision Transformers, LLMs, and VLMs.
Result: Provides comprehensive analysis of current capabilities in AV object detection, highlighting integration potential with LLM/VLM-driven perception frameworks and emerging transformer-based approaches.
Conclusion: The survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities in AV object detection, emphasizing the transformative potential of Vision-Language Models, Large Language Models, and Generative AI for multimodal perception in autonomous driving.
Abstract: Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.
[231] VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models
Xinlei Yu, Chengming Xu, Guibin Zhang, Zhangquan Chen, Yudong Zhang, Yongbo He, Peng-Tao Jiang, Jiangning Zhang, Xiaobin Hu, Shuicheng Yan
Main category: cs.CV
TL;DR: VisMem introduces a cognitive memory framework for Vision-Language Models with short-term visual and long-term semantic memory modules to address visual processing bottlenecks during prolonged generation.
Details
Motivation: VLMs suffer from "visual processing bottlenecks" - losing visual grounding and contextual experience during extended generation tasks, inspired by human cognitive memory theory distinguishing short-term visual and long-term semantic memory.Method: Proposes VisMem framework with dynamic latent vision memories: short-term module for fine-grained perceptual retention and long-term module for abstract semantic consolidation, seamlessly invoked during inference.
Result: Extensive experiments across diverse visual benchmarks show 11.0% average performance boost relative to vanilla model, outperforming all counterparts in understanding, reasoning, and generation tasks.
Conclusion: VisMem establishes a new paradigm for latent-space memory enhancement in VLMs, maintaining both perceptual fidelity and semantic consistency across thinking and generation.
Abstract: Despite the remarkable success of Vision-Language Models (VLMs), their performance on a range of complex visual tasks is often hindered by a “visual processing bottleneck”: a propensity to lose grounding in visual evidence and exhibit a deficit in contextualized visual experience during prolonged generation. Drawing inspiration from human cognitive memory theory, which distinguishes short-term visually-dominant memory and long-term semantically-dominant memory, we propose VisMem, a cognitively-aligned framework that equips VLMs with dynamic latent vision memories, a short-term module for fine-grained perceptual retention and a long-term module for abstract semantic consolidation. These memories are seamlessly invoked during inference, allowing VLMs to maintain both perceptual fidelity and semantic consistency across thinking and generation. Extensive experiments across diverse visual benchmarks for understanding, reasoning, and generation reveal that VisMem delivers a significant average performance boost of 11.0% relative to the vanilla model and outperforms all counterparts, establishing a new paradigm for latent-space memory enhancement. The code will be available: https://github.com/YU-deep/VisMem.git.
[232] REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting
Di Wu, Liu Liu, Anran Huang, Yuyan Liu, Qiaojun Yu, Shaofan Liu, Liangtu Song, Cewu Lu
Main category: cs.CV
TL;DR: REArtGS++ improves articulated object reconstruction using planar Gaussian splatting with temporal geometry constraints and decoupled screw motion modeling for better generalization.
Details
Motivation: Existing methods like REArtGS struggle with screw-joint or multi-part objects and lack geometric constraints for unseen states, limiting generalizable articulated object reconstruction.Method: Proposes decoupled screw motion modeling without type priors, part-aware Gaussian optimization with motion blending, and introduces temporal geometry constraints via planar Gaussian splatting with temporally consistent regularization using Taylor expansion.
Result: Extensive experiments on synthetic and real-world articulated objects demonstrate superior performance in part-level surface reconstruction and joint parameter estimation compared to existing approaches.
Conclusion: REArtGS++ advances articulated object reconstruction by addressing limitations of previous methods through novel temporal geometry constraints and more flexible motion modeling.
Abstract: Articulated objects are pervasive in daily environments, such as drawers and refrigerators. Towards their part-level surface reconstruction and joint parameter estimation, REArtGS introduces a category-agnostic approach using multi-view RGB images at two different states. However, we observe that REArtGS still struggles with screw-joint or multi-part objects and lacks geometric constraints for unseen states. In this paper, we propose REArtGS++, a novel method towards generalizable articulated object reconstruction with temporal geometry constraint and planar Gaussian splatting. We first model a decoupled screw motion for each joint without type prior, and jointly optimize part-aware Gaussians with joint parameters through part motion blending. To introduce time-continuous geometric constraint for articulated modeling, we encourage Gaussians to be planar and propose a temporally consistent regularization between planar normal and depth through Taylor first-order expansion. Extensive experiments on both synthetic and real-world articulated objects demonstrate our superiority in generalizable part-level surface reconstruction and joint parameter estimation, compared to existing approaches. Project Site: https://sites.google.com/view/reartgs2/home.
[233] UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification
Taixi Chen, Jingyun Chen, Nancy Guo
Main category: cs.CV
TL;DR: Unified Attention-Mamba (UAM) backbone combines attention and Mamba modules in a flexible architecture for multimodal cell classification and segmentation tasks, achieving state-of-the-art performance.
Details
Motivation: Inspired by Mamba's success in vision and language, the authors aim to create a unified architecture that flexibly combines attention and Mamba capabilities without manual ratio tuning, improving encoding capabilities for multimodal tasks.Method: Developed a Unified Attention-Mamba (UAM) backbone with two variants that flexibly integrate attention and Mamba modules within a single cohesive architecture. Built a multimodal UAM framework for joint cell-level classification and image segmentation.
Result: UAM achieves state-of-the-art performance on public benchmarks, improving cell classification accuracy from 74% to 78% (n=349,882 cells) and tumor segmentation precision from 75% to 80% (n=406 patches), surpassing leading image-based foundation models.
Conclusion: The unified Attention-Mamba design provides a flexible and effective backbone for multimodal tasks, eliminating manual tuning requirements while achieving superior performance in biomedical image analysis tasks.
Abstract: Inspired by the recent success of the Mamba architecture in vision and language domains, we introduce a Unified Attention-Mamba (UAM) backbone. Unlike previous hybrid approaches that integrate Attention and Mamba modules in fixed proportions, our unified design flexibly combines their capabilities within a single cohesive architecture, eliminating the need for manual ratio tuning and improving encode capability. We develop two UAM variants to comprehensively evaluate the benefits of this unified structure. Building on this backbone, we further propose a multimodal UAM framework that jointly performs cell-level classification and image segmentation. Experimental results demonstrate that UAM achieves state-of-the-art performance across both tasks on public benchmarks, surpassing leading image-based foundation models. It improves cell classification accuracy from 74% to 78% ($n$=349,882 cells), and tumor segmentation precision from 75% to 80% ($n$=406 patches).
[234] MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding
Benjamin Beilharz, Thomas S. A. Wallis
Main category: cs.CV
TL;DR: MRD uses differentiable rendering to find 3D scenes that produce identical activations in vision models, probing their implicit 3D understanding through physically-grounded metamers.
Details
Motivation: To understand what 3D scene properties vision models implicitly represent, moving beyond pixel-based analysis to physically-grounded scene understanding.Method: Uses physically-based differentiable rendering to find 3D scene parameters (geometry, materials) that produce the same model activation as target scenes, creating model metamers.
Result: High similarity in model activation between target and optimized scenes, with varying visual reconstructions, showing models’ sensitivity to different physical scene attributes.
Conclusion: MRD enables analysis of how physical scene parameters drive model responses, advancing understanding of both computer and human vision.
Abstract: While deep learning methods have achieved impressive success in many vision benchmarks, it remains difficult to understand and explain the representations and decisions of these models. Though vision models are typically trained on 2D inputs, they are often assumed to develop an implicit representation of the underlying 3D scene (for example, showing tolerance to partial occlusion, or the ability to reason about relative depth). Here, we introduce MRD (metamers rendered differentiably), an approach that uses physically based differentiable rendering to probe vision models’ implicit understanding of generative 3D scene properties, by finding 3D scene parameters that are physically different but produce the same model activation (i.e. are model metamers). Unlike previous pixel-based methods for evaluating model representations, these reconstruction results are always grounded in physical scene descriptions. This means we can, for example, probe a model’s sensitivity to object shape while holding material and lighting constant. As a proof-of-principle, we assess multiple models in their ability to recover scene parameters of geometry (shape) and bidirectional reflectance distribution function (material). The results show high similarity in model activation between target and optimized scenes, with varying visual results. Qualitatively, these reconstructions help investigate the physical scene attributes to which models are sensitive or invariant. MRD holds promise for advancing our understanding of both computer and human vision by enabling analysis of how physical scene parameters drive changes in model responses.
[235] MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning
Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Hongwei Xie, Bing Wang, Guang Chen, Dingkang Liang, Xiang Bai
Main category: cs.CV
TL;DR: MindDrive: A Vision-Language-Action framework using LLMs with dual LoRA parameters for autonomous driving, enabling online reinforcement learning through discrete linguistic decisions instead of continuous actions.
Details
Motivation: Current VLA paradigms in autonomous driving rely on Imitation Learning which suffers from distribution shift and causal confusion. Online RL could address these issues but faces inefficient exploration in continuous action spaces.Method: Uses a single LLM with two sets of LoRA parameters: one as Decision Expert for scenario reasoning and driving decisions, another as Action Expert mapping linguistic decisions to trajectories. Enables trial-and-error learning over discrete linguistic decisions by feeding trajectory-level rewards back to reasoning space.
Result: Achieves Driving Score of 78.04 and Success Rate of 55.09% on Bench2Drive benchmark using lightweight Qwen-0.5B LLM, demonstrating effectiveness of online RL for VLA models in autonomous driving.
Conclusion: MindDrive successfully balances optimal decision-making, human-like driving behavior, and efficient exploration in online RL by operating over discrete linguistic decisions rather than continuous action spaces.
Abstract: Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. To overcome this limitation, we propose MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. By feeding trajectory-level rewards back into the reasoning space, MindDrive enables trial-and-error learning over a finite set of discrete linguistic driving decisions, instead of operating directly in a continuous action space. This approach effectively balances optimal decision-making in complex scenarios, human-like driving behavior, and efficient exploration in online reinforcement learning. Using the lightweight Qwen-0.5B LLM, MindDrive achieves Driving Score (DS) of 78.04 and Success Rate (SR) of 55.09% on the challenging Bench2Drive benchmark. To the best of our knowledge, this is the first work to demonstrate the effectiveness of online reinforcement learning for the VLA model in autonomous driving.
[236] A 96pJ/Frame/Pixel and 61pJ/Event Anti-UAV System with Hybrid Object Tracking Modes
Yuncheng Lu, Yucen Shi, Aobo Li, Zehao Li, Junying Li, Bo Wang, Tony Tae-Hyoung Kim
Main category: cs.CV
TL;DR: Energy-efficient anti-UAV system using hybrid frame-based and event-driven tracking with custom NPU for low-power drone detection
Details
Motivation: Need for energy-efficient systems to detect small, fast-moving drones (UAVs) at long ranges, requiring robust tracking that balances accuracy with power consumptionMethod: Hybrid tracking system with frame-based and event-driven modes, adaptive switching based on object size/velocity, custom NPU with run-length encoding, zero-skipping MAC architecture, and Fast Object Tracking Unit with adaptive thresholding
Result: 98.2% recognition accuracy on UAV datasets (50-400m range, 5-80 pixels/sec), 96 pJ/frame/pixel, 61 pJ/event at 0.8V, 97% reduction in redundant neural computations
Conclusion: Demonstrates state-of-the-art energy efficiency for anti-UAV systems through hardware-software co-design of hybrid vision processing
Abstract: We present an energy-efficient anti-UAV system that integrates frame-based and event-driven object tracking to enable reliable detection of small and fast-moving drones. The system reconstructs binary event frames using run-length encoding, generates region proposals, and adaptively switches between frame mode and event mode based on object size and velocity. A Fast Object Tracking Unit improves robustness for high-speed targets through adaptive thresholding and trajectory-based classification. The neural processing unit supports both grayscale-patch and trajectory inference with a custom instruction set and a zero-skipping MAC architecture, reducing redundant neural computations by more than 97 percent. Implemented in 40 nm CMOS technology, the 2 mm^2 chip achieves 96 pJ per frame per pixel and 61 pJ per event at 0.8 V, and reaches 98.2 percent recognition accuracy on public UAV datasets across 50 to 400 m ranges and 5 to 80 pixels per second speeds. The results demonstrate state-of-the-art end-to-end energy efficiency for anti-UAV systems.
[237] VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis
Meng Chu, Senqiao Yang, Haoxuan Che, Suiyun Zhang, Xichen Zhang, Shaozuo Yu, Haokun Gui, Zhefan Rao, Dandan Tu, Rui Liu, Jiaya Jia
Main category: cs.CV
TL;DR: VisionDirector is a training-free vision-language supervisor that improves image generation/editing for complex multi-goal prompts by extracting structured goals, dynamically planning edit trajectories, and using semantic verification with rollback.
Details
Motivation: Current generative models struggle with long, multi-goal prompts that professional designers use, often missing localized edits and failing to satisfy complex requirements spanning layout, object placement, typography, and logo fidelity.Method: VisionDirector extracts structured goals from long instructions, dynamically decides between one-shot generation vs staged edits, runs micro-grid sampling with semantic verification and rollback after every edit, and logs goal-level rewards. Fine-tuned with Group Relative Policy Optimization for shorter edit trajectories.
Result: Achieves new SOTA on GenEval (+7% overall) and ImgEdit (+0.07 absolute), reduces edit trajectories from 4.2 to 3.1 steps, and shows qualitative improvements on typography, multi-object scenes, and pose editing.
Conclusion: VisionDirector addresses the brittleness of current generative pipelines for complex multi-goal tasks through structured goal extraction and dynamic planning with verification, enabling better performance on real-world design prompts.
Abstract: Generative models can now produce photorealistic imagery, yet they still struggle with the long, multi-goal prompts that professional designers issue. To expose this gap and better evaluate models’ performance in real-world settings, we introduce Long Goal Bench (LGBench), a 2,000-task suite (1,000 T2I and 1,000 I2I) whose average instruction contains 18 to 22 tightly coupled goals spanning global layout, local object placement, typography, and logo fidelity. We find that even state-of-the-art models satisfy fewer than 72 percent of the goals and routinely miss localized edits, confirming the brittleness of current pipelines. To address this, we present VisionDirector, a training-free vision-language supervisor that (i) extracts structured goals from long instructions, (ii) dynamically decides between one-shot generation and staged edits, (iii) runs micro-grid sampling with semantic verification and rollback after every edit, and (iv) logs goal-level rewards. We further fine-tune the planner with Group Relative Policy Optimization, yielding shorter edit trajectories (3.1 versus 4.2 steps) and stronger alignment. VisionDirector achieves new state of the art on GenEval (plus 7 percent overall) and ImgEdit (plus 0.07 absolute) while producing consistent qualitative improvements on typography, multi-object scenes, and pose editing.
[238] See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, Rui Wang
Main category: cs.CV
TL;DR: Bi-directional Perceptual Shaping (BiPS) improves vision-language models by using bidirectional where-to-look signals to shape perception during training, enhancing visual reliance and reducing text-only shortcuts.
Details
Motivation: Current vision-language models often rely on intermediate visual cues but overlook fine-grained visual evidence, generalize poorly across domains, and incur high inference costs. There's a need for better mechanisms that enforce visual reliance and prevent text-only shortcuts.Method: BiPS transforms question-conditioned masked views into bidirectional signals: 1) KL-consistency constraint between original image and evidence-preserving view (keeping only question-relevant regions) for coarse but complete coverage, and 2) KL-separation constraint between original and evidence-ablated view (masking critical pixels) to discourage text-only shortcuts and enforce fine-grained visual reliance.
Result: BiPS boosts Qwen2.5-VL-7B by 8.2% on average across eight benchmarks and shows strong out-of-domain generalization to unseen datasets and image types.
Conclusion: BiPS effectively shapes perception in vision-language models by using bidirectional constraints to enhance visual reliance and reduce text-only reasoning, leading to improved performance and better generalization.
Abstract: Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.
[239] Active Perception Agent for Omnimodal Audio-Video Understanding
Keda Tao, Wenjie Du, Bohan Yu, Weiqiang Wang, Jian Liu, Huan Wang
Main category: cs.CV
TL;DR: OmniAgent is the first fully active perception agent that dynamically orchestrates specialized unimodal tools for fine-grained omnimodal reasoning, using audio-guided perception to localize temporal events and achieve state-of-the-art performance on audio-video understanding benchmarks.
Details
Motivation: Current omnimodal LLMs struggle with fine-grained cross-modal understanding and multimodal alignment, relying on rigid workflows and dense frame-captioning. There's a need for active multimodal inquiry rather than passive response generation.Method: OmniAgent employs dynamic planning to autonomously orchestrate tool invocation, strategically focusing perceptual attention on task-relevant cues. It introduces a novel coarse-to-fine audio-guided perception paradigm that uses audio cues to localize temporal events and guide subsequent reasoning.
Result: Extensive evaluations on three audio-video understanding benchmarks show OmniAgent achieves state-of-the-art performance, surpassing leading open-source and closed-source models by 10-20% accuracy without training.
Conclusion: OmniAgent represents a paradigm shift from passive response generation to active multimodal inquiry, demonstrating superior fine-grained omnimodal reasoning through dynamic tool orchestration and audio-guided perception.
Abstract: Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often face challenges in fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, to our best knowledge, the first fully active perception agent that dynamically orchestrates specialized unimodal tools to achieve more fine-grained omnimodal reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, we demonstrate a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and closed-source models by substantial margins of 10% - 20% accuracy without training.
[240] Deep Probabilistic Supervision for Image Classification
Anton Adelöw, Matteo Gamba, Atsuto Maki
Main category: cs.CV
TL;DR: DPS is a probabilistic supervision framework that constructs sample-specific target distributions via statistical inference on model predictions, improving accuracy, calibration, and robustness without hard targets.
Details
Motivation: Hard targets in supervised training promote overconfidence and limit calibration, generalization, and robustness. Existing self-distillation methods still rely on hard targets without explicitly modeling predictive uncertainty.Method: Deep Probabilistic Supervision (DPS) constructs sample-specific target distributions via statistical inference on the model’s own predictions, remaining independent of hard targets after initialization.
Result: DPS yields higher test accuracy (+2.0% for DenseNet-264 on ImageNet) and significantly lower Expected Calibration Error (-40% ResNet-50, CIFAR-100) than existing self-distillation methods. Combined with contrastive loss, achieves SOTA robustness under label noise.
Conclusion: DPS provides a principled framework for probabilistic supervision that improves model accuracy, calibration, and robustness by moving beyond hard targets and explicitly modeling predictive uncertainty.
Abstract: Supervised training of deep neural networks for classification typically relies on hard targets, which promote overconfidence and can limit calibration, generalization, and robustness. Self-distillation methods aim to mitigate this by leveraging inter-class and sample-specific information present in the model’s own predictions, but often remain dependent on hard targets without explicitly modeling predictive uncertainty. With this in mind, we propose Deep Probabilistic Supervision (DPS), a principled learning framework constructing sample-specific target distributions via statistical inference on the model’s own predictions, remaining independent of hard targets after initialization. We show that DPS consistently yields higher test accuracy (e.g., +2.0% for DenseNet-264 on ImageNet) and significantly lower Expected Calibration Error (ECE) (-40% ResNet-50, CIFAR-100) than existing self-distillation methods. When combined with a contrastive loss, DPS achieves state-of-the-art robustness under label noise.
[241] Quantifying and Inducing Shape Bias in CNNs via Max-Pool Dilation
Takito Sawada, Akinori Iwata, Masahiro Okuda
Main category: cs.CV
TL;DR: Proposes a data-driven metric to quantify shape-texture balance in datasets and an efficient adaptation method using modified max-pooling dilation to promote shape bias in CNNs, particularly beneficial for shape-dominant data like illustrations and sketches.
Details
Motivation: CNNs have inherent texture bias that prioritizes local patterns over global shapes, which degrades performance on shape-dominant data like illustrations and sketches. Existing shape-biased models lack quantitative metrics to identify which datasets would benefit from such modifications.Method: 1) Proposes a data-driven metric using SSIM between image’s luminance channel and its L0-smoothed counterpart to quantify shape-texture balance. 2) Introduces computationally efficient adaptation method that modifies dilation of max-pooling operations while keeping convolutional weights frozen, requiring only final classification layer training.
Result: Experimental results show consistent accuracy improvements on shape-dominant datasets, particularly in low-data regimes where full fine-tuning is impractical.
Conclusion: The proposed metric effectively identifies datasets that benefit from shape-biased adaptations, and the efficient adaptation method improves CNN performance on shape-dominant data without requiring extensive retraining.
Abstract: Convolutional Neural Networks (CNNs) exhibit a well-known texture bias, prioritizing local patterns over global shapes - a tendency inherent to their convolutional architecture. While this bias is beneficial for texture-rich natural images, it often degrades performance on shape-dominant data such as illustrations and sketches. Although prior work has proposed shape-biased models to mitigate this issue, these approaches lack a quantitative metric for identifying which datasets would actually benefit from such modifications. To address this limitation, we propose a data-driven metric that quantifies the shape-texture balance within a dataset by computing the Structural Similarity Index (SSIM) between an image’s luminance (Y) channel and its L0-smoothed counterpart. Building on this metric, we introduce a computationally efficient adaptation method that promotes shape bias by modifying the dilation of max-pooling operations while keeping convolutional weights frozen. Experimental results demonstrate consistent accuracy improvements on shape-dominant datasets, particularly in low-data regimes where full fine-tuning is impractical, requiring training only the final classification layer.
[242] Quantification and Classification of Carbon Nanotubes in Electron Micrographs using Vision Foundation Models
Sanjay Pradeep, Chen Wang, Matthew M. Dahm, Jeff D. Eldredge, Candace S. J. Tsai
Main category: cs.CV
TL;DR: A unified framework using vision foundation models (SAM and DINOv2) to automate quantification and classification of carbon nanotube morphologies in electron microscopy images with high accuracy.
Details
Motivation: Current workflows for characterizing carbon nanotube morphologies in electron microscopy images rely on slow, subjective manual segmentation, creating a bottleneck for exposure assessment and toxicological studies.Method: Two-stage approach: 1) Interactive quantification tool using Segment Anything Model (SAM) for accurate particle segmentation with minimal user input; 2) Classification pipeline using segmentation masks to spatially constrain DINOv2 vision transformer, extracting features exclusively from particle regions while suppressing background noise.
Result: Achieves 95.5% accuracy in distinguishing between four different CNT morphologies on a dataset of 1,800 TEM images, significantly outperforming current baselines with less training data. Enables instance-level processing to resolve mixed samples with distinct particle types in single field of view.
Conclusion: Integrating zero-shot segmentation with self-supervised feature learning enables high-throughput, reproducible nanomaterial analysis, transforming a labor-intensive bottleneck into a scalable, data-driven process.
Abstract: Accurate characterization of carbon nanotube morphologies in electron microscopy images is vital for exposure assessment and toxicological studies, yet current workflows rely on slow, subjective manual segmentation. This work presents a unified framework leveraging vision foundation models to automate the quantification and classification of CNTs in electron microscopy images. First, we introduce an interactive quantification tool built on the Segment Anything Model (SAM) that segments particles with near-perfect accuracy using minimal user input. Second, we propose a novel classification pipeline that utilizes these segmentation masks to spatially constrain a DINOv2 vision transformer, extracting features exclusively from particle regions while suppressing background noise. Evaluated on a dataset of 1,800 TEM images, this architecture achieves 95.5% accuracy in distinguishing between four different CNT morphologies, significantly outperforming the current baseline despite using a fraction of the training data. Crucially, this instance-level processing allows the framework to resolve mixed samples, correctly classifying distinct particle types co-existing within a single field of view. These results demonstrate that integrating zero-shot segmentation with self-supervised feature learning enables high-throughput, reproducible nanomaterial analysis, transforming a labor-intensive bottleneck into a scalable, data-driven process.
[243] Test-time Adaptive Hierarchical Co-enhanced Denoising Network for Reliable Multimodal Classification
Shu Shen, C. L. Philip Chen, Tong Zhang
Main category: cs.CV
TL;DR: TAHCD is a test-time adaptive hierarchical co-enhanced denoising network for reliable multimodal learning that addresses heterogeneous noise at both global and instance levels, with adaptive test-time updates for improved generalization.
Details
Motivation: Multimodal data (like multi-omics) often suffers from low-quality data induced by multimodal noise, which is critical in safety-critical applications like medical diagnosis. Existing methods struggle with heterogeneous data noise and have limited adaptability/generalization to unseen noise.Method: Proposes Test-time Adaptive Hierarchical Co-enhanced Denoising Network (TAHCD) with two key components: 1) Adaptive Stable Subspace Alignment and Sample-Adaptive Confidence Alignment to remove heterogeneous noise at global and instance levels, handling both modality-specific and cross-modality noise; 2) Test-Time Cooperative Enhancement that adaptively updates the model in response to input noise in a label-free manner.
Result: Experiments on multiple benchmarks show superior classification performance, robustness, and generalization compared with state-of-the-art reliable multimodal learning approaches.
Conclusion: TAHCD effectively addresses multimodal noise challenges through hierarchical denoising and test-time adaptation, achieving reliable multimodal learning particularly valuable for safety-critical applications.
Abstract: Reliable learning of multimodal data (e.g., multi-omics) is a widely concerning issue, especially in safety-critical applications such as medical diagnosis. However, low-quality data induced by multimodal noise poses a major challenge in this domain, causing existing methods to suffer from two key limitations. First, they struggle to handle heterogeneous data noise, hindering robust multimodal representation learning. Second, they exhibit limited adaptability and generalization when encountering previously unseen noise. To address these issues, we propose Test-time Adaptive Hierarchical Co-enhanced Denoising Network (TAHCD). On one hand, TAHCD introduces the Adaptive Stable Subspace Alignment and Sample-Adaptive Confidence Alignment to reliably remove heterogeneous noise. They account for noise at both global and instance levels and enable jointly removal of modality-specific and cross-modality noise, achieving robust learning. On the other hand, TAHCD introduces Test-Time Cooperative Enhancement, which adaptively updates the model in response to input noise in a label-free manner, thus improving generalization. This is achieved by collaboratively enhancing the joint removal process of modality-specific and cross-modality noise across global and instance levels according to sample noise. Experiments on multiple benchmarks demonstrate that the proposed method achieves superior classification performance, robustness, and generalization compared with state-of-the-art reliable multimodal learning approaches.
[244] SIRR-LMM: Single-image Reflection Removal via Large Multimodal Model
Yu Guo, Zhiqiang Lao, Xiyun Song, Yubin Zhou, Heather Yu
Main category: cs.CV
TL;DR: A framework for single-image reflection removal using physically accurate synthetic data generation and Large Multimodal Model fine-tuning with LoRA for improved performance.
Details
Motivation: Glass surfaces create complex light interactions making reflection removal challenging, and existing datasets lack either physical realism or sufficient scale.Method: Path-tracing 3D glass models over real backgrounds to create physically accurate synthetic data, then using LMM with concatenated image layers, joint captioning, and task-specific LoRA fine-tuning.
Result: Achieves improved reflection removal and separation performance compared to state-of-the-art methods.
Conclusion: The combination of physically accurate synthetic data generation and efficient LMM fine-tuning enables better reflection removal than existing approaches.
Abstract: Glass surfaces create complex interactions of reflected and transmitted light, making single-image reflection removal (SIRR) challenging. Existing datasets suffer from limited physical realism in synthetic data or insufficient scale in real captures. We introduce a synthetic dataset generation framework that path-traces 3D glass models over real background imagery to create physically accurate reflection scenarios with varied glass properties, camera settings, and post-processing effects. To leverage the capabilities of Large Multimodal Model (LMM), we concatenate the image layers into a single composite input, apply joint captioning, and fine-tune the model using task-specific LoRA rather than full-parameter training. This enables our approach to achieve improved reflection removal and separation performance compared to state-of-the-art methods.
[245] An Example for Domain Adaptation Using CycleGAN
Yanhua Zhao
Main category: cs.CV
TL;DR: CycleGAN applied to medical domain adaptation for unpaired image translation from microscopy to pseudo H&E stained histopathology images
Details
Motivation: To leverage CycleGAN's domain adaptation capabilities for medical imaging, specifically for translating microscopy images to pseudo H&E stained histopathology images without requiring paired training dataMethod: Uses Cycle-Consistent Adversarial Network (CycleGAN) architecture for unpaired image-to-image translation between microscopy and histopathology domains
Result: Demonstrates successful application of CycleGAN in medical imaging for domain adaptation between microscopy and histopathology image modalities
Conclusion: CycleGAN is effective for medical domain adaptation tasks, enabling unpaired image translation between different medical imaging modalities
Abstract: Cycle-Consistent Adversarial Network (CycleGAN) is very promising in domain adaptation. In this report, an example in medical domain will be explained. We present struecture of a CycleGAN model for unpaired image-to-image translation from microscopy to pseudo H&E stained histopathology images.
[246] Near-Light Color Photometric Stereo for Mono-Chromatic Non-Lambertian Surfaces
Zonglin Li, Jieji Ren, Shuangfan Zhou, Heng Guo, Jinnuo Zhang, Jiang Zhou, Boxin Shi, Zhanyu Ma, Guoying Gu
Main category: cs.CV
TL;DR: Neural implicit framework for single-shot color photometric stereo using depth and BRDF modeling under mono-chromaticity assumption, validated with optical tactile sensor.
Details
Motivation: Existing color photometric stereo methods assume ideal distant lighting and Lambertian reflectance, limiting practical applications with near-light conditions and non-Lambertian surfaces. Need for robust single-shot surface reconstruction.Method: Proposes neural implicit representations for depth and BRDF modeling under mono-chromaticity assumption (uniform chromaticity, homogeneous material). Uses single image input and designs compact optical tactile sensor for validation.
Result: Achieves accurate and robust surface reconstruction on both synthetic and real-world datasets. Framework handles near-light conditions and non-Lambertian surfaces effectively.
Conclusion: The proposed neural implicit framework successfully addresses limitations of traditional color photometric stereo, enabling detailed surface recovery from single images under practical conditions.
Abstract: Color photometric stereo enables single-shot surface reconstruction, extending conventional photometric stereo that requires multiple images of a static scene under varying illumination to dynamic scenarios. However, most existing approaches assume ideal distant lighting and Lambertian reflectance, leaving more practical near-light conditions and non-Lambertian surfaces underexplored. To overcome this limitation, we propose a framework that leverages neural implicit representations for depth and BRDF modeling under the assumption of mono-chromaticity (uniform chromaticity and homogeneous material), which alleviates the inherent ill-posedness of color photometric stereo and allows for detailed surface recovery from just one image. Furthermore, we design a compact optical tactile sensor to validate our approach. Experiments on both synthetic and real-world datasets demonstrate that our method achieves accurate and robust surface reconstruction.
[247] Towards Visually Explaining Statistical Tests with Applications in Biomedical Imaging
Masoumeh Javanbakhat, Piotr Komorowski, Dilyara Bareeva, Wei-Chang Lai, Wojciech Samek, Christoph Lippert
Main category: cs.CV
TL;DR: An explainable deep statistical testing framework that provides sample-level and feature-level explanations for deep two-sample tests, making them interpretable for biomedical imaging analysis.
Details
Motivation: Deep neural two-sample tests have shown strong power for detecting distributional differences but lack interpretability, limiting practical adoption in biomedical analysis. Existing explainability methods require class labels, making them unsuitable for label-free statistical testing settings.Method: Proposes an explainable deep statistical testing framework that augments deep two-sample tests with sample-level and feature-level explanations. The method identifies which individual samples and which input features drive statistically significant group differences, providing spatial and instance-wise insights.
Result: The framework successfully identifies influential samples and highlights anatomically meaningful regions associated with disease-related variation in biomedical imaging data. It provides both spatial and instance-wise insight into test decisions.
Conclusion: This work bridges statistical inference and explainable AI, enabling interpretable, label-free population analysis in medical imaging by making deep two-sample tests explainable at both sample and feature levels.
Abstract: Deep neural two-sample tests have recently shown strong power for detecting distributional differences between groups, yet their black-box nature limits interpretability and practical adoption in biomedical analysis. Moreover, most existing post-hoc explainability methods rely on class labels, making them unsuitable for label-free statistical testing settings. We propose an explainable deep statistical testing framework that augments deep two-sample tests with sample-level and feature-level explanations, revealing which individual samples and which input features drive statistically significant group differences. Our method highlights which image regions and which individual samples contribute most to the detected group difference, providing spatial and instance-wise insight into the test’s decision. Applied to biomedical imaging data, the proposed framework identifies influential samples and highlights anatomically meaningful regions associated with disease-related variation. This work bridges statistical inference and explainable AI, enabling interpretable, label-free population analysis in medical imaging.
[248] An AI-enabled tool for quantifying overlapping red blood cell sickling dynamics in microfluidic assays
Nikhil Kadivar, Guansheng Li, Jianlu Zheng, Ming Dao, George Em Karniadakis, Mengjia Xu
Main category: cs.CV
TL;DR: AI-driven framework for automated analysis of red blood cell morphology in sickle cell disease using deep learning segmentation and classification in microscopy data.
Details
Motivation: Need for accurate identification of morphological transitions in sickle cell dynamics under diverse biophysical conditions, especially in densely packed and overlapping cell populations, to improve experimental throughput and therapeutic assessment.Method: Automated deep learning framework integrating AI-assisted annotation via Roboflow, nnU-Net segmentation model training, watershed algorithm for resolving overlapping cells, and instance counting for quantifying RBC populations across varying density regimes.
Result: High segmentation performance with limited labeled data, ability to track temporal evolution of sickle cell fraction, doubled experimental throughput via densely packed suspensions, capture of drug-dependent sickling behavior, and revelation of mechanobiological signatures.
Conclusion: Establishes scalable and reproducible computational platform for investigating cellular biomechanics and assessing therapeutic efficacy in microphysiological systems through AI-driven analysis of RBC morphology.
Abstract: Understanding sickle cell dynamics requires accurate identification of morphological transitions under diverse biophysical conditions, particularly in densely packed and overlapping cell populations. Here, we present an automated deep learning framework that integrates AI-assisted annotation, segmentation, classification, and instance counting to quantify red blood cell (RBC) populations across varying density regimes in time-lapse microscopy data. Experimental images were annotated using the Roboflow platform to generate labeled dataset for training an nnU-Net segmentation model. The trained network enables prediction of the temporal evolution of the sickle cell fraction, while a watershed algorithm resolves overlapping cells to enhance quantification accuracy. Despite requiring only a limited amount of labeled data for training, the framework achieves high segmentation performance, effectively addressing challenges associated with scarce manual annotations and cell overlap. By quantitatively tracking dynamic changes in RBC morphology, this approach can more than double the experimental throughput via densely packed cell suspensions, capture drug-dependent sickling behavior, and reveal distinct mechanobiological signatures of cellular morphological evolution. Overall, this AI-driven framework establishes a scalable and reproducible computational platform for investigating cellular biomechanics and assessing therapeutic efficacy in microphysiological systems.
[249] One-step Latent-free Image Generation with Pixel Mean Flows
Yiyang Lu, Susie Lu, Qiao Sun, Hanhong Zhao, Zhicheng Jiang, Xianbang Wang, Tianhong Li, Zhengyang Geng, Kaiming He
Main category: cs.CV
TL;DR: Pixel MeanFlow (pMF) enables one-step latent-free image generation using diffusion/flow models by separating network output space and loss space, achieving strong results on ImageNet at 256x256 and 512x512 resolutions.
Details
Motivation: Current diffusion/flow models for image generation require multi-step sampling and operate in latent spaces. The goal is to achieve one-step generation without latents by addressing both aspects simultaneously.Method: Proposes pixel MeanFlow (pMF) with separate network output space and loss space formulation. Network targets low-dimensional image manifold (x-prediction) while loss is defined via MeanFlow in velocity space, with transformation between image manifold and average velocity field.
Result: Achieves 2.22 FID on ImageNet 256x256 and 2.48 FID on ImageNet 512x512 for one-step latent-free generation, filling a key gap in this research regime.
Conclusion: pMF advances diffusion/flow-based generative models toward one-step latent-free generation, demonstrating strong performance and providing a foundation for further research in this direction.
Abstract: Modern diffusion/flow-based models for image generation typically exhibit two core characteristics: (i) using multi-step sampling, and (ii) operating in a latent space. Recent advances have made encouraging progress on each aspect individually, paving the way toward one-step diffusion/flow without latents. In this work, we take a further step towards this goal and propose “pixel MeanFlow” (pMF). Our core guideline is to formulate the network output space and the loss space separately. The network target is designed to be on a presumed low-dimensional image manifold (i.e., x-prediction), while the loss is defined via MeanFlow in the velocity space. We introduce a simple transformation between the image manifold and the average velocity field. In experiments, pMF achieves strong results for one-step latent-free generation on ImageNet at 256x256 resolution (2.22 FID) and 512x512 resolution (2.48 FID), filling a key missing piece in this regime. We hope that our study will further advance the boundaries of diffusion/flow-based generative models.
[250] PEAR: Pixel-aligned Expressive humAn mesh Recovery
Jiahao Wu, Yunfei Liu, Lijian Lin, Ye Zhu, Lei Zhu, Jingyi Li, Yu Li
Main category: cs.CV
TL;DR: PEAR is a fast, robust framework for pixel-aligned expressive human mesh recovery from single images, achieving real-time SMPLX parameter inference at 100+ FPS with improved fine-grained detail accuracy.
Details
Motivation: Existing SMPLX-based methods for 3D human mesh reconstruction suffer from slow inference, produce only coarse body poses, and exhibit misalignments/artifacts in fine-grained regions like face and hands, limiting practical applications.Method: Uses a clean unified ViT-based model for coarse 3D geometry recovery, adds pixel-level supervision to optimize geometry for fine-grained details, and employs modular data annotation strategy to enrich training data and enhance robustness.
Result: Achieves over 100 FPS inference speed, substantial improvements in pose estimation accuracy on multiple benchmarks compared to previous SMPLX-based approaches, and better reconstruction of fine-grained human details.
Conclusion: PEAR provides a fast, robust preprocessing-free framework for expressive human mesh recovery that addresses key limitations of existing methods while maintaining real-time performance.
Abstract: Reconstructing detailed 3D human meshes from a single in-the-wild image remains a fundamental challenge in computer vision. Existing SMPLX-based methods often suffer from slow inference, produce only coarse body poses, and exhibit misalignments or unnatural artifacts in fine-grained regions such as the face and hands. These issues make current approaches difficult to apply to downstream tasks. To address these challenges, we propose PEAR-a fast and robust framework for pixel-aligned expressive human mesh recovery. PEAR explicitly tackles three major limitations of existing methods: slow inference, inaccurate localization of fine-grained human pose details, and insufficient facial expression capture. Specifically, to enable real-time SMPLX parameter inference, we depart from prior designs that rely on high resolution inputs or multi-branch architectures. Instead, we adopt a clean and unified ViT-based model capable of recovering coarse 3D human geometry. To compensate for the loss of fine-grained details caused by this simplified architecture, we introduce pixel-level supervision to optimize the geometry, significantly improving the reconstruction accuracy of fine-grained human details. To make this approach practical, we further propose a modular data annotation strategy that enriches the training data and enhances the robustness of the model. Overall, PEAR is a preprocessing-free framework that can simultaneously infer EHM-s (SMPLX and scaled-FLAME) parameters at over 100 FPS. Extensive experiments on multiple benchmark datasets demonstrate that our method achieves substantial improvements in pose estimation accuracy compared to previous SMPLX-based approaches. Project page: https://wujh2001.github.io/PEAR
[251] Investigating the Impact of Histopathological Foundation Models on Regressive Prediction of Homologous Recombination Deficiency
Alexander Blezinger, Wolfgang Nejdl, Ming Tang
Main category: cs.CV
TL;DR: Evaluation of histopathology foundation models for regression-based biomarker prediction (HRD scores) showing superior performance over contrastive learning baselines across multiple cancer types.
Details
Motivation: While foundation models pretrained on large-scale histopathology data have shown success in various computational pathology tasks, their impact on regressive biomarker prediction remains underexplored. The authors aim to systematically evaluate these models for regression tasks, specifically for predicting homologous recombination deficiency (HRD) scores which are critical for personalized cancer treatment.Method: Used multiple instance learning frameworks to extract patch-level features from whole slide images using five state-of-the-art foundation models. Compared these against contrastive learning-based features. Trained models to predict continuous HRD scores across breast, endometrial, and lung cancer cohorts from two public medical datasets. Proposed distribution-based upsampling to address target imbalance. Conducted ablation studies on sampling strategies and instance bag sizes.
Result: Foundation model features consistently outperformed baseline contrastive learning features in predictive accuracy and generalization capabilities. Systematic differences were observed among different foundation models. The proposed upsampling strategy significantly improved recall and balanced accuracy for underrepresented but clinically important patient populations.
Conclusion: Large-scale histopathological pretraining provides benefits for more precise and transferable regressive biomarker prediction, demonstrating potential to advance AI-driven precision oncology. The work highlights the value of foundation models in computational pathology regression tasks.
Abstract: Foundation models pretrained on large-scale histopathology data have found great success in various fields of computational pathology, but their impact on regressive biomarker prediction remains underexplored. In this work, we systematically evaluate histopathological foundation models for regression-based tasks, demonstrated through the prediction of homologous recombination deficiency (HRD) score - a critical biomarker for personalized cancer treatment. Within multiple instance learning frameworks, we extract patch-level features from whole slide images (WSI) using five state-of-the-art foundation models, and evaluate their impact compared to contrastive learning-based features. Models are trained to predict continuous HRD scores based on these extracted features across breast, endometrial, and lung cancer cohorts from two public medical data collections. Extensive experiments demonstrate that models trained on foundation model features consistently outperform the baseline in terms of predictive accuracy and generalization capabilities while exhibiting systematic differences among the foundation models. Additionally, we propose a distribution-based upsampling strategy to mitigate target imbalance in these datasets, significantly improving the recall and balanced accuracy for underrepresented but clinically important patient populations. Furthermore, we investigate the impact of different sampling strategies and instance bagsizes by ablation studies. Our results highlight the benefits of large-scale histopathological pretraining for more precise and transferable regressive biomarker prediction, showcasing its potential to advance AI-driven precision oncology.
[252] Robust automatic brain vessel segmentation in 3D CTA scans using dynamic 4D-CTA data
Alberto Mario Ceballos-Arroyo, Shrikanth M. Yadav, Chu-Hsuan Lin, Jisoo Kim, Geoffrey S. Young, Lei Qin, Huaizu Jiang
Main category: cs.CV
TL;DR: A novel methodology for brain vasculature annotation using dynamic 4D-CTA scans that enhances vessel visualization and trains robust deep learning models for vessel segmentation.
Details
Motivation: To reduce manual annotation effort for brain vessels and create robust segmentation models by leveraging multiple time points from dynamic CTA acquisitions.Method: Uses dynamic 4D-CTA head scans with multiple time points to subtract bone/soft tissue for enhanced vessel visualization. Trains deep learning models using same segmentation across multiple phases, effectively expanding dataset 4-5x and inducing contrast phase robustness.
Result: Achieved significantly better segmentations than comparable datasets: mDC of 0.846 for arteries and 0.957 for veins, with low error margins (adHD 0.304mm for arteries, 0.078mm for veins) and high topology sensitivity (0.877 for arteries, 0.974 for veins).
Conclusion: The methodology successfully creates robust brain vessel segmentation models with excellent accuracy in capturing vessel morphology, reducing manual annotation effort while improving performance.
Abstract: In this study, we develop a novel methodology for annotating the brain vasculature using dynamic 4D-CTA head scans. By using multiple time points from dynamic CTA acquisitions, we subtract bone and soft tissue to enhance the visualization of arteries and veins, reducing the effort required to obtain manual annotations of brain vessels. We then train deep learning models on our ground truth annotations by using the same segmentation for multiple phases from the dynamic 4D-CTA collection, effectively enlarging our dataset by 4 to 5 times and inducing robustness to contrast phases. In total, our dataset comprises 110 training images from 25 patients and 165 test images from 14 patients. In comparison with two similarly-sized datasets for CTA-based brain vessel segmentation, a nnUNet model trained on our dataset can achieve significantly better segmentations across all vascular regions, with an average mDC of 0.846 for arteries and 0.957 for veins in the TopBrain dataset. Furthermore, metrics such as average directed Hausdorff distance (adHD) and topology sensitivity (tSens) reflected similar trends: using our dataset resulted in low error margins (adHD of 0.304 mm for arteries and 0.078 for veins) and high sensitivity (tSens of 0.877 for arteries and 0.974 for veins), indicating excellent accuracy in capturing vessel morphology. Our code and model weights are available online at https://github.com/alceballosa/robust-vessel-segmentation
[253] Invariance on Manifolds: Understanding Robust Visual Representations for Place Recognition
Jintao Cheng, Weibin Li, Zhijian He, Jin Wu, Chi Man Vong, Wei Zhang
Main category: cs.CV
TL;DR: A training-free Visual Place Recognition framework using second-order geometric statistics on SPD manifolds to capture scene structure without supervision.
Details
Motivation: Current VPR methods either need extensive supervised data or use simplistic first-order statistics, failing to capture intrinsic structural correlations and geometric stability needed for robustness to environmental and viewpoint changes.Method: Proposes a Second-Order Geometric Statistics framework that represents scenes as covariance descriptors on the Symmetric Positive Definite (SPD) manifold, where perturbations are modeled as congruence transformations. Uses geometry-aware Riemannian mappings to project these descriptors into linearized Euclidean embeddings, decoupling signal structure from noise. The approach is training-free and built on fixed pre-trained backbones.
Result: Achieves highly competitive performance against state-of-the-art baselines, particularly excelling in challenging zero-shot scenarios without any parameter updates.
Conclusion: The second-order geometric statistics framework provides an effective, training-free solution for robust visual place recognition that captures intrinsic structural correlations and demonstrates strong zero-shot generalization capabilities.
Abstract: Visual Place Recognition (VPR) demands representations robust to drastic environmental and viewpoint shifts. Current aggregation paradigms, however, either rely on data-hungry supervision or simplistic first-order statistics, often neglecting intrinsic structural correlations. In this work, we propose a Second-Order Geometric Statistics framework that inherently captures geometric stability without training. We conceptualize scenes as covariance descriptors on the Symmetric Positive Definite (SPD) manifold, where perturbations manifest as tractable congruence transformations. By leveraging geometry-aware Riemannian mappings, we project these descriptors into a linearized Euclidean embedding, effectively decoupling signal structure from noise. Our approach introduces a training-free framework built upon fixed, pre-trained backbones, achieving strong zero-shot generalization without parameter updates. Extensive experiments confirm that our method achieves highly competitive performance against state-of-the-art baselines, particularly excelling in challenging zero-shot scenarios.
[254] GMAC: Global Multi-View Constraint for Automatic Multi-Camera Extrinsic Calibration
Chentian Sun
Main category: cs.CV
TL;DR: GMAC is a multi-camera extrinsic calibration framework that uses implicit geometric representations from multi-view reconstruction networks to estimate camera extrinsics without explicit 3D reconstruction or manual calibration.
Details
Motivation: Existing multi-camera calibration methods rely on calibration targets, explicit geometric modeling, or task-specific neural networks, which lack robustness and applicability in complex dynamic environments or online scenarios, making practical deployment difficult.Method: GMAC models extrinsics as global variables constrained by latent multi-view geometric structure, prunes and reconfigures existing networks to use their latent features for extrinsic prediction via a lightweight regression head, and jointly optimizes cross-view reprojection consistency and multi-view cycle consistency.
Result: Experiments on synthetic and real-world multi-camera datasets show GMAC achieves accurate and stable extrinsic estimation without explicit 3D reconstruction or manual calibration.
Conclusion: GMAC provides a new solution for efficient deployment and online calibration of multi-camera systems by leveraging implicit geometric representations from existing networks.
Abstract: Automatic calibration of multi-camera systems, namely the accurate estimation of spatial extrinsic parameters, is fundamental for 3D reconstruction, panoramic perception, and multi-view data fusion. Existing methods typically rely on calibration targets, explicit geometric modeling, or task-specific neural networks. Such approaches often exhibit limited robustness and applicability in complex dynamic environments or online scenarios, making them difficult to deploy in practical applications. To address this, this paper proposes GMAC, a multi-camera extrinsic estimation framework based on the implicit geometric representations learned by multi-view reconstruction networks. GMAC models extrinsics as global variables constrained by the latent multi-view geometric structure and prunes and structurally reconfigures existing networks so that their latent features can directly support extrinsic prediction through a lightweight regression head, without requiring a completely new network design. Furthermore, GMAC jointly optimizes cross-view reprojection consistency and multi-view cycle consistency, ensuring geometric coherence across cameras while improving prediction accuracy and optimization stability. Experiments on both synthetic and real-world multi-camera datasets demonstrate that GMAC achieves accurate and stable extrinsic estimation without explicit 3D reconstruction or manual calibration, providing a new solution for efficient deployment and online calibration of multi-camera systems.
[255] FUSE-Flow: Scalable Real-Time Multi-View Point Cloud Reconstruction Using Confidence
Chentian Sun
Main category: cs.CV
TL;DR: FUSE-Flow: A real-time, stateless point cloud reconstruction framework for multi-view systems using adaptive spatial hashing and weighted fusion for high-quality 3D reconstruction with linear scalability.
Details
Motivation: Real-time multi-view point cloud reconstruction is crucial for VR/AR, robotics, and digital twins, but existing methods struggle with computational complexity, memory usage, and scalability while maintaining quality under real-time constraints.Method: Frame-wise stateless framework where each frame generates point cloud fragments fused via measurement confidence and 3D distance consistency weights. Uses adaptive spatial hashing-based weighted aggregation that partitions 3D space by local density, selects representative points per cell, and performs weighted fusion for sparse/dense regions with GPU parallelization.
Result: Achieves high-throughput, low-latency point cloud generation with linear complexity, improves reconstruction stability and geometric fidelity in overlapping, depth-discontinuous, and dynamic scenes while maintaining real-time frame rates on modern GPUs.
Conclusion: FUSE-Flow effectively addresses real-time multi-view point cloud reconstruction challenges with linear scalability, demonstrating robustness and practical applicability for immersive perception applications.
Abstract: Real-time multi-view point cloud reconstruction is a core problem in 3D vision and immersive perception, with wide applications in VR, AR, robotic navigation, digital twins, and computer interaction. Despite advances in multi-camera systems and high-resolution depth sensors, fusing large-scale multi-view depth observations into high-quality point clouds under strict real-time constraints remains challenging. Existing methods relying on voxel-based fusion, temporal accumulation, or global optimization suffer from high computational complexity, excessive memory usage, and limited scalability, failing to simultaneously achieve real-time performance, reconstruction quality, and multi-camera extensibility. We propose FUSE-Flow, a frame-wise, stateless, and linearly scalable point cloud streaming reconstruction framework. Each frame independently generates point cloud fragments, fused via two weights, measurement confidence and 3D distance consistency to suppress noise while preserving geometric details. For large-scale multi-camera efficiency, we introduce an adaptive spatial hashing-based weighted aggregation method: 3D space is adaptively partitioned by local point cloud density, representative points are selected per cell, and weighted fusion is performed to handle both sparse and dense regions. With GPU parallelization, FUSE-Flow achieves high-throughput, low-latency point cloud generation and fusion with linear complexity. Experiments demonstrate that the framework improves reconstruction stability and geometric fidelity in overlapping, depth-discontinuous, and dynamic scenes, while maintaining real-time frame rates on modern GPUs, verifying its effectiveness, robustness, and scalability.
[256] SharpTimeGS: Sharp and Stable Dynamic Gaussian Splatting via Lifespan Modulation
Zhanfeng Liao, Jiajun Zhang, Hanzhang Tu, Zhixi Wang, Yunqi Gao, Hongwen Zhang, Yebin Liu
Main category: cs.CV
TL;DR: SharpTimeGS introduces a lifespan-aware 4D Gaussian framework for dynamic scene novel view synthesis with temporally adaptive modeling of static and dynamic regions using learnable lifespan parameters and lifespan-velocity-aware densification.
Details
Motivation: Existing Gaussian-based methods struggle to balance long-term static and short-term dynamic regions in both representation and optimization for 4D reconstruction and novel view synthesis of dynamic scenes.Method: Introduces learnable lifespan parameters that reformulate temporal visibility from Gaussian-shaped decay to flat-top profiles, modulates motion based on lifespan, and uses lifespan-velocity-aware densification to allocate capacity appropriately between static and dynamic regions.
Result: Achieves state-of-the-art performance on multiple benchmarks while supporting real-time rendering up to 4K resolution at 100 FPS on one RTX 4090.
Conclusion: SharpTimeGS provides an effective unified representation for dynamic scene reconstruction that balances static and dynamic regions through lifespan-aware modeling, enabling high-quality real-time novel view synthesis.
Abstract: Novel view synthesis of dynamic scenes is fundamental to achieving photorealistic 4D reconstruction and immersive visual experiences. Recent progress in Gaussian-based representations has significantly improved real-time rendering quality, yet existing methods still struggle to maintain a balance between long-term static and short-term dynamic regions in both representation and optimization. To address this, we present SharpTimeGS, a lifespan-aware 4D Gaussian framework that achieves temporally adaptive modeling of both static and dynamic regions under a unified representation. Specifically, we introduce a learnable lifespan parameter that reformulates temporal visibility from a Gaussian-shaped decay into a flat-top profile, allowing primitives to remain consistently active over their intended duration and avoiding redundant densification. In addition, the learned lifespan modulates each primitives’ motion, reducing drift in long-lived static points while retaining unrestricted motion for short-lived dynamic ones. This effectively decouples motion magnitude from temporal duration, improving long-term stability without compromising dynamic fidelity. Moreover, we design a lifespan-velocity-aware densification strategy that mitigates optimization imbalance between static and dynamic regions by allocating more capacity to regions with pronounced motion while keeping static areas compact and stable. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art performance while supporting real-time rendering up to 4K resolution at 100 FPS on one RTX 4090.
[257] HP-GAN: Harnessing pretrained networks for GAN improvement with FakeTwins and discriminator consistency
Geonhui Son, Jeong Ryong Lee, Dosik Hwang
Main category: cs.CV
TL;DR: HP-GAN improves GAN image synthesis by leveraging pretrained networks with self-supervised learning (FakeTwins) and enforcing consistency between CNN and ViT-based discriminators.
Details
Motivation: To enhance GAN image synthesis quality and diversity by better exploiting neural network priors from pretrained models through self-supervised learning and multi-discriminator consistency.Method: Proposes HP-GAN with two key components: 1) FakeTwins - uses pretrained networks as encoders to compute self-supervised loss applied through generated images to train generator, 2) Discriminator consistency - aligns assessments between CNN and ViT feature network discriminators to promote coherent learning.
Result: Extensive evaluation across 17 datasets shows HP-GAN consistently outperforms state-of-the-art methods in FID scores, achieving significant improvements in image diversity and quality across various data scenarios.
Conclusion: HP-GAN effectively leverages neural network priors through self-supervised learning and discriminator consistency to advance GAN-based image synthesis quality and diversity.
Abstract: Generative Adversarial Networks (GANs) have made significant progress in enhancing the quality of image synthesis. Recent methods frequently leverage pretrained networks to calculate perceptual losses or utilize pretrained feature spaces. In this paper, we extend the capabilities of pretrained networks by incorporating innovative self-supervised learning techniques and enforcing consistency between discriminators during GAN training. Our proposed method, named HP-GAN, effectively exploits neural network priors through two primary strategies: FakeTwins and discriminator consistency. FakeTwins leverages pretrained networks as encoders to compute a self-supervised loss and applies this through the generated images to train the generator, thereby enabling the generation of more diverse and high quality images. Additionally, we introduce a consistency mechanism between discriminators that evaluate feature maps extracted from Convolutional Neural Network (CNN) and Vision Transformer (ViT) feature networks. Discriminator consistency promotes coherent learning among discriminators and enhances training robustness by aligning their assessments of image quality. Our extensive evaluation across seventeen datasets-including scenarios with large, small, and limited data, and covering a variety of image domains-demonstrates that HP-GAN consistently outperforms current state-of-the-art methods in terms of Fréchet Inception Distance (FID), achieving significant improvements in image diversity and quality. Code is available at: https://github.com/higun2/HP-GAN.
[258] JSynFlow: Japanese Synthesised Flowchart Visual Question Answering Dataset built with Large Language Models
Hiroshi Sasaki
Main category: cs.CV
TL;DR: JSynFlow is a synthesized visual QA dataset for Japanese flowcharts generated using LLMs, comprising task descriptions, flowchart images from DSL code, and QA pairs to improve VLM performance on flowchart understanding tasks.
Details
Motivation: Flowchart understanding is crucial for analyzing complex documents, but creating large-scale datasets of flowchart images with corresponding text is time-consuming. There's a need for efficient ways to develop VLMs with precise flowchart interpretation capabilities.Method: Uses large language models to synthesize a visual QA dataset for Japanese flowcharts. The dataset includes task descriptions for business occupations, flowchart images rendered from domain-specific language code, and related QA pairs.
Result: Fine-tuning with JSynFlow significantly improves VLM performance on flowchart-based QA tasks. The dataset is publicly available on Hugging Face.
Conclusion: JSynFlow provides an efficient solution for training VLMs on flowchart understanding tasks, addressing the data scarcity problem through LLM-based synthesis.
Abstract: Vision and language models (VLMs) are expected to analyse complex documents, such as those containing flowcharts, through a question-answering (QA) interface. The ability to recognise and interpret these flowcharts is in high demand, as they provide valuable insights unavailable in text-only explanations. However, developing VLMs with precise flowchart understanding requires large-scale datasets of flowchart images and corresponding text, the creation of which is highly time-consuming. To address this challenge, we introduce JSynFlow, a synthesised visual QA dataset for Japanese flowcharts, generated using large language models (LLMs). Our dataset comprises task descriptions for various business occupations, the corresponding flowchart images rendered from domain-specific language (DSL) code, and related QA pairs. This paper details the dataset’s synthesis procedure and demonstrates that fine-tuning with JSynFlow significantly improves VLM performance on flowchart-based QA tasks. Our dataset is publicly available at https://huggingface.co/datasets/jri-advtechlab/jsynflow.
[259] TrajVG: 3D Trajectory-Coupled Visual Geometry Learning
Xingyu Miao, Weiguang Zhao, Tao Lu, Linning Xu, Mulin Yu, Yang Long, Jiangmiao Pang, Junting Dong
Main category: cs.CV
TL;DR: TrajVG is a 3D reconstruction framework that explicitly predicts cross-frame 3D correspondences via camera-coordinate 3D trajectories, addressing motion degradation in videos through geometric consistency objectives and unified training with mixed supervision.
Details
Motivation: Feed-forward multi-frame 3D reconstruction models degrade on videos with object motion due to ambiguous global references and drifting local pointmaps that cause cross-frame misalignment and duplicated structures.Method: Proposes TrajVG framework that estimates camera-coordinate 3D trajectories to establish explicit cross-frame 3D correspondence. Couples sparse trajectories, per-frame local point maps, and relative camera poses with: (1) bidirectional trajectory-pointmap consistency with controlled gradient flow, and (2) pose consistency objective driven by static track anchors that suppresses gradients from dynamic regions. Uses unified training with mixed supervision using pseudo 2D tracks when 3D trajectory labels are scarce.
Result: Extensive experiments across 3D tracking, pose estimation, pointmap reconstruction, and video depth show TrajVG surpasses current feedforward performance baselines.
Conclusion: TrajVG effectively addresses motion degradation in 3D video reconstruction by making cross-frame 3D correspondence an explicit prediction through 3D trajectory estimation and geometric consistency constraints.
Abstract: Feed-forward multi-frame 3D reconstruction models often degrade on videos with object motion. Global-reference becomes ambiguous under multiple motions, while the local pointmap relies heavily on estimated relative poses and can drift, causing cross-frame misalignment and duplicated structures. We propose TrajVG, a reconstruction framework that makes cross-frame 3D correspondence an explicit prediction by estimating camera-coordinate 3D trajectories. We couple sparse trajectories, per-frame local point maps, and relative camera poses with geometric consistency objectives: (i) bidirectional trajectory-pointmap consistency with controlled gradient flow, and (ii) a pose consistency objective driven by static track anchors that suppresses gradients from dynamic regions. To scale training to in-the-wild videos where 3D trajectory labels are scarce, we reformulate the same coupling constraints into self-supervised objectives using only pseudo 2D tracks, enabling unified training with mixed supervision. Extensive experiments across 3D tracking, pose estimation, pointmap reconstruction, and video depth show that TrajVG surpasses the current feedforward performance baseline.
[260] PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective
Haokui Zhang, Congyang Ou, Dawei Yan, Peng Wang, Qingsen Yan, Ying Li, Rong Xiao, Chunhua Shen
Main category: cs.CV
TL;DR: PIO-FVLM is a training-free method for accelerating vision-language models by compressing visual tokens based on their importance to preserving output results, achieving significant speedups with minimal performance loss.
Details
Motivation: Existing visual token compression methods rely on heuristics based on similarity metrics, which have limitations in compression performance and practical deployment. The authors propose a new perspective focusing on preserving inference output invariance.Method: Uses token-level gradient saliency from a layer-local proxy loss to reorder vision tokens by importance, then selects the most valuable tokens using non-maximum suppression. The method is training-free, compatible with FlashAttention, and can be deployed as encoder-free or combined with encoder compression approaches.
Result: On LLaVA-Next-7B, retains only 11.1% of visual tokens while maintaining 97.2% of original performance, with 2.67× prefill speedup, 2.11× inference speedup, 6.22× lower FLOPs, and 6.05× reduced KV Cache overhead.
Conclusion: PIO-FVLM provides an effective, practical approach for accelerating vision-language models by focusing on preserving output invariance rather than similarity heuristics, with significant efficiency gains and minimal performance degradation.
Abstract: Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose PIO-FVLM from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specially, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle. The proposed PIO-FVLM is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder compression approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, PIO-FVLM retains just 11.1% of visual tokens but maintains 97.2% of the original performance, with a 2.67$\times$ prefill speedup, 2.11$\times$ inference speedup, 6.22$\times$ lower FLOPs, and 6.05$\times$ reduced KV Cache overhead. Our code is available at https://github.com/ocy1/PIO-FVLM.
cs.AI
[261] Artificial Intelligence as Strange Intelligence: Against Linear Models of Intelligence
Kendra Chilson, Eric Schwitzgebel
Main category: cs.AI
TL;DR: The paper critiques the linear model of AI progress and introduces concepts of “familiar intelligence” (human-like) vs “strange intelligence” (AI-specific), arguing AI will exhibit superhuman capabilities in some domains with surprising failures in others, requiring a nonlinear model of intelligence.
Details
Motivation: The authors challenge the prevailing linear model of AI progress that assumes intelligence progresses uniformly across domains. They argue this model fails to capture how AI systems will likely develop - exhibiting "strange intelligence" with uneven capabilities across different domains rather than uniform human-like progression.Method: The paper develops a conceptual framework by introducing two novel concepts: “familiar intelligence” (human-like patterns of ability/inability) and “strange intelligence” (AI-specific patterns). They propose a nonlinear model of intelligence where general intelligence is defined as the ability to achieve broad goals in diverse environments, not reducible to a single linear quantity.
Result: The analysis shows that AI intelligence will likely be “strange” - combining superhuman capacities in some domains with subhuman performance in others, and sometimes making surprising errors that few humans would make even in domains where they otherwise excel.
Conclusion: The nonlinear model has implications for AI evaluation: (1) even the most capable systems will sometimes fail at seemingly obvious tasks, (2) such errors don’t demonstrate lack of general intelligence, and (3) excellent performance on one task type (like IQ tests) doesn’t warrant assumptions of broad capacities beyond that domain.
Abstract: We endorse and expand upon Susan Schneider’s critique of the linear model of AI progress and introduce two novel concepts: “familiar intelligence” and “strange intelligence”. AI intelligence is likely to be strange intelligence, defying familiar patterns of ability and inability, combining superhuman capacities in some domains with subhuman performance in other domains, and even within domains sometimes combining superhuman insight with surprising errors that few humans would make. We develop and defend a nonlinear model of intelligence on which “general intelligence” is not a unified capacity but instead the ability to achieve a broad range of goals in a broad range of environments, in a manner that defies nonarbitrary reduction to a single linear quantity. We conclude with implications for adversarial testing approaches to evaluating AI capacities. If AI is strange intelligence, we should expect that even the most capable systems will sometimes fail in seemingly obvious tasks. On a nonlinear model of AI intelligence, such errors on their own do not demonstrate a system’s lack of outstanding general intelligence. Conversely, excellent performance on one type of task, such as an IQ test, cannot warrant assumptions of broad capacities beyond that task domain.
[262] DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search
Zhanli Li, Huiwen Tian, Lvzhou Luo, Yixuan Cao, Ping Luo
Main category: cs.AI
TL;DR: DeepRead is a structure-aware multi-turn document reasoning agent that preserves document hierarchy for better long-document QA through coordinated retrieval and reading tools.
Details
Motivation: Existing agentic search frameworks treat long documents as flat collections of chunks, underutilizing document-native priors like hierarchical organization and sequential discourse structure, which limits their effectiveness for long-document question answering.Method: DeepRead converts PDFs to structured Markdown preserving headings/paragraph boundaries, indexes at paragraph level with coordinate-style metadata encoding section identity and order, and provides two LLM tools: Retrieve (localizes relevant paragraphs with structural context) and ReadSection (enables contiguous, order-preserving reading within specified sections).
Result: DeepRead achieves significant improvements over Search-o1-style agentic search in document question answering, with validated synergistic effects between retrieval and reading tools, and demonstrates human-like “locate then read” behavior.
Conclusion: Explicitly operationalizing document structural priors through coordinated retrieval and reading tools enables more effective long-document reasoning, resembling human reading patterns and outperforming flat-chunk approaches.
Abstract: With the rapid progress of tool-using and agentic large language models (LLMs), Retrieval-Augmented Generation (RAG) is evolving from one-shot, passive retrieval into multi-turn, decision-driven evidence acquisition. Despite strong results in open-domain settings, existing agentic search frameworks commonly treat long documents as flat collections of chunks, underutilizing document-native priors such as hierarchical organization and sequential discourse structure. We introduce DeepRead, a structure-aware, multi-turn document reasoning agent that explicitly operationalizes these priors for long-document question answering. DeepRead leverages LLM-based OCR model to convert PDFs into structured Markdown that preserves headings and paragraph boundaries. It then indexes documents at the paragraph level and assigns each paragraph a coordinate-style metadata key encoding its section identity and in-section order. Building on this representation, DeepRead equips the LLM with two complementary tools: a Retrieve tool that localizes relevant paragraphs while exposing their structural coordinates (with lightweight scanning context), and a ReadSection tool that enables contiguous, order-preserving reading within a specified section and paragraph range. Our experiments demonstrate that DeepRead achieves significant improvements over Search-o1-style agentic search in document question answering. The synergistic effect between retrieval and reading tools is also validated. Our fine-grained behavioral analysis reveals a reading and reasoning paradigm resembling human-like ``locate then read’’ behavior.
[263] MINT: Minimal Information Neuro-Symbolic Tree for Objective-Driven Knowledge-Gap Reasoning and Active Elicitation
Zeyu Fang, Tian Lan, Mahdi Imani
Main category: cs.AI
TL;DR: MINT is a neuro-symbolic framework that enables AI agents to actively elicit human inputs through optimal questioning strategies in object-driven planning tasks with knowledge gaps.
Details
Motivation: Planning in open-world environments often involves incomplete information about objects and human goals, creating knowledge gaps that hinder effective human-AI collaboration. Current approaches lack systematic methods for AI agents to actively elicit missing information through strategic questioning.Method: Proposes Minimal Information Neuro-Symbolic Tree (MINT) which builds symbolic trees of possible human-AI interactions, uses neural planning policies to estimate uncertainty from knowledge gaps, leverages LLMs to search reasoning processes and curate optimal queries, and employs self-play to optimize elicitation strategies.
Result: MINT achieves near-expert performance on three benchmarks with unseen/unknown objects, attaining significantly improved rewards and success rates while asking only a limited number of questions per task.
Conclusion: MINT provides an effective framework for AI agents to actively bridge knowledge gaps through strategic questioning, enabling better human-AI collaboration in object-driven planning tasks with incomplete information.
Abstract: Joint planning through language-based interactions is a key area of human-AI teaming. Planning problems in the open world often involve various aspects of incomplete information and unknowns, e.g., objects involved, human goals/intents – thus leading to knowledge gaps in joint planning. We consider the problem of discovering optimal interaction strategies for AI agents to actively elicit human inputs in object-driven planning. To this end, we propose Minimal Information Neuro-Symbolic Tree (MINT) to reason about the impact of knowledge gaps and leverage self-play with MINT to optimize the AI agent’s elicitation strategies and queries. More precisely, MINT builds a symbolic tree by making propositions of possible human-AI interactions and by consulting a neural planning policy to estimate the uncertainty in planning outcomes caused by remaining knowledge gaps. Finally, we leverage LLM to search and summarize MINT’s reasoning process and curate a set of queries to optimally elicit human inputs for best planning performance. By considering a family of extended Markov decision processes with knowledge gaps, we analyze the return guarantee for a given MINT with active human elicitation. Our evaluation on three benchmarks involving unseen/unknown objects of increasing realism shows that MINT-based planning attains near-expert returns by issuing a limited number of questions per task while achieving significantly improved rewards and success rates.
[264] Evaluating Large Language Models on Solved and Unsolved Problems in Graph Theory: Implications for Computing Education
Adithya Kulkarni, Mohna Chakraborty, Jay Bagga
Main category: cs.AI
TL;DR: LLMs show strong performance on solved graph theory problems but limited capability on open problems, highlighting their utility for conceptual exploration but not novel mathematical insight.
Details
Motivation: As LLMs become integrated into computer science education, there's a need to understand their reliability in supporting mathematically rigorous thinking, particularly in advanced topics like graph theory.Method: Used an eight-stage evaluation protocol reflecting authentic mathematical inquiry to test an LLM on both a solved graph theory problem (gracefulness of line graphs) and an open problem with no known solution.
Result: The LLM performed strongly on the solved problem, producing correct definitions, identifying relevant structures, recalling appropriate results without hallucination, and constructing valid proofs. For the open problem, it generated coherent interpretations and plausible exploratory strategies but didn’t advance toward a solution, while appropriately acknowledging uncertainty.
Conclusion: LLMs can support exploration of established material but remain limited in tasks requiring novel mathematical insight or critical structural reasoning, highlighting the importance of guiding students to use LLMs for conceptual exploration while maintaining independent verification for formal problem solving.
Abstract: Large Language Models are increasingly used by students to explore advanced material in computer science, including graph theory. As these tools become integrated into undergraduate and graduate coursework, it is important to understand how reliably they support mathematically rigorous thinking. This study examines the performance of a LLM on two related graph theoretic problems: a solved problem concerning the gracefulness of line graphs and an open problem for which no solution is currently known. We use an eight stage evaluation protocol that reflects authentic mathematical inquiry, including interpretation, exploration, strategy formation, and proof construction. The model performed strongly on the solved problem, producing correct definitions, identifying relevant structures, recalling appropriate results without hallucination, and constructing a valid proof confirmed by a graph theory expert. For the open problem, the model generated coherent interpretations and plausible exploratory strategies but did not advance toward a solution. It did not fabricate results and instead acknowledged uncertainty, which is consistent with the explicit prompting instructions that directed the model to avoid inventing theorems or unsupported claims. These findings indicate that LLMs can support exploration of established material but remain limited in tasks requiring novel mathematical insight or critical structural reasoning. For computing education, this distinction highlights the importance of guiding students to use LLMs for conceptual exploration while relying on independent verification and rigorous argumentation for formal problem solving.
[265] Towards Reducible Uncertainty Modeling for Reliable Large Language Model Agents
Changdae Oh, Seongheon Park, To Eun Kim, Jiatong Li, Wendi Li, Samuel Yeh, Xuefeng Du, Hamed Hassani, Paul Bogdan, Dawn Song, Sharon Li
Main category: cs.AI
TL;DR: This paper proposes a new framework for uncertainty quantification in LLM agents, shifting from traditional single-turn QA to interactive agent settings, and introduces a conditional uncertainty reduction perspective.
Details
Motivation: Current uncertainty quantification research focuses on single-turn question-answering, but LLM agents are increasingly deployed in complex interactive tasks requiring new UQ frameworks for realistic agent settings.Method: The paper presents a general formulation of agent UQ that subsumes existing setups, critiques prior works’ uncertainty accumulation approach, and proposes a conditional uncertainty reduction perspective that models reducible uncertainty over agent trajectories by emphasizing action interactivity.
Result: A conceptual framework is developed to provide actionable guidance for designing UQ in LLM agent setups, with practical implications for frontier LLM development and domain-specific applications.
Conclusion: A paradigm shift is needed from single-turn UQ to agent UQ with a conditional uncertainty reduction perspective, highlighting remaining open problems in this emerging research area.
Abstract: Uncertainty quantification (UQ) for large language models (LLMs) is a key building block for safety guardrails of daily LLM applications. Yet, even as LLM agents are increasingly deployed in highly complex tasks, most UQ research still centers on single-turn question-answering. We argue that UQ research must shift to realistic settings with interactive agents, and that a new principled framework for agent UQ is needed. This paper presents the first general formulation of agent UQ that subsumes broad classes of existing UQ setups. Under this formulation, we show that prior works implicitly treat LLM UQ as an uncertainty accumulation process, a viewpoint that breaks down for interactive agents in an open world. In contrast, we propose a novel perspective, a conditional uncertainty reduction process, that explicitly models reducible uncertainty over an agent’s trajectory by highlighting “interactivity” of actions. From this perspective, we outline a conceptual framework to provide actionable guidance for designing UQ in LLM agent setups. Finally, we conclude with practical implications of the agent UQ in frontier LLM development and domain-specific applications, as well as open remaining problems.
[266] Optimizing Mission Planning for Multi-Debris Rendezvous Using Reinforcement Learning with Refueling and Adaptive Collision Avoidance
Agni Bandyopadhyay, Gunther Waxenegger-Wilfing
Main category: cs.AI
TL;DR: RL-based framework for adaptive collision avoidance in active debris removal missions using small satellites, optimizing multi-debris rendezvous with fuel efficiency and real-time maneuver adjustments.
Details
Motivation: Increasing orbital debris poses collision risks for active debris removal missions, requiring adaptive solutions that balance safety with mission efficiency, especially for small satellites used in multi-debris removal operations.Method: Uses reinforcement learning with masked Proximal Policy Optimization (PPO) algorithm to dynamically adjust maneuvers based on real-time orbital conditions, integrating refueling strategies and collision avoidance while optimizing fuel usage and mission time.
Result: The RL framework reduces collision risk while improving mission efficiency compared to traditional heuristic approaches, demonstrated through simulated ADR scenarios using Iridium 33 debris dataset across diverse orbital configurations.
Conclusion: Provides a scalable solution for planning complex multi-debris ADR missions applicable to other multi-target rendezvous problems in autonomous space mission planning.
Abstract: As the orbital environment around Earth becomes increasingly crowded with debris, active debris removal (ADR) missions face significant challenges in ensuring safe operations while minimizing the risk of in-orbit collisions. This study presents a reinforcement learning (RL) based framework to enhance adaptive collision avoidance in ADR missions, specifically for multi-debris removal using small satellites. Small satellites are increasingly adopted due to their flexibility, cost effectiveness, and maneuverability, making them well suited for dynamic missions such as ADR. Building on existing work in multi-debris rendezvous, the framework integrates refueling strategies, efficient mission planning, and adaptive collision avoidance to optimize spacecraft rendezvous operations. The proposed approach employs a masked Proximal Policy Optimization (PPO) algorithm, enabling the RL agent to dynamically adjust maneuvers in response to real-time orbital conditions. Key considerations include fuel efficiency, avoidance of active collision zones, and optimization of dynamic orbital parameters. The RL agent learns to determine efficient sequences for rendezvousing with multiple debris targets, optimizing fuel usage and mission time while incorporating necessary refueling stops. Simulated ADR scenarios derived from the Iridium 33 debris dataset are used for evaluation, covering diverse orbital configurations and debris distributions to demonstrate robustness and adaptability. Results show that the proposed RL framework reduces collision risk while improving mission efficiency compared to traditional heuristic approaches. This work provides a scalable solution for planning complex multi-debris ADR missions and is applicable to other multi-target rendezvous problems in autonomous space mission planning.
[267] VERA-MH: Reliability and Validity of an Open-Source AI Safety Evaluation in Mental Health
Kate H. Bentley, Luca Belli, Adam M. Chekroud, Emily J. Ward, Emily R. Dworkin, Emily Van Ark, Kelly M. Johnston, Will Alexander, Millard Brown, Matt Hawrilenko
Main category: cs.AI
TL;DR: VERA-MH evaluation framework shows strong clinical validity and reliability for assessing AI chatbot safety in mental health contexts, with LLM judges aligning well with clinician consensus.
Details
Motivation: As millions use AI chatbots for psychological support, there's an urgent need for evidence-based safety benchmarks to ensure these tools are safe, particularly for high-risk scenarios like suicide prevention.Method: Simulated conversations between LLM-based user-agents and general-purpose AI chatbots were rated by licensed mental health clinicians using a scoring rubric. An LLM-based judge used the same rubric to evaluate the same conversations, with comparisons made between clinician ratings and between clinician consensus and the LLM judge.
Result: Clinicians showed strong inter-rater reliability (0.77), establishing a gold-standard reference. The LLM judge was strongly aligned with clinical consensus (0.81), and user-agents were generally perceived as realistic by clinicians.
Conclusion: The VERA-MH evaluation demonstrates clinical validity and reliability as an automated AI safety benchmark for mental health, supporting its use for ensuring chatbot safety while further research is needed on generalizability and robustness.
Abstract: Millions now use leading generative AI chatbots for psychological support. Despite the promise related to availability and scale, the single most pressing question in AI for mental health is whether these tools are safe. The Validation of Ethical and Responsible AI in Mental Health (VERA-MH) evaluation was recently proposed to meet the urgent need for an evidence-based automated safety benchmark. This study aimed to examine the clinical validity and reliability of the VERA-MH evaluation for AI safety in suicide risk detection and response. We first simulated a large set of conversations between large language model (LLM)-based users (user-agents) and general-purpose AI chatbots. Licensed mental health clinicians used a rubric (scoring guide) to independently rate the simulated conversations for safe and unsafe chatbot behaviors, as well as user-agent realism. An LLM-based judge used the same scoring rubric to evaluate the same set of simulated conversations. We then compared rating alignment across (a) individual clinicians and (b) clinician consensus and the LLM judge, and (c) examined clinicians’ ratings of user-agent realism. Individual clinicians were generally consistent with one another in their safety ratings (chance-corrected inter-rater reliability [IRR]: 0.77), thus establishing a gold-standard clinical reference. The LLM judge was strongly aligned with this clinical consensus (IRR: 0.81) overall and within key conditions. Clinician raters generally perceived the user-agents to be realistic. For the potential mental health benefits of AI chatbots to be realized, attention to safety is paramount. Findings from this human evaluation study support the clinical validity and reliability of VERA-MH: an open-source, fully automated AI safety evaluation for mental health. Further research will address VERA-MH generalizability and robustness.
[268] Emulating Aggregate Human Choice Behavior and Biases with GPT Conversational Agents
Stephen Pilli, Vivek Nallur
Main category: cs.AI
TL;DR: LLMs can predict individual-level cognitive biases and emulate biased human decision-making in interactive contexts, with GPT-4 and GPT-5 showing differences in how they align with human behavior.
Details
Motivation: While LLMs have been shown to reproduce known cognitive biases, the paper investigates whether they can predict biases at the individual level and emulate the dynamics of biased human behavior when contextual factors like cognitive load interact with these biases in interactive settings.Method: Adapted three well-established decision scenarios into conversational settings, conducted human experiments (N=1100) with chatbots facilitating decision-making through simple or complex dialogues. Used participant demographics and dialogue transcripts to simulate conditions with GPT-4 and GPT-5 LLMs.
Result: Human experiments revealed robust cognitive biases. LLMs reproduced human biases with precision, with notable differences between GPT-4 and GPT-5 in how they aligned with human behavior.
Conclusion: LLMs can effectively emulate human decision-making biases in interactive contexts, which has important implications for designing and evaluating adaptive, bias-aware LLM-based AI systems.
Abstract: Cognitive biases often shape human decisions. While large language models (LLMs) have been shown to reproduce well-known biases, a more critical question is whether LLMs can predict biases at the individual level and emulate the dynamics of biased human behavior when contextual factors, such as cognitive load, interact with these biases. We adapted three well-established decision scenarios into a conversational setting and conducted a human experiment (N=1100). Participants engaged with a chatbot that facilitates decision-making through simple or complex dialogues. Results revealed robust biases. To evaluate how LLMs emulate human decision-making under similar interactive conditions, we used participant demographics and dialogue transcripts to simulate these conditions with LLMs based on GPT-4 and GPT-5. The LLMs reproduced human biases with precision. We found notable differences between models in how they aligned human behavior. This has important implications for designing and evaluating adaptive, bias-aware LLM-based AI systems in interactive contexts.
[269] Evaluating Robustness and Adaptability in Learning-Based Mission Planning for Active Debris Removal
Agni Bandyopadhyay, Günther Waxenegger-Wilfing
Main category: cs.AI
TL;DR: Comparison of three planning approaches for Active Debris Removal missions: nominal PPO, domain-randomized PPO, and Monte Carlo Tree Search, evaluated in high-fidelity orbital simulations with varying constraints.
Details
Motivation: Autonomous mission planning for Active Debris Removal needs to balance efficiency, adaptability, and strict feasibility constraints on fuel and mission duration, requiring robust planning methods that can handle distributional shifts.Method: Three planners compared: 1) nominal Masked PPO trained under fixed mission parameters, 2) domain-randomized Masked PPO trained across varying constraints for robustness, and 3) plain Monte Carlo Tree Search baseline. Evaluated in high-fidelity orbital simulation with refueling, realistic transfer dynamics, and randomized debris fields across 300 test cases.
Result: Nominal PPO achieves top performance when conditions match training but degrades sharply under distributional shift. Domain-randomized PPO shows improved adaptability with moderate loss in nominal performance. MCTS handles constraint changes best due to online replanning but has orders-of-magnitude higher computation time.
Conclusion: There’s a trade-off between speed of learned policies and adaptability of search-based methods. Combining training-time diversity with online planning could be promising for future resilient ADR mission planners.
Abstract: Autonomous mission planning for Active Debris Removal (ADR) must balance efficiency, adaptability, and strict feasibility constraints on fuel and mission duration. This work compares three planners for the constrained multi-debris rendezvous problem in Low Earth Orbit: a nominal Masked Proximal Policy Optimization (PPO) policy trained under fixed mission parameters, a domain-randomized Masked PPO policy trained across varying mission constraints for improved robustness, and a plain Monte Carlo Tree Search (MCTS) baseline. Evaluations are conducted in a high-fidelity orbital simulation with refueling, realistic transfer dynamics, and randomized debris fields across 300 test cases in nominal, reduced fuel, and reduced mission time scenarios. Results show that nominal PPO achieves top performance when conditions match training but degrades sharply under distributional shift, while domain-randomized PPO exhibits improved adaptability with only moderate loss in nominal performance. MCTS consistently handles constraint changes best due to online replanning but incurs orders-of-magnitude higher computation time. The findings underline a trade-off between the speed of learned policies and the adaptability of search-based methods, and suggest that combining training-time diversity with online planning could be a promising path for future resilient ADR mission planners.
[270] GAMMS: Graph based Adversarial Multiagent Modeling Simulator
Rohan Patil, Jai Malegaonkar, Xiao Jiang, Andre Dion, Gaurav S. Sukhatme, Henrik I. Christensen
Main category: cs.AI
TL;DR: GAMMS is a lightweight, graph-based simulation framework for multi-agent systems that prioritizes scalability, ease of use, and integration with external tools including machine learning libraries and LLMs.
Details
Motivation: Existing high-fidelity simulators are computationally expensive and not suitable for rapid prototyping or large-scale agent deployments, creating a need for accessible, scalable simulation tools for multi-agent coordination research.Method: Developed GAMMS as a lightweight, extensible simulation framework using graph-based representations of environments. The framework emphasizes five core objectives: scalability, ease of use, integration-first architecture, fast visualization feedback, and real-world grounding.
Result: GAMMS enables efficient simulation of complex domains like urban road networks and communication systems, supports integration with external tools, provides built-in visualization, and is agnostic to policy type including heuristic, optimization-based, learning-based, and LLM-based agents.
Conclusion: GAMMS lowers the barrier to entry for multi-agent systems research by enabling high-performance simulations on standard hardware, facilitating experimentation in multi-agent systems, autonomous planning, and adversarial modeling.
Abstract: As intelligent systems and multi-agent coordination become increasingly central to real-world applications, there is a growing need for simulation tools that are both scalable and accessible. Existing high-fidelity simulators, while powerful, are often computationally expensive and ill-suited for rapid prototyping or large-scale agent deployments. We present GAMMS (Graph based Adversarial Multiagent Modeling Simulator), a lightweight yet extensible simulation framework designed to support fast development and evaluation of agent behavior in environments that can be represented as graphs. GAMMS emphasizes five core objectives: scalability, ease of use, integration-first architecture, fast visualization feedback, and real-world grounding. It enables efficient simulation of complex domains such as urban road networks and communication systems, supports integration with external tools (e.g., machine learning libraries, planning solvers), and provides built-in visualization with minimal configuration. GAMMS is agnostic to policy type, supporting heuristic, optimization-based, and learning-based agents, including those using large language models. By lowering the barrier to entry for researchers and enabling high-performance simulations on standard hardware, GAMMS facilitates experimentation and innovation in multi-agent systems, autonomous planning, and adversarial modeling. The framework is open-source and available at https://github.com/GAMMSim/GAMMS/
[271] Understanding LLM Evaluator Behavior: A Structured Multi-Evaluator Framework for Merchant Risk Assessment
Liang Wang, Junpeng Wang, Chin-chia Michael Yeh, Yan Zheng, Jiarui Sun, Xiran Fan, Xin Dai, Yujie Fan, Yiwei Cai
Main category: cs.AI
TL;DR: A framework for evaluating LLM reasoning in payment risk assessment using multi-evaluator approach with consensus metrics, revealing model biases and alignment with human judgment.
Details
Motivation: LLMs are increasingly used as evaluators of reasoning quality in financial settings, but their reliability and bias in payments-risk contexts remain poorly understood, necessitating a structured evaluation framework.Method: Introduced a structured multi-evaluator framework combining a five-criterion rubric with Monte-Carlo scoring to evaluate rationale quality and evaluator stability. Five frontier LLMs generated and cross-evaluated MCC risk rationales under attributed and anonymized conditions, using a consensus-deviation metric to eliminate circularity by comparing each judge’s score to the mean of all other judges.
Result: Results showed substantial heterogeneity: GPT-5.1 and Claude 4.5 Sonnet exhibited negative self-evaluation bias (-0.33, -0.31), while Gemini-2.5 Pro and Grok 4 displayed positive bias (+0.77, +0.71), with bias attenuating by 25.8% under anonymization. LLM judges assigned scores averaging +0.46 points above human consensus, and ground-truth validation using payment-network data showed four models exhibited statistically significant alignment (Spearman rho = 0.56 to 0.77).
Conclusion: The framework provides a replicable basis for evaluating LLM-as-a-judge systems in payment-risk workflows and highlights the need for bias-aware protocols in operational financial settings, with negative bias in some models reflecting closer alignment with human judgment.
Abstract: Large Language Models (LLMs) are increasingly used as evaluators of reasoning quality, yet their reliability and bias in payments-risk settings remain poorly understood. We introduce a structured multi-evaluator framework for assessing LLM reasoning in Merchant Category Code (MCC)-based merchant risk assessment, combining a five-criterion rubric with Monte-Carlo scoring to evaluate rationale quality and evaluator stability. Five frontier LLMs generate and cross-evaluate MCC risk rationales under attributed and anonymized conditions. To establish a judge-independent reference, we introduce a consensus-deviation metric that eliminates circularity by comparing each judge’s score to the mean of all other judges, yielding a theoretically grounded measure of self-evaluation and cross-model deviation. Results reveal substantial heterogeneity: GPT-5.1 and Claude 4.5 Sonnet show negative self-evaluation bias (-0.33, -0.31), while Gemini-2.5 Pro and Grok 4 display positive bias (+0.77, +0.71), with bias attenuating by 25.8 percent under anonymization. Evaluation by 26 payment-industry experts shows LLM judges assign scores averaging +0.46 points above human consensus, and that the negative bias of GPT-5.1 and Claude 4.5 Sonnet reflects closer alignment with human judgment. Ground-truth validation using payment-network data shows four models exhibit statistically significant alignment (Spearman rho = 0.56 to 0.77), confirming that the framework captures genuine quality. Overall, the framework provides a replicable basis for evaluating LLM-as-a-judge systems in payment-risk workflows and highlights the need for bias-aware protocols in operational financial settings.
[272] Democratic Preference Alignment via Sortition-Weighted RLHF
Suvadip Sana, Jinzhou Wu, Martin T. Wells
Main category: cs.AI
TL;DR: Democratic Preference Optimization (DemPO) applies algorithmic sortition (citizen assembly mechanism) to AI preference alignment, using either exclusive training on quota-satisfying mini-publics (Hard Panel) or reweighting all raters by inclusion probability (Soft Panel) to improve demographic representativeness.
Details
Motivation: Current preference-based AI alignment methods like RLHF rely on convenience samples that systematically over/under-represent certain demographics, potentially embedding unrepresentative values into AI systems. The paper addresses the question: "Whose values should AI systems learn?"Method: Introduces Democratic Preference Optimization (DemPO) with two schemes: 1) Hard Panel trains exclusively on preferences from quota-satisfying mini-publics sampled via sortition, 2) Soft Panel retains all data but reweights each rater by their inclusion probability under the sortition lottery. The authors prove Soft Panel weighting recovers the expected Hard Panel objective in closed form.
Result: Using a public preference dataset with human judgments, rater demographics, and a 75-clause constitution from a representative US panel, Llama models (1B to 8B parameters) were evaluated. Across six aggregation methods, Hard Panel consistently ranked first and Soft Panel consistently outperformed unweighted baselines, with effect sizes growing as model capacity increases.
Conclusion: Enforcing demographic representativeness at the preference collection stage (via sortition mechanisms) yields AI models whose behavior better reflects values elicited from representative publics, rather than relying on post hoc corrections of biased convenience samples.
Abstract: Whose values should AI systems learn? Preference based alignment methods like RLHF derive their training signal from human raters, yet these rater pools are typically convenience samples that systematically over represent some demographics and under represent others. We introduce Democratic Preference Optimization, or DemPO, a framework that applies algorithmic sortition, the same mechanism used to construct citizen assemblies, to preference based fine tuning. DemPO offers two training schemes. Hard Panel trains exclusively on preferences from a quota satisfying mini public sampled via sortition. Soft Panel retains all data but reweights each rater by their inclusion probability under the sortition lottery. We prove that Soft Panel weighting recovers the expected Hard Panel objective in closed form. Using a public preference dataset that pairs human judgments with rater demographics and a seventy five clause constitution independently elicited from a representative United States panel, we evaluate Llama models from one billion to eight billion parameters fine tuned under each scheme. Across six aggregation methods, the Hard Panel consistently ranks first and the Soft Panel consistently outperforms the unweighted baseline, with effect sizes growing as model capacity increases. These results demonstrate that enforcing demographic representativeness at the preference collection stage, rather than post hoc correction, yields models whose behavior better reflects values elicited from representative publics.
[273] Beyond Prompting: Efficient and Robust Contextual Biasing for Speech LLMs via Logit-Space Integration (LOGIC)
Peidong Wang
Main category: cs.AI
TL;DR: LOGIC is a framework for Speech LLMs that enables efficient contextual biasing for new entities without prompting limitations, achieving 9% WER reduction with minimal false alarms.
Details
Motivation: Speech LLMs struggle with recognizing new domain-specific entities (names, jargon, etc.) due to static training. Prompting solutions have scalability issues with context windows, latency, and lost-in-the-middle problems. GEC approaches suffer from over-correction hallucinations.Method: LOGIC (Logit-Space Integration for Contextual Biasing) operates directly in the decoding layer, decoupling context injection from input processing. This ensures constant-time complexity relative to prompt length, avoiding the limitations of prompting approaches.
Result: Experiments with Phi-4-MM model across 11 multilingual locales show LOGIC achieves average 9% relative reduction in Entity Word Error Rate with only 0.30% increase in False Alarm Rate.
Conclusion: LOGIC provides an efficient and robust framework for contextual biasing in Speech LLMs that addresses scalability issues of prompting while avoiding over-correction problems of GEC approaches.
Abstract: The rapid emergence of new entities – driven by cultural shifts, evolving trends, and personalized user data – poses a significant challenge for existing Speech Large Language Models (Speech LLMs). While these models excel at general conversational tasks, their static training knowledge limits their ability to recognize domain-specific terms such as contact names, playlists, or technical jargon. Existing solutions primarily rely on prompting, which suffers from poor scalability: as the entity list grows, prompting encounters context window limitations, increased inference latency, and the “lost-in-the-middle” phenomenon. An alternative approach, Generative Error Correction (GEC), attempts to rewrite transcripts via post-processing but frequently suffers from “over-correction”, introducing hallucinations of entities that were never spoken. In this work, we introduce LOGIC (Logit-Space Integration for Contextual Biasing), an efficient and robust framework that operates directly in the decoding layer. Unlike prompting, LOGIC decouples context injection from input processing, ensuring constant-time complexity relative to prompt length. Extensive experiments using the Phi-4-MM model across 11 multilingual locales demonstrate that LOGIC achieves an average 9% relative reduction in Entity WER with a negligible 0.30% increase in False Alarm Rate.
[274] SocialVeil: Probing Social Intelligence of Language Agents under Communication Barriers
Keyang Xuan, Pengda Wang, Chongrui Ye, Haofei Yu, Tal August, Jiaxuan You
Main category: cs.AI
TL;DR: SocialVeil: A social learning environment simulating communication barriers (semantic vagueness, sociocultural mismatch, emotional interference) to evaluate LLMs’ social intelligence in realistic imperfect settings.
Details
Motivation: Existing LLM benchmarks assume idealized communication, limiting ability to diagnose whether LLMs can maintain and repair interactions in realistic imperfect settings with communication barriers.Method: Developed SocialVeil environment based on systematic literature review of human communication challenges, introducing three disruption types and two barrier-aware metrics (unresolved confusion, mutual understanding). Tested across 720 scenarios with four frontier LLMs.
Result: Communication barriers consistently impair LLM performance: mutual understanding reduced by over 45% on average, confusion elevated by nearly 50%. Human evaluations validate barrier fidelity (ICC≈0.78, Pearson r≈0.80). Adaptation strategies show only modest improvement.
Conclusion: SocialVeil advances social interaction environments toward real-world communication, revealing LLMs’ limitations in handling communication barriers and opening opportunities for exploring social intelligence in imperfect settings.
Abstract: Large language models (LLMs) are increasingly evaluated in interactive environments to test their social intelligence. However, existing benchmarks often assume idealized communication between agents, limiting our ability to diagnose whether LLMs can maintain and repair interactions in more realistic, imperfect settings. To close this gap, we present \textsc{SocialVeil}, a social learning environment that can simulate social interaction under cognitive-difference-induced communication barriers. Grounded in a systematic literature review of communication challenges in human interaction, \textsc{SocialVeil} introduces three representative types of such disruption, \emph{semantic vagueness}, \emph{sociocultural mismatch}, and \emph{emotional interference}. We also introduce two barrier-aware evaluation metrics, \emph{unresolved confusion} and \emph{mutual understanding}, to evaluate interaction quality under impaired communication. Experiments across 720 scenarios and four frontier LLMs show that barriers consistently impair performance, with mutual understanding reduced by over 45% on average, and confusion elevated by nearly 50%. Human evaluations validate the fidelity of these simulated barriers (ICC$\approx$0.78, Pearson r$\approx$0.80). We further demonstrate that adaptation strategies (Repair Instruction and Interactive learning) only have a modest effect far from barrier-free performance. This work takes a step toward bringing social interaction environments closer to real-world communication, opening opportunities for exploring the social intelligence of LLM agents.
[275] Scaling Multi-Agent Epistemic Planning through GNN-Derived Heuristics
Giovanni Briglia, Francesco Fabiano, Stefano Mariani
Main category: cs.AI
TL;DR: GNN-based heuristics for multi-agent epistemic planning using Kripke structures to improve scalability
Details
Motivation: Multi-agent epistemic planning (MEP) requires reasoning about both physical world and agent beliefs, but existing heuristics don't scale well with Kripke structure representations, leading to exponential search spaces and intractability.Method: Use Graph Neural Networks (GNNs) to learn patterns in epistemic states (represented as Kripke structures/directed labeled graphs), derive predictive heuristics for state quality (e.g., distance to goal), and integrate these into epistemic planning pipeline.
Result: The GNN-based heuristics show improvements in scalability of multi-agent epistemic planning compared to standard baselines.
Conclusion: GNNs effectively capture graph-like nature of Kripke models and enable learning meaningful heuristics that improve epistemic planning scalability.
Abstract: Multi-agent Epistemic Planning (MEP) is an autonomous planning framework for reasoning about both the physical world and the beliefs of agents, with applications in domains where information flow and awareness among agents are critical. The richness of MEP requires states to be represented as Kripke structures, i.e., directed labeled graphs. This representation limits the applicability of existing heuristics, hindering the scalability of epistemic solvers, which must explore an exponential search space without guidance, resulting often in intractability. To address this, we exploit Graph Neural Networks (GNNs) to learn patterns and relational structures within epistemic states, to guide the planning process. GNNs, which naturally capture the graph-like nature of Kripke models, allow us to derive meaningful estimates of state quality – e.g., the distance from the nearest goal – by generalizing knowledge obtained from previously solved planning instances. We integrate these predictive heuristics into an epistemic planning pipeline and evaluate them against standard baselines, showing improvements in the scalability of multi-agent epistemic planning.
[276] CAST-CKT: Chaos-Aware Spatio-Temporal and Cross-City Knowledge Transfer for Traffic Flow Prediction
Abdul Joseph Fofanah, Lian Wen, David Chen, Alpha Alimamy Kamara, Zhongyi Zhang
Main category: cs.AI
TL;DR: CAST-CKT is a chaos-aware spatio-temporal framework for cross-city traffic prediction that uses chaotic analysis to quantify predictability regimes and enables few-shot learning through adaptive attention, dynamic topology learning, and cross-city alignment.
Details
Motivation: Traffic prediction in data-scarce cross-city settings is challenging due to complex nonlinear dynamics and domain shifts. Existing methods fail to capture traffic's inherent chaotic nature for effective few-shot learning.Method: Proposes CAST-CKT with: 1) chaotic analyser to quantify traffic predictability regimes, 2) chaos-aware attention for regime-adaptive temporal modelling, 3) adaptive topology learning for dynamic spatial dependencies, 4) chaotic consistency-based cross-city alignment for knowledge transfer, and 5) horizon-specific predictions with uncertainty quantification.
Result: Extensive experiments on four benchmarks in cross-city few-shot settings show CAST-CKT outperforms state-of-the-art methods by significant margins in MAE and RMSE, while offering interpretable regime analysis.
Conclusion: CAST-CKT effectively addresses cross-city traffic prediction challenges by incorporating chaotic analysis, enabling better few-shot learning and knowledge transfer with theoretical generalization guarantees.
Abstract: Traffic prediction in data-scarce, cross-city settings is challenging due to complex nonlinear dynamics and domain shifts. Existing methods often fail to capture traffic’s inherent chaotic nature for effective few-shot learning. We propose CAST-CKT, a novel Chaos-Aware Spatio-Temporal and Cross-City Knowledge Transfer framework. It employs an efficient chaotic analyser to quantify traffic predictability regimes, driving several key innovations: chaos-aware attention for regime-adaptive temporal modelling; adaptive topology learning for dynamic spatial dependencies; and chaotic consistency-based cross-city alignment for knowledge transfer. The framework also provides horizon-specific predictions with uncertainty quantification. Theoretical analysis shows improved generalisation bounds. Extensive experiments on four benchmarks in cross-city few-shot settings show CAST-CKT outperforms state-of-the-art methods by significant margins in MAE and RMSE, while offering interpretable regime analysis. Code is available at https://github.com/afofanah/CAST-CKT.
[277] HugRAG: Hierarchical Causal Knowledge Graph Design for RAG
Nengbo Wang, Tuo Liang, Vikash Singh, Chaoda Song, Van Yang, Yu Yin, Jing Ma, Jagdip Singh, Vipin Chaudhary
Main category: cs.AI
TL;DR: HugRAG is a framework that enhances graph-based retrieval augmented generation by incorporating causal gating across hierarchical modules to suppress spurious correlations and enable scalable reasoning over large knowledge graphs.
Details
Motivation: Existing graph-based RAG methods often over-rely on surface-level node matching and lack explicit causal modeling, leading to unfaithful or spurious answers. Current causality approaches are limited to local contexts and suffer from information isolation in modular graph structures, hindering scalability and cross-module causal reasoning.Method: HugRAG rethinks knowledge organization for graph-based RAG through causal gating across hierarchical modules. It explicitly models causal relationships to suppress spurious correlations while enabling scalable reasoning over large-scale knowledge graphs.
Result: Extensive experiments show that HugRAG consistently outperforms competitive graph-based RAG baselines across multiple datasets and evaluation metrics.
Conclusion: HugRAG establishes a principled foundation for structured, scalable, and causally grounded RAG systems, addressing key limitations in current graph-based retrieval augmented generation approaches.
Abstract: Retrieval augmented generation (RAG) has enhanced large language models by enabling access to external knowledge, with graph-based RAG emerging as a powerful paradigm for structured retrieval and reasoning. However, existing graph-based methods often over-rely on surface-level node matching and lack explicit causal modeling, leading to unfaithful or spurious answers. Prior attempts to incorporate causality are typically limited to local or single-document contexts and also suffer from information isolation that arises from modular graph structures, which hinders scalability and cross-module causal reasoning. To address these challenges, we propose HugRAG, a framework that rethinks knowledge organization for graph-based RAG through causal gating across hierarchical modules. HugRAG explicitly models causal relationships to suppress spurious correlations while enabling scalable reasoning over large-scale knowledge graphs. Extensive experiments demonstrate that HugRAG consistently outperforms competitive graph-based RAG baselines across multiple datasets and evaluation metrics. Our work establishes a principled foundation for structured, scalable, and causally grounded RAG systems.
[278] First Proof
Mohammed Abouzaid, Andrew J. Blumberg, Martin Hairer, Joe Kileel, Tamara G. Kolda, Paul D. Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, Lauren Williams
Main category: cs.AI
TL;DR: A set of 10 research-level mathematics questions created by authors to test AI systems’ mathematical reasoning capabilities, with answers temporarily encrypted
Details
Motivation: To evaluate the current capabilities of AI systems in solving authentic research-level mathematics problems that arise naturally in academic researchMethod: Created 10 original mathematics questions from authors’ research work, kept answers encrypted temporarily to prevent training data contamination
Result: A benchmark dataset of research-level math questions for testing AI systems, with answers known to authors but temporarily withheld
Conclusion: Provides a challenging testbed for assessing AI mathematical reasoning on authentic research problems, with controlled answer release to prevent data leakage
Abstract: To assess the ability of current AI systems to correctly answer research-level mathematics questions, we share a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now; the answers are known to the authors of the questions but will remain encrypted for a short time.
[279] Traceable Cross-Source RAG for Chinese Tibetan Medicine Question Answering
Fengxian Chen, Zhilong Tao, Jiaxuan Li, Yunlong Li, Qingguo Zhou
Main category: cs.AI
TL;DR: A retrieval-augmented generation system for Chinese Tibetan medicine that addresses challenges in multi-KB settings by using KB routing and alignment graphs to improve traceability and reduce hallucinations.
Details
Motivation: Domain settings with multiple heterogeneous knowledge bases (KBs) pose challenges for RAG systems, where dense encyclopedia entries can dominate retrieval even when classics or clinical papers provide more authoritative evidence, especially in specialized domains like Chinese Tibetan medicine.Method: Two complementary methods: 1) DAKS performs KB routing and budgeted retrieval to mitigate density-driven bias and prioritize authoritative sources; 2) Alignment graph guides evidence fusion and coverage-aware packing to improve cross-KB evidence coverage without naive concatenation. Uses lightweight generator openPangu-Embedded-7B.
Result: Consistent gains in routing quality and cross-KB evidence coverage, with the full system achieving best CrossEv@5 while maintaining strong faithfulness and citation correctness on a 500-query benchmark covering both single-KB and cross-KB questions.
Conclusion: The proposed methods effectively address multi-KB RAG challenges in specialized domains by improving traceability, reducing hallucinations, and enabling cross-KB verification through intelligent routing and evidence fusion techniques.
Abstract: Retrieval-augmented generation (RAG) promises grounded question answering, yet domain settings with multiple heterogeneous knowledge bases (KBs) remain challenging. In Chinese Tibetan medicine, encyclopedia entries are often dense and easy to match, which can dominate retrieval even when classics or clinical papers provide more authoritative evidence. We study a practical setting with three KBs (encyclopedia, classics, and clinical papers) and a 500-query benchmark (cutoff $K{=}5$) covering both single-KB and cross-KB questions. We propose two complementary methods to improve traceability, reduce hallucinations, and enable cross-KB verification. First, DAKS performs KB routing and budgeted retrieval to mitigate density-driven bias and to prioritize authoritative sources when appropriate. Second, we use an alignment graph to guide evidence fusion and coverage-aware packing, improving cross-KB evidence coverage without relying on naive concatenation. All answers are generated by a lightweight generator, \textsc{openPangu-Embedded-7B}. Experiments show consistent gains in routing quality and cross-KB evidence coverage, with the full system achieving the best CrossEv@5 while maintaining strong faithfulness and citation correctness.
[280] Surgery: Mitigating Harmful Fine-Tuning for Large Language Models via Attention Sink
Guozhi Liu, Weiwei Lin, Tiansheng Huang, Ruichao Mo, Qi Mu, Xiumin Wang, Li Shen
Main category: cs.AI
TL;DR: Surgery: A fine-tuning defense method that uses attention sink divergence analysis to mitigate harmful fine-tuning in LLMs by suppressing positive sink divergence in attention heads.
Details
Motivation: Harmful fine-tuning can compromise the safety alignment of large language models, creating significant safety risks. Current defenses are insufficient, so the authors explore attention mechanisms to detect and prevent harmful pattern learning during fine-tuning.Method: The method analyzes attention sink divergence in attention heads, finding that harmful fine-tuning increases positive sink divergence. Based on a separable sink divergence hypothesis, Surgery uses a regularizer to suppress positive sink divergence, steering attention heads toward negative divergence to reduce harmful pattern learning.
Result: Surgery improves defense performance by 5.90% on BeaverTails, 11.25% on HarmBench, and 9.55% on SorryBench benchmarks, demonstrating effectiveness in mitigating harmful fine-tuning while maintaining model utility.
Conclusion: Attention sink divergence provides a measurable signal for harmful fine-tuning detection, and Surgery offers an effective fine-tuning-stage defense that improves safety alignment without compromising model performance.
Abstract: Harmful fine-tuning can invalidate safety alignment of large language models, exposing significant safety risks. In this paper, we utilize the attention sink mechanism to mitigate harmful fine-tuning. Specifically, we first measure a statistic named \emph{sink divergence} for each attention head and observe that \emph{different attention heads exhibit two different signs of sink divergence}. To understand its safety implications, we conduct experiments and find that the number of attention heads of positive sink divergence increases along with the increase of the model’s harmfulness when undergoing harmful fine-tuning. Based on this finding, we propose a separable sink divergence hypothesis – \emph{attention heads associating with learning harmful patterns during fine-tuning are separable by their sign of sink divergence}. Based on the hypothesis, we propose a fine-tuning-stage defense, dubbed Surgery. Surgery utilizes a regularizer for sink divergence suppression, which steers attention heads toward the negative sink divergence group, thereby reducing the model’s tendency to learn and amplify harmful patterns. Extensive experiments demonstrate that Surgery improves defense performance by 5.90%, 11.25%, and 9.55% on the BeaverTails, HarmBench, and SorryBench benchmarks, respectively. Source code is available on https://github.com/Lslland/Surgery.
[281] Explainable AI: A Combined XAI Framework for Explaining Brain Tumour Detection Models
Patrick McGonagle, William Farrelly, Kevin Curran
Main category: cs.AI
TL;DR: Integrated multiple XAI techniques (GRAD-CAM, LRP, SHAP) to enhance interpretability of CNN for brain tumor detection, achieving 91.24% accuracy on BraTS 2021 dataset.
Details
Motivation: To improve transparency and trust in AI-driven medical imaging by providing comprehensive explanations of deep learning model decisions for brain tumor detection, addressing the black-box nature of such systems in critical healthcare applications.Method: Developed custom CNN trained on BraTS 2021 dataset, then integrated three XAI techniques: GRAD-CAM for spatial region importance, LRP for pixel-level relevance, and SHAP for feature contribution quantification. Multi-technique approach provides layered explanations from broad regions to pixel details.
Result: Achieved 91.24% accuracy in tumor detection, successfully identified both full and partial tumors. Integrated XAI approach showed superior explanatory power compared to individual methods, effectively explaining model predictions including challenging cases with partial tumor visibility.
Conclusion: Integrated XAI techniques enhance transparency and trust in AI medical imaging by providing comprehensive model reasoning insights. This approach demonstrates potential for improving reliability and interpretability of AI systems in healthcare, particularly for critical tasks like brain tumor detection.
Abstract: This study explores the integration of multiple Explainable AI (XAI) techniques to enhance the interpretability of deep learning models for brain tumour detection. A custom Convolutional Neural Network (CNN) was developed and trained on the BraTS 2021 dataset, achieving 91.24% accuracy in distinguishing between tumour and non-tumour regions. This research combines Gradient-weighted Class Activation Mapping (GRAD-CAM), Layer-wise Relevance Propagation (LRP) and SHapley Additive exPlanations (SHAP) to provide comprehensive insights into the model’s decision-making process. This multi-technique approach successfully identified both full and partial tumours, offering layered explanations ranging from broad regions of interest to pixel-level details. GRAD-CAM highlighted important spatial regions, LRP provided detailed pixel-level relevance and SHAP quantified feature contributions. The integrated approach effectively explained model predictions, including cases with partial tumour visibility thus showing superior explanatory power compared to individual XAI methods. This research enhances transparency and trust in AI-driven medical imaging analysis by offering a more comprehensive perspective on the model’s reasoning. The study demonstrates the potential of integrated XAI techniques in improving the reliability and interpretability of AI systems in healthcare, particularly for critical tasks like brain tumour detection.
[282] Automatic Cognitive Task Generation for In-Situ Evaluation of Embodied Agents
Xinyi He, Ying Yang, Chuanjian Fu, Sihan Guo, Songchun Zhu, Lifeng Fan, Zhenliang Zhang, Yujia Peng
Main category: cs.AI
TL;DR: TEA: A dynamic in-situ task generation method for evaluating embodied AI agents in unseen 3D environments through interaction-evolution cycles and graph-based task modeling.
Details
Motivation: Existing benchmarks for embodied AI agents suffer from data contamination and lack scene specificity, making them inadequate for evaluating agents in real-world unseen environments. There's a critical need for evaluation methods that can assess agent capabilities in diverse, previously unseen 3D settings before deployment in human households.Method: Proposes TEA (Task Generation for Embodied Agents) with two-stage interaction-evolution system: 1) Interaction stage where agent actively interacts with environment, creating loop between task execution and generation; 2) Evolution stage using task graph modeling to recombine and reuse existing tasks to generate new ones without external data. Tasks are defined through structured graph representation.
Result: In experiments across 10 unseen scenes, TEA automatically generated 87,876 tasks in two cycles, verified by humans as physically reasonable and covering essential daily cognitive capabilities. Benchmarking showed SOTA models perform poorly on basic perception tasks, severely lack 3D interaction awareness, and show high sensitivity to task types in reasoning, despite excelling on public benchmarks.
Conclusion: The sobering findings highlight the necessity of in-situ evaluation before deploying agents into real-world human environments, as current models have significant limitations in 3D understanding and interaction that aren’t revealed by existing benchmarks.
Abstract: As general intelligent agents are poised for widespread deployment in diverse households, evaluation tailored to each unique unseen 3D environment has become a critical prerequisite. However, existing benchmarks suffer from severe data contamination and a lack of scene specificity, inadequate for assessing agent capabilities in unseen settings. To address this, we propose a dynamic in-situ task generation method for unseen environments inspired by human cognition. We define tasks through a structured graph representation and construct a two-stage interaction-evolution task generation system for embodied agents (TEA). In the interaction stage, the agent actively interacts with the environment, creating a loop between task execution and generation that allows for continuous task generation. In the evolution stage, task graph modeling allows us to recombine and reuse existing tasks to generate new ones without external data. Experiments across 10 unseen scenes demonstrate that TEA automatically generated 87,876 tasks in two cycles, which human verification confirmed to be physically reasonable and encompassing essential daily cognitive capabilities. Benchmarking SOTA models against humans on our in-situ tasks reveals that models, despite excelling on public benchmarks, perform surprisingly poorly on basic perception tasks, severely lack 3D interaction awareness and show high sensitivity to task types in reasoning. These sobering findings highlight the necessity of in-situ evaluation before deploying agents into real-world human environments.
[283] Beyond Cosine Similarity
Xinbo Ai
Main category: cs.AI
TL;DR: Recos is a new similarity metric that improves upon cosine similarity by using sorted vector components to capture nonlinear semantic relationships, outperforming cosine on STS benchmarks.
Details
Motivation: Cosine similarity is limited to capturing linear relationships due to its mathematical grounding in the Cauchy-Schwarz inequality, which fails to model the complex, nonlinear structures of real-world semantic spaces.Method: Derived a tighter upper bound for the dot product than the classical Cauchy-Schwarz bound, leading to recos - a similarity metric that normalizes the dot product by the sorted vector components, relaxing the condition for perfect similarity from strict linear dependence to ordinal concordance.
Result: Extensive experiments across 11 embedding models (static, contextualized, and universal) show recos consistently outperforms traditional cosine similarity, achieving higher correlation with human judgments on standard Semantic Textual Similarity (STS) benchmarks.
Conclusion: Recos is established as a mathematically principled and empirically superior alternative to cosine similarity, offering enhanced accuracy for semantic analysis in complex embedding spaces.
Abstract: Cosine similarity, the standard metric for measuring semantic similarity in vector spaces, is mathematically grounded in the Cauchy-Schwarz inequality, which inherently limits it to capturing linear relationships–a constraint that fails to model the complex, nonlinear structures of real-world semantic spaces. We advance this theoretical underpinning by deriving a tighter upper bound for the dot product than the classical Cauchy-Schwarz bound. This new bound leads directly to recos, a similarity metric that normalizes the dot product by the sorted vector components. recos relaxes the condition for perfect similarity from strict linear dependence to ordinal concordance, thereby capturing a broader class of relationships. Extensive experiments across 11 embedding models–spanning static, contextualized, and universal types–demonstrate that recos consistently outperforms traditional cosine similarity, achieving higher correlation with human judgments on standard Semantic Textual Similarity (STS) benchmarks. Our work establishes recos as a mathematically principled and empirically superior alternative, offering enhanced accuracy for semantic analysis in complex embedding spaces.
[284] Hallucination-Resistant Security Planning with a Large Language Model
Kim Hammar, Tansu Alpcan, Emil Lupu
Main category: cs.AI
TL;DR: Framework integrates LLMs in security management with consistency checking and feedback loops to control hallucinations and improve incident response planning.
Details
Motivation: LLMs show promise for security management tasks like incident response planning, but their unreliability and tendency to hallucinate remain significant challenges that need to be addressed.Method: Principled framework using LLMs in iterative loop: generates candidate actions, checks consistency with system constraints and lookahead predictions, collects external feedback (e.g., digital twin evaluation) when consistency is low, and refines actions through in-context learning.
Result: Framework reduces recovery times by up to 30% compared to frontier LLMs in incident response experiments on four public datasets; proven ability to control hallucination risk through consistency threshold tuning.
Conclusion: The framework provides a reliable approach to leverage LLMs for security management by controlling hallucinations through consistency checking and feedback loops, making LLMs more practical for real-world security applications.
Abstract: Large language models (LLMs) are promising tools for supporting security management tasks, such as incident response planning. However, their unreliability and tendency to hallucinate remain significant challenges. In this paper, we address these challenges by introducing a principled framework for using an LLM as decision support in security management. Our framework integrates the LLM in an iterative loop where it generates candidate actions that are checked for consistency with system constraints and lookahead predictions. When consistency is low, we abstain from the generated actions and instead collect external feedback, e.g., by evaluating actions in a digital twin. This feedback is then used to refine the candidate actions through in-context learning (ICL). We prove that this design allows to control the hallucination risk by tuning the consistency threshold. Moreover, we establish a bound on the regret of ICL under certain assumptions. To evaluate our framework, we apply it to an incident response use case where the goal is to generate a response and recovery plan based on system logs. Experiments on four public datasets show that our framework reduces recovery times by up to 30% compared to frontier LLMs.
[285] Position: Universal Time Series Foundation Models Rest on a Category Error
Xilin Dai, Wanxu Cai, Zhijian Xu, Qiang Xu
Main category: cs.AI
TL;DR: Time series foundation models are fundamentally flawed due to category error - treating time series as a modality rather than a container for incompatible generative processes, leading to poor generalization under distributional drift.
Details
Motivation: The paper challenges the current paradigm of building universal foundation models for time series, arguing that this approach is fundamentally misguided because time series data comes from incompatible generative processes (e.g., finance vs. physics) that cannot be effectively captured by monolithic models.Method: Introduces theoretical analysis including the “Autoregressive Blindness Bound” proving limitations of history-only models, and proposes a “Causal Control Agent” paradigm where an agent uses external context to orchestrate specialized solvers (domain experts and lightweight adaptors).
Result: Theoretical demonstration that current universal time series models degenerate into expensive “Generic Filters” that fail under distributional drift, with proposed alternative framework for robust adaptation.
Conclusion: Advocates shifting from universal foundation models to causal control systems, and changing benchmarks from “Zero-Shot Accuracy” to “Drift Adaptation Speed” to prioritize robust, control-theoretic approaches.
Abstract: This position paper argues that the pursuit of “Universal Foundation Models for Time Series” rests on a fundamental category error, mistaking a structural Container for a semantic Modality. We contend that because time series hold incompatible generative processes (e.g., finance vs. fluid dynamics), monolithic models degenerate into expensive “Generic Filters” that fail to generalize under distributional drift. To address this, we introduce the “Autoregressive Blindness Bound,” a theoretical limit proving that history-only models cannot predict intervention-driven regime shifts. We advocate replacing universality with a Causal Control Agent paradigm, where an agent leverages external context to orchestrate a hierarchy of specialized solvers, from frozen domain experts to lightweight Just-in-Time adaptors. We conclude by calling for a shift in benchmarks from “Zero-Shot Accuracy” to “Drift Adaptation Speed” to prioritize robust, control-theoretic systems.
[286] Aspect-Aware MOOC Recommendation in a Heterogeneous Network
Seongyeub Chu, Jongwoo Kim, Mun Yong Yi
Main category: cs.AI
TL;DR: AMR is a novel MOOC recommendation framework that uses aspect-aware path representations from automatically discovered metapaths to improve recommendation accuracy over traditional graph-based methods.
Details
Motivation: Traditional MOOC recommendation methods (collaborative filtering, content-based filtering) suffer from data sparsity and over-specialization. Graph-based approaches help but rely heavily on manually predefined metapaths, which capture only superficial relationships and require significant domain expertise and engineering costs.Method: AMR automatically discovers metapaths through bi-directional walks, derives aspect-aware path representations using a bi-LSTM-based encoder, and incorporates these representations as edge features in learner-learner and KC-KC subgraphs for fine-grained semantically informed knowledge component recommendations.
Result: Extensive experiments on large-scale MOOCCube and PEEK datasets show AMR consistently outperforms state-of-the-art graph neural network baselines across key metrics (HR@K and nDCG@K). Analysis confirms AMR effectively captures rich path-specific aspect information for more accurate recommendations.
Conclusion: AMR overcomes limitations of traditional recommendation methods by automatically discovering metapaths and modeling path-specific multiple aspects, providing more accurate MOOC recommendations than methods relying solely on predefined metapaths.
Abstract: MOOC recommendation systems have received increasing attention to help learners navigate and select preferred learning content. Traditional methods such as collaborative filtering and content-based filtering suffer from data sparsity and over-specialization. To alleviate these limitations, graph-based approaches have been proposed; however, they still rely heavily on manually predefined metapaths, which often capture only superficial structural relationships and impose substantial burdens on domain experts as well as significant engineering costs. To overcome these limitations, we propose AMR (Aspect-aware MOOC Recommendation), a novel framework that models path-specific multiple aspects by embedding the semantic content of nodes within each metapath. AMR automatically discovers metapaths through bi-directional walks, derives aspect-aware path representations using a bi-LSTM-based encoder, and incorporates these representations as edge features in the learner-learner and KC-KC subgraphs to achieve fine-grained semantically informed KC recommendations. Extensive experiments on the large-scale MOOCCube and PEEK datasets show that AMR consistently outperforms state-of-the-art graph neural network baselines across key metrics such as HR@K and nDCG@K. Further analysis confirms that AMR effectively captures rich path-specific aspect information, allowing more accurate recommendations than those methods that rely solely on predefined metapaths. The code will be available upon accepted.
[287] PieArena: Frontier Language Agents Achieve MBA-Level Negotiation Performance and Reveal Novel Behavioral Differences
Chris Zhu, Sasha Cui, Will Sanok Dufallo, Runzhi Jin, Zhen Xu, Linjun Zhang, Daylian Cain
Main category: cs.AI
TL;DR: GPT-5 matches or outperforms business school students in realistic negotiation scenarios, showing AGI-level performance in strategic reasoning and economic value creation.
Details
Motivation: To evaluate LLMs' negotiation capabilities, which require complex skills like strategic reasoning, theory of mind, and economic value creation - central to business applications.Method: Introduces PieArena, a large-scale negotiation benchmark with realistic scenarios from an MBA negotiation course, comparing frontier LLMs (GPT-5) against trained business students.
Result: GPT-5 matches or outperforms business school students despite their semester of instruction and coaching. Agentic scaffolding helps mid-tier models but has diminishing returns for frontier models.
Conclusion: Frontier language agents are intellectually capable for high-stakes economic settings but still face robustness and trustworthiness challenges.
Abstract: We present an in-depth evaluation of LLMs’ ability to negotiate, a central business task that requires strategic reasoning, theory of mind, and economic value creation. To do so, we introduce PieArena, a large-scale negotiation benchmark grounded in multi-agent interactions over realistic scenarios drawn from an MBA negotiation course at an elite business school. We find systematic evidence of AGI-level performance in which a representative frontier agent (GPT-5) matches or outperforms trained business-school students, despite a semester of general negotiation instruction and targeted coaching immediately prior to the task. We further study the effects of joint-intentionality agentic scaffolding and find asymmetric gains, with large improvements for mid- and lower-tier LMs and diminishing returns for frontier LMs. Beyond deal outcomes, PieArena provides a multi-dimensional negotiation behavioral profile, revealing novel cross-model heterogeneity, masked by deal-outcome-only benchmarks, in deception, computation accuracy, instruction compliance, and perceived reputation. Overall, our results suggest that frontier language agents are already intellectually and psychologically capable of deployment in high-stakes economic settings, but deficiencies in robustness and trustworthiness remain open challenges.
[288] ProAct: Agentic Lookahead in Interactive Environments
Yangbin Yu, Mingyu Yang, Junyou Li, Yiming Gao, Feiyu Liu, Yijun Yang, Zichuan Lin, Jiafei Lyu, Yicheng Liu, Zhicong Lu, Deheng Ye, Jie Jiang
Main category: cs.AI
TL;DR: ProAct is a framework for LLM agents that improves long-horizon planning through grounded lookahead distillation and Monte-Carlo critic, achieving state-of-the-art performance in interactive environments.
Details
Motivation: Existing LLM agents struggle with long-horizon planning in interactive environments due to compounding errors when simulating future states, necessitating a more efficient approach to internalize accurate lookahead reasoning.Method: Two-stage training: 1) Grounded LookAhead Distillation (GLAD) - supervised fine-tuning on trajectories from environment-based search, compressing search trees into causal reasoning chains; 2) Monte-Carlo Critic (MC-Critic) - plug-and-play auxiliary value estimator using lightweight environment rollouts to calibrate value estimates for stable policy optimization.
Result: ProAct significantly improves planning accuracy in both stochastic (2048) and deterministic (Sokoban) environments. A 4B parameter model outperforms all open-source baselines and rivals state-of-the-art closed-source models, demonstrating robust generalization to unseen environments.
Conclusion: ProAct effectively addresses long-horizon planning challenges in LLM agents through efficient lookahead reasoning internalization and stable policy optimization, enabling competitive performance with relatively small models.
Abstract: Existing Large Language Model (LLM) agents struggle in interactive environments requiring long-horizon planning, primarily due to compounding errors when simulating future states. To address this, we propose ProAct, a framework that enables agents to internalize accurate lookahead reasoning through a two-stage training paradigm. First, we introduce Grounded LookAhead Distillation (GLAD), where the agent undergoes supervised fine-tuning on trajectories derived from environment-based search. By compressing complex search trees into concise, causal reasoning chains, the agent learns the logic of foresight without the computational overhead of inference-time search. Second, to further refine decision accuracy, we propose the Monte-Carlo Critic (MC-Critic), a plug-and-play auxiliary value estimator designed to enhance policy-gradient algorithms like PPO and GRPO. By leveraging lightweight environment rollouts to calibrate value estimates, MC-Critic provides a low-variance signal that facilitates stable policy optimization without relying on expensive model-based value approximation. Experiments on both stochastic (e.g., 2048) and deterministic (e.g., Sokoban) environments demonstrate that ProAct significantly improves planning accuracy. Notably, a 4B parameter model trained with ProAct outperforms all open-source baselines and rivals state-of-the-art closed-source models, while demonstrating robust generalization to unseen environments. The codes and models are available at https://github.com/GreatX3/ProAct
[289] AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction
Ruijie Shi, Houbin Zhang, Yuecheng Han, Yuheng Wang, Jingru Fan, Runde Yang, Yufan Dang, Huatao Li, Dewen Liu, Yuan Cheng, Chen Qian
Main category: cs.AI
TL;DR: AgentXRay: A search-based framework that reconstructs interpretable agent workflows from black-box systems using only input-output access, employing Monte Carlo Tree Search with pruning for efficient exploration.
Details
Motivation: Large Language Model agentic systems are often black boxes with opaque internal workflows, making them difficult to interpret and control. Existing frameworks lack methods to synthesize explicit, interpretable stand-in workflows from deployed systems without access to internal parameters.Method: Proposes Agentic Workflow Reconstruction (AWR) task and AgentXRay framework that formulates workflow reconstruction as combinatorial optimization over discrete agent roles and tool invocations. Uses Monte Carlo Tree Search enhanced by Red-Black Pruning mechanism to navigate vast search space efficiently.
Result: AgentXRay achieves higher proxy similarity and reduces token consumption compared to unpruned search, enabling deeper workflow exploration under fixed iteration budgets across diverse domains.
Conclusion: AgentXRay successfully reconstructs interpretable, editable white-box workflows from black-box agentic systems using only input-output access, providing better interpretability and control without model parameter access.
Abstract: Large Language Models have shown strong capabilities in complex problem solving, yet many agentic systems remain difficult to interpret and control due to opaque internal workflows. While some frameworks offer explicit architectures for collaboration, many deployed agentic systems operate as black boxes to users. We address this by introducing Agentic Workflow Reconstruction (AWR), a new task aiming to synthesize an explicit, interpretable stand-in workflow that approximates a black-box system using only input–output access. We propose AgentXRay, a search-based framework that formulates AWR as a combinatorial optimization problem over discrete agent roles and tool invocations in a chain-structured workflow space. Unlike model distillation, AgentXRay produces editable white-box workflows that match target outputs under an observable, output-based proxy metric, without accessing model parameters. To navigate the vast search space, AgentXRay employs Monte Carlo Tree Search enhanced by a scoring-based Red-Black Pruning mechanism, which dynamically integrates proxy quality with search depth. Experiments across diverse domains demonstrate that AgentXRay achieves higher proxy similarity and reduces token consumption compared to unpruned search, enabling deeper workflow exploration under fixed iteration budgets.
[290] PATHWAYS: Evaluating Investigation and Context Discovery in AI Web Agents
Shifat E. Arman, Syed Nazmus Sakib, Tapodhir Karmakar Taton, Nafiul Haque, Shahrear Bin Amin
Main category: cs.AI
TL;DR: PATHWAYS benchmark tests web agents’ ability to discover and use hidden contextual information in multi-step decision tasks, revealing significant limitations in current architectures.
Details
Motivation: To evaluate whether current web-based agents can effectively discover and utilize hidden contextual information that's not immediately apparent, testing their adaptive investigation and evidence integration capabilities.Method: Created a benchmark of 250 multi-step decision tasks requiring agents to navigate web pages and discover hidden contextual evidence, testing both closed and open models on their ability to find and correctly use decisive hidden information.
Result: Agents typically navigate to relevant pages but retrieve decisive hidden evidence in only a small fraction of cases. Performance drops sharply to near chance when tasks require overturning misleading surface-level signals. Agents frequently hallucinate reasoning and fail to integrate discovered context into final decisions.
Conclusion: Current web agent architectures lack reliable mechanisms for adaptive investigation, evidence integration, and judgement override, revealing fundamental limitations in how they process and utilize contextual information.
Abstract: We introduce PATHWAYS, a benchmark of 250 multi-step decision tasks that test whether web-based agents can discover and correctly use hidden contextual information. Across both closed and open models, agents typically navigate to relevant pages but retrieve decisive hidden evidence in only a small fraction of cases. When tasks require overturning misleading surface-level signals, performance drops sharply to near chance accuracy. Agents frequently hallucinate investigative reasoning by claiming to rely on evidence they never accessed. Even when correct context is discovered, agents often fail to integrate it into their final decision. Providing more explicit instructions improves context discovery but often reduces overall accuracy, revealing a tradeoff between procedural compliance and effective judgement. Together, these results show that current web agent architectures lack reliable mechanisms for adaptive investigation, evidence integration, and judgement override.
[291] RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs
Youngcheon You, Banseok Lee, Minseop Choi, Seonyoung Kim, Hyochan Chong, Changdong Kim, Youngmin Kim, Dongkyu Kim
Main category: cs.AI
TL;DR: RaBiT is a novel quantization framework that resolves feature co-adaptation in residual binarization by enforcing a residual hierarchy, achieving state-of-the-art 2-bit performance with 4.49× inference speed-up.
Details
Motivation: Extreme quantization of large language models creates a trade-off between low-bit efficiency and performance. Residual binarization enables matmul-free inference but suffers from pathological feature co-adaptation where parallel residual binary paths learn redundant features, degrading error compensation and limiting model capacity.Method: RaBiT proposes a quantization framework that resolves co-adaptation by algorithmically enforcing a residual hierarchy. The core mechanism sequentially derives each binary path from a single shared full-precision weight, ensuring each path corrects the error of the preceding one. This is stabilized by robust initialization prioritizing functional preservation over weight approximation.
Result: RaBiT redefines the 2-bit accuracy-efficiency frontier, achieving state-of-the-art performance, rivals hardware-intensive Vector Quantization methods, and delivers 4.49× inference speed-up over full-precision models on an RTX 4090.
Conclusion: The RaBiT framework successfully addresses the co-adaptation problem in residual binarization through algorithmic residual hierarchy enforcement, enabling efficient extreme quantization of LLMs without sacrificing performance.
Abstract: Efficient deployment of large language models (LLMs) requires extreme quantization, forcing a critical trade-off between low-bit efficiency and performance. Residual binarization enables hardware-friendly, matmul-free inference by stacking binary ($\pm$1) layers, but is plagued by pathological feature co-adaptation. We identify a key failure mode, which we term inter-path adaptation: during quantization-aware training (QAT), parallel residual binary paths learn redundant features, degrading the error-compensation structure and limiting the expressive capacity of the model. While prior work relies on heuristic workarounds (e.g., path freezing) that constrain the solution space, we propose RaBiT, a novel quantization framework that resolves co-adaptation by algorithmically enforcing a residual hierarchy. Its core mechanism sequentially derives each binary path from a single shared full-precision weight, which ensures that every path corrects the error of the preceding one. This process is stabilized by a robust initialization that prioritizes functional preservation over mere weight approximation. RaBiT redefines the 2-bit accuracy-efficiency frontier: it achieves state-of-the-art performance, rivals even hardware-intensive Vector Quantization (VQ) methods, and delivers a $4.49\times$ inference speed-up over full-precision models on an RTX 4090.
[292] Clinical Validation of Medical-based Large Language Model Chatbots on Ophthalmic Patient Queries with LLM-based Evaluation
Ting Fang Tan, Kabilan Elangovan, Andreas Pollreisz, Kevin Bryan Dy, Wei Yan Ng, Joy Le Yi Wong, Jin Liyuan, Chrystie Quek Wan Ning, Ashley Shuen Ying Hong, Arun James Thirunavukarasu, Shelley Yin-His Chang, Jie Yao, Dylan Hong, Wang Zhaoran, Amrita Gupta, Daniel SW Ting
Main category: cs.AI
TL;DR: Evaluation of four small medical LLMs (7-8B parameters) for ophthalmology patient queries shows Meerkat-7B performs best, while GPT-4-Turbo demonstrates strong alignment with clinician grading for evaluation.
Details
Motivation: As domain-specific LLMs are increasingly used in ophthalmology for patient education and clinical decision support, rigorous evaluation is needed to ensure safety and accuracy, especially for resource-efficient small models suitable for deployment.Method: Four medical LLMs under 10B parameters answered 180 ophthalmology patient queries, generating 2160 responses. Three ophthalmologists of different seniority and GPT-4-Turbo evaluated responses using the S.C.O.R.E. framework (Safety, Consensus, Objectivity, Reproducibility, Explainability) on a 5-point Likert scale. Agreement was analyzed using Spearman correlation, Kendall tau, and kernel density estimates.
Result: Meerkat-7B performed best with mean scores of 3.44-4.18 across clinician levels. MedLLaMA3-v20 performed worst with 25.5% of responses containing hallucinations or misleading content. GPT-4-Turbo grading strongly aligned with clinicians (Spearman rho=0.80, Kendall tau=0.67), though senior consultants graded more conservatively.
Conclusion: Medical LLMs show potential for safe ophthalmology question answering but have gaps in clinical depth and consensus. LLM-based evaluation is feasible for large-scale benchmarking, but hybrid automated-clinician review frameworks are needed for safe clinical deployment.
Abstract: Domain specific large language models are increasingly used to support patient education, triage, and clinical decision making in ophthalmology, making rigorous evaluation essential to ensure safety and accuracy. This study evaluated four small medical LLMs Meerkat-7B, BioMistral-7B, OpenBioLLM-8B, and MedLLaMA3-v20 in answering ophthalmology related patient queries and assessed the feasibility of LLM based evaluation against clinician grading. In this cross sectional study, 180 ophthalmology patient queries were answered by each model, generating 2160 responses. Models were selected for parameter sizes under 10 billion to enable resource efficient deployment. Responses were evaluated by three ophthalmologists of differing seniority and by GPT-4-Turbo using the S.C.O.R.E. framework assessing safety, consensus and context, objectivity, reproducibility, and explainability, with ratings assigned on a five point Likert scale. Agreement between LLM and clinician grading was assessed using Spearman rank correlation, Kendall tau statistics, and kernel density estimate analyses. Meerkat-7B achieved the highest performance with mean scores of 3.44 from Senior Consultants, 4.08 from Consultants, and 4.18 from Residents. MedLLaMA3-v20 performed poorest, with 25.5 percent of responses containing hallucinations or clinically misleading content, including fabricated terminology. GPT-4-Turbo grading showed strong alignment with clinician assessments overall, with Spearman rho of 0.80 and Kendall tau of 0.67, though Senior Consultants graded more conservatively. Overall, medical LLMs demonstrated potential for safe ophthalmic question answering, but gaps remained in clinical depth and consensus, supporting the feasibility of LLM based evaluation for large scale benchmarking and the need for hybrid automated and clinician review frameworks to guide safe clinical deployment.
[293] Advancing Opinion Dynamics Modeling with Neural Diffusion-Convection-Reaction Equation
Chenghua Gong, Yihang Jiang, Hao Li, Rui Sun, Juyuan Zhang, Tianjun Gu, Liming Pan, Linyuan Lü
Main category: cs.AI
TL;DR: OPINN: A physics-informed neural framework using Diffusion-Convection-Reaction system and Neural ODEs for opinion dynamics modeling, achieving SOTA performance in opinion evolution forecasting.
Details
Motivation: Existing opinion dynamics methods lack comprehensive physical systems and struggle to deeply encode physical priors, leading to optimization issues and poor transparency. Need to synergize mechanistic interpretability with data-driven flexibility for better social behavior understanding.Method: Proposes OPINN framework that interprets opinion dynamics via Diffusion-Convection-Reaction (DCR) system inspired by interacting particle theory. Uses Neural ODEs to coordinate neural networks with physical priors, creating a physics-informed neural framework.
Result: Achieves state-of-the-art performance in opinion evolution forecasting on both real-world and synthetic datasets.
Conclusion: OPINN offers a promising paradigm for integrating cyber, physical, and social systems, providing better interpretability and forecasting accuracy for opinion dynamics.
Abstract: Advanced opinion dynamics modeling is vital for deciphering social behavior, emphasizing its role in mitigating polarization and securing cyberspace. To synergize mechanistic interpretability with data-driven flexibility, recent studies have explored the integration of Physics-Informed Neural Networks (PINNs) for opinion modeling. Despite this promise, existing methods are tailored to incomplete priors, lacking a comprehensive physical system to integrate dynamics from local, global, and endogenous levels. Moreover, penalty-based constraints adopted in existing methods struggle to deeply encode physical priors, leading to optimization pathologies and discrepancy between latent representations and physical transparency. To this end, we offer a physical view to interpret opinion dynamics via Diffusion-Convection-Reaction (DCR) system inspired by interacting particle theory. Building upon the Neural ODEs, we define the neural opinion dynamics to coordinate neural networks with physical priors, and further present the OPINN, a physics-informed neural framework for opinion dynamics modeling. Evaluated on real-world and synthetic datasets, OPINN achieves state-of-the-art performance in opinion evolution forecasting, offering a promising paradigm for the nexus of cyber, physical, and social systems.
[294] H-AdminSim: A Multi-Agent Simulator for Realistic Hospital Administrative Workflows with FHIR Integration
Jun-Min Lee, Meong Hi Son, Edward Choi
Main category: cs.AI
TL;DR: H-AdminSim: A comprehensive simulation framework for evaluating LLM-based automation in hospital administrative workflows using multi-agent simulation and FHIR integration
Details
Motivation: Hospital administration handles over 10,000 requests daily, creating interest in LLM-based automation, but prior work has focused only on patient-physician interactions or isolated subtasks, failing to capture the complexity of real administrative workflowsMethod: Proposes H-AdminSim, an end-to-end simulation framework combining realistic data generation with multi-agent-based simulation of hospital administrative workflows, evaluated using detailed rubrics and integrated with FHIR for interoperability
Result: Provides a unified, interoperable environment for testing administrative workflows across heterogeneous hospital settings, serving as a standardized testbed for assessing LLM-driven administrative automation feasibility and performance
Conclusion: H-AdminSim addresses the gap in comprehensive evaluation of LLM-based automation for complex hospital administrative workflows through systematic simulation and standardized testing
Abstract: Hospital administration departments handle a wide range of operational tasks and, in large hospitals, process over 10,000 requests per day, driving growing interest in LLM-based automation. However, prior work has focused primarily on patient–physician interactions or isolated administrative subtasks, failing to capture the complexity of real administrative workflows. To address this gap, we propose H-AdminSim, a comprehensive end-to-end simulation framework that combines realistic data generation with multi-agent-based simulation of hospital administrative workflows. These tasks are quantitatively evaluated using detailed rubrics, enabling systematic comparison of LLMs. Through FHIR integration, H-AdminSim provides a unified and interoperable environment for testing administrative workflows across heterogeneous hospital settings, serving as a standardized testbed for assessing the feasibility and performance of LLM-driven administrative automation.
[295] THOR: Inductive Link Prediction over Hyper-Relational Knowledge Graphs
Weijian Yu, Yuhuan Lu, Dingqi Yang
Main category: cs.AI
TL;DR: THOR is an inductive link prediction technique for hyper-relational knowledge graphs that learns transferable structural patterns across different KGs, enabling predictions on unseen vocabularies.
Details
Motivation: Existing hyper-relational KG link prediction methods are mostly transductive, limited to specific vocabularies and lacking generalizability to unseen entities/relations. There's a need for inductive techniques that can transfer learned patterns across different KGs.Method: Proposes THOR with two key components: 1) Relation and entity foundation graphs modeling fundamental inter- and intra-fact interactions agnostic to specific relations/entities, 2) Two parallel graph encoders followed by a transformer decoder for masked training and fully-inductive inference.
Result: Outperforms baselines on 12 datasets with 66.1%, 55.9%, and 20.4% improvement over best-performing rule-based, semi-inductive, and fully-inductive techniques respectively. Ablation studies confirm design factors capture structural invariance transferable across HKGs.
Conclusion: THOR effectively addresses inductive link prediction in hyper-relational KGs by learning transferable structural patterns, demonstrating strong performance across diverse datasets and settings.
Abstract: Knowledge graphs (KGs) have become a key ingredient supporting a variety of applications. Beyond the traditional triplet representation of facts where a relation connects two entities, modern KGs observe an increasing number of hyper-relational facts, where an arbitrary number of qualifiers associated with a triplet provide auxiliary information to further describe the rich semantics of the triplet, which can effectively boost the reasoning performance in link prediction tasks. However, existing link prediction techniques over such hyper-relational KGs (HKGs) mostly focus on a transductive setting, where KG embedding models are learned from the specific vocabulary of a given KG and subsequently can only make predictions within the same vocabulary, limiting their generalizability to previously unseen vocabularies. Against this background, we propose THOR, an inducTive link prediction technique for Hyper-relational knOwledge gRaphs. Specifically, we first introduce both relation and entity foundation graphs, modeling their fundamental inter- and intra-fact interactions in HKGs, which are agnostic to any specific relations and entities. Afterward, THOR is designed to learn from the two foundation graphs with two parallel graph encoders followed by a transformer decoder, which supports efficient masked training and fully-inductive inference. We conduct a thorough evaluation of THOR in hyper-relational link prediction tasks on 12 datasets with different settings. Results show that THOR outperforms a sizable collection of baselines, yielding 66.1%, 55.9%, and 20.4% improvement over the best-performing rule-based, semi-inductive, and fully-inductive techniques, respectively. A series of ablation studies also reveals our key design factors capturing the structural invariance transferable across HKGs for inductive tasks.
[296] M$^2$-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining
Rui Lv, Juncheng Mo, Tianyi Chu, Chen Rao, Hongyi Jing, Jiajie Teng, Jiafu Chen, Shiqi Zhang, Liangzi Ding, Shuo Fang, Huaizhong Lin, Ziqiang Dang, Chenguang Ma, Lei Zhao
Main category: cs.AI
TL;DR: M²-Miner: A low-cost automated mobile GUI agent data-mining framework using Monte Carlo Tree Search with multi-agent collaboration and intent recycling strategies to generate high-quality user-behavior trajectory data for training GUI agents.
Details
Motivation: Current GUI agent data collection faces three critical challenges: high construction cost, poor data quality, and low data richness. Manual annotation is expensive and existing data mining approaches are inefficient, necessitating a better solution for large-scale, high-quality GUI agent training data.Method: Proposes M²-Miner framework based on Monte Carlo Tree Search with three collaborative agents: InferAgent (guidance), OrchestraAgent (acceleration), and JudgeAgent (evaluation). Includes intent recycling strategy to extract extra interaction trajectories and progressive model-in-the-loop training to improve mining success rate.
Result: GUI agents fine-tuned using the mined data achieve state-of-the-art performance on several commonly used mobile GUI benchmarks, demonstrating the effectiveness of the automated data mining approach.
Conclusion: M²-Miner provides a low-cost, automated solution for generating high-quality GUI agent training data, addressing key challenges in data construction and enabling better GUI agent performance through improved training datasets.
Abstract: Graphical User Interface (GUI) agent is pivotal to advancing intelligent human-computer interaction paradigms. Constructing powerful GUI agents necessitates the large-scale annotation of high-quality user-behavior trajectory data (i.e., intent-trajectory pairs) for training. However, manual annotation methods and current GUI agent data mining approaches typically face three critical challenges: high construction cost, poor data quality, and low data richness. To address these issues, we propose M$^2$-Miner, the first low-cost and automated mobile GUI agent data-mining framework based on Monte Carlo Tree Search (MCTS). For better data mining efficiency and quality, we present a collaborative multi-agent framework, comprising InferAgent, OrchestraAgent, and JudgeAgent for guidance, acceleration, and evaluation. To further enhance the efficiency of mining and enrich intent diversity, we design an intent recycling strategy to extract extra valuable interaction trajectories. Additionally, a progressive model-in-the-loop training strategy is introduced to improve the success rate of data mining. Extensive experiments have demonstrated that the GUI agent fine-tuned using our mined data achieves state-of-the-art performance on several commonly used mobile GUI benchmarks. Our work will be released to facilitate the community research.
[297] Day-Ahead Electricity Price Forecasting for Volatile Markets Using Foundation Models with Regularization Strategy
Kritchanat Ponyuenyong, Pengyu Tu, Jia Wei Tan, Wei Soon Cheong, Jamie Ng Suat Ling, Lianlian Jiang
Main category: cs.AI
TL;DR: Time series foundation models (TSFMs) outperform traditional statistical and deep learning methods for electricity price forecasting in volatile markets, achieving up to 37.4% MAPE improvement with spike regularization.
Details
Motivation: Electricity price forecasting is challenging due to volatility and nonlinearity, and while time series foundation models show promise in general forecasting tasks, their effectiveness for day-ahead electricity price forecasting in volatile markets remains underexplored.Method: Proposes spike regularization strategy and evaluates various TSFMs (Tiny Time Mixers, MOIRAI, MOMENT, TimesFM) against traditional models (ARIMA, LSTM, CNN-LSTM) using half-hourly wholesale market data from Singapore with volatile trends, incorporating exogenous factors like weather and calendar variables.
Result: TSFMs consistently outperform traditional approaches, achieving up to 37.4% improvement in Mean Absolute Percentage Error (MAPE) across various evaluation settings.
Conclusion: Time series foundation models offer practical guidance for improving forecast accuracy and decision-making in volatile electricity markets, demonstrating superior performance over traditional methods.
Abstract: Electricity price forecasting (EPF) is essential for energy markets stakeholders (e.g. grid operators, energy traders, policymakers) but remains challenging due to the inherent volatility and nonlinearity of price signals. Traditional statistical and deep learning (DL) models often struggle to capture complex temporal dependencies and integrate heterogeneous data effectively. While time series foundation models (TSFMs) have shown strong performance in general time series forecasting tasks, such as traffic forecasting and weather forecasting. However, their effectiveness in day-ahead EPF, particularly in volatile markets, remains underexplored. This paper presents a spike regularization strategy and evaluates a wide range of TSFMs, including Tiny Time Mixers (TTMs), MOIRAI, MOMENT, and TimesFM, against traditional statistical and DL models such as Autoregressive Integrated Moving Average (ARIMA), Long-short Term Memory (LSTM), and Convolutional Neural Network - LSTM (CNN-LSTM) using half-hourly wholesale market data with volatile trends in Singapore. Exogenous factors (e.g. weather and calendar variables) are also incorporated into models where applicable. Results demonstrate that TSFMs consistently outperform traditional approaches, achieving up to 37.4% improvement in MAPE across various evaluation settings. The findings offer practical guidance for improving forecast accuracy and decision-making in volatile electricity markets.
[298] Refine and Purify: Orthogonal Basis Optimization with Null-Space Denoising for Conditional Representation Learning
Jiaquan Wang, Yan Lyu, Chen Li, Yuheng Jia
Main category: cs.AI
TL;DR: OD-CRL improves conditional representation learning by optimizing orthogonal semantic bases and using null-space projection to reduce interference between different semantic subspaces.
Details
Motivation: Existing conditional representation learning methods that project features onto LLM-generated text subspaces suffer from sensitivity to subspace basis selection and vulnerability to interference between different semantic subspaces, limiting their effectiveness.Method: Proposes OD-CRL with two key components: Adaptive Orthogonal Basis Optimization (AOBO) that constructs orthogonal semantic bases via SVD with curvature-based truncation, and Null-Space Denoising Projection (NSDP) that suppresses non-target semantic interference by projecting embeddings onto the null space of irrelevant subspaces.
Result: Extensive experiments across customized clustering, classification, and retrieval tasks demonstrate state-of-the-art performance with superior generalization capabilities.
Conclusion: OD-CRL effectively addresses basis sensitivity and subspace interference issues in conditional representation learning, achieving improved performance across diverse customized tasks.
Abstract: Conditional representation learning aims to extract criterion-specific features for customized tasks. Recent studies project universal features onto the conditional feature subspace spanned by an LLM-generated text basis to obtain conditional representations. However, such methods face two key limitations: sensitivity to subspace basis and vulnerability to inter-subspace interference. To address these challenges, we propose OD-CRL, a novel framework integrating Adaptive Orthogonal Basis Optimization (AOBO) and Null-Space Denoising Projection (NSDP). Specifically, AOBO constructs orthogonal semantic bases via singular value decomposition with a curvature-based truncation. NSDP suppresses non-target semantic interference by projecting embeddings onto the null space of irrelevant subspaces. Extensive experiments conducted across customized clustering, customized classification, and customized retrieval tasks demonstrate that OD-CRL achieves a new state-of-the-art performance with superior generalization.
[299] ALIVE: Awakening LLM Reasoning via Adversarial Learning and Instructive Verbal Evaluation
Yiwen Duan, Jing Ye, Xinpei Zhao
Main category: cs.AI
TL;DR: ALIVE is a framework that replaces traditional scalar reward RL with adversarial learning and instructive verbal feedback to help LLMs internalize reasoning principles without human supervision.
Details
Motivation: Traditional RL for LLMs suffers from a "reward bottleneck" - scalar rewards are costly to scale, brittle across domains, and blind to solution logic, preventing models from developing deep reasoning understanding.Method: ALIVE uses adversarial learning with instructive verbal evaluation, unifying problem posing, solving, and judging within a single policy model based on Cognitive Synergy principle to internalize logic of correctness.
Result: ALIVE achieves accuracy gains, improved cross-domain generalization, and higher self-correction rates across mathematical reasoning, code generation, and logical inference benchmarks with identical data/compute.
Conclusion: ALIVE provides a scalable foundation for general-purpose reasoning alignment without human supervision by fostering self-sustaining capability growth through internalized reasoning faculties.
Abstract: The quest for expert-level reasoning in Large Language Models (LLMs) has been hampered by a persistent \textit{reward bottleneck}: traditional reinforcement learning (RL) relies on scalar rewards that are \textbf{costly} to scale, \textbf{brittle} across domains, and \textbf{blind} to the underlying logic of a solution. This reliance on external, impoverished signals prevents models from developing a deep, self-contained understanding of reasoning principles. We introduce \textbf{ALIVE} (\emph{Adversarial Learning with Instructive Verbal Evaluation}), a hands-free alignment framework that moves beyond scalar reward optimization toward intrinsic reasoning acquisition. Grounded in the principle of \emph{Cognitive Synergy}, ALIVE unifies problem posing, solving, and judging within a single policy model to internalize the logic of correctness. By coupling adversarial learning with instructive verbal feedback, ALIVE enables models to internalize evaluative criteria directly from raw corpora, effectively transforming external critiques into an endogenous reasoning faculty. Empirical evaluations across mathematical reasoning, code generation, and general logical inference benchmarks demonstrate that ALIVE consistently mitigates reward signal limitations. With identical data and compute, it achieves accuracy gains, markedly improved cross-domain generalization, and higher self-correction rates. These results indicate that the reasoning trinity fosters a self-sustaining trajectory of capability growth, positioning ALIVE as a scalable foundation for general-purpose reasoning alignment without human-in-the-loop supervision.
[300] Phi-Former: A Pairwise Hierarchical Approach for Compound-Protein Interactions Prediction
Zhe Wang, Zijing Liu, Chencheng Xu, Yuan Yao
Main category: cs.AI
TL;DR: Phi-former is a hierarchical interaction representation learning method for predicting compound-protein interactions by modeling interactions at atom-atom, motif-motif, and atom-motif levels to better align with biological recognition patterns.
Details
Motivation: Current deep learning methods for compound-protein interaction prediction operate at atomic level but don't align well with chemical realities where molecular fragments/motifs are the primary units of biological recognition and binding. There's a need for models that incorporate the biological role of motifs in drug-target interactions.Method: Proposes Phi-former, a pairwise hierarchical interaction representation learning method that represents compounds and proteins hierarchically and employs pairwise pre-training to model interactions systematically across three levels: atom-atom, motif-motif, and atom-motif interactions. Uses intra-level and inter-level learning pipelines that make different interaction levels mutually beneficial.
Result: Phi-former achieves superior performance on CPI-related tasks compared to existing methods. A case study shows the method accurately identifies specific atoms or motifs activated in CPIs, providing interpretable model explanations.
Conclusion: The hierarchical approach that incorporates motif-level interactions better reflects biological recognition patterns and provides interpretable insights that can guide rational drug design and support precision medicine applications.
Abstract: Drug discovery remains time-consuming, labor-intensive, and expensive, often requiring years and substantial investment per drug candidate. Predicting compound-protein interactions (CPIs) is a critical component in this process, enabling the identification of molecular interactions between drug candidates and target proteins. Recent deep learning methods have successfully modeled CPIs at the atomic level, achieving improved efficiency and accuracy over traditional energy-based approaches. However, these models do not always align with chemical realities, as molecular fragments (motifs or functional groups) typically serve as the primary units of biological recognition and binding. In this paper, we propose Phi-former, a pairwise hierarchical interaction representation learning method that addresses this gap by incorporating the biological role of motifs in CPIs. Phi-former represents compounds and proteins hierarchically and employs a pairwise pre-training framework to model interactions systematically across atom-atom, motif-motif, and atom-motif levels, reflecting how biological systems recognize molecular partners. We design intra-level and inter-level learning pipelines that make different interaction levels mutually beneficial. Experimental results demonstrate that Phi-former achieves superior performance on CPI-related tasks. A case study shows that our method accurately identifies specific atoms or motifs activated in CPIs, providing interpretable model explanations. These insights may guide rational drug design and support precision medicine applications.
[301] SDFP: Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM Acceleration
Hanyu Wei, Zunhai Su, Peng Lu, Chao Li, Spandan Tiwari, Ashish Sirasao, Yuhan Dong
Main category: cs.AI
TL;DR: SDFP: A training-free speculative decoding framework that accelerates LLM inference via Fisher Information Trace-based layer pruning to create draft models, achieving 1.32x-1.5x speedup for multimedia applications.
Details
Motivation: LLMs enable multimedia applications but suffer from high latency due to autoregressive decoding. Existing speculative decoding methods require costly draft model training/maintenance, limiting deployment. Need a plug-and-play solution without training overhead.Method: Uses Fisher Information Trace (FIT) to measure layer sensitivity, prunes low-impact layers from the original LLM to create a compact draft model. Draft preserves compatibility with original model for speculative verification. No training, hyperparameter tuning, or separate model maintenance needed.
Result: Achieves 1.32x-1.5x decoding speedup across benchmarks without altering target model’s output distribution. Enables low-latency multimedia applications while maintaining output quality.
Conclusion: SDFP provides a practical, deployment-friendly speculative decoding solution that eliminates draft model training overhead while delivering significant speed improvements for LLM-based multimedia applications.
Abstract: Large language models (LLMs) underpin interactive multimedia applications such as captioning, retrieval, recommendation, and creative content generation, yet their autoregressive decoding incurs substantial latency. Speculative decoding reduces latency using a lightweight draft model, but deployment is often limited by the cost and complexity of acquiring, tuning, and maintaining an effective draft model. Recent approaches usually require auxiliary training or specialization, and even training-free methods incur costly search or optimization. We propose SDFP, a fully training-free and plug-and-play framework that builds the draft model via Fisher Information Trace (FIT)-based layer pruning of a given LLM. Using layer sensitivity as a proxy for output perturbation, SDFP removes low-impact layers to obtain a compact draft while preserving compatibility with the original model for standard speculative verification. SDFP needs no additional training, hyperparameter tuning, or separately maintained drafts, enabling rapid, deployment-friendly draft construction. Across benchmarks, SDFP delivers 1.32x-1.5x decoding speedup without altering the target model’s output distribution, supporting low-latency multimedia applications.
[302] A Unified Multimodal Framework for Dataset Construction and Model-Based Diagnosis of Ameloblastoma
Ajo Babu George, Anna Mariam John, Athul Anoop, Balu Bhasuran
Main category: cs.AI
TL;DR: A multimodal AI framework for ameloblastoma diagnosis using radiological, histopathological, and clinical images with NLP-extracted features from case reports, achieving improved classification accuracy and tissue detection.
Details
Motivation: Existing AI-enabled diagnostics in maxillofacial pathology lack structured, high-quality multimodal datasets with consistent format for direct model training, particularly for ameloblastoma which has limited coverage in current resources.Method: Curated a multimodal dataset integrating annotated radiological, histopathological, and intraoral clinical images with structured data from case reports using NLP for feature extraction. Developed a multimodal deep learning model for classification, recurrence risk assessment, and surgical planning that accepts clinical inputs like presenting complaint, age, and gender.
Result: Variant classification accuracy increased from 46.2% to 65.9%, and abnormal tissue detection F1-score improved from 43.0% to 90.3%. Benchmarked favorably against resources like MultiCaRe.
Conclusion: The work advances patient-specific decision support by providing both a robust multimodal dataset and an adaptable AI framework for ameloblastoma diagnosis and treatment planning.
Abstract: Artificial intelligence (AI)-enabled diagnostics in maxillofacial pathology require structured, high-quality multimodal datasets. However, existing resources provide limited ameloblastoma coverage and lack the format consistency needed for direct model training. We present a newly curated multimodal dataset specifically focused on ameloblastoma, integrating annotated radiological, histopathological, and intraoral clinical images with structured data derived from case reports. Natural language processing techniques were employed to extract clinically relevant features from textual reports, while image data underwent domain specific preprocessing and augmentation. Using this dataset, a multimodal deep learning model was developed to classify ameloblastoma variants, assess behavioral patterns such as recurrence risk, and support surgical planning. The model is designed to accept clinical inputs such as presenting complaint, age, and gender during deployment to enhance personalized inference. Quantitative evaluation demonstrated substantial improvements; variant classification accuracy increased from 46.2 percent to 65.9 percent, and abnormal tissue detection F1-score improved from 43.0 percent to 90.3 percent. Benchmarked against resources like MultiCaRe, this work advances patient-specific decision support by providing both a robust dataset and an adaptable multimodal AI framework.
[303] Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities
Florian Dietz, William Wale, Oscar Gilg, Robert McCarthy, Felix Michalak, Gustavo Ewbank Rodrigues Danon, Miguelito de Guzman, Dietrich Klakow
Main category: cs.AI
TL;DR: SPT introduces a method to detect misalignment in LLMs by fine-tuning a separate “honest persona” in LoRA parameters that reviews the main model’s responses when activated with a trigger, achieving 96% accuracy on a benchmark where standard methods fail.
Details
Motivation: Standard auditing techniques for detecting misalignment in large language models are insufficient - black-box methods struggle to distinguish misaligned outputs, and mechanistic interpretability doesn't scale with model capabilities. Models may learn to conceal misbehavior during training.Method: Split Personality Training (SPT) fine-tunes a second “honest persona” into LoRA parameters that remain inactive during normal operation. After the main model responds, the LoRA adapter is activated with a trigger string, enabling the honest persona to review the response while accessing the main model’s latent states.
Result: SPT achieves 96% overall accuracy on the Anthropic Auditing Game Model Organism benchmark, where standard methods report near 0% accuracy. The honest persona reveals latent knowledge inaccessible to external observers, such as fictional biases the compromised model was trained on.
Conclusion: SPT provides an effective method for detecting misalignment in LLMs by creating an internal auditing mechanism that can access latent model states, overcoming limitations of standard black-box and mechanistic interpretability approaches.
Abstract: Detecting misalignment in large language models is challenging because models may learn to conceal misbehavior during training. Standard auditing techniques fall short: black-box methods often cannot distinguish misaligned outputs from benign ones, and mechanistic interpretability does not scale with model capabilities. We introduce Split Personality Training (SPT), which fine-tunes a second ``honest persona’’ into LoRA parameters that remain inactive during normal operation. After the main model responds, we activate the LoRA adapter and insert a trigger string, enabling the honest persona to review the response while accessing the main model’s latent states. We test our method on the Anthropic Auditing Game Model Organism, a benchmark where Llama-3.3-70B is trained to exploit reward hacks while concealing this behavior. SPT achieves 96% overall accuracy, whereas Anthropic reports near 0% accuracy. The honest persona reveals latent knowledge inaccessible to external observers, such as the fictional biases the compromised model was trained on.
[304] Conditional Diffusion Guidance under Hard Constraint: A Stochastic Analysis Approach
Zhengyi Guo, Wenpin Tang, Renyuan Xu
Main category: cs.AI
TL;DR: A principled conditional diffusion guidance framework using Doob’s h-transform and martingale theory to enforce hard constraints in diffusion models without modifying pretrained score networks.
Details
Motivation: Addresses the need for guaranteed constraint satisfaction in safety-critical applications and rare-event simulation where soft/reward-based guidance methods offer no guarantees.Method: Develops conditional diffusion guidance based on Doob’s h-transform, martingale representation and quadratic variation process. Proposes two off-policy learning algorithms using martingale loss and martingale-covariation loss to estimate the conditioning function and its gradient from pretrained model trajectories.
Result: Provides non-asymptotic guarantees for conditional samplers in total variation and Wasserstein distances, characterizing impact of approximation errors. Numerical experiments demonstrate effectiveness in enforcing hard constraints and generating rare-event samples.
Conclusion: The framework enables guaranteed constraint satisfaction in diffusion models through principled conditional guidance without modifying pretrained networks, with theoretical guarantees and practical effectiveness.
Abstract: We study conditional generation in diffusion models under hard constraints, where generated samples must satisfy prescribed events with probability one. Such constraints arise naturally in safety-critical applications and in rare-event simulation, where soft or reward-based guidance methods offer no guarantee of constraint satisfaction. Building on a probabilistic interpretation of diffusion models, we develop a principled conditional diffusion guidance framework based on Doob’s h-transform, martingale representation and quadratic variation process. Specifically, the resulting guided dynamics augment a pretrained diffusion with an explicit drift correction involving the logarithmic gradient of a conditioning function, without modifying the pretrained score network. Leveraging martingale and quadratic-variation identities, we propose two novel off-policy learning algorithms based on a martingale loss and a martingale-covariation loss to estimate h and its gradient using only trajectories from the pretrained model. We provide non-asymptotic guarantees for the resulting conditional sampler in both total variation and Wasserstein distances, explicitly characterizing the impact of score approximation and guidance estimation errors. Numerical experiments demonstrate the effectiveness of the proposed methods in enforcing hard constraints and generating rare-event samples.
[305] Reasoning-guided Collaborative Filtering with Language Models for Explainable Recommendation
Fahad Anwaar, Adil Mehmood Khan, Muhammad Khalid, Usman Zia, Kezhi Wang
Main category: cs.AI
TL;DR: RGCF-XRec is a hybrid framework that integrates reasoning-guided collaborative filtering knowledge into language models for explainable sequential recommendations in a single step, improving performance and efficiency.
Details
Motivation: Current LLM-based explainable recommendation systems overlook collaborative signals and treat recommendation and explanation as separate tasks, leading to memory inefficiency and suboptimal performance.Method: Introduces reasoning-guided collaborative filtering knowledge through contextual prompting, uses a four-dimensional scoring mechanism (coherence, completeness, relevance, consistency) to filter noisy reasoning traces, and employs a unified representation learning network to encode collaborative and semantic signals for structured LLM prompting.
Result: Consistent improvements across Amazon datasets: HR@10 improved by 7.38% in Sports and 4.59% in Toys; ROUGE-L improved by 8.02% and 3.49%; reduced cold-warm performance gap with 14.5% gains in cold-start and 11.9% in warm-start; enhanced zero-shot HR@5 by 18.54% in Beauty and 23.16% in Toys.
Conclusion: RGCF-XRec effectively integrates collaborative filtering knowledge with LLMs for explainable sequential recommendations, demonstrating improved performance, better generalization, robustness, and training efficiency with a lightweight LLaMA 3.2-3B backbone.
Abstract: Large Language Models (LLMs) exhibit potential for explainable recommendation systems but overlook collaborative signals, while prevailing methods treat recommendation and explanation as separate tasks, resulting in a memory footprint. We present RGCF-XRec, a hybrid framework that introduces reasoning-guided collaborative filtering (CF) knowledge into a language model to deliver explainable sequential recommendations in a single step. Theoretical grounding and empirical findings reveal that RGCF-XRec offers three key merits over leading CF-aware LLM-based methods: (1) reasoning-guided augmentation of CF knowledge through contextual prompting to discover latent preferences and interpretable reasoning paths; (2) an efficient scoring mechanism based on four dimensions: coherence, completeness, relevance, and consistency to mitigate noisy CF reasoning traces and retain high-quality explanations; (3) a unified representation learning network that encodes collaborative and semantic signals, enabling a structured prompt to condition the LLM for explainable sequential recommendation. RGCF-XRec demonstrates consistent improvements across Amazon datasets, Sports, Toys, and Beauty, comprising 642,503 user-item interactions. It improves HR@10 by 7.38% in Sports and 4.59% in Toys, along with ROUGE-L by 8.02% and 3.49%, respectively. It reduces the cold warm performance gap, achieving overall gains of 14.5% in cold-start and 11.9% in warm start scenarios, and enhances zero-shot HR@5 by 18.54% in Beauty and 23.16% in Toys, highlighting effective generalization and robustness. Moreover, RGCF-XRec achieves training efficiency with a lightweight LLaMA 3.2-3B backbone, ensuring scalability for real-world applications.
[306] TangramSR: Can Vision-Language Models Reason in Continuous Geometric Space?
Yikun Zong, Cheston Tan
Main category: cs.AI
TL;DR: A framework for test-time self-refinement in Vision-Language Models using human-inspired iterative refinement with in-context learning and reward feedback loops, significantly improving geometric reasoning on Tangram puzzles without parameter updates.
Details
Motivation: Current VLMs show systematic failures in continuous geometric reasoning (average IoU 0.41 on single-piece tasks, dropping to 0.23 on two-piece composition), far below human performance. The paper addresses whether models can iteratively refine predictions at test time without parameter updates, inspired by human cognitive processes in solving Tangram puzzles through trial-and-error and correction.Method: A training-free verifier-refiner agent framework combining in-context learning with reward-guided feedback loops. The system applies recursive refinement loops that iteratively self-refine predictions based on geometric consistency feedback, mimicking human cognitive mechanisms of mental rotation, iterative refinement, and visual feedback.
Result: The framework achieves significant improvements: IoU increases from 0.63 to 0.932 on medium-triangle cases without any model retraining. This demonstrates substantial enhancement in geometric reasoning capabilities compared to baseline VLMs that showed poor performance (0.41 IoU on single-piece tasks, 0.23 on two-piece composition).
Conclusion: Incorporating human-inspired iterative refinement mechanisms through ICL and reward loops can substantially enhance geometric reasoning in VLMs, moving self-improving AI from promise to practice in continuous spatial domains. The work shows that test-time self-refinement without parameter updates is feasible and effective for spatial reasoning tasks.
Abstract: Humans excel at spatial reasoning tasks like Tangram puzzle assembly through cognitive processes involving mental rotation, iterative refinement, and visual feedback. Inspired by how humans solve Tangram puzzles through trial-and-error, observation, and correction, we design a framework that models these human cognitive mechanisms. However, comprehensive experiments across five representative Vision-Language Models (VLMs) reveal systematic failures in continuous geometric reasoning: average IoU of only 0.41 on single-piece tasks, dropping to 0.23 on two-piece composition, far below human performance where children can complete Tangram tasks successfully. This paper addresses a fundamental challenge in self-improving AI: can models iteratively refine their predictions at test time without parameter updates? We introduce a test-time self-refinement framework that combines in-context learning (ICL) with reward-guided feedback loops, inspired by human cognitive processes. Our training-free verifier-refiner agent applies recursive refinement loops that iteratively self-refine predictions based on geometric consistency feedback, achieving IoU improvements from 0.63 to 0.932 on medium-triangle cases without any model retraining. This demonstrates that incorporating human-inspired iterative refinement mechanisms through ICL and reward loops can substantially enhance geometric reasoning in VLMs, moving self-improving AI from promise to practice in continuous spatial domains. Our work is available at this anonymous link https://anonymous.4open.science/r/TangramVLM-F582/.
[307] BhashaSetu: Cross-Lingual Knowledge Transfer from High-Resource to Extreme Low-Resource Languages
Subhadip Maji, Arnab Bhattacharya
Main category: cs.AI
TL;DR: GETR: Graph-Enhanced Token Representation method for cross-lingual knowledge transfer from high-resource to low-resource languages, achieving significant improvements on POS tagging, sentiment classification, and NER tasks.
Details
Motivation: Low-resource languages lag behind high-resource languages in NLP performance due to data scarcity. Cross-lingual knowledge transfer is needed to leverage resources from high-resource languages to improve performance on low-resource languages with only hundreds of labeled examples.Method: Proposes GETR (Graph-Enhanced Token Representation), a GNN-based approach for cross-lingual knowledge transfer. Compares against two baselines: (1) augmentation in hidden layers, and (2) token embedding transfer through token translation.
Result: GETR significantly outperforms multilingual and cross-lingual baselines: 13 percentage point improvements on truly low-resource languages (Mizo, Khasi) for POS tagging, and 20-27 percentage point improvements in macro-F1 on simulated low-resource languages (Marathi, Bangla, Malayalam) for sentiment classification and NER.
Conclusion: Graph-based approaches like GETR are effective for cross-lingual knowledge transfer, with detailed analysis identifying key factors for successful transfer in low-resource linguistic contexts.
Abstract: Despite remarkable advances in natural language processing, developing effective systems for low-resource languages remains a formidable challenge, with performances typically lagging far behind high-resource counterparts due to data scarcity and insufficient linguistic resources. Cross-lingual knowledge transfer has emerged as a promising approach to address this challenge by leveraging resources from high-resource languages. In this paper, we investigate methods for transferring linguistic knowledge from high-resource languages to low-resource languages, where the number of labeled training instances is in hundreds. We focus on sentence-level and word-level tasks. We introduce a novel method, GETR (Graph-Enhanced Token Representation) for cross-lingual knowledge transfer along with two adopted baselines (a) augmentation in hidden layers and (b) token embedding transfer through token translation. Experimental results demonstrate that our GNN-based approach significantly outperforms existing multilingual and cross-lingual baseline methods, achieving 13 percentage point improvements on truly low-resource languages (Mizo, Khasi) for POS tagging, and 20 and 27 percentage point improvements in macro-F1 on simulated low-resource languages (Marathi, Bangla, Malayalam) across sentiment classification and NER tasks respectively. We also present a detailed analysis of the transfer mechanisms and identify key factors that contribute to successful knowledge transfer in this linguistic context.
[308] Reactive Knowledge Representation and Asynchronous Reasoning
Simon Kohaut, Benedict Flade, Julian Eggert, Kristian Kersting, Devendra Singh Dhami
Main category: cs.AI
TL;DR: Resin (Reactive Signal Inference) is a probabilistic programming language that combines probabilistic logic with reactive programming, using Reactive Circuits for efficient exact inference in dynamic environments by adapting computation based on input signal volatility.
Details
Motivation: Exact inference in probabilistic models is computationally expensive, especially for autonomous agents in dynamic environments requiring real-time belief updates. Existing methods inefficiently re-evaluate entire models upon changes, failing to exploit heterogeneous update rates in real-world information streams.Method: Introduces Resin, a probabilistic programming language merging probabilistic logic with reactive programming. Proposes Reactive Circuits (RCs) as meta-structures over Algebraic Circuits and asynchronous data streams - time-dynamic DAGs that autonomously adapt based on input signal volatility. Partitions computations based on Frequency of Change in asynchronous inputs.
Result: In high-fidelity drone swarm simulations, achieves several orders of magnitude speedup over frequency-agnostic inference. RCs’ structural adaptations successfully capture environmental dynamics, significantly reducing latency and facilitating reactive real-time reasoning.
Conclusion: By decomposing large inference tasks into memoized sub-problems based on input volatility, the approach ensures only affected model components are re-evaluated, drastically reducing redundant computation in streaming contexts for efficient real-time probabilistic reasoning.
Abstract: Exact inference in complex probabilistic models often incurs prohibitive computational costs. This challenge is particularly acute for autonomous agents in dynamic environments that require frequent, real-time belief updates. Existing methods are often inefficient for ongoing reasoning, as they re-evaluate the entire model upon any change, failing to exploit that real-world information streams have heterogeneous update rates. To address this, we approach the problem from a reactive, asynchronous, probabilistic reasoning perspective. We first introduce Resin (Reactive Signal Inference), a probabilistic programming language that merges probabilistic logic with reactive programming. Furthermore, to provide efficient and exact semantics for Resin, we propose Reactive Circuits (RCs). Formulated as a meta-structure over Algebraic Circuits and asynchronous data streams, RCs are time-dynamic Directed Acyclic Graphs that autonomously adapt themselves based on the volatility of input signals. In high-fidelity drone swarm simulations, our approach achieves several orders of magnitude of speedup over frequency-agnostic inference. We demonstrate that RCs’ structural adaptations successfully capture environmental dynamics, significantly reducing latency and facilitating reactive real-time reasoning. By partitioning computations based on the estimated Frequency of Change in the asynchronous inputs, large inference tasks can be decomposed into individually memoized sub-problems. This ensures that only the specific components of a model affected by new information are re-evaluated, drastically reducing redundant computation in streaming contexts.
[309] Generative Ontology: When Structured Knowledge Learns to Create
Benny Cheung
Main category: cs.AI
TL;DR: Generative Ontology framework combines LLM creativity with ontology structure to generate valid, creative domain artifacts like games, music, or software.
Details
Motivation: Traditional ontologies describe domain structure but can't generate novel artifacts, while LLMs generate fluently but produce structurally invalid outputs that hallucinate mechanisms without components. Need to synthesize ontology's structural validity with LLM's creativity.Method: Encode domain knowledge as executable Pydantic schemas that constrain LLM generation via DSPy signatures. Use multi-agent pipeline with specialized roles (Mechanics Architect, Theme Weaver, Balance Critic) each with professional “anxiety” to prevent shallow outputs. Employ retrieval-augmented generation grounded in existing exemplars and iterative validation for coherence.
Result: Demonstrated through GameGrammar system generating complete, playable tabletop game designs from thematic prompts. Outputs satisfy ontological constraints while remaining creative with mechanisms, components, victory conditions, and setup instructions.
Conclusion: Pattern generalizes to any domain with expert vocabulary, validity constraints, and accumulated exemplars (music, software, culinary arts). Constraints enable rather than limit creativity - ontology makes structured generation possible like grammar enables poetry.
Abstract: Traditional ontologies excel at describing domain structure but cannot generate novel artifacts. Large language models generate fluently but produce outputs that lack structural validity, hallucinating mechanisms without components, goals without end conditions. We introduce Generative Ontology, a framework that synthesizes these complementary strengths: ontology provides the grammar; the LLM provides the creativity. Generative Ontology encodes domain knowledge as executable Pydantic schemas that constrain LLM generation via DSPy signatures. A multi-agent pipeline assigns specialized roles to different ontology domains: a Mechanics Architect designs game systems, a Theme Weaver integrates narrative, a Balance Critic identifies exploits. Each agent carrying a professional “anxiety” that prevents shallow, agreeable outputs. Retrieval-augmented generation grounds novel designs in precedents from existing exemplars, while iterative validation ensures coherence between mechanisms and components. We demonstrate the framework through GameGrammar, a system for generating complete tabletop game designs. Given a thematic prompt (“bioluminescent fungi competing in a cave ecosystem”), the pipeline produces structurally complete, playable game specifications with mechanisms, components, victory conditions, and setup instructions. These outputs satisfy ontological constraints while remaining genuinely creative. The pattern generalizes beyond games. Any domain with expert vocabulary, validity constraints, and accumulated exemplars (music composition, software architecture, culinary arts) is a candidate for Generative Ontology. We argue that constraints do not limit creativity but enable it: just as grammar makes poetry possible, ontology makes structured generation possible.
[310] Graph-based Agent Memory: Taxonomy, Techniques, and Applications
Chang Yang, Chuang Zhou, Yilin Xiao, Su Dong, Luyao Zhuang, Yujing Zhang, Zhu Wang, Zijin Hong, Zheng Yuan, Zhishang Xiang, Shengyuan Chen, Huachi Zhou, Qinggang Zhang, Ninghao Liu, Jinsong Su, Xinrun Wang, Yi Chang, Xiao Huang
Main category: cs.AI
TL;DR: Survey paper on graph-based memory systems for LLM agents, covering taxonomy, lifecycle techniques, tools, and applications
Details
Motivation: Memory is crucial for LLM agents handling complex long-horizon tasks, and graph structures offer superior capabilities for modeling relational dependencies, organizing hierarchical information, and enabling efficient retrieval compared to other memory paradigmsMethod: Comprehensive survey methodology: 1) Taxonomy of agent memory types, 2) Systematic analysis of graph-based memory lifecycle (extraction, storage, retrieval, evolution), 3) Review of open-source libraries and benchmarks, 4) Exploration of application scenarios
Result: Provides structured framework for understanding graph-based agent memory, identifies key techniques and tools, summarizes available resources, and establishes research directions for self-evolving memory systems
Conclusion: Graph-based memory offers powerful advantages for LLM agents, and this survey provides actionable insights and resources to advance development of more efficient and reliable memory systems for complex AI tasks
Abstract: Memory emerges as the core module in the Large Language Model (LLM)-based agents for long-horizon complex tasks (e.g., multi-turn dialogue, game playing, scientific discovery), where memory can enable knowledge accumulation, iterative reasoning and self-evolution. Among diverse paradigms, graph stands out as a powerful structure for agent memory due to the intrinsic capabilities to model relational dependencies, organize hierarchical information, and support efficient retrieval. This survey presents a comprehensive review of agent memory from the graph-based perspective. First, we introduce a taxonomy of agent memory, including short-term vs. long-term memory, knowledge vs. experience memory, non-structural vs. structural memory, with an implementation view of graph-based memory. Second, according to the life cycle of agent memory, we systematically analyze the key techniques in graph-based agent memory, covering memory extraction for transforming the data into the contents, storage for organizing the data efficiently, retrieval for retrieving the relevant contents from memory to support reasoning, and evolution for updating the contents in the memory. Third, we summarize the open-sourced libraries and benchmarks that support the development and evaluation of self-evolving agent memory. We also explore diverse application scenarios. Finally, we identify critical challenges and future research directions. This survey aims to offer actionable insights to advance the development of more efficient and reliable graph-based agent memory systems. All the related resources, including research papers, open-source data, and projects, are collected for the community in https://github.com/DEEP-PolyU/Awesome-GraphMemory.
[311] Determining Energy Efficiency Sweet Spots in Production LLM Inference
Hiari Pizzini Cavagna, Andrea Proia, Giacomo Madella, Giovanni B. Esposito, Francesco Antici, Daniele Cesarini, Zeynep Kiziltan, Andrea Bartolini
Main category: cs.AI
TL;DR: Paper proposes analytical model for LLM energy consumption that reveals non-linear efficiency regimes with “sweet spots” for optimal energy use
Details
Motivation: Existing approaches estimate LLM energy consumption through simple linear functions, but real-world observations show non-linear dependencies with clear efficiency regimes that need better modelingMethod: Develop analytical model based on computational and memory-access complexity of Transformer architecture, validated using TensorRT-LLM on NVIDIA H100 GPUs across diverse LLMs (1B-9B parameters) with input/output lengths from 64-4096 tokens
Result: Model achieves mean MAPE of 1.79%, identifies energy efficiency “sweet spots” where peak efficiency occurs with short-to-moderate inputs and medium-length outputs, with sharp drops for long inputs or very short outputs
Conclusion: Aligning sequence lengths with efficiency sweet spots can substantially reduce energy usage, supporting informed truncation, summarization, and adaptive generation strategies in production systems
Abstract: Large Language Models (LLMs) inference is central in modern AI applications, making it critical to understand their energy footprint. Existing approaches typically estimate energy consumption through simple linear functions of input and output sequence lengths, yet our observations reveal clear Energy Efficiency regimes: peak efficiency occurs with short-to-moderate inputs and medium-length outputs, while efficiency drops sharply for long inputs or very short outputs, indicating a non-linear dependency. In this work, we propose an analytical model derived from the computational and memory-access complexity of the Transformer architecture, capable of accurately characterizing the efficiency curve as a function of input and output lengths. To assess its accuracy, we evaluate energy consumption using TensorRT-LLM on NVIDIA H100 GPUs across a diverse set of LLMs ranging from 1B to 9B parameters, including OPT, LLaMA, Gemma, Falcon, Qwen2, and Granite, tested over input and output lengths from 64 to 4096 tokens, achieving a mean MAPE of 1.79%. Our results show that aligning sequence lengths with these efficiency “Sweet Spots” can substantially reduce energy usage, supporting informed truncation, summarization, and adaptive generation strategies in production systems.
[312] Nonlinearity as Rank: Generative Low-Rank Adapter with Radial Basis Functions
Yihao Ouyang, Shiwei Li, Haozhao Wang, Xiandi Luo, Zhuoqi Hu, Yuetong Song, Qiyu Qin, Yichen Li, Ruixuan Li
Main category: cs.AI
TL;DR: GenLoRA replaces explicit basis vector storage in LoRA with lightweight nonlinear functions for more parameter-efficient fine-tuning.
Details
Motivation: Standard LoRA suffers from parameter redundancy in basis vectors and substantial parameter growth when increasing model capacity, limiting its efficiency.Method: GenLoRA maintains latent vectors for each low-rank matrix and uses lightweight radial basis functions (RBFs) to synthesize basis vectors instead of storing them explicitly.
Result: GenLoRA achieves higher effective LoRA ranks under smaller parameter budgets and shows superior fine-tuning performance across multiple datasets and architectures.
Conclusion: GenLoRA provides a more parameter-efficient alternative to standard LoRA by replacing explicit basis vector storage with generative functions.
Abstract: Low-rank adaptation (LoRA) approximates the update of a pretrained weight matrix using the product of two low-rank matrices. However, standard LoRA follows an explicit-rank paradigm, where increasing model capacity requires adding more rows or columns (i.e., basis vectors) to the low-rank matrices, leading to substantial parameter growth. In this paper, we find that these basis vectors exhibit significant parameter redundancy and can be compactly represented by lightweight nonlinear functions. Therefore, we propose Generative Low-Rank Adapter (GenLoRA), which replaces explicit basis vector storage with nonlinear basis vector generation. Specifically, GenLoRA maintains a latent vector for each low-rank matrix and employs a set of lightweight radial basis functions (RBFs) to synthesize the basis vectors. Each RBF requires far fewer parameters than an explicit basis vector, enabling higher parameter efficiency in GenLoRA. Extensive experiments across multiple datasets and architectures show that GenLoRA attains higher effective LoRA ranks under smaller parameter budgets, resulting in superior fine-tuning performance. The code is available at https://anonymous.4open.science/r/GenLoRA-1519.
[313] Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification
Tianyi Wang, Long Li, Hongcan Guo, Yibiao Chen, Yixia Li, Yong Wang, Yun Chen, Guanhua Chen
Main category: cs.AI
TL;DR: APO addresses RLVR’s Recursive Space Contraction by shifting from KL regularization’s Shape Matching to Support Coverage, enabling aggressive sharpening while preventing collapse through elastic recovery.
Details
Motivation: Standard RLVR suffers from Recursive Space Contraction where valid alternatives vanish due to positive sharpening and negative squeezing dynamics. KL regularization creates gradient conflicts by forcing full density matching rather than supporting efficient sharpening.Method: Anchored Policy Optimization defines a Safe Manifold based on reference model’s high-confidence support, allowing aggressive sharpening for efficiency while selectively applying restorative forces during error correction to prevent collapse.
Result: APO breaks the accuracy-diversity trade-off, significantly improving Pass@1 while restoring Pass@K diversity typically lost by standard policy gradient methods on mathematical benchmarks.
Conclusion: APO provides a gradient-aligned mechanism for maximizing support coverage, enabling elastic recovery that re-inflates valid branches and prevents irreversible collapse in RLVR systems.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is increasingly viewed as a tree pruning mechanism. However, we identify a systemic pathology termed Recursive Space Contraction (RSC), an irreversible collapse driven by the combined dynamics of positive sharpening and negative squeezing, where the sampling probability of valid alternatives vanishes. While Kullback-Leibler (KL) regularization aims to mitigate this, it imposes a rigid Shape Matching constraint that forces the policy to mimic the reference model’s full density, creating a gradient conflict with the sharpening required for correctness. We propose Anchored Policy Optimization (APO), shifting the paradigm from global Shape Matching to Support Coverage. By defining a Safe Manifold based on the reference model’s high-confidence support, APO permits aggressive sharpening for efficiency while selectively invoking a restorative force during error correction to prevent collapse. We theoretically derive that APO serves as a gradient-aligned mechanism to maximize support coverage, enabling an Elastic Recovery that re-inflates valid branches. Empirical evaluations on mathematical benchmarks demonstrate that APO breaks the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.
[314] Mitigating Hallucination in Financial Retrieval-Augmented Generation via Fine-Grained Knowledge Verification
Taoye Yin, Haoyuan Hu, Yaxin Fan, Xinhao Chen, Xinya Wu, Kai Deng, Kezun Zhang, Feng Wang
Main category: cs.AI
TL;DR: A Reinforcement Learning framework with Fine-grained Knowledge Verification (RLFKV) for financial RAG systems that decomposes responses into atomic knowledge units to compute faithful rewards and prevent hallucinations.
Details
Motivation: Financial RAG systems suffer from hallucinations that contradict retrieved information despite using retrieved documents to address knowledge gaps, requiring better alignment between generated responses and source documents.Method: Proposes RLFKV framework that decomposes financial responses into atomic knowledge units, computes fine-grained faithful rewards based on correctness of each unit, and adds informativeness reward to prevent overly concise replies.
Result: Experiments on Financial Data Description (FDD) task and newly proposed FDD-ANT dataset show consistent improvements, confirming effectiveness of the approach.
Conclusion: The RLFKV framework successfully mitigates hallucinations in financial RAG systems by providing more precise optimization signals through fine-grained knowledge verification.
Abstract: In financial Retrieval-Augmented Generation (RAG) systems, models frequently rely on retrieved documents to generate accurate responses due to the time-sensitive nature of the financial domain. While retrieved documents help address knowledge gaps, model-generated responses still suffer from hallucinations that contradict the retrieved information. To mitigate this inconsistency, we propose a Reinforcement Learning framework enhanced with Fine-grained Knowledge Verification (RLFKV). Our method decomposes financial responses into atomic knowledge units and assesses the correctness of each unit to compute the fine-grained faithful reward. This reward offers more precise optimization signals, thereby improving alignment with the retrieved documents. Additionally, to prevent reward hacking (e.g., overly concise replies), we incorporate an informativeness reward that encourages the policy model to retain at least as many knowledge units as the base model. Experiments conducted on the public Financial Data Description (FDD) task and our newly proposed FDD-ANT dataset demonstrate consistent improvements, confirming the effectiveness of our approach.
[315] LeakBoost: Perceptual-Loss-Based Membership Inference Attack
Amit Kravchik Taub, Fred M. Grabovski, Guy Amit, Yisroel Mirsky
Main category: cs.AI
TL;DR: LeakBoost is a membership inference attack framework that uses perceptual-loss optimization to actively probe model representations and amplify differences between training and non-training samples, significantly improving attack performance.
Details
Motivation: Existing membership inference attacks rely on static indicators like loss or confidence scores, failing to leverage the dynamic behavior of models when actively probed. There's a need for more effective attacks that can better expose privacy risks in machine learning systems.Method: LeakBoost synthesizes interrogation images by optimizing a perceptual (activation-space) objective to amplify representational differences between members and non-members. It works with existing membership detectors without modifying them, using gradient-based optimization to create probing images that reveal membership signals.
Result: LeakBoost substantially improves attack performance, raising AUC from near-chance levels (0.53-0.62) to 0.81-0.88, and increasing true positive rates at 1% false positive rate by over an order of magnitude compared to strong baseline attacks. Improvements are strongest with gradient-based detectors.
Conclusion: LeakBoost offers a modular, computationally efficient way to assess privacy risks in white-box settings, advancing dynamic membership inference by actively probing model representations rather than relying on static indicators.
Abstract: Membership inference attacks (MIAs) aim to determine whether a sample was part of a model’s training set, posing serious privacy risks for modern machine-learning systems. Existing MIAs primarily rely on static indicators, such as loss or confidence, and do not fully leverage the dynamic behavior of models when actively probed. We propose LeakBoost, a perceptual-loss-based interrogation framework that actively probes a model’s internal representations to expose hidden membership signals. Given a candidate input, LeakBoost synthesizes an interrogation image by optimizing a perceptual (activation-space) objective, amplifying representational differences between members and non-members. This image is then analyzed by an off-the-shelf membership detector, without modifying the detector itself. When combined with existing membership inference methods, LeakBoost achieves substantial improvements at low false-positive rates across multiple image classification datasets and diverse neural network architectures. In particular, it raises AUC from near-chance levels (0.53-0.62) to 0.81-0.88, and increases TPR at 1 percent FPR by over an order of magnitude compared to strong baseline attacks. A detailed sensitivity analysis reveals that deeper layers and short, low-learning-rate optimization produce the strongest leakage, and that improvements concentrate in gradient-based detectors. LeakBoost thus offers a modular and computationally efficient way to assess privacy risks in white-box settings, advancing the study of dynamic membership inference.
[316] RocqSmith: Can Automatic Optimization Forge Better Proof Agents?
Andrei Kozyrev, Nikita Khramov, Denis Lochmelis, Valerio Morelli, Gleb Solovev, Anton Podkopaev
Main category: cs.AI
TL;DR: Automatic AI agent optimization methods applied to theorem proving agents in Rocq, with few-shot bootstrapping showing consistent improvements but not matching carefully engineered state-of-the-art agents.
Details
Motivation: To study whether automatic AI agent optimization methods can be effectively applied to real-world agents in formal verification settings, specifically automated theorem proving in Rocq, and to determine if fine-grained tuning aspects like prompt design, contextual knowledge, and control strategies can be automated.Method: Evaluated different automatic agent optimizers applied to optimizing a Rocq proof-generation agent, comparing their performance in automating aspects of agent tuning such as prompt design, contextual knowledge, and control strategies.
Result: Several optimizers yielded measurable improvements, but simple few-shot bootstrapping was the most consistently effective method. However, none of the studied methods matched the performance of a carefully engineered state-of-the-art proof agent.
Conclusion: While automatic optimization methods show promise for AI agents in formal verification, current approaches still fall short of carefully engineered solutions, suggesting that human expertise remains important for achieving state-of-the-art performance in theorem proving agents.
Abstract: This work studies the applicability of automatic AI agent optimization methods to real-world agents in formal verification settings, focusing on automated theorem proving in Rocq as a representative and challenging domain. We evaluate how different automatic agent optimizers perform when applied to the task of optimizing a Rocq proof-generation agent, and assess whether parts of the fine-grained tuning of agentic systems, such as prompt design, contextual knowledge, and control strategies, can be automated. Our results show that while several optimizers yield measurable improvements, simple few-shot bootstrapping is the most consistently effective; however, none of the studied methods matches the performance of a carefully engineered state-of-the-art proof agent.
[317] RL-VLA$^3$: Reinforcement Learning VLA Accelerating via Full Asynchronism
Zhong Guan, Haoran Sun, Yongjian Guo, Shuai Di, Xiaodong Bai, Jing Long, Tianyun Zhao, Mingxi Luo, Chen Zhou, Yucheng Guo, Qiming Yang, Wanting Xu, Wen Huang, Yunxuan Ma, Hongke Zhao, Likang Wu, Xiaotie Deng, Xi Xiao, Sheng Wen, Yicheng Gong, Junwu Xiong
Main category: cs.AI
TL;DR: Proposes a fully-asynchronous policy training framework for Vision-Language-Action models to overcome efficiency bottlenecks in synchronous RL training, achieving significant throughput improvements.
Details
Motivation: Existing RL-based training frameworks for VLA models suffer from synchronous execution limitations causing severe resource underutilization and throughput bottlenecks during environment interaction, policy generation, and model update phases.Method: A fully-asynchronous policy training framework with multi-level decoupled architecture: asynchronous parallelization of environment interaction/trajectory collection, streaming execution for policy generation, and decoupled scheduling for training updates.
Result: Achieved up to 59.25% throughput improvement on LIBERO benchmark vs synchronous strategies, with up to 126.67% improvement with deep optimization; demonstrated excellent scalability across 8-256 GPUs.
Conclusion: The proposed asynchronous framework significantly enhances training efficiency for VLA models while maintaining effectiveness, addressing a key bottleneck in embodied intelligence research.
Abstract: In recent years, Vision-Language-Action (VLA) models have emerged as a crucial pathway towards general embodied intelligence, yet their training efficiency has become a key bottleneck. Although existing reinforcement learning (RL)-based training frameworks like RLinf can enhance model generalization, they still rely on synchronous execution, leading to severe resource underutilization and throughput limitations during environment interaction, policy generation (rollout), and model update phases (actor). To overcome this challenge, this paper, for the first time, proposes and implements a fully-asynchronous policy training framework encompassing the entire pipeline from environment interaction, rollout generation, to actor policy updates. Systematically drawing inspiration from asynchronous optimization ideas in large model RL, our framework designs a multi-level decoupled architecture. This includes asynchronous parallelization of environment interaction and trajectory collection, streaming execution for policy generation, and decoupled scheduling for training updates. We validated the effectiveness of our method across diverse VLA models and environments. On the LIBERO benchmark, the framework achieves throughput improvements of up to 59.25% compared to existing synchronous strategies. When deeply optimizing separation strategies, throughput can be increased by as much as 126.67%. We verified the effectiveness of each asynchronous component via ablation studies. Scaling law validation across 8 to 256 GPUs demonstrates our method’s excellent scalability under most conditions.
[318] FiMI: A Domain-Specific Language Model for Indian Finance Ecosystem
Aboli Kathar, Aman Kumar, Anusha Kamath, Araveeti Srujan, Ashish Sharma, Chandra Bhushan, Dilip Asbe, Divya Sorate, Duddu Prasanth Kumar, Evan Acharya, Harsh Sharma, Hrithik Kadam, Kanishk Singla, Keyur Doshi, Kiran Praveen, Kolisetty Krishna SK, Krishanu Adhikary, Lokesh MPT, Mayurdeep Sonowal, Nadeem Shaikh, Navya Prakash, Nimit Kothari, Nitin Kukreja, Prashant Devadiga, Rakesh Paul, Ratanjeet Pratap Chauhan, Raunak Kalani, Raviraj Joshi, Shamanth MH, Shantanu Pandey, Shubham Soni, Siddharth Dixit, Smriti Jopat, Sunil Patel, Suraj Singh, Suvradip Paul, Tulasi Pilla, Utkarsh Vaidya, Vineeth Nambiar, Vishal Kanvaty, Yatharth Dedhia
Main category: cs.AI
TL;DR: FiMI is a domain-specialized financial language model for Indian digital payment systems, with two variants (Base and Instruct) built on Mistral Small 24B architecture, achieving significant improvements in finance reasoning and tool-calling while maintaining general capabilities.
Details
Motivation: To develop a specialized financial language model for Indian digital payment systems that can handle financial workflows, multilingual content (English, Hindi, Hinglish), and tool-driven conversations for real-world financial applications.Method: Adapted Mistral Small 24B architecture through multi-stage training: continuous pre-training on 68B tokens of curated financial, multilingual, and synthetic data, followed by instruction fine-tuning and domain-specific supervised fine-tuning for multi-turn, tool-driven conversations.
Result: FiMI Base shows 20% improvement over Mistral Small 24B Base on finance reasoning benchmarks; FiMI Instruct outperforms Mistral Small 24B Instruct by 87% on domain-specific tool-calling while maintaining comparable performance on general benchmarks.
Conclusion: FiMI successfully creates a specialized financial model for Indian payment systems that significantly improves domain-specific performance without sacrificing general capabilities, demonstrating effective domain adaptation for financial applications.
Abstract: We present FiMI (Finance Model for India), a domain-specialized financial language model developed for Indian digital payment systems. We develop two model variants: FiMI Base and FiMI Instruct. FiMI adapts the Mistral Small 24B architecture through a multi-stage training pipeline, beginning with continuous pre-training on 68 Billion tokens of curated financial, multilingual (English, Hindi, Hinglish), and synthetic data. This is followed by instruction fine-tuning and domain-specific supervised fine-tuning focused on multi-turn, tool-driven conversations that model real-world workflows, such as transaction disputes and mandate lifecycle management. Evaluations reveal that FiMI Base achieves a 20% improvement over the Mistral Small 24B Base model on finance reasoning benchmark, while FiMI Instruct outperforms the Mistral Small 24B Instruct model by 87% on domain-specific tool-calling. Moreover, FiMI achieves these significant domain gains while maintaining comparable performance to models of similar size on general benchmarks.
[319] NEX: Neuron Explore-Exploit Scoring for Label-Free Chain-of-Thought Selection and Model Ranking
Kang Chen, Zhuoka Feng, Sihan Zhao, Kai Xiong, Junjie Nian, Yaoning Wang, Changyi Xiao, Yixin Cao
Main category: cs.AI
TL;DR: NEX is an unsupervised scoring framework that analyzes reasoning processes by detecting exploration-exploitation phases through MLP neuron activation patterns, enabling selection of better reasoning traces without task supervision.
Details
Motivation: As LLMs increasingly use multiple reasoning traces or merged checkpoints, the bottleneck shifts from generation to selection, often without supervision on the target distribution. Current methods lack interpretable, unsupervised ways to identify high-quality reasoning processes.Method: NEX views reasoning as alternating exploration (E-phase) and exploitation (X-phase) phases. It detects E-phase via spikes in newly activated MLP neurons per token from sparse activation caches, uses a sticky two-state HMM to infer E-X phases, and credits E-introduced neurons based on whether they’re reused in following X spans.
Result: NEX’s Good-Mass Fraction score predicts downstream accuracy across reasoning benchmarks and Qwen3 merge families, identifies better variants without task answers, and provides interpretable neuron weights. Human annotations validate the E-X signal, and causal evidence is shown via neuron transfer experiments.
Conclusion: NEX offers a white-box, label-free unsupervised framework for scoring reasoning processes by analyzing exploration-exploitation dynamics through neuron activation patterns, enabling better selection of reasoning traces and model variants without supervision.
Abstract: Large language models increasingly spend inference compute sampling multiple chain-of-thought traces or searching over merged checkpoints. This shifts the bottleneck from generation to selection, often without supervision on the target distribution. We show entropy-based exploration proxies follow an inverted-U with accuracy, suggesting extra exploration can become redundant and induce overthinking. We propose NEX, a white-box label-free unsupervised scoring framework that views reasoning as alternating E-phase (exploration) and X-phase (exploitation). NEX detects E-phase as spikes in newly activated MLP neurons per token from sparse activation caches, then uses a sticky two-state HMM to infer E-X phases and credits E-introduced neurons by whether they are reused in the following X span. These signals yield interpretable neuron weights and a single Good-Mass Fraction score to rank candidate responses and merged variants without task answers. Across reasoning benchmarks and Qwen3 merge families, NEX computed on a small unlabeled activation set predicts downstream accuracy and identifies better variants; we further validate the E-X signal with human annotations and provide causal evidence via “Effective-vs-Redundant” neuron transfer.
[320] STProtein: predicting spatial protein expression from multi-omics data
Zhaorui Jiang, Yingfang Yuan, Lei Hu, Wei Pang
Main category: cs.AI
TL;DR: STProtein is a graph neural network framework that predicts spatial protein expression from spatial transcriptomics data to address data imbalance in spatial multi-omics integration.
Details
Motivation: There's a significant data imbalance in spatial multi-omics: spatial transcriptomics data is abundant while spatial proteomics data is scarce due to technical limitations and high costs, hindering biological research progress.Method: STProtein uses graph neural networks with a multi-task learning strategy to predict unknown spatial protein expression from more accessible spatial transcriptomics data.
Result: The framework enables accurate prediction of spatial protein expression, addressing the scarcity of spatial proteomics data and accelerating spatial multi-omics integration.
Conclusion: STProtein can effectively overcome the spatial proteomics data scarcity, accelerate multi-omics integration, and potentially catalyze transformative breakthroughs in life sciences by uncovering hidden spatial patterns and biological relationships.
Abstract: The integration of spatial multi-omics data from single tissues is crucial for advancing biological research. However, a significant data imbalance impedes progress: while spatial transcriptomics data is relatively abundant, spatial proteomics data remains scarce due to technical limitations and high costs. To overcome this challenge we propose STProtein, a novel framework leveraging graph neural networks with multi-task learning strategy. STProtein is designed to accurately predict unknown spatial protein expression using more accessible spatial multi-omics data, such as spatial transcriptomics. We believe that STProtein can effectively addresses the scarcity of spatial proteomics, accelerating the integration of spatial multi-omics and potentially catalyzing transformative breakthroughs in life sciences. This tool enables scientists to accelerate discovery by identifying complex and previously hidden spatial patterns of proteins within tissues, uncovering novel relationships between different marker genes, and exploring the biological “Dark Matter”.
[321] TKG-Thinker: Towards Dynamic Reasoning over Temporal Knowledge Graphs via Agentic Reinforcement Learning
Zihao Jiang, Miao Peng, Zhenyan Shan, Wenjie Xu, Ben Liu, Gong Chen, Ziqi Gao, Min Peng
Main category: cs.AI
TL;DR: TKG-Thinker: An LLM-based agent with autonomous planning and adaptive retrieval for temporal knowledge graph question answering, using dual-training (SFT + RL) to improve reasoning under complex temporal constraints.
Details
Motivation: Current LLM prompting strategies for TKGQA suffer from reasoning hallucinations under complex temporal constraints and lack optimization through dynamic interaction with temporal knowledge graphs, limiting model autonomy and generalization.Method: Proposes TKG-Thinker agent with autonomous planning and adaptive retrieval capabilities. Uses dual-training strategy: 1) Supervised Fine-Tuning with chain-of-thought data for core planning, 2) Reinforcement Learning with multi-dimensional rewards to refine reasoning policies under temporal constraints.
Result: Achieves state-of-the-art performance on benchmark datasets with three open-source LLMs and exhibits strong generalization across complex TKGQA settings.
Conclusion: TKG-Thinker effectively addresses limitations of current prompting strategies by enabling dynamic multi-turn interactions with TKGs and refined reasoning through dual-training, demonstrating superior performance and generalization in temporal reasoning tasks.
Abstract: Temporal knowledge graph question answering (TKGQA) aims to answer time-sensitive questions by leveraging temporal knowledge bases. While Large Language Models (LLMs) demonstrate significant potential in TKGQA, current prompting strategies constrain their efficacy in two primary ways. First, they are prone to reasoning hallucinations under complex temporal constraints. Second, static prompting limits model autonomy and generalization, as it lack optimization through dynamic interaction with temporal knowledge graphs (TKGs) environments. To address these limitations, we propose \textbf{TKG-Thinker}, a novel agent equipped with autonomous planning and adaptive retrieval capabilities for reasoning over TKGs. Specifically, TKG-Thinker performs in-depth temporal reasoning through dynamic multi-turn interactions with TKGs via a dual-training strategy. We first apply Supervised Fine-Tuning (SFT) with chain-of thought data to instill core planning capabilities, followed by a Reinforcement Learning (RL) stage that leverages multi-dimensional rewards to refine reasoning policies under intricate temporal constraints. Experimental results on benchmark datasets with three open-source LLMs show that TKG-Thinker achieves state-of-the-art performance and exhibits strong generalization across complex TKGQA settings.
[322] Learning Compact Boolean Networks
Shengpu Wang, Yuhao Mao, Yani Zhang, Martin Vechev
Main category: cs.AI
TL;DR: Proposes three innovations for Boolean neural networks: learned connections, compact convolutions, and adaptive discretization, achieving better accuracy with up to 37x fewer Boolean operations than prior work.
Details
Motivation: Floating-point neural networks have high inference costs, motivating Boolean networks for resource-constrained settings, but learning compact and accurate Boolean networks is challenging due to their combinatorial nature.Method: Three-pronged approach: 1) Learned connections strategy with no additional parameters, 2) Novel convolutional Boolean architecture exploiting locality with reduced operations, 3) Adaptive discretization strategy to reduce accuracy drop when converting continuous networks to Boolean.
Result: Extensive results on standard vision benchmarks show the Pareto front of accuracy vs. computation significantly outperforms prior state-of-the-art, achieving better accuracy with up to 37x fewer Boolean operations.
Conclusion: The proposed methods enable more efficient Boolean neural networks that significantly reduce computational requirements while maintaining accuracy, making them suitable for resource-constrained vision applications.
Abstract: Floating-point neural networks dominate modern machine learning but incur substantial inference cost, motivating interest in Boolean networks for resource-constrained settings. However, learning compact and accurate Boolean networks is challenging due to their combinatorial nature. In this work, we address this challenge from three different angles: learned connections, compact convolutions and adaptive discretization. First, we propose a novel strategy to learn efficient connections with no additional parameters and negligible computational overhead. Second, we introduce a novel convolutional Boolean architecture that exploits the locality with reduced number of Boolean operations than existing methods. Third, we propose an adaptive discretization strategy to reduce the accuracy drop when converting a continuous-valued network into a Boolean one. Extensive results on standard vision benchmarks demonstrate that the Pareto front of accuracy vs. computation of our method significantly outperforms prior state-of-the-art, achieving better accuracy with up to 37x fewer Boolean operations.
[323] OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention
Zhangquan Chen, Jiale Tao, Ruihuang Li, Yihao Hu, Ruitao Chen, Zhantao Yang, Xinlei Yu, Haodong Jing, Manyuan Zhang, Shuai Shao, Biao Wang, Qinglin Lu, Ruqi Huang
Main category: cs.AI
TL;DR: OmniVideo-R1 is a reinforced framework for audio-visual understanding that improves mixed-modality reasoning through query-intensive grounding and modality-attentive fusion strategies.
Details
Motivation: Existing omnivideo models face substantial challenges in audio-visual understanding tasks despite humans perceiving the world through diverse synergistic modalities. The paper aims to address these limitations by enabling models to "think with omnimodal cues."Method: Two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms, and (2) modality-attentive fusion built upon contrastive learning paradigms. The framework is reinforced to improve mixed-modality reasoning.
Result: Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.
Conclusion: OmniVideo-R1 represents an effective approach for improving audio-visual understanding in omnivideo models through reinforced multimodal reasoning strategies.
Abstract: While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to “think with omnimodal cues” by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.
[324] BABE: Biology Arena BEnchmark
Junting Zhou, Jin Chen, Linfeng Hao, Denghui Cao, Zheyu Wang, Qiguang Chen, Chaoyou Fu, Jiaze Chen, Yuchen Wu, Ge Zhang, Mingxuan Wang, Wenhao Huang, Tong Yang
Main category: cs.AI
TL;DR: BABE benchmark evaluates AI’s ability to integrate experimental results with contextual knowledge for scientific reasoning in biology, using real research papers and biological studies.
Details
Motivation: Existing biology benchmarks fail to assess critical researcher skills: integrating experimental results with contextual knowledge to derive meaningful conclusions. There's a need for benchmarks that evaluate experimental reasoning capabilities reflecting real scientific inquiry complexity.Method: Constructed BABE benchmark from peer-reviewed research papers and real-world biological studies. Designed to challenge models with causal reasoning and cross-scale inference tasks that reflect interdisciplinary scientific inquiry complexity.
Result: BABE provides a robust framework for assessing how well AI systems can reason like practicing scientists, offering a more authentic measure of their potential to contribute to biological research.
Conclusion: BABE addresses the gap in evaluating experimental reasoning capabilities in biology AI systems, providing a benchmark that better reflects real scientific practice and interdisciplinary complexity.
Abstract: The rapid evolution of large language models (LLMs) has expanded their capabilities from basic dialogue to advanced scientific reasoning. However, existing benchmarks in biology often fail to assess a critical skill required of researchers: the ability to integrate experimental results with contextual knowledge to derive meaningful conclusions. To address this gap, we introduce BABE(Biology Arena BEnchmark), a comprehensive benchmark designed to evaluate the experimental reasoning capabilities of biological AI systems. BABE is uniquely constructed from peer-reviewed research papers and real-world biological studies, ensuring that tasks reflect the complexity and interdisciplinary nature of actual scientific inquiry. BABE challenges models to perform causal reasoning and cross-scale inference. Our benchmark provides a robust framework for assessing how well AI systems can reason like practicing scientists, offering a more authentic measure of their potential to contribute to biological research.
[325] Beyond Manual Planning: Seating Allocation for Large Organizations
Anton Ipsen, Michael Cashmore, Kirsty Fielding, Nicolas Marchesotti, Parisa Zehtabi, Daniele Magazzeni, Manuela Veloso
Main category: cs.AI
TL;DR: HSAP solves hierarchical team seating allocation using probabilistic road maps and integer programming for optimal floor plan assignments.
Details
Motivation: Large organizations need to seat hierarchically structured teams in proximity for better collaboration, but current manual processes are infrequent and suboptimal.Method: Combines probabilistic road maps (PRM) and rapidly-exploring random trees (RRT) for distance calculation, with heuristic search, dynamic programming, and integer programming to solve the hierarchical seating allocation problem.
Result: Demonstrated approach works under different sized instances, evaluated both quantitatively and qualitatively for PRM framework and allocations.
Conclusion: Proposed end-to-end framework effectively solves HSAP, automating what was previously a manual and suboptimal process for hierarchical team seating arrangements.
Abstract: We introduce the Hierarchical Seating Allocation Problem (HSAP) which addresses the optimal assignment of hierarchically structured organizational teams to physical seating arrangements on a floor plan. This problem is driven by the necessity for large organizations with large hierarchies to ensure that teams with close hierarchical relationships are seated in proximity to one another, such as ensuring a research group occupies a contiguous area. Currently, this problem is managed manually leading to infrequent and suboptimal replanning efforts. To alleviate this manual process, we propose an end-to-end framework to solve the HSAP. A scalable approach to calculate the distance between any pair of seats using a probabilistic road map (PRM) and rapidly-exploring random trees (RRT) which is combined with heuristic search and dynamic programming approach to solve the HSAP using integer programming. We demonstrate our approach under different sized instances by evaluating the PRM framework and subsequent allocations both quantitatively and qualitatively.
[326] Agent2Agent Threats in Safety-Critical LLM Assistants: A Human-Centric Taxonomy
Lukas Stappen, Ahmet Erkan Turan, Johann Hagerer, Georg Groh
Main category: cs.AI
TL;DR: AgentHeLLM: A threat modeling framework for LLM-based vehicle assistants that separates asset identification from attack path analysis to address security risks in automotive AI systems.
Details
Motivation: LLM-based conversational agents in vehicles create novel security challenges at the intersection of AI, automotive safety, and inter-agent communication. Existing AI security frameworks lack proper separation of concerns between what is being protected (assets) and how it is attacked (attack paths), which is critical for safety-critical automotive systems.Method: Proposes AgentHeLLM framework with: 1) Human-centric asset taxonomy derived from harm-oriented “victim modeling” inspired by Universal Declaration of Human Rights, 2) Formal graph-based model distinguishing poison paths (malicious data propagation) from trigger paths (activation actions), and 3) Open-source attack path suggestion tool using bi-level search strategy for multi-stage threat discovery.
Result: Developed a practical framework and tool (AgentHeLLM Attack Path Generator) that automates threat discovery for LLM-based vehicle assistants, addressing the methodological gap in existing AI security frameworks for automotive applications.
Conclusion: AgentHeLLM provides a systematic approach to threat modeling for LLM-based automotive assistants, formally separating asset protection from attack path analysis to improve security in safety-critical vehicle systems with AI conversational agents.
Abstract: The integration of Large Language Model (LLM)-based conversational agents into vehicles creates novel security challenges at the intersection of agentic AI, automotive safety, and inter-agent communication. As these intelligent assistants coordinate with external services via protocols such as Google’s Agent-to-Agent (A2A), they establish attack surfaces where manipulations can propagate through natural language payloads, potentially causing severe consequences ranging from driver distraction to unauthorized vehicle control. Existing AI security frameworks, while foundational, lack the rigorous “separation of concerns” standard in safety-critical systems engineering by co-mingling the concepts of what is being protected (assets) with how it is attacked (attack paths). This paper addresses this methodological gap by proposing a threat modeling framework called AgentHeLLM (Agent Hazard Exploration for LLM Assistants) that formally separates asset identification from attack path analysis. We introduce a human-centric asset taxonomy derived from harm-oriented “victim modeling” and inspired by the Universal Declaration of Human Rights, and a formal graph-based model that distinguishes poison paths (malicious data propagation) from trigger paths (activation actions). We demonstrate the framework’s practical applicability through an open-source attack path suggestion tool AgentHeLLM Attack Path Generator that automates multi-stage threat discovery using a bi-level search strategy.
[327] A Guide to Large Language Models in Modeling and Simulation: From Core Techniques to Critical Challenges
Philippe J. Giabbanelli
Main category: cs.AI
TL;DR: A practical guide on using LLMs in Modeling & Simulation workflows, highlighting common pitfalls and providing principled guidance on design choices, diagnostics, and evaluation.
Details
Motivation: LLMs are increasingly used in M&S workflows, but common practices like prompting, temperature settings, and data augmentation can introduce subtle issues, unnecessary complexity, or inferior results. The paper aims to provide comprehensive practical guidance for informed LLM usage in M&S applications.Method: The paper discusses common sources of confusion including non-determinism, knowledge augmentation (RAG and LoRA), decomposition of M&S data, and hyper-parameter settings. It emphasizes principled design choices, diagnostic strategies, and empirical evaluation approaches.
Result: Provides practical guidance on when, how, and whether to rely on LLMs in M&S workflows, helping modelers avoid common pitfalls like model collapse, unnecessary fine-tuning, and information loss through naive simplifications.
Conclusion: LLMs require careful, principled usage in M&S applications. Modelers need to make informed decisions about LLM integration, considering factors like non-determinism, data decomposition, and appropriate augmentation techniques to achieve optimal results.
Abstract: Large language models (LLMs) have rapidly become familiar tools to researchers and practitioners. Concepts such as prompting, temperature, or few-shot examples are now widely recognized, and LLMs are increasingly used in Modeling & Simulation (M&S) workflows. However, practices that appear straightforward may introduce subtle issues, unnecessary complexity, or may even lead to inferior results. Adding more data can backfire (e.g., deteriorating performance through model collapse or inadvertently wiping out existing guardrails), spending time on fine-tuning a model can be unnecessary without a prior assessment of what it already knows, setting the temperature to 0 is not sufficient to make LLMs deterministic, providing a large volume of M&S data as input can be excessive (LLMs cannot attend to everything) but naive simplifications can lose information. We aim to provide comprehensive and practical guidance on how to use LLMs, with an emphasis on M&S applications. We discuss common sources of confusion, including non-determinism, knowledge augmentation (including RAG and LoRA), decomposition of M&S data, and hyper-parameter settings. We emphasize principled design choices, diagnostic strategies, and empirical evaluation, with the goal of helping modelers make informed decisions about when, how, and whether to rely on LLMs.
[328] Quantum Reinforcement Learning with Transformers for the Capacitated Vehicle Routing Problem
Eva Andrés
Main category: cs.AI
TL;DR: Quantum-enhanced reinforcement learning models outperform classical approaches for Capacitated Vehicle Routing Problem, with hybrid quantum-classical architecture achieving best performance across multiple metrics.
Details
Motivation: To explore quantum-enhanced reinforcement learning approaches for solving complex combinatorial optimization problems like the Capacitated Vehicle Routing Problem, comparing classical, full quantum, and hybrid architectures.Method: Implemented Advantage Actor-Critic (A2C) agent in classical, full quantum, and hybrid variants, integrating transformer architectures with self- and cross-attention mechanisms to capture relationships between vehicles, clients, and depot. Experiments conducted on multi-vehicle scenarios with capacity constraints (20 clients, 4 vehicles) over ten independent runs.
Result: All three approaches learned effective routing policies, but quantum-enhanced models outperformed classical baseline with more robust route organization. Hybrid architecture achieved best overall performance across distance, compactness, and route overlap metrics. Quantum-based models generated more structured and coherent routing solutions.
Conclusion: Hybrid quantum-classical reinforcement learning models show strong potential for addressing complex combinatorial optimization problems like CVRP, with quantum enhancements providing measurable improvements in solution quality and robustness.
Abstract: This paper addresses the Capacitated Vehicle Routing Problem (CVRP) by comparing classical and quantum Reinforcement Learning (RL) approaches. An Advantage Actor-Critic (A2C) agent is implemented in classical, full quantum, and hybrid variants, integrating transformer architectures to capture the relationships between vehicles, clients, and the depot through self- and cross-attention mechanisms. The experiments focus on multi-vehicle scenarios with capacity constraints, considering 20 clients and 4 vehicles, and are conducted over ten independent runs. Performance is assessed using routing distance, route compactness, and route overlap. The results show that all three approaches are capable of learning effective routing policies. However, quantum-enhanced models outperform the classical baseline and produce more robust route organization, with the hybrid architecture achieving the best overall performance across distance, compactness, and route overlap. In addition to quantitative improvements, qualitative visualizations reveal that quantum-based models generate more structured and coherent routing solutions. These findings highlight the potential of hybrid quantum-classical reinforcement learning models for addressing complex combinatorial optimization problems such as the CVRP.
[329] Geographically-aware Transformer-based Traffic Forecasting for Urban Motorway Digital Twins
Krešimir Kušić, Vinny Cahill, Ivana Dusparic
Main category: cs.AI
TL;DR: GATTF: A geographically-aware Transformer model for motorway traffic forecasting that uses mutual information between distributed sensors to capture spatial relationships, improving accuracy without increasing complexity.
Details
Motivation: Digital twins for motorway traffic management require accurate traffic predictions, but existing models struggle with spatio-temporal complexity and non-linear traffic dynamics. While sequence-based deep learning models capture temporal dependencies well, they need improvements in forecasting accuracy and model complexity.Method: Proposes GATTF (Geographically-aware Transformer-based Traffic Forecasting) model that exploits geographical relationships between distributed sensors using mutual information (MI) to enhance spatial awareness in traffic forecasting.
Result: Evaluation on real-time data from Geneva motorway network shows that incorporating geographical awareness through MI enhances GATTF forecasting accuracy compared to standard Transformer models, without increasing model complexity.
Conclusion: Geographical awareness through mutual information improves traffic forecasting accuracy in Transformer models, making GATTF a promising approach for digital twin applications in motorway traffic management.
Abstract: The operational effectiveness of digital-twin technology in motorway traffic management depends on the availability of a continuous flow of high-resolution real-time traffic data. To function as a proactive decision-making support layer within traffic management, a digital twin must also incorporate predicted traffic conditions in addition to real-time observations. Due to the spatio-temporal complexity and the time-variant, non-linear nature of traffic dynamics, predicting motorway traffic remains a difficult problem. Sequence-based deep-learning models offer clear advantages over classical machine learning and statistical models in capturing long-range, temporal dependencies in time-series traffic data, yet limitations in forecasting accuracy and model complexity point to the need for further improvements. To improve motorway traffic forecasting, this paper introduces a Geographically-aware Transformer-based Traffic Forecasting GATTF model, which exploits the geographical relationships between distributed sensors using their mutual information (MI). The model has been evaluated using real-time data from the Geneva motorway network in Switzerland and results confirm that incorporating geographical awareness through MI enhances the accuracy of GATTF forecasting compared to a standard Transformer, without increasing model complexity.
[330] Speech Emotion Recognition Leveraging OpenAI’s Whisper Representations and Attentive Pooling Methods
Ali Shendabadi, Parnia Izadirad, Mostafa Salehi, Mahmoud Bijankhan
Main category: cs.AI
TL;DR: Whisper ASR model adapted for speech emotion recognition using attention-based pooling methods, achieving state-of-the-art results on Persian dataset with lightweight architecture.
Details
Motivation: Speech Emotion Recognition (SER) research is limited by lack of large standard datasets, and pre-trained models like Whisper offer potential for extracting emotional features from speech.Method: Proposed two attention-based pooling methods (Multi-head Attentive Average Pooling and QKV Pooling) to reduce dimensionality of Whisper representations while preserving emotional features. Tested on English (IEMOCAP) and Persian (ShEMO) datasets using Whisper Tiny and Small models.
Result: Multi-head QKV architecture achieved state-of-the-art results on ShEMO dataset with 2.47% improvement in unweighted accuracy. Intermediate Whisper encoder layers performed better for SER on Persian dataset, providing lightweight alternative to larger models like HuBERT X-Large.
Conclusion: Whisper shows strong potential as representation extractor for SER, and attention-based pooling is effective for dimension reduction while preserving emotional features across languages.
Abstract: Speech Emotion Recognition (SER) research has faced limitations due to the lack of standard and sufficiently large datasets. Recent studies have leveraged pre-trained models to extract features for downstream tasks such as SER. This work explores the capabilities of Whisper, a pre-trained ASR system, in speech emotion recognition by proposing two attention-based pooling methods, Multi-head Attentive Average Pooling and QKV Pooling, designed to efficiently reduce the dimensionality of Whisper representations while preserving emotional features. We experiment on English and Persian, using the IEMOCAP and ShEMO datasets respectively, with Whisper Tiny and Small. Our multi-head QKV architecture achieves state-of-the-art results on the ShEMO dataset, with a 2.47% improvement in unweighted accuracy. We further compare the performance of different Whisper encoder layers and find that intermediate layers often perform better for SER on the Persian dataset, providing a lightweight and efficient alternative to much larger models such as HuBERT X-Large. Our findings highlight the potential of Whisper as a representation extractor for SER and demonstrate the effectiveness of attention-based pooling for dimension reduction.
[331] AgenticPay: A Multi-Agent LLM Negotiation System for Buyer-Seller Transactions
Xianyang Liu, Shangding Gu, Dawn Song
Main category: cs.AI
TL;DR: AgenticPay is a benchmark and simulation framework for evaluating multi-agent buyer-seller negotiation using natural language, featuring over 110 tasks with structured metrics for feasibility, efficiency, and welfare.
Details
Motivation: Existing benchmarks lack principled settings for evaluating language-mediated economic interaction among multiple LLM-based agents, which are increasingly expected to negotiate, coordinate, and transact autonomously.Method: The framework models markets where buyers and sellers have private constraints and product-dependent valuations, requiring multi-round linguistic negotiation rather than just numeric bidding. It includes structured action extraction and supports diverse tasks from bilateral bargaining to many-to-many markets.
Result: Benchmarking state-of-the-art LLMs reveals substantial gaps in negotiation performance and highlights challenges in long-horizon strategic reasoning.
Conclusion: AgenticPay establishes a foundation for studying agentic commerce and language-based market interaction, providing a comprehensive evaluation framework for multi-agent economic negotiations.
Abstract: Large language model (LLM)-based agents are increasingly expected to negotiate, coordinate, and transact autonomously, yet existing benchmarks lack principled settings for evaluating language-mediated economic interaction among multiple agents. We introduce AgenticPay, a benchmark and simulation framework for multi-agent buyer-seller negotiation driven by natural language. AgenticPay models markets in which buyers and sellers possess private constraints and product-dependent valuations, and must reach agreements through multi-round linguistic negotiation rather than numeric bidding alone. The framework supports a diverse suite of over 110 tasks ranging from bilateral bargaining to many-to-many markets, with structured action extraction and metrics for feasibility, efficiency, and welfare. Benchmarking state-of-the-art proprietary and open-weight LLMs reveals substantial gaps in negotiation performance and highlights challenges in long-horizon strategic reasoning, establishing AgenticPay as a foundation for studying agentic commerce and language-based market interaction. Code and dataset are available at the link: https://github.com/SafeRL-Lab/AgenticPay.
[332] Learning Event-Based Shooter Models from Virtual Reality Experiments
Christopher A. McClurg, Alan R. Wagner
Main category: cs.AI
TL;DR: A data-driven discrete-event simulator (DES) for evaluating school security interventions using VR-derived behavioral data to model shooter movements and enable scalable testing of robot-based intervention strategies.
Details
Motivation: VR is effective for evaluating school security measures but requires recruiting new participants for each condition, making large-scale or iterative evaluation difficult, especially for learning effective intervention strategies that need many training episodes.Method: Developed a data-driven discrete-event simulator (DES) that models shooter movement and in-region actions as stochastic processes learned from participant behavior in VR studies, then used the simulator to examine robot-based shooter intervention strategies.
Result: The DES reproduces key empirical patterns from VR studies and enables scalable evaluation and learning of intervention strategies that are infeasible to train directly with human subjects.
Conclusion: This work demonstrates a high-to-mid fidelity simulation workflow that provides a scalable surrogate for developing and evaluating autonomous school-security interventions.
Abstract: Virtual reality (VR) has emerged as a powerful tool for evaluating school security measures in high-risk scenarios such as school shootings, offering experimental control and high behavioral fidelity. However, assessing new interventions in VR requires recruiting new participant cohorts for each condition, making large-scale or iterative evaluation difficult. These limitations are especially restrictive when attempting to learn effective intervention strategies, which typically require many training episodes. To address this challenge, we develop a data-driven discrete-event simulator (DES) that models shooter movement and in-region actions as stochastic processes learned from participant behavior in VR studies. We use the simulator to examine the impact of a robot-based shooter intervention strategy. Once shown to reproduce key empirical patterns, the DES enables scalable evaluation and learning of intervention strategies that are infeasible to train directly with human subjects. Overall, this work demonstrates a high-to-mid fidelity simulation workflow that provides a scalable surrogate for developing and evaluating autonomous school-security interventions.
[333] DyTopo: Dynamic Topology Routing for Multi-Agent Reasoning via Semantic Matching
Yuxing Lu, Yucheng Hu, Xukai Zhao, Jiuxin Cao
Main category: cs.AI
TL;DR: DyTopo is a manager-guided multi-agent framework that dynamically reconstructs sparse directed communication graphs at each reasoning round using semantic matching of agent needs and offers.
Details
Motivation: Existing multi-agent LLM systems use fixed communication patterns that don't adapt to stage-dependent needs of iterative problem solving, limiting efficiency and effectiveness.Method: DyTopo uses a manager to set round goals, then each agent outputs natural-language query (need) and key (offer) descriptors. These are embedded and semantically matched to create sparse directed communication graphs, routing private messages only along induced edges.
Result: DyTopo consistently outperforms strongest baselines by average +6.2% across code generation and mathematical reasoning benchmarks with four LLM backbones, while providing interpretable coordination traces.
Conclusion: Dynamic communication graph reconstruction improves multi-agent reasoning by adapting to stage-dependent needs, offering both performance gains and interpretability through evolving coordination patterns.
Abstract: Multi-agent systems built from prompted large language models can improve multi-round reasoning, yet most existing pipelines rely on fixed, trajectory-wide communication patterns that are poorly matched to the stage-dependent needs of iterative problem solving. We introduce DyTopo, a manager-guided multi-agent framework that reconstructs a sparse directed communication graph at each round. Conditioned on the manager’s round goal, each agent outputs lightweight natural-language query (need) and \key (offer) descriptors; DyTopo embeds these descriptors and performs semantic matching, routing private messages only along the induced edges. Across code generation and mathematical reasoning benchmarks and four LLM backbones, DyTopo consistently outperforms over the strongest baseline (avg. +6.2). Beyond accuracy, DyTopo yields an interpretable coordination trace via the evolving graphs, enabling qualitative inspection of how communication pathways reconfigure across rounds.
[334] Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers
Wooseok Seo, Seungju Han, Jaehun Jung, Benjamin Newman, Seungwon Lim, Seungbeen Lee, Ximing Lu, Yejin Choi, Youngjae Yu
Main category: cs.AI
TL;DR: Evaluation of 12 LLMs and 1 specialized fact-verifier on 14 fact-checking benchmarks reveals three key findings: dataset quality issues affect rankings, frontier LLMs with few-shot examples perform best but are costly, and small fine-tuned models can be improved with synthetic multi-hop reasoning data.
Details
Motivation: Fact verification is crucial for reliable LLM applications, but current evaluations may be misleading due to dataset issues and incomplete baseline comparisons. The study aims to provide guidance for developing more robust fact verifiers.Method: Evaluated 12 pre-trained LLMs and 1 specialized fact-verifier on examples from 14 fact-checking benchmarks. Used systematic pipeline with LLM-as-a-judge to identify annotation errors and ambiguous data. Tested frontier LLMs with few-shot in-context examples and developed small fine-tuned models augmented with synthetic multi-hop reasoning data.
Result: 1) Found ~16% ambiguous/incorrectly labeled data significantly influences model rankings. 2) Frontier LLMs with few-shot examples achieve top-tier performance. 3) Small fine-tuned models have room for improvement, especially on complex reasoning, but synthetic multi-hop reasoning data significantly enhances their capabilities.
Conclusion: Future fact-verifier development should address dataset quality issues, include frontier LLM baselines with few-shot examples, and focus on improving small models through synthetic data augmentation for complex reasoning tasks.
Abstract: Fact verification is essential for ensuring the reliability of LLM applications. In this study, we evaluate 12 pre-trained LLMs and one specialized fact-verifier, including frontier LLMs and open-weight reasoning LLMs, using a collection of examples from 14 fact-checking benchmarks. We share three findings intended to guide future development of more robust fact verifiers. First, we highlight the importance of addressing annotation errors and ambiguity in datasets, demonstrating that approximately 16% of ambiguous or incorrectly labeled data substantially influences model rankings. Neglecting this issue may result in misleading conclusions during comparative evaluations, and we suggest using a systematic pipeline utilizing LLM-as-a-judge to help identify these issues at scale. Second, we discover that frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance. We therefore recommend that future studies include comparisons with these simple yet highly effective baselines. Lastly, despite their effectiveness, frontier LLMs incur substantial costs, motivating the development of small, fine-tuned fact verifiers. We show that these small models still have room for improvement, particularly on instances that require complex reasoning. Encouragingly, we demonstrate that augmenting training with synthetic multi-hop reasoning data significantly enhances their capabilities in such instances. We release our code, model, and dataset at https://github.com/just1nseo/verifying-the-verifiers.
[335] DeepAgent: A General Reasoning Agent with Scalable Toolsets
Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, Zhicheng Dou
Main category: cs.AI
TL;DR: DeepAgent is an end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution within a single coherent reasoning process, featuring memory folding for long-horizon interactions and ToolPO reinforcement learning for efficient tool use.
Details
Motivation: Real-world tasks often require external tools and long-horizon interactions, but existing agent frameworks typically follow predefined workflows that limit autonomous and global task completion.Method: Introduces DeepAgent with autonomous memory folding (compressing past interactions into episodic, working, and tool memories) and ToolPO reinforcement learning (leveraging LLM-simulated APIs with tool-call advantage attribution for fine-grained credit assignment).
Result: Extensive experiments on eight benchmarks including general tool-use tasks (ToolBench, API-Bank, TMDB, Spotify, ToolHop) and downstream applications (ALFWorld, WebShop, GAIA, HLE) demonstrate consistent outperformance over baselines across both labeled-tool and open-set tool retrieval scenarios.
Conclusion: DeepAgent provides an effective framework for autonomous reasoning with external tools through its memory management and reinforcement learning approach for tool use.
Abstract: Large reasoning models have demonstrated strong problem-solving abilities, yet real-world tasks often require external tools and long-horizon interactions. Existing agent frameworks typically follow predefined workflows, which limit autonomous and global task completion. In this paper, we introduce DeepAgent, an end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution within a single, coherent reasoning process. To manage long-horizon interactions, we introduce an autonomous memory folding mechanism that compresses past interactions into structured episodic, working, and tool memories, reducing error accumulation while preserving critical information. To teach general-purpose tool use efficiently and stably, we develop an end-to-end reinforcement learning strategy, namely ToolPO, that leverages LLM-simulated APIs and applies tool-call advantage attribution to assign fine-grained credit to the tool invocation tokens. Extensive experiments on eight benchmarks, including general tool-use tasks (ToolBench, API-Bank, TMDB, Spotify, ToolHop) and downstream applications (ALFWorld, WebShop, GAIA, HLE), demonstrate that DeepAgent consistently outperforms baselines across both labeled-tool and open-set tool retrieval scenarios. The code and demo are available at https://github.com/RUC-NLPIR/DeepAgent.
[336] Can MLLMs generate human-like feedback in grading multimodal short answers?
Pritam Sil, Pushpak Bhattacharyya, Pawan Goyal, Ganesh Ramakrishnan
Main category: cs.AI
TL;DR: Introduces Multimodal Short Answer Grading with Feedback (MMSAF) for evaluating text+diagram responses, creates dataset using LLM hallucinations, and benchmarks MLLMs on STEM grading tasks.
Details
Motivation: Traditional ASAG only handles text, but real assessments include multimodal responses with diagrams and text. Need to jointly evaluate both modalities and provide explanatory feedback.Method: Developed automated data generation framework using LLM hallucinations to mimic student errors, creating 2,197 instances. Evaluated 4 MLLMs across 3 STEM subjects on correctness prediction and image relevance assessment.
Result: MLLMs achieved up to 62.5% accuracy in predicting answer correctness (correct/partially correct/incorrect) and up to 80.36% in assessing image relevance. Human evaluation with 9 annotators across 5 parameters including rubric-based feedback quality assessment.
Conclusion: Identifies which MLLMs are better suited for multimodal grading tasks while highlighting remaining drawbacks. Rubric-based approach provides semantic evaluation of feedback quality beyond overlap metrics.
Abstract: In education, the traditional Automatic Short Answer Grading (ASAG) with feedback problem has focused primarily on evaluating text-only responses. However, real-world assessments often include multimodal responses containing both diagrams and text. To address this limitation, we introduce the Multimodal Short Answer Grading with Feedback (MMSAF) problem, which requires jointly evaluating textual and diagrammatic content while also providing explanatory feedback. Collecting data representative of such multimodal responses is challenging due to both scale and logistical constraints. To mitigate this, we develop an automated data generation framework that leverages LLM hallucinations to mimic common student errors, thereby constructing a dataset of 2,197 instances. We evaluate 4 Multimodal Large Language Models (MLLMs) across 3 STEM subjects, showing that MLLMs achieve accuracies of up to 62.5% in predicting answer correctness (correct/partially correct/incorrect) and up to 80.36% in assessing image relevance. This also includes a human evaluation with 9 annotators across 5 parameters, including a rubric-based approach. The rubrics also serve as a way to evaluate the feedback quality semantically rather than using overlap-based approaches. Our findings highlight which MLLMs are better suited for such tasks while also pointing out to drawbacks of the remaining MLLMs.
[337] Are foundation models useful feature extractors for electroencephalography analysis?
Özgün Turgut, Felix S. Bott, Markus Ploner, Daniel Rueckert
Main category: cs.AI
TL;DR: General-purpose time series foundation models are competitive with specialized EEG models for medical applications like age prediction, seizure detection, and EEG event classification, reducing need for large task-specific datasets.
Details
Motivation: To investigate whether general-purpose time series foundation models can be effective for medical applications with limited data, specifically in EEG analysis, where specialized models typically require large task-specific datasets.Method: Evaluated general-purpose time series models on EEG tasks including age prediction, seizure detection, and classification of clinically relevant EEG events. Compared their performance against specialized EEG models and assessed the quality of extracted features.
Result: General-purpose models are competitive with specialized EEG models and capture features useful for localizing demographic and disease-related biomarkers.
Conclusion: Foundational time series models can reduce reliance on large task-specific datasets and models, making them valuable in clinical practice for medical applications with limited data.
Abstract: The success of foundation models in natural language processing and computer vision has motivated similar approaches in time series analysis. While foundational time series models have proven beneficial on a variety of tasks, their effectiveness in medical applications with limited data remains underexplored. In this work, we investigate this question in the context of electroencephalography (EEG) by evaluating general-purpose time series models on age prediction, seizure detection, and classification of clinically relevant EEG events. We compare their diagnostic performance against specialised EEG models and assess the quality of the extracted features. The results show that general-purpose models are competitive and capture features useful to localising demographic and disease-related biomarkers. These findings indicate that foundational time series models can reduce the reliance on large task-specific datasets and models, making them valuable in clinical practice.
[338] SPhyR: Spatial-Physical Reasoning Benchmark on Material Distribution
Philipp D. Siedler
Main category: cs.AI
TL;DR: A novel dataset for benchmarking LLMs’ physical and spatial reasoning capabilities using topology optimization problems in 2D structural design scenarios.
Details
Motivation: To create a benchmark that evaluates LLMs' ability to reason about physical and spatial relationships in structural design problems, complementing traditional language and logic benchmarks.Method: Developed a dataset based on topology optimization where LLMs are given 2D boundary conditions, applied forces, and supports, and must reason about optimal material distributions without access to simulation tools.
Result: Created a dataset with various tasks including filling masked regions and predicting complete material distributions, challenging models to understand force flow and structural stability.
Conclusion: The dataset provides a new benchmark for evaluating spatial and physical reasoning in 2D settings, offering insights into LLMs’ capabilities beyond traditional language tasks.
Abstract: We introduce a novel dataset designed to benchmark the physical and spatial reasoning capabilities of Large Language Models (LLM) based on topology optimization, a method for computing optimal material distributions within a design space under prescribed loads and supports. In this dataset, LLMs are provided with conditions such as 2D boundary, applied forces and supports, and must reason about the resulting optimal material distribution. The dataset includes a variety of tasks, ranging from filling in masked regions within partial structures to predicting complete material distributions. Solving these tasks requires understanding the flow of forces and the required material distribution under given constraints, without access to simulation tools or explicit physical models, challenging models to reason about structural stability and spatial organization. Our dataset targets the evaluation of spatial and physical reasoning abilities in 2D settings, offering a complementary perspective to traditional language and logic benchmarks.
[339] The Why Behind the Action: Unveiling Internal Drivers via Agentic Attribution
Chen Qian, Peng Wang, Dongrui Liu, Junyao Yang, Dadi Guo, Ling Tang, Jilin Mei, Qihan Ren, Shuai Shao, Yong Liu, Jie Fu, Jing Shao, Xia Hu
Main category: cs.AI
TL;DR: A framework for general agentic attribution that identifies internal factors driving LLM-based agent actions, using hierarchical temporal likelihood dynamics and perturbation analysis to pinpoint historical events and textual evidence behind behaviors.
Details
Motivation: As LLM-based agents become more autonomous and deployed at scale, understanding why agents take particular actions is crucial for accountability and governance. Existing research focuses only on failure attribution for unsuccessful trajectories, which is insufficient for explaining the reasons behind agent behaviors regardless of task outcome.Method: Proposes a hierarchical framework: 1) At component level, uses temporal likelihood dynamics to identify critical interaction steps; 2) At sentence level, refines localization using perturbation-based analysis to isolate specific textual evidence driving agent decisions.
Result: The framework reliably pinpoints pivotal historical events and sentences behind agent behavior across diverse scenarios including standard tool use and subtle reliability risks like memory-induced bias.
Conclusion: The proposed framework offers a critical step toward safer and more accountable agentic systems by providing general attribution for agent behaviors beyond just failure analysis.
Abstract: Large Language Model (LLM)-based agents are widely used in real-world applications such as customer service, web navigation, and software engineering. As these systems become more autonomous and are deployed at scale, understanding why an agent takes a particular action becomes increasingly important for accountability and governance. However, existing research predominantly focuses on \textit{failure attribution} to localize explicit errors in unsuccessful trajectories, which is insufficient for explaining \textbf{the reason behind agent behaviors}. To bridge this gap, we propose a novel framework for \textbf{general agentic attribution}, designed to identify the internal factors driving agent actions regardless of the task outcome. Our framework operates hierarchically to manage the complexity of agent interactions. Specifically, at the \textit{component level}, we employ temporal likelihood dynamics to identify critical interaction steps; then at the \textit{sentence level}, we refine this localization using perturbation-based analysis to isolate the specific textual evidence. We validate our framework across a diverse suite of agentic scenarios, including standard tool use and subtle reliability risks like memory-induced bias. Experimental results demonstrate that the proposed framework reliably pinpoints pivotal historical events and sentences behind the agent behavior, offering a critical step toward safer and more accountable agentic systems. Codes are available at https://github.com/AI45Lab/AgentDoG.
[340] Robust Answers, Fragile Logic: Probing the Decoupling Hypothesis in LLM Reasoning
Enyi Jiang, Changming Xu, Nischay Singh, Tian Qiu, Gagandeep Singh
Main category: cs.AI
TL;DR: MATCHA introduces an Answer-Conditioned Probing framework to test the faithfulness of Chain-of-Thought reasoning in LLMs, revealing that models often maintain correct answers while generating inconsistent rationales under perturbations.
Details
Motivation: The paper investigates whether Chain-of-Thought (CoT) reasoning in LLMs is genuinely faithful or merely post-hoc rationalization, addressing concerns that correct answers might mask fragile reasoning that isn't causally tied to predictions.Method: Introduces MATCHA, an Answer-Conditioned Probing framework that isolates reasoning by conditioning generation on the model’s predicted answer, allowing stress-testing of rationale stability under imperceptible input perturbations.
Result: LLMs frequently maintain correct answers while generating inconsistent or nonsensical reasoning under perturbations (“Right for the Wrong Reasons”), with multi-step and commonsense tasks being more vulnerable than logical tasks. Adversarial examples transfer to black-box models.
Conclusion: CoT reasoning in LLMs suffers from robustness issues, exposing an illusion of faithfulness. Future architectures need to enforce genuine answer-reasoning consistency rather than relying on surface-level accuracy.
Abstract: While Chain-of-Thought (CoT) prompting has become a cornerstone for complex reasoning in Large Language Models (LLMs), the faithfulness of the generated reasoning remains an open question. We investigate the Decoupling Hypothesis: that correct answers often mask fragile, post-hoc rationalizations that are not causally tied to the model’s prediction. To systematically verify this, we introduce MATCHA, a novel Answer-Conditioned Probing framework. Unlike standard evaluations that focus on final output accuracy, MATCHA isolates the reasoning phase by conditioning generation on the model’s predicted answer, allowing us to stress-test the stability of the rationale itself. Our experiments reveal a critical vulnerability: under imperceptible input perturbations, LLMs frequently maintain the correct answer while generating inconsistent or nonsensical reasoning - effectively being ``Right for the Wrong Reasons’’. Using LLM judges to quantify this robustness gap, we find that multi-step and commonsense tasks are significantly more susceptible to this decoupling than logical tasks. Furthermore, we demonstrate that adversarial examples generated by MATCHA transfer non-trivially to black-box models. Our findings expose the illusion of CoT robustness and underscore the need for future architectures that enforce genuine answer-reasoning consistency rather than mere surface-level accuracy.
[341] Interpretability by Design for Efficient Multi-Objective Reinforcement Learning
Qiyue Xia, Tianwei Wang, J. Michael Herrmann
Main category: cs.AI
TL;DR: LLE-MORL: Interpretable multi-objective reinforcement learning using locally linear mapping between parameter and performance spaces for efficient Pareto front generation.
Details
Motivation: Multi-objective RL needs to optimize conflicting goals and find diverse non-dominated policies forming Pareto fronts, but current methods lack interpretability and efficiency in generating high-quality solutions.Method: Uses a training scheme based on local relationships between parameter space and performance space, exploiting locally linear maps to interpret policy parameters in terms of objectives, enabling efficient search in contiguous solution domains.
Result: Experiments across diverse continuous control domains show LLE-MORL consistently achieves higher Pareto front quality and efficiency than state-of-the-art approaches.
Conclusion: LLE-MORL provides interpretability by design and enables rapid generation of high-quality solutions without extensive retraining, advancing multi-objective RL capabilities.
Abstract: Multi-objective reinforcement learning (MORL) aims at optimising several, often conflicting goals to improve the flexibility and reliability of RL in practical tasks. This is typically achieved by finding a set of diverse, non-dominated policies that form a Pareto front in the performance space. We introduce LLE-MORL, an approach that achieves interpretability by design by utilising a training scheme based on the local relationship between the parameter space and the performance space. By exploiting a locally linear map between these spaces, our method provides an interpretation of policy parameters in terms of the objectives, and this structured representation enables an efficient search within contiguous solution domains, allowing for the rapid generation of high-quality solutions without extensive retraining. Experiments across diverse continuous control domains demonstrate that LLE-MORL consistently achieves higher Pareto front quality and efficiency than state-of-the-art approaches.
[342] AlphaBeta is not as good as you think: a simple class of synthetic games for a better analysis of deterministic game-solving algorithms
Raphaël Boige, Amine Boumaza, Bruno Scherrer
Main category: cs.AI
TL;DR: The paper introduces a new probabilistic model for game-tree analysis that incorporates ancestor dependencies, challenging traditional independence assumptions and revealing practical performance differences between algorithms like AlphaBeta and Scout.
Details
Motivation: Traditional game-solving algorithm analysis uses simplified models with independent leaf values, which strips games of structural complexity and produces trivial instances. This fails to capture the meaningful challenges algorithms face in real-world games with structural dependencies.Method: The authors introduce a class of synthetic games generated by a probabilistic model that incrementally constructs game-trees using fixed level-wise conditional distributions. This enforces ancestor dependencies, a critical structural feature of real games, while retaining analytical tractability.
Result: For algorithms including AlphaBeta and Scout, recursive formulas characterize their average-case complexities under this new model. While asymptotically all algorithms converge to identical branching factors, deep finite trees reveal stark practical differences: AlphaBeta incurs significantly larger constant multiplicative factors compared to Scout, leading to substantial practical slowdown.
Conclusion: The framework provides rigorous evidence and analytical tools to advance understanding of game-solving algorithms under a richer, more challenging, yet tractable model that better captures real-world game complexity.
Abstract: Deterministic game-solving algorithms are conventionally analyzed in the light of their average-case complexity against a distribution of random game-trees, where leaf values are independently sampled from a fixed distribution. This simplified model enables uncluttered mathematical analysis, revealing two key properties: root value distributions asymptotically collapse to a single fixed value for finite-valued trees, and all reasonable algorithms achieve global optimality. However, these findings are artifacts of the model’s design: its long criticized independence assumption strips games of structural complexity, producing trivial instances where no algorithm faces meaningful challenges. To address this limitation, we introduce a class of synthetic games generated by a probabilistic model that incrementally constructs game-trees using a fixed level-wise conditional distribution. By enforcing ancestor dependencies, a critical structural feature of real-world games, our framework generates problems with adjustable difficulty while retaining some form of analytical tractability. For several algorithms, including AlphaBeta and Scout, we derive recursive formulas characterizing their average-case complexities under this model. These allow us to rigorously compare algorithms on deep game-trees, where Monte-Carlo simulations are no longer feasible. While asymptotically, all algorithms seem to converge to identical branching factor (a result analogous to that of independence-based models), deep finite trees reveal stark differences: AlphaBeta incurs a significantly larger constant multiplicative factor compared to algorithms like Scout, leading to a substantial practical slowdown. Our framework sheds new light on classical game-solving algorithms, offering rigorous evidence and analytical tools to advance the understanding of these methods under a richer, more challenging, and yet tractable model.
[343] Explanations are a Means to an End: Decision Theoretic Explanation Evaluation
Ziyang Guo, Berk Ustun, Jessica Hullman
Main category: cs.AI
TL;DR: A decision-theoretic framework for evaluating explanations based on their practical value in improving decision-making tasks, with three measurable estimands: theoretical benchmark, human-complementary value, and behavioral value.
Details
Motivation: Current evaluation of model explanations relies on proxy properties that are weakly tied to their practical purposes. There's a need for a principled framework that directly measures the value explanations provide in real-world decision-making contexts.Method: Proposes a decision-theoretic framework treating explanations as information signals, with three estimands: 1) theoretical benchmark (upper bound on achievable performance), 2) human-complementary value (theoretically attainable value not captured by baseline human policy), and 3) behavioral value (causal effect of providing explanations to humans).
Result: The framework provides a practical validation workflow applied to assess explanation potential and interpret behavioral effects in human-AI decision support and mechanistic interpretability contexts.
Conclusion: The decision-theoretic approach offers a principled way to evaluate explanations based on their actual utility in decision-making tasks, moving beyond proxy metrics to measure real-world value.
Abstract: Explanations of model behavior are commonly evaluated via proxy properties weakly tied to the purposes explanations serve in practice. We contribute a decision theoretic framework that treats explanations as information signals valued by the expected improvement they enable on a specified decision task. This approach yields three distinct estimands: 1) a theoretical benchmark that upperbounds achievable performance by any agent with the explanation, 2) a human-complementary value that quantifies the theoretically attainable value that is not already captured by a baseline human decision policy, and 3) a behavioral value representing the causal effect of providing the explanation to human decision-makers. We instantiate these definitions in a practical validation workflow, and apply them to assess explanation potential and interpret behavioral effects in human-AI decision support and mechanistic interpretability.
[344] Dynamic Context Adaptation for Consistent Role-Playing Agents with Retrieval-Augmented Generations
Jeiyoon Park, Yongshin Han, Minseop Kim, Kisu Yang
Main category: cs.AI
TL;DR: Amadeus is a training-free framework that enhances persona consistency in retrieval-augmented generation (RAG)-based role-playing agents, addressing hallucination issues when characters lack relevant knowledge, accompanied by CharacterRAG dataset for evaluation.
Details
Motivation: Role-playing agents (RPAs) face challenges in faithfully emulating characters due to resource-intensive data collection and model updates. RAG-based approaches are practical but under-researched, with existing methods prone to hallucination when characters lack relevant knowledge.Method: Proposes Amadeus, a training-free framework that enhances persona consistency in RAG-based RPAs. Also introduces CharacterRAG dataset with 15 fictional characters’ persona documents (976K characters) and 450 question-answer pairs for development and evaluation.
Result: The proposed method effectively models not only character knowledge but also various attributes like personality, significantly enhancing persona consistency even when responding to questions beyond a character’s knowledge.
Conclusion: Amadeus provides a practical solution for improving RAG-based role-playing agents without requiring training, while CharacterRAG dataset enables better development and evaluation of such systems.
Abstract: Building role-playing agents (RPAs) that faithfully emulate specific characters remains challenging because collecting character-specific utterances and continually updating model parameters are resource-intensive, making retrieval-augmented generation (RAG) a practical necessity. However, despite the importance of RAG, there has been little research on RAG-based RPAs. For example, we empirically find that when a persona lacks knowledge relevant to a given query, RAG-based RPAs are prone to hallucination, making it challenging to generate accurate responses. In this paper, we propose Amadeus, a training-free framework that can significantly enhance persona consistency even when responding to questions that lie beyond a character’s knowledge. In addition, to underpin the development and rigorous evaluation of RAG-based RPAs, we manually construct CharacterRAG, a role-playing dataset that consists of persona documents for 15 distinct fictional characters totaling 976K written characters, and 450 question-answer pairs. We find that our proposed method effectively models not only the knowledge possessed by characters, but also various attributes such as personality.
[345] Controlling the Risk of Corrupted Contexts for Language Models via Early-Exiting
Andrea Wynn, Metod Jazbec, Charith Peris, Rinat Khaziev, Anqi Liu, Daniel Khashabi, Eric Nalisnick
Main category: cs.AI
TL;DR: Proposes a risk control method using distribution-free risk control (DFRC) with dynamic early exit to protect LLMs from harmful context while maintaining performance on helpful inputs.
Details
Motivation: LLMs are vulnerable to harmful or irrelevant context that degrades performance ("garbage in, garbage out"), requiring mechanisms to guard against such scenarios while maintaining benefits from helpful context.Method: Defines baseline safe behavior (zero-shot performance), applies DFRC to control performance decay below baseline, uses dynamic early exit prediction to ignore attention heads that attend to unsafe inputs, and modifies DFRC for both risk control on harmful inputs and efficiency gains on helpful inputs.
Result: Theoretical and empirical results across 9 tasks show effective risk control for harmful context while achieving substantial computational efficiency gains with helpful context.
Conclusion: The approach provides principled protection against harmful context degradation while maintaining or improving efficiency with helpful inputs, addressing the garbage-in-garbage-out problem in LLMs.
Abstract: Large language models (LLMs) can be influenced by harmful or irrelevant context, which can significantly harm model performance on downstream tasks. This motivates principled designs in which LLM systems include built-in mechanisms to guard against such garbage in, garbage out'' scenarios. We propose a novel approach to limit the degree to which harmful context can degrade model performance. First, we define a baseline safe’’ behavior for the model – the model’s performance given no context at all (zero-shot). Next, we apply distribution-free risk control (DFRC) to control the extent to which the user-provided context can decay performance below this safe zero-shot baseline. We achieve this by leveraging dynamic early exit prediction, ignoring later attention heads that attend the most to the unsafe inputs. Finally, we propose modifications to DFRC that allow it to both control risk for harmful inputs and leverage performance and efficiency gains on helpful inputs. We present both theoretical and empirical results across 9 tasks spanning in-context learning and open-ended question answering, showing that our approach can effectively control risk for harmful context and simultaneously achieve substantial computational efficiency gains with helpful context.
[346] How Catastrophic is Your LLM? Certifying Risk in Conversation
Chengxiao Wang, Isha Chaudhary, Qian Hu, Weitong Ruan, Rahul Gupta, Gagandeep Singh
Main category: cs.AI
TL;DR: C³LLM: A statistical certification framework for bounding catastrophic risks in multi-turn LLM conversations with guaranteed confidence intervals.
Details
Motivation: Existing evaluations fail to fully reveal LLM vulnerabilities in conversational settings due to reliance on fixed attack prompts, lack of statistical guarantees, and inability to scale to vast multi-turn conversation spaces, posing serious public safety risks.Method: Models multi-turn conversations as probability distributions over query sequences using Markov processes on query graphs with edges encoding semantic similarity. Defines practical distributions (random node, graph path, adaptive with rejection) and quantifies catastrophic risks using confidence intervals.
Result: The framework reveals substantial catastrophic risks in frontier models, with certified lower bounds as high as 70% for worst models, highlighting urgent need for improved safety training.
Conclusion: C³LLM provides a principled statistical certification approach for catastrophic risks in multi-turn LLM conversations, offering scalable evaluation with statistical guarantees that existing methods lack.
Abstract: Large Language Models (LLMs) can produce catastrophic responses in conversational settings that pose serious risks to public safety and security. Existing evaluations often fail to fully reveal these vulnerabilities because they rely on fixed attack prompt sequences, lack statistical guarantees, and do not scale to the vast space of multi-turn conversations. In this work, we propose C$^3$LLM, a novel, principled statistical Certification framework for Catastrophic risks in multi-turn Conversation for LLMs that bounds the probability of an LLM generating catastrophic responses under multi-turn conversation distributions with statistical guarantees. We model multi-turn conversations as probability distributions over query sequences, represented by a Markov process on a query graph whose edges encode semantic similarity to capture realistic conversational flow, and quantify catastrophic risks using confidence intervals. We define several inexpensive and practical distributions–random node, graph path, and adaptive with rejection. Our results demonstrate that these distributions can reveal substantial catastrophic risks in frontier models, with certified lower bounds as high as 70% for the worst model, highlighting the urgent need for improved safety training strategies in frontier LLMs.
[347] Resisting Manipulative Bots in Meme Coin Copy Trading: A Multi-Agent Approach with Chain-of-Thought Reasoning
Yichen Luo, Yebo Feng, Jiahua Xu, Yang Liu
Main category: cs.AI
TL;DR: A manipulation-resistant copy-trading system for meme coins using multi-agent architecture with multimodal LLM and chain-of-thought reasoning to defend against bot-driven manipulation attacks.
Details
Motivation: Copy trading in meme coin markets is vulnerable to manipulation attacks where adversaries deploy bots to front-run trades, conceal positions, and fabricate sentiment, systematically extracting value from naive copiers. Despite its prevalence, no robust defensive framework exists against bot-driven manipulation.Method: Multi-agent architecture powered by multimodal large language model (LLM) with chain-of-thought (CoT) reasoning to create a manipulation-resistant copy-trading system that can detect and defend against adversarial bot attacks.
Result: Outperforms zero-shot and most statistic-driven baselines in prediction accuracy and all baselines in economic performance, achieving average copier return of 3% per meme coin investment under realistic market frictions.
Conclusion: Demonstrates effectiveness of agent-based defenses and predictability of trader profitability in adversarial meme coin markets, providing practical foundation for robust copy trading.
Abstract: Copy trading has become the dominant entry strategy in meme coin markets. However, due to the market’s extremely illiquid and volatile nature, the strategy exposes an exploitable attack surface: adversaries deploy manipulative bots to front-run trades, conceal positions, and fabricate sentiment, systematically extracting value from naïve copiers at scale. Despite its prevalence, bot-driven manipulation remains largely unexplored, and no robust defensive framework exists. We propose a manipulation-resistant copy-trading system based on a multi-agent architecture powered by a multi-modal large language model (LLM) and chain-of-thought (CoT) reasoning. Our approach outperforms zero-shot and most statistic-driven baselines in prediction accuracy as well as all baselines in economic performance, achieving an average copier return of 3% per meme coin investment under realistic market frictions. Overall, our results demonstrate the effectiveness of agent-based defenses and predictability of trader profitability in adversarial meme coin markets, providing a practical foundation for robust copy trading.
[348] Semi-Autonomous Mathematics Discovery with Gemini: A Case Study on the Erdős Problems
Tony Feng, Trieu Trinh, Garrett Bingham, Jiwon Kang, Shengtong Zhang, Sang-hyun Kim, Kevin Barreto, Carl Schildkraut, Junehyuk Jung, Jaehyeon Seo, Carlo Pagano, Yuri Chervonyi, Dawsen Hwang, Kaiying Hou, Sergei Gukov, Cheng-Chiang Tsai, Hyunwoo Choi, Youngbeom Jin, Wei-Yuan Li, Hao-An Wu, Ruey-An Shiu, Yu-Sheng Shih, Quoc V. Le, Thang Luong
Main category: cs.AI
TL;DR: AI-assisted evaluation of 700 “Open” Erdős problems using Gemini, addressing 13 problems with hybrid AI-human approach, revealing issues in literature identification and AI plagiarism risks.
Details
Motivation: To explore semi-autonomous mathematics discovery by systematically evaluating conjectures labeled as 'Open' in Bloom's Erdős Problems database using AI assistance, addressing the challenge of verifying mathematical conjectures at scale.Method: Hybrid methodology: 1) AI-driven natural language verification using Gemini to narrow search space of 700 conjectures, 2) Human expert evaluation to gauge correctness and novelty of identified solutions, 3) Systematic analysis of 13 addressed problems (5 with novel autonomous solutions, 8 through literature identification).
Result: Addressed 13 ‘Open’ problems: 5 through seemingly novel autonomous solutions, 8 through identification of previous solutions in existing literature. Found that ‘Open’ status was often due to obscurity rather than difficulty. Identified key issues: difficulty of literature identification at scale and risk of ‘subconscious plagiarism’ by AI.
Conclusion: AI can assist in mathematics discovery but faces challenges with literature identification and plagiarism risks. The ‘Open’ status of problems may reflect obscurity rather than inherent difficulty. Hybrid AI-human approaches show promise for systematic mathematical verification.
Abstract: We present a case study in semi-autonomous mathematics discovery, using Gemini to systematically evaluate 700 conjectures labeled ‘Open’ in Bloom’s Erdős Problems database. We employ a hybrid methodology: AI-driven natural language verification to narrow the search space, followed by human expert evaluation to gauge correctness and novelty. We address 13 problems that were marked ‘Open’ in the database: 5 through seemingly novel autonomous solutions, and 8 through identification of previous solutions in the existing literature. Our findings suggest that the ‘Open’ status of the problems was through obscurity rather than difficulty. We also identify and discuss issues arising in applying AI to math conjectures at scale, highlighting the difficulty of literature identification and the risk of ‘‘subconscious plagiarism’’ by AI. We reflect on the takeaways from AI-assisted efforts on the Erdős Problems.
[349] A Study of Adaptive Modeling Towards Robust Generalization
Zihao Jing, Qiuhao Zeng, Ruiyi Fang, Yan Yi Li, Yan Sun, Boyu Wang, Pingzhao Hu
Main category: cs.AI
TL;DR: A unified all-atom framework for language models that grounds reasoning in molecular geometry using adaptive structural token allocation and geometry-informed token injection.
Details
Motivation: Existing approaches for biomolecular structure reasoning are modality-specific and use either sequence encodings or fixed-length connector tokens, which under-expose geometric cues, impose rigid fusion bottlenecks, cause over-compression, and have poor token allocation as structural complexity grows.Method: Constructs variable-size structural patches on molecular graphs using instruction-conditioned gating policy for complexity-aware token allocation, then refines patch tokens via cross-attention with modality embeddings and injects geometry-informed tokens into the language model.
Result: The approach yields consistent gains in heterogeneous structure-grounded reasoning across diverse all-atom benchmarks.
Conclusion: The unified all-atom framework improves structure grounding and reduces structural hallucinations in language models for biomolecular reasoning.
Abstract: Large language models (LLMs) increasingly support reasoning over biomolecular structures, but most existing approaches remain modality-specific and rely on either sequence-style encodings or fixed-length connector tokens for structural inputs. These designs can under-expose explicit geometric cues and impose rigid fusion bottlenecks, leading to over-compression and poor token allocation as structural complexity grows. We present a unified all-atom framework that grounds language reasoning in geometric information while adaptively scaling structural tokens. The method first constructs variable-size structural patches on molecular graphs using an instruction-conditioned gating policy, enabling complexity-aware allocation of query tokens. It then refines the resulting patch tokens via cross-attention with modality embeddings and injects geometry-informed tokens into the language model to improve structure grounding and reduce structural hallucinations. Across diverse all-atom benchmarks, the proposed approach yields consistent gains in heterogeneous structure-grounded reasoning. An anonymized implementation is provided in the supplementary material.
[350] KANFIS: A Neuro-Symbolic Framework for Interpretable and Uncertainty-Aware Learning
Binbin Yong, Haoran Pei, Jun Shen, Haoran Li, Qingguo Zhou, Zhao Su
Main category: cs.AI
TL;DR: KANFIS is a compact neuro-fuzzy architecture that uses additive function decomposition to avoid exponential rule explosion, scales linearly with input dimensions, supports Type-1 and Type-2 fuzzy logic, and maintains interpretability through structured rule sets.
Details
Motivation: Conventional ANFIS architectures suffer from structural complexity where product-based inference causes exponential rule explosion in high-dimensional spaces, limiting scalability and interpretability.Method: Proposes KANFIS (Kolmogorov-Arnold Neuro-Fuzzy Inference System) that unifies fuzzy reasoning with additive function decomposition, using additive aggregation mechanisms that scale linearly with input dimensionality, sparse masking for compact rule sets, and compatibility with both Type-1 and Interval Type-2 fuzzy logic.
Result: Empirical results show KANFIS achieves competitive performance against representative neural and neuro-fuzzy baselines while maintaining interpretability and scalability.
Conclusion: KANFIS provides a compact, scalable neuro-fuzzy architecture that overcomes exponential rule explosion while preserving interpretability and supporting uncertainty modeling through Type-2 fuzzy logic.
Abstract: Adaptive Neuro-Fuzzy Inference System (ANFIS) was designed to combine the learning capabilities of neural network with the reasoning transparency of fuzzy logic. However, conventional ANFIS architectures suffer from structural complexity, where the product-based inference mechanism causes an exponential explosion of rules in high-dimensional spaces. We herein propose the Kolmogorov-Arnold Neuro-Fuzzy Inference System (KANFIS), a compact neuro-symbolic architecture that unifies fuzzy reasoning with additive function decomposition. KANFIS employs an additive aggregation mechanism, under which both model parameters and rule complexity scale linearly with input dimensionality rather than exponentially. Furthermore, KANFIS is compatible with both Type-1 (T1) and Interval Type-2 (IT2) fuzzy logic systems, enabling explicit modeling of uncertainty and ambiguity in fuzzy representations. By using sparse masking mechanisms, KANFIS generates compact and structured rule sets, resulting in an intrinsically interpretable model with clear rule semantics and transparent inference processes. Empirical results demonstrate that KANFIS achieves competitive performance against representative neural and neuro-fuzzy baselines.
[351] Mitigating Conversational Inertia in Multi-Turn Agents
Yang Wan, Zheng Cao, Zhenhao Zhang, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Linchao Zhu
Main category: cs.AI
TL;DR: The paper identifies “conversational inertia” in LLMs acting as agents, where models overly attend to their own previous responses, limiting exploration. The authors propose Context Preference Learning to calibrate models to favor low-inertia responses and improve agent performance.
Details
Motivation: When LLMs are used as agents in multiturn scenarios, they exhibit "conversational inertia" - they overly mimic their own previous responses as few-shot examples, which constrains exploration and limits agent performance. This creates a tension between longer context (which provides more environmental feedback) and increased inertia (which reduces exploration).Method: 1) Identified conversational inertia through attention analysis, showing strong diagonal attention to previous responses. 2) Proposed Context Preference Learning (CPL) to calibrate model preferences to favor low-inertia responses over high-inertia ones, using preference pairs constructed without environment rewards. 3) Developed context management strategies at inference time to balance exploration and exploitation.
Result: Experimental validation across eight agentic environments and one deep research scenario shows that the framework reduces conversational inertia and achieves performance improvements.
Conclusion: Conversational inertia is a key challenge when using few-shot LLMs as agents, and the proposed Context Preference Learning framework effectively addresses this issue by calibrating model preferences and providing context management strategies, leading to improved agent performance.
Abstract: Large language models excel as few-shot learners when provided with appropriate demonstrations, yet this strength becomes problematic in multiturn agent scenarios, where LLMs erroneously mimic their own previous responses as few-shot examples. Through attention analysis, we identify conversational inertia, a phenomenon where models exhibit strong diagonal attention to previous responses, which is associated with imitation bias that constrains exploration. This reveals a tension when transforming few-shot LLMs into agents: longer context enriches environmental feedback for exploitation, yet also amplifies conversational inertia that undermines exploration. Our key insight is that for identical states, actions generated with longer contexts exhibit stronger inertia than those with shorter contexts, enabling construction of preference pairs without environment rewards. Based on this, we propose Context Preference Learning to calibrate model preferences to favor low-inertia responses over highinertia ones. We further provide context management strategies at inference time to balance exploration and exploitation. Experimental results across eight agentic environments and one deep research scenario validate that our framework reduces conversational inertia and achieves performance improvements.
[352] Vibe AIGC: A New Paradigm for Content Generation via Agentic Orchestration
Jiaheng Liu, Yuanxing Zhang, Shihao Li, Xinping Lei
Main category: cs.AI
TL;DR: Vibe AIGC introduces a new paradigm for content generation using agentic orchestration, moving from stochastic single-shot models to hierarchical multi-agent workflows to bridge the Intent-Execution Gap.
Details
Motivation: Current generative AI faces a "usability ceiling" due to the Intent-Execution Gap - the disparity between creator's high-level intent and stochastic, black-box nature of single-shot models. The model-centric paradigm driven by scaling laws has limitations in usability despite visual fidelity improvements.Method: Introduces Vibe AIGC paradigm with agentic orchestration: users provide a “Vibe” (high-level aesthetic/functional representation), a centralized Meta-Planner deconstructs this into executable, verifiable, and adaptive agentic pipelines using hierarchical multi-agent workflows.
Result: Transitions from stochastic inference to logical orchestration, bridging the gap between human imagination and machine execution. Transforms AI from a fragile inference engine into a robust system-level engineering partner.
Conclusion: This paradigm shift will redefine human-AI collaborative economy and democratize creation of complex, long-horizon digital assets by making AI a system-level engineering partner rather than just an inference engine.
Abstract: For the past decade, the trajectory of generative artificial intelligence (AI) has been dominated by a model-centric paradigm driven by scaling laws. Despite significant leaps in visual fidelity, this approach has encountered a usability ceiling'' manifested as the Intent-Execution Gap (i.e., the fundamental disparity between a creator's high-level intent and the stochastic, black-box nature of current single-shot models). In this paper, inspired by the Vibe Coding, we introduce the \textbf{Vibe AIGC}, a new paradigm for content generation via agentic orchestration, which represents the autonomous synthesis of hierarchical multi-agent workflows. Under this paradigm, the user's role transcends traditional prompt engineering, evolving into a Commander who provides a Vibe, a high-level representation encompassing aesthetic preferences, functional logic, and etc. A centralized Meta-Planner then functions as a system architect, deconstructing this Vibe’’ into executable, verifiable, and adaptive agentic pipelines. By transitioning from stochastic inference to logical orchestration, Vibe AIGC bridges the gap between human imagination and machine execution. We contend that this shift will redefine the human-AI collaborative economy, transforming AI from a fragile inference engine into a robust system-level engineering partner that democratizes the creation of complex, long-horizon digital assets.
cs.SD
[353] AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders
Georgii Aparin, Tasnima Sadekova, Alexey Rukhovich, Assel Yermekova, Laida Kushnareva, Vadim Popov, Kristian Kuznetsov, Irina Piontkovskaya
Main category: cs.SD
TL;DR: Sparse Autoencoders (SAEs) applied to audio models (Whisper and HuBERT) for interpretable feature extraction, showing stability, interpretability, and practical applications including concept erasure, false speech detection reduction, and correlation with human EEG during speech perception.
Details
Motivation: Sparse Autoencoders are effective for interpreting neural representations but remain underexplored in audio domain. The paper aims to bridge this gap by applying SAEs to state-of-the-art audio models to understand their internal representations and demonstrate practical utility.Method: Train SAEs across all encoder layers of Whisper and HuBERT models. Evaluate stability across random seeds, interpretability through feature analysis, and practical applications including concept erasure and feature steering. Also correlate SAE features with human EEG data during speech perception.
Result: Over 50% of features remain consistent across random seeds with preserved reconstruction quality. SAE features capture acoustic, semantic, and specific audio events (environmental noises, paralinguistic sounds). Concept erasure requires removing only 19-27% of features. Feature steering reduces Whisper’s false speech detections by 70% with minimal WER increase. SAE features correlate with human EEG activity during speech perception.
Conclusion: SAEs provide stable, interpretable representations for audio models with practical applications in model editing and alignment with human neural processing. The approach demonstrates the value of interpretability methods for audio understanding systems.
Abstract: Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their practical utility. Over 50% of the features remain consistent across random seeds, and reconstruction quality is preserved. SAE features capture general acoustic and semantic information as well as specific events, including environmental noises and paralinguistic sounds (e.g. laughter, whispering) and disentangle them effectively, requiring removal of only 19-27% of features to erase a concept. Feature steering reduces Whisper’s false speech detections by 70% with negligible WER increase, demonstrating real-world applicability. Finally, we find SAE features correlated with human EEG activity during speech perception, indicating alignment with human neural processing. The code and checkpoints are available at https://github.com/audiosae/audiosae_demo.
[354] Speech-XL: Towards Long-Form Speech Understanding in Large Speech Language Models
Haoqin Sun, Chenyang Lyu, Shiwan Zhao, Xuanfan Ni, Xiangyu Kong, Longyue Wang, Weihua Luo, Yong Qin
Main category: cs.SD
TL;DR: Speech-XL is a novel model that addresses long-form audio understanding limitations in Large Speech Language Models by using Speech Summarization Tokens (SST) to compress speech intervals via KV sparsification, enabling efficient processing of extended audio sequences.
Details
Motivation: Current Large Speech Language Models struggle with long-form audio understanding due to limited context length and high memory requirements for processing extended audio sequences, creating a bottleneck for practical applications.Method: Introduces Speech Summarization Tokens (SST) that encapsulate speech interval information into KV pairs, trained via instruction fine-tuning with curriculum learning from low to high compression ratios, leveraging LLMs’ intrinsic KV sparsification capacity.
Result: Achieves competitive performance on major benchmarks (LongSpeech and AUDIOMARATHON) despite using significantly less training data than other baselines, effectively addressing long-form audio modeling bottlenecks.
Conclusion: Speech-XL provides a novel approach to condensing extensive acoustic sequences by addressing key limitations in long-form audio understanding through efficient compression techniques.
Abstract: Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value (KV) sparsification capacity of Large Language Models (LLMs) to achieve high-ratio speech input compression. Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated KV pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner–advancing from low-ratio (simple) to high-ratio (challenging) compression. Despite utilizing significantly less training data than other baselines, our model achieves highly competitive performance on major benchmarks, including LongSpeech and AUDIOMARATHON. By addressing the long-standing bottlenecks in long-form audio modeling, our approach offers a novel perspective on the condensation of extensive acoustic sequences.
[355] Enabling Automatic Disordered Speech Recognition: An Impaired Speech Dataset in the Akan Language
Isaac Wiafe, Akon Obu Ekpezu, Sumaya Ahmed Salihs, Elikem Doe Atsakpo, Fiifi Baffoe Payin Winful, Jamal-Deen Abdulai
Main category: cs.SD
TL;DR: A curated corpus of 50.01 hours of impaired speech data in Akan language covering four impairment types (stammering, cerebral palsy, cleft palate, stroke), with audio recordings, transcriptions, and metadata for low-resource ASR research.
Details
Motivation: To address the lack of impaired speech data for developing inclusive speech technologies in low-resource languages like Akan, which hinders advancements in assistive speech technology for speech-impaired communities.Method: Collected speech samples from native Akan speakers with four types of speech impairments in controlled supervised environments. Participants described pre-selected images in their own words, resulting in audio recordings, transcriptions, and comprehensive metadata.
Result: Created a dataset of 50.01 hours of audio recordings across four impairment classes (stammering, cerebral palsy, cleft palate, stroke-induced speech disorder) with associated transcriptions and metadata on speaker demographics, impairment class, recording environment, and device.
Conclusion: This dataset fills a critical gap in impaired speech resources for low-resource languages and is intended to support research in automatic disordered speech recognition systems and assistive speech technology development.
Abstract: The lack of impaired speech data hinders advancements in the development of inclusive speech technologies, particularly in low-resource languages such as Akan. To address this gap, this study presents a curated corpus of speech samples from native Akan speakers with speech impairment. The dataset comprises of 50.01 hours of audio recordings cutting across four classes of impaired speech namely stammering, cerebral palsy, cleft palate, and stroke induced speech disorder. Recordings were done in controlled supervised environments were participants described pre-selected images in their own words. The resulting dataset is a collection of audio recordings, transcriptions, and associated metadata on speaker demographics, class of impairment, recording environment and device. The dataset is intended to support research in low-resource automatic disordered speech recognition systems and assistive speech technology.
[356] HyperPotter: Spell the Charm of High-Order Interactions in Audio Deepfake Detection
Qing Wen, Haohao Li, Zhongjie Ba, Peng Cheng, Miao He, Li Lu, Kui Ren
Main category: cs.SD
TL;DR: HyperPotter: A hypergraph-based framework for audio deepfake detection that models high-order interactions between features through clustering-based hyperedges with class-aware prototype initialization, achieving superior performance and generalization.
Details
Motivation: Current audio deepfake detection methods focus on local temporal/spectral features or pairwise relations, overlooking high-order interactions that capture discriminative patterns emerging from multiple feature components beyond their individual contributions.Method: Proposes HyperPotter, a hypergraph-based framework that explicitly models high-order interactions through clustering-based hyperedges with class-aware prototype initialization to capture synergistic relationships between multiple feature components.
Result: HyperPotter surpasses its baseline by an average relative gain of 22.15% across 11 datasets and outperforms state-of-the-art methods by 13.96% on 4 challenging cross-domain datasets, demonstrating superior generalization to diverse attacks and speakers.
Conclusion: Modeling high-order interactions through hypergraphs significantly improves audio deepfake detection performance and generalization capabilities across diverse datasets and attack types.
Abstract: Advances in AIGC technologies have enabled the synthesis of highly realistic audio deepfakes capable of deceiving human auditory perception. Although numerous audio deepfake detection (ADD) methods have been developed, most rely on local temporal/spectral features or pairwise relations, overlooking high-order interactions (HOIs). HOIs capture discriminative patterns that emerge from multiple feature components beyond their individual contributions. We propose HyperPotter, a hypergraph-based framework that explicitly models these synergistic HOIs through clustering-based hyperedges with class-aware prototype initialization. Extensive experiments demonstrate that HyperPotter surpasses its baseline by an average relative gain of 22.15% across 11 datasets and outperforms state-of-the-art methods by 13.96% on 4 challenging cross-domain datasets, demonstrating superior generalization to diverse attacks and speakers.
[357] Video Soundtrack Generation by Aligning Emotions and Temporal Boundaries
Serkan Sulun, Paula Viana, Matthew E. P. Davies
Main category: cs.SD
TL;DR: EMSYC is an automatic video-based symbolic music generator that creates emotionally and temporally aligned MIDI music for videos using a two-stage framework with emotion classification and conditional music generation.
Details
Motivation: Creating soundtracks for videos is costly and time-consuming for content creators. There's a need for automatic systems that can generate music that aligns both emotionally and temporally with video content.Method: Two-stage framework: 1) Pretrained video emotion classifier extracts emotional features, 2) Conditional music generator produces MIDI sequences guided by emotional and temporal cues. Introduces boundary offsets for temporal alignment and a mapping scheme to bridge categorical emotion outputs with continuous valence-arousal inputs.
Result: Outperforms state-of-the-art models in objective and subjective evaluations across different video datasets. Demonstrates effectiveness in generating music aligned to video both emotionally and temporally.
Conclusion: EMSYC provides an effective solution for automatic video soundtrack generation that aligns music with video content both emotionally and temporally, addressing the costly and time-consuming challenge for content creators.
Abstract: Providing soundtracks for videos remains a costly and time-consuming challenge for multimedia content creators. We introduce EMSYNC, an automatic video-based symbolic music generator that creates music aligned with a video’s emotional content and temporal boundaries. It follows a two-stage framework, where a pretrained video emotion classifier extracts emotional features, and a conditional music generator produces MIDI sequences guided by both emotional and temporal cues. We introduce boundary offsets, a novel temporal conditioning mechanism that enables the model to anticipate upcoming video scene cuts and align generated musical chords with them. We also propose a mapping scheme that bridges the discrete categorical outputs of the video emotion classifier with the continuous valence-arousal inputs required by the emotion-conditioned MIDI generator, enabling seamless integration of emotion information across different representations. Our method outperforms state-of-the-art models in objective and subjective evaluations across different video datasets, demonstrating its effectiveness in generating music aligned to video both emotionally and temporally. Our demo and output samples are available at https://serkansulun.com/emsync.
[358] BACHI: Boundary-Aware Symbolic Chord Recognition Through Masked Iterative Decoding on Pop and Classical Music
Mingyang Yao, Ke Chen, Shlomo Dubnov, Taylor Berg-Kirkpatrick
Main category: cs.SD
TL;DR: BACHI: A symbolic chord recognition model using boundary detection and iterative ranking that mimics human ear-training practices, achieving SOTA performance on classical and pop music benchmarks.
Details
Motivation: Address two key challenges in automatic chord recognition: 1) limited attention to symbolic music ACR due to data scarcity, and 2) existing methods overlook strategies aligned with human music analytical practices.Method: Introduces POP909-CL dataset with tempo-aligned content and human-corrected labels. Proposes BACHI model that decomposes chord recognition into boundary detection and iterative ranking of chord root, quality, and bass (inversion), mirroring human ear-training practices.
Result: BACHI achieves state-of-the-art chord recognition performance on both classical and pop music benchmarks. Ablation studies validate the effectiveness of each module.
Conclusion: The proposed approach successfully addresses symbolic chord recognition challenges by incorporating human-like analytical practices and providing enhanced dataset, achieving superior performance.
Abstract: Automatic chord recognition (ACR) via deep learning models has gradually achieved promising recognition accuracy, yet two key challenges remain. First, prior work has primarily focused on audio-domain ACR, while symbolic music (e.g., score) ACR has received limited attention due to data scarcity. Second, existing methods still overlook strategies that are aligned with human music analytical practices. To address these challenges, we make two contributions: (1) we introduce POP909-CL, an enhanced version of POP909 dataset with tempo-aligned content and human-corrected labels of chords, beats, keys, and time signatures; and (2) We propose BACHI, a symbolic chord recognition model that decomposes the task into different decision steps, namely boundary detection and iterative ranking of chord root, quality, and bass (inversion). This mechanism mirrors the human ear-training practices. Experiments demonstrate that BACHI achieves state-of-the-art chord recognition performance on both classical and pop music benchmarks, with ablation studies validating the effectiveness of each module.
[359] Leveraging Whisper Embeddings for Audio-based Lyrics Matching
Eleonora Mancini, Joan Serrà, Paolo Torroni, Yuki Mitsufuji
Main category: cs.SD
TL;DR: WEALY is a reproducible pipeline using Whisper decoder embeddings for audio-based lyrics matching, establishing robust baselines and exploring multimodal text-acoustic integration.
Details
Motivation: Existing audio-based lyrics matching methods suffer from limited reproducibility and inconsistent baselines, creating challenges for reliable research and comparison in music information retrieval.Method: WEALY leverages Whisper decoder embeddings for lyrics matching, establishes transparent baselines, and explores multimodal extensions integrating textual and acoustic features through extensive experiments and ablation studies.
Result: WEALY achieves performance comparable to state-of-the-art methods while providing reproducibility, with additional analyses on language robustness, loss functions, and embedding strategies.
Conclusion: The work contributes a reliable benchmark for future research and demonstrates the potential of speech technologies for music information retrieval tasks through reproducible multimodal approaches.
Abstract: Audio-based lyrics matching can be an appealing alternative to other content-based retrieval approaches, but existing methods often suffer from limited reproducibility and inconsistent baselines. In this work, we introduce WEALY, a fully reproducible pipeline that leverages Whisper decoder embeddings for lyrics matching tasks. WEALY establishes robust and transparent baselines, while also exploring multimodal extensions that integrate textual and acoustic features. Through extensive experiments on standard datasets, we demonstrate that WEALY achieves a performance comparable to state-of-the-art methods that lack reproducibility. In addition, we provide ablation studies and analyses on language robustness, loss functions, and embedding strategies. This work contributes a reliable benchmark for future research, and underscores the potential of speech technologies for music information retrieval tasks.
[360] ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge Evaluation Plan
Xueping Zhang, Han Yin, Yang Xiao, Lin Zhang, Ting Dang, Rohan Kumar Das, Ming Li
Main category: cs.SD
TL;DR: A dataset and challenge for detecting component-level audio deepfakes where speech and environmental sounds can be independently manipulated, making detection more challenging than whole audio deepfakes.
Details
Motivation: Real-world audio contains mixtures of speech and environmental sounds, and with advances in generation models, either component can now be modified independently. These component-level manipulations are harder to detect than whole audio deepfakes because the unaltered component can mislead detection systems and they sound more natural to humans.Method: Proposed CompSpoofV2 dataset (250k+ audio samples, ~283 hours) for component-level audio anti-spoofing, and a separation-enhanced joint learning framework. Also launched the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2) focusing on component-level spoofing.
Result: Created a large-scale curated dataset and framework for component-level audio deepfake detection, and established a challenge (ESDD2) to be held at ICME 2026 to advance research in this area.
Conclusion: Component-level audio manipulations present a more challenging detection scenario than whole audio deepfakes, requiring specialized datasets and approaches. The proposed dataset, framework, and challenge aim to advance research in detecting realistic audio deepfakes where speech and environmental sounds can be independently manipulated.
Abstract: Audio recorded in real-world environments often contains a mixture of foreground speech and background environmental sounds. With rapid advances in text-to-speech, voice conversion, and other generation models, either component can now be modified independently. Such component-level manipulations are harder to detect, as the remaining unaltered component can mislead the systems designed for whole deepfake audio, and they often sound more natural to human listeners. To address this gap, we have proposed CompSpoofV2 dataset and a separation-enhanced joint learning framework. CompSpoofV2 is a large-scale curated dataset designed for component-level audio anti-spoofing, which contains over 250k audio samples, with a total duration of approximately 283 hours. Based on the CompSpoofV2 and the separation-enhanced joint learning framework, we launch the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), focusing on component-level spoofing, where both speech and environmental sounds may be manipulated or synthesized, creating a more challenging and realistic detection scenario. The challenge will be held in conjunction with the IEEE International Conference on Multimedia and Expo 2026 (ICME 2026).
[361] ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation
Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo
Main category: cs.SD
TL;DR: ACE-Step v1.5 is an efficient open-source music foundation model that achieves commercial-grade music generation on consumer hardware with fast inference and low VRAM requirements.
Details
Motivation: To create an accessible, high-quality music generation model that runs efficiently on consumer hardware while offering professional-grade capabilities and personalization options for creators.Method: Uses a novel hybrid architecture with a Language Model as an omni-capable planner that transforms user queries into comprehensive song blueprints via Chain-of-Thought, guiding a Diffusion Transformer. Features intrinsic reinforcement learning without external reward models.
Result: Achieves quality beyond most commercial music models with extremely fast inference (under 2 seconds per song on A100, under 10 seconds on RTX 3090), runs locally with <4GB VRAM, supports multi-language prompts, and offers versatile editing capabilities.
Conclusion: ACE-Step v1.5 democratizes professional music generation by making high-quality, controllable music synthesis accessible on consumer hardware with efficient performance and personalization capabilities.
Abstract: We present ACE-Step v1.5, a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware. On commonly used evaluation metrics, ACE-Step v1.5 achieves quality beyond most commercial music models while remaining extremely fast – under 2 seconds per full song on an A100 and under 10 seconds on an RTX 3090. The model runs locally with less than 4GB of VRAM, and supports lightweight personalization: users can train a LoRA from just a few songs to capture their own style. At its core lies a novel hybrid architecture where the Language Model (LM) functions as an omni-capable planner: it transforms simple user queries into comprehensive song blueprints – scaling from short loops to 10-minute compositions – while synthesizing metadata, lyrics, and captions via Chain-of-Thought to guide the Diffusion Transformer (DiT). Uniquely, this alignment is achieved through intrinsic reinforcement learning relying solely on the model’s internal mechanisms, thereby eliminating the biases inherent in external reward models or human preferences. Beyond standard synthesis, ACE-Step v1.5 unifies precise stylistic control with versatile editing capabilities – such as cover generation, repainting, and vocal-to-BGM conversion – while maintaining strict adherence to prompts across 50+ languages. This paves the way for powerful tools that seamlessly integrate into the creative workflows of music artists, producers, and content creators. The code, the model weights and the demo are available at: https://ace-step.github.io/ace-step-v1.5.github.io/
[362] UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization
Dongchao Yang, Yuanyuan Wang, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng
Main category: cs.SD
TL;DR: UniAudio 2.0 introduces a novel audio tokenizer (ReasoningCodec) with reasoning and reconstruction tokens, and a unified autoregressive model for text and audio, achieving strong few-shot/zero-shot generalization across speech, sound, and music tasks.
Details
Motivation: The paper addresses two fundamental challenges in audio language models: (1) designing an audio tokenizer that serves as intermediate representation for both understanding and generation, and (2) building an audio foundation model that generalizes in few-shot and zero-shot settings like large language models.Method: Proposes ReasoningCodec, a discrete audio codec with two token types: reasoning tokens for text-aligned high-level analysis/planning, and reconstruction tokens for semantic-rich acoustic cues. Also introduces a unified autoregressive architecture for text and audio with multi-stage training and multi-task data construction, trained on 100B text tokens and 60B audio tokens.
Result: Achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks across speech, sound, and music domains.
Conclusion: The proposed ReasoningCodec and unified autoregressive framework effectively address foundational problems in audio language models, enabling both understanding and generation capabilities with strong generalization performance across diverse audio tasks.
Abstract: We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. Second, we introduce a unified autoregressive architecture for text and audio, together with multi-stage training and multi-task data construction. Using this framework, we train UniAudio 2.0 on 100B text tokens and 60B audio tokens. Across a wide range of speech, sound, and music tasks, UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks. Demo, code, and checkpoints will be available at \href{https://dongchaoyang.top/UniAudio2Demo/}{https://dongchaoyang.top/UniAudio2Demo/}.
cs.LG
[363] DCER: Dual-Stage Compression and Energy-Based Reconstruction
Yiwen Wang, Jiahao Qin
Main category: cs.LG
TL;DR: DCER: Dual-stage compression and energy-based reconstruction framework for robust multimodal fusion that handles noisy inputs and missing modalities through frequency transforms and learned energy functions.
Details
Motivation: Multimodal fusion faces two key robustness challenges: noisy inputs degrade representation quality, and missing modalities cause prediction failures. Existing methods often fail to address both issues simultaneously.Method: Proposes DCER with dual-stage compression: 1) within-modality frequency transforms (wavelet for audio, DCT for video) to remove noise while preserving task-relevant patterns, and 2) cross-modality bottleneck tokens to force genuine integration. For missing modalities, uses energy-based reconstruction via gradient descent on a learned energy function.
Result: Achieves state-of-the-art performance on CMU-MOSI, CMU-MOSEI, and CH-SIMS benchmarks. Shows U-shaped robustness pattern favoring multimodal fusion at both complete and high-missing conditions. Energy function provides intrinsic uncertainty quantification with ρ > 0.72 correlation with prediction error.
Conclusion: DCER provides a unified framework addressing both noisy inputs and missing modalities in multimodal fusion, with energy-based reconstruction offering uncertainty quantification and robust performance across varying missing modality conditions.
Abstract: Multimodal fusion faces two robustness challenges: noisy inputs degrade representation quality, and missing modalities cause prediction failures. We propose DCER, a unified framework addressing both challenges through dual-stage compression and energy-based reconstruction. The compression stage operates at two levels: within-modality frequency transforms (wavelet for audio, DCT for video) remove noise while preserving task-relevant patterns, and cross-modality bottleneck tokens force genuine integration rather than modality-specific shortcuts. For missing modalities, energy-based reconstruction recovers representations via gradient descent on a learned energy function, with the final energy providing intrinsic uncertainty quantification (\r{ho} > 0.72 correlation with prediction error). Experiments on CMU-MOSI, CMU-MOSEI, and CH-SIMS demonstrate state-of-the-art performance across all benchmarks, with a U-shaped robustness pattern favoring multimodal fusion at both complete and high-missing conditions. The code will be available on Github.
[364] Denoising diffusion networks for normative modeling in neuroimaging
Luke Whitbread, Lyle J. Palmer, Mark Jenkinson
Main category: cs.LG
TL;DR: DDPMs for normative modeling of neuroimaging data, enabling multivariate dependence modeling while deriving univariate centiles and deviation scores through sampling.
Details
Motivation: Traditional normative modeling in neuroimaging fits separate models per imaging-derived phenotype, discarding multivariate dependence that may encode coordinated patterns. There's a need for unified conditional density estimators that can capture these dependencies while remaining compatible with standard clinical interpretation pipelines.Method: Proposes denoising diffusion probabilistic models (DDPMs) as unified conditional density estimators for tabular imaging-derived phenotypes. Uses two denoiser backbones: (1) FiLM-conditioned multilayer perceptron, and (2) tabular transformer with feature self-attention and intersample attention (SAINT) with conditioning covariates through learned embeddings. Evaluates on synthetic benchmarks and UK Biobank FreeSurfer phenotypes scaling from 2 to 200 dimensions.
Result: For low dimensions, diffusion models deliver well-calibrated per-IDP outputs comparable to traditional baselines while jointly modeling realistic dependence structure. At higher dimensions, the transformer backbone remains substantially better calibrated than the MLP and better preserves higher-order dependence. Diffusion models enable scalable joint normative models that remain compatible with standard per-IDP pipelines.
Conclusion: Diffusion-based normative modeling provides a practical route to calibrated multivariate deviation profiles in neuroimaging, capturing multivariate dependencies while maintaining compatibility with clinical interpretation standards.
Abstract: Normative modeling estimates reference distributions of biological measures conditional on covariates, enabling centiles and clinically interpretable deviation scores to be derived. Most neuroimaging pipelines fit one model per imaging-derived phenotype (IDP), which scales well but discards multivariate dependence that may encode coordinated patterns. We propose denoising diffusion probabilistic models (DDPMs) as a unified conditional density estimator for tabular IDPs, from which univariate centiles and deviation scores are derived by sampling. We utilise two denoiser backbones: (i) a feature-wise linear modulation (FiLM) conditioned multilayer perceptron (MLP) and (ii) a tabular transformer with feature self-attention and intersample attention (SAINT), conditioning covariates through learned embeddings. We evaluate on a synthetic benchmark with heteroscedastic and multimodal age effects and on UK Biobank FreeSurfer phenotypes, scaling from dimension of 2 to 200. Our evaluation suite includes centile calibration (absolute centile error, empirical coverage, and the probability integral transform), distributional fidelity (Kolmogorov-Smirnov tests), multivariate dependence diagnostics, and nearest-neighbour memorisation analysis. For low dimensions, diffusion models deliver well-calibrated per-IDP outputs comparable to traditional baselines while jointly modeling realistic dependence structure. At higher dimensions, the transformer backbone remains substantially better calibrated than the MLP and better preserves higher-order dependence, enabling scalable joint normative models that remain compatible with standard per-IDP pipelines. These results support diffusion-based normative modeling as a practical route to calibrated multivariate deviation profiles in neuroimaging.
[365] A Causal Perspective for Enhancing Jailbreak Attack and Defense
Licheng Pan, Yunsheng Lu, Jiexi Liu, Jialing Tao, Haozhe Feng, Hui Xue, Zhixuan Chu, Kui Ren
Main category: cs.LG
TL;DR: Causal Analyst framework uses LLMs and causal discovery to identify direct causes of jailbreaks in LLMs, with applications for both attack enhancement and defense.
Details
Motivation: Existing jailbreak analysis focuses on latent representations but overlooks causal relationships between interpretable prompt features and jailbreak occurrences, limiting understanding of underlying mechanisms.Method: Proposes Causal Analyst framework integrating LLMs into data-driven causal discovery; creates dataset of 35k jailbreak attempts across 7 LLMs with 37 human-readable prompt features; jointly trains LLM-based prompt encoding and GNN-based causal graph learning to reconstruct causal pathways.
Result: Identifies specific features like “Positive Character” and “Number of Task Steps” as direct causal drivers of jailbreaks; demonstrates practical utility through Jailbreaking Enhancer (boosts attack success rates) and Guardrail Advisor (extracts malicious intent from obfuscated queries).
Conclusion: Causal analysis of jailbreak features is effective and interpretable for improving LLM reliability; framework provides insights for both attack and defense applications.
Abstract: Uncovering the mechanisms behind “jailbreaks” in large language models (LLMs) is crucial for enhancing their safety and reliability, yet these mechanisms remain poorly understood. Existing studies predominantly analyze jailbreak prompts by probing latent representations, often overlooking the causal relationships between interpretable prompt features and jailbreak occurrences. In this work, we propose Causal Analyst, a framework that integrates LLMs into data-driven causal discovery to identify the direct causes of jailbreaks and leverage them for both attack and defense. We introduce a comprehensive dataset comprising 35k jailbreak attempts across seven LLMs, systematically constructed from 100 attack templates and 50 harmful queries, annotated with 37 meticulously designed human-readable prompt features. By jointly training LLM-based prompt encoding and GNN-based causal graph learning, we reconstruct causal pathways linking prompt features to jailbreak responses. Our analysis reveals that specific features, such as “Positive Character” and “Number of Task Steps”, act as direct causal drivers of jailbreaks. We demonstrate the practical utility of these insights through two applications: (1) a Jailbreaking Enhancer that targets identified causal features to significantly boost attack success rates on public benchmarks, and (2) a Guardrail Advisor that utilizes the learned causal graph to extract true malicious intent from obfuscated queries. Extensive experiments, including baseline comparisons and causal structure validation, confirm the robustness of our causal analysis and its superiority over non-causal approaches. Our results suggest that analyzing jailbreak features from a causal perspective is an effective and interpretable approach for improving LLM reliability. Our code is available at https://github.com/Master-PLC/Causal-Analyst.
[366] A$^2$-LLM: An End-to-end Conversational Audio Avatar Large Language Model
Xiaolin Hu, Hang Yuan, Xinzhu Sang, Binbin Yan, Zhou Yu, Cong Huang, Kai Chen
Main category: cs.LG
TL;DR: A^2-LLM is an end-to-end conversational audio avatar LLM that jointly models language, audio prosody, and 3D facial motion in a unified framework, generating emotionally rich facial movements beyond lip-sync.
Details
Motivation: Current conversational digital humans rely on cascaded architectures with accumulated errors, high latency, and poor real-time performance. These systems lack access to conversational context and prioritize rigid lip-sync over emotional depth.Method: Propose A^2-LLM, an end-to-end conversational audio avatar large language model that jointly reasons about language, audio prosody, and 3D facial motion within a unified framework. Introduce FLAME-QA, a high-quality multimodal dataset aligning semantic intent with expressive facial dynamics in QA format.
Result: The system achieves superior emotional expressiveness while maintaining real-time efficiency with 500 ms latency and 0.7 RTF (real-time factor).
Conclusion: A^2-LLM addresses limitations of cascaded architectures by providing unified multimodal reasoning for conversational avatars, enabling emotionally rich facial movements beyond simple lip-synchronization.
Abstract: Developing expressive and responsive conversational digital humans is a cornerstone of next-generation human-computer interaction. While large language models (LLMs) have significantly enhanced dialogue capabilities, most current systems still rely on cascaded architectures that connect independent modules. These pipelines are often plagued by accumulated errors, high latency, and poor real-time performance. Lacking access to the underlying conversational context, these pipelines inherently prioritize rigid lip-sync over emotional depth. To address these challenges, we propose A$^2$-LLM, an end-to-end conversational audio avatar large language model that jointly reasons about language, audio prosody, and 3D facial motion within a unified framework. To facilitate training, we introduce FLAME-QA, a high-quality multimodal dataset designed to align semantic intent with expressive facial dynamics within a QA format. By leveraging deep semantic understanding, A$^2$-LLM generates emotionally rich facial movements beyond simple lip-synchronization. Experimental results demonstrate that our system achieves superior emotional expressiveness while maintaining real-time efficiency (500 ms latency, 0.7 RTF).
[367] Momentum Attention: The Physics of In-Context Learning and Spectral Forensics for Mechanistic Interpretability
Kingsuk Maitra
Main category: cs.LG
TL;DR: Momentum Attention introduces symplectic physics into Transformers via kinematic momentum, enabling single-layer induction heads through a symplectic-filter duality that connects Hamiltonian mechanics with signal processing.
Details
Motivation: To extend mechanistic interpretability of Transformers by viewing them as physical circuits with conservation laws and time-varying dynamics, bridging generative AI with Hamiltonian physics and signal processing.Method: Introduces Momentum Attention with symplectic augmentation using kinematic difference operator p_t = q_t - q_{t-1}, applying symplectic shear on queries and keys, establishing symplectic-filter duality where physical shear equals high-pass filtering.
Result: 125M Momentum model exceeds expectations on induction-heavy tasks, tracks 350M baseline within ~2.9% validation loss, enables single-layer induction heads (bypassing L≥2 constraint), and reveals scaling law γ* = 4.17 × N^{-0.74} for momentum-depth fungibility.
Conclusion: The framework connects generative AI, Hamiltonian physics, and signal processing, offering complementary analytical toolkit with validated results from 5,100+ experiments showing momentum enables single-layer induction through symplectic-filter duality.
Abstract: The Mechanistic Interpretability (MI) program has mapped the Transformer as a precise computational graph. We extend this graph with a conservation law and time-varying AC dynamics, viewing it as a physical circuit. We introduce Momentum Attention, a symplectic augmentation embedding physical priors via the kinematic difference operator $p_t = q_t - q_{t-1}$, implementing the symplectic shear $\hat{q}_t = q_t + γp_t$ on queries and keys. We identify a fundamental Symplectic-Filter Duality: the physical shear is mathematically equivalent to a High-Pass Filter. This duality is our cornerstone contribution – by injecting kinematic momentum, we sidestep the topological depth constraint ($L \geq 2$) for induction head formation. While standard architectures require two layers for induction from static positions, our extension grants direct access to velocity, enabling Single-Layer Induction and Spectral Forensics via Bode Plots. We formalize an Orthogonality Theorem proving that DC (semantic) and AC (mechanistic) signals segregate into orthogonal frequency bands when Low-Pass RoPE interacts with High-Pass Momentum. Validated through 5,100+ controlled experiments (documented in Supplementary Appendices A–R and 27 Jupyter notebooks), our 125M Momentum model exceeds expectations on induction-heavy tasks while tracking a 350M baseline within $\sim$2.9% validation loss. Dedicated associative recall experiments reveal a scaling law $γ^* = 4.17 \times N^{-0.74}$ establishing momentum-depth fungibility. We offer this framework as a complementary analytical toolkit connecting Generative AI, Hamiltonian Physics, and Signal Processing.
[368] CyIN: Cyclic Informative Latent Space for Bridging Complete and Incomplete Multimodal Learning
Ronghao Lin, Qiaolin He, Sijie Mai, Ying Zeng, Aolin Xiong, Li Huang, Yap-Peng Tan, Haifeng Hu
Main category: cs.LG
TL;DR: CyIN framework addresses multimodal learning with dynamic missing modalities by creating informative latent spaces and cross-modal cyclic translation to handle incomplete inputs.
Details
Motivation: Real-world multimodal deployments face unpredictable modality availability, causing performance drops in models trained on perfectly paired data. Need robust models that handle dynamic missing modalities.Method: Cyclic INformative Learning (CyIN) builds informative latent spaces using token- and label-level Information Bottleneck cyclically across modalities. Uses cross-modal cyclic translation to reconstruct missing modalities through forward/reverse propagation.
Result: Extensive experiments on 4 multimodal datasets show superior performance in both complete and diverse incomplete scenarios compared to previous methods.
Conclusion: CyIN successfully bridges complete and incomplete multimodal learning in one unified model, achieving robustness to dynamic missing modalities through informative latent spaces and cyclic translation.
Abstract: Multimodal machine learning, mimicking the human brain’s ability to integrate various modalities has seen rapid growth. Most previous multimodal models are trained on perfectly paired multimodal input to reach optimal performance. In real-world deployments, however, the presence of modality is highly variable and unpredictable, causing the pre-trained models in suffering significant performance drops and fail to remain robust with dynamic missing modalities circumstances. In this paper, we present a novel Cyclic INformative Learning framework (CyIN) to bridge the gap between complete and incomplete multimodal learning. Specifically, we firstly build an informative latent space by adopting token- and label-level Information Bottleneck (IB) cyclically among various modalities. Capturing task-related features with variational approximation, the informative bottleneck latents are purified for more efficient cross-modal interaction and multimodal fusion. Moreover, to supplement the missing information caused by incomplete multimodal input, we propose cross-modal cyclic translation by reconstruct the missing modalities with the remained ones through forward and reverse propagation process. With the help of the extracted and reconstructed informative latents, CyIN succeeds in jointly optimizing complete and incomplete multimodal learning in one unified model. Extensive experiments on 4 multimodal datasets demonstrate the superior performance of our method in both complete and diverse incomplete scenarios.
[369] Mind the Performance Gap: Capability-Behavior Trade-offs in Feature Steering
Eitan Sprejer, Oscar Agustín Stanchi, María Victoria Carro, Denise Alejandra Mester, Iván Arcuschin
Main category: cs.LG
TL;DR: Feature steering methods successfully control LLM behaviors but cause severe performance degradation on knowledge tasks, making simple prompting more practical for real-world deployment.
Details
Motivation: To evaluate the practical effectiveness of feature steering methods for controlling LLM behavior, particularly understanding the trade-offs between behavioral control and output quality in real-world applications.Method: Evaluated Goodfire’s Auto Steer against prompt engineering baselines across 14 steering queries covering various behaviors, tested on 171 MMLU questions using Llama-8B and Llama-70B models, measuring accuracy, coherence, and behavioral control.
Result: Auto Steer successfully modifies target behaviors (3.33 vs. 2.98 for prompting on Llama-8B; 3.57 vs. 3.10 on Llama-70B), but causes dramatic performance degradation: accuracy drops from 66% to 46% on Llama-8B and 87% to 73% on Llama-70B, with coherence falling significantly.
Conclusion: Simple prompting achieves the best overall balance, highlighting limitations of current feature steering methods for practical deployment where task performance cannot be sacrificed, revealing fundamental capability-behavior trade-offs.
Abstract: Feature steering has emerged as a promising approach for controlling LLM behavior through direct manipulation of internal representations, offering advantages over prompt engineering. However, its practical effectiveness in real-world applications remains poorly understood, particularly regarding potential trade-offs with output quality. We show that feature steering methods substantially degrade model performance even when successfully controlling target behaviors, a critical trade-off. Specifically, we evaluate Goodfire’s Auto Steer against prompt engineering baselines across 14 steering queries (covering innocuous and safety-relevant behaviors) on 171 Massive Multitask Language Understanding (MMLU) questions using Llama-8B and Llama-70B, measuring accuracy, coherence, and behavioral control. Our findings show that Auto Steer successfully modifies target behaviors (achieving scores of 3.33 vs. 2.98 for prompting on Llama-8B and 3.57 vs. 3.10 on Llama-70B), but causes dramatic performance degradation: accuracy on the MMLU questions drops from 66% to 46% on Llama-8B and 87% to 73% on Llama-70B, with coherence falling from 4.62 to 2.24 and 4.94 to 3.89 respectively. Simple prompting achieves the best overall balance. These findings highlight limitations of current feature steering methods for practical deployment where task performance cannot be sacrificed. More broadly, our work demonstrates that mechanistic control methods face fundamental capability-behavior trade-offs that must be empirically characterized before deployment.
[370] Knowing When to Answer: Adaptive Confidence Refinement for Reliable Audio-Visual Question Answering
Dinh Phu Tran, Jihoon Jeong, Saad Wazir, Seongah Kim, Thao Do, Cem Subakan, Daeyoung Kim
Main category: cs.LG
TL;DR: A method for reliable audio-visual question answering that prefers abstention over incorrect answers, using adaptive confidence refinement to improve confidence calibration.
Details
Motivation: Current AVQA models have high accuracy but lack reliable mechanisms to identify when they're likely wrong and abstain from answering. There's a need for reliable AVQA systems that can recognize their own limitations.Method: Proposes Adaptive Confidence Refinement (ACR) - a lightweight method that maintains Maximum Softmax Probability as primary confidence signal but applies input-adaptive residual corrections when MSP is unreliable. Uses two learned heads: Residual Risk Head (predicts correctness residuals) and Confidence Gating Head (determines MSP trustworthiness).
Result: ACR consistently outperforms existing methods on in-distribution, out-of-distribution, and data bias settings across three different AVQA architectures, establishing a solid foundation for reliable AVQA.
Conclusion: The proposed ACR method effectively enhances AVQA reliability by improving confidence calibration, enabling models to better recognize when to abstain rather than answer incorrectly.
Abstract: We present a formal problem formulation for \textit{Reliable} Audio-Visual Question Answering ($\mathcal{R}$-AVQA), where we prefer abstention over answering incorrectly. While recent AVQA models have high accuracy, their ability to identify when they are likely wrong and their consequent abstention from answering remain underexplored areas of research. To fill this gap, we explore several approaches and then propose Adaptive Confidence Refinement (ACR), a lightweight method to further enhance the performance of $\mathcal{R}$-AVQA. Our key insight is that the Maximum Softmax Probability (MSP) is Bayes-optimal only under strong calibration, a condition usually not met in deep neural networks, particularly in multimodal models. Instead of replacing MSP, our ACR maintains it as a primary confidence signal and applies input-adaptive residual corrections when MSP is deemed unreliable. ACR introduces two learned heads: i) a Residual Risk Head that predicts low-magnitude correctness residuals that MSP does not capture, and ii) a Confidence Gating Head to determine MSP trustworthiness. Our experiments and theoretical analysis show that ACR consistently outperforms existing methods on in- and out-of-disrtibution, and data bias settings across three different AVQA architectures, establishing a solid foundation for $\mathcal{R}$-AVQA task. The code and checkpoints will be available upon acceptance \href{https://github.com/PhuTran1005/R-AVQA}{at here}
[371] Training Data Efficiency in Multimodal Process Reward Models
Jinyuan Li, Chengsong Huang, Langlin Huang, Shaoyang Xu, Haolin Liu, Wenxuan Zhang, Jiaxin Huang
Main category: cs.LG
TL;DR: Proposes Balanced-Information Score (BIS) for efficient data selection in Multimodal Process Reward Models, achieving full-data performance with only 10% of training data.
Details
Motivation: Training Multimodal Process Reward Models (MPRMs) requires expensive Monte Carlo-annotated corpora, but preliminary experiments show training saturates quickly with random subsampling, indicating data redundancy. Need more efficient data selection methods.Method: Develops theoretical framework revealing informative gradient updates depend on label mixtures (positive/negative steps) and label reliability (average MC scores of positive steps). Proposes Balanced-Information Score (BIS) that prioritizes both mixture and reliability using existing MC signals at rollout level without additional cost.
Result: BIS-selected subsets consistently match and surpass full-data performance at small fractions across two backbones (InternVL2.5-8B and Qwen2.5-VL-7B) on VisualProcessBench. Notably achieves full-data performance using only 10% of training data, improving over random subsampling by relative 4.1%.
Conclusion: BIS provides efficient data selection for MPRM training, significantly reducing training costs while maintaining or improving performance, addressing data efficiency challenges in multimodal reasoning supervision.
Abstract: Multimodal Process Reward Models (MPRMs) are central to step-level supervision for visual reasoning in MLLMs. Training MPRMs typically requires large-scale Monte Carlo (MC)-annotated corpora, incurring substantial training cost. This paper studies the data efficiency for MPRM training. Our preliminary experiments reveal that MPRM training quickly saturates under random subsampling of the training data, indicating substantial redundancy within existing MC-annotated corpora. To explain this, we formalize a theoretical framework and reveal that informative gradient updates depend on two factors: label mixtures of positive/negative steps and label reliability (average MC scores of positive steps). Guided by these insights, we propose the Balanced-Information Score (BIS), which prioritizes both mixture and reliability based on existing MC signals at the rollout level, without incurring any additional cost. Across two backbones (InternVL2.5-8B and Qwen2.5-VL-7B) on VisualProcessBench, BIS-selected subsets consistently match and even surpass the full-data performance at small fractions. Notably, the BIS subset reaches full-data performance using only 10% of the training data, improving over random subsampling by a relative 4.1%.
[372] LISA: Laplacian In-context Spectral Analysis
Julio Candanedo
Main category: cs.LG
TL;DR: LISA enables inference-time adaptation of Laplacian-based time-series models using only observed data prefixes, combining delay-coordinate embeddings with spectral learning for improved forecasting under changing dynamics.
Details
Motivation: The paper addresses the need for time-series models that can adapt to changing dynamics at inference time without retraining, using only observed data prefixes to improve forecasting performance.Method: LISA combines delay-coordinate embeddings with Laplacian spectral learning to create diffusion-coordinate state representations, then uses lightweight latent-space residual adapters (Gaussian-process regression or attention-like Markov operators) for inference-time adaptation.
Result: LISA improves over frozen baselines in forecasting and autoregressive rollout experiments, showing particular benefits under changing dynamics conditions.
Conclusion: The work successfully links in-context adaptation to nonparametric spectral methods for dynamical systems, providing a practical approach for inference-time adaptation of time-series models.
Abstract: We propose Laplacian In-context Spectral Analysis (LISA), a method for inference-time adaptation of Laplacian-based time-series models using only an observed prefix. LISA combines delay-coordinate embeddings and Laplacian spectral learning to produce diffusion-coordinate state representations, together with a frozen nonlinear decoder for one-step prediction. We introduce lightweight latent-space residual adapters based on either Gaussian-process regression or an attention-like Markov operator over context windows. Across forecasting and autoregressive rollout experiments, LISA improves over the frozen baseline and is often most beneficial under changing dynamics. This work links in-context adaptation to nonparametric spectral methods for dynamical systems.
[373] Physics as the Inductive Bias for Causal Discovery
Jianhong Chen, Naichen Shi, Xubo Yue
Main category: cs.LG
TL;DR: Integrative causal discovery framework for dynamical systems that combines partial physical knowledge (ODEs) with data-driven causal discovery using stochastic differential equations
Details
Motivation: Real-world dynamical systems often have feedback, cyclic interactions, and non-stationary trends, but many causal discovery methods assume acyclicity or equilibrium. Integrating physics-based models (ODEs) with data-driven causal discovery could improve identifiability, stability, and robustness.Method: Model system evolution as a stochastic differential equation (SDE) where drift term encodes known ODE dynamics and diffusion term corresponds to unknown causal couplings. Develop scalable sparsity-inducing MLE algorithm that exploits causal graph structure for efficient parameter estimation.
Result: Experiments on dynamical systems with diverse causal structures show improved causal graph recovery and more stable, physically consistent estimates compared to purely data-driven state-of-the-art baselines.
Conclusion: The proposed framework successfully integrates partial physical knowledge as inductive bias for causal discovery in dynamical systems, overcoming limitations of purely data-driven approaches while handling complex system characteristics like feedback and non-stationarity.
Abstract: Causal discovery is often a data-driven paradigm to analyze complex real-world systems. In parallel, physics-based models such as ordinary differential equations (ODEs) provide mechanistic structure for many dynamical processes. Integrating these paradigms potentially allows physical knowledge to act as an inductive bias, improving identifiability, stability, and robustness of causal discovery in dynamical systems. However, such integration remains challenging: real dynamical systems often exhibit feedback, cyclic interactions, and non-stationary data trend, while many widely used causal discovery methods are formulated under acyclicity or equilibrium-based assumptions. In this work, we propose an integrative causal discovery framework for dynamical systems that leverages partial physical knowledge as an inductive bias. Specifically, we model system evolution as a stochastic differential equation (SDE), where the drift term encodes known ODE dynamics and the diffusion term corresponds to unknown causal couplings beyond the prescribed physics. We develop a scalable sparsity-inducing MLE algorithm that exploits causal graph structure for efficient parameter estimation. Under mild conditions, we establish guarantees to recover the causal graph. Experiments on dynamical systems with diverse causal structures show that our approach improves causal graph recovery and produces more stable, physically consistent estimates than purely data-driven state-of-the-art baselines.
[374] Temporal Pair Consistency for Variance-Reduced Flow Matching
Chika Maduabuchi, Jindong Wang
Main category: cs.LG
TL;DR: TPC introduces temporal pair consistency for flow matching models, reducing gradient variance by coupling velocity predictions at paired timesteps without modifying architecture or probability paths.
Details
Motivation: Continuous-time generative models suffer from high estimator variance due to independent timestep training, leading to inefficient sampling. Existing solutions require architectural changes or modified probability paths.Method: Temporal Pair Consistency (TPC) couples velocity predictions at paired timesteps along the same probability path, providing trajectory-coupled regularization that reduces gradient variance while preserving the flow-matching objective.
Result: TPC improves sample quality and efficiency on CIFAR-10 and ImageNet at multiple resolutions, achieving lower FID at identical or lower computational cost than prior methods, and extends to modern pipelines.
Conclusion: TPC offers a lightweight variance-reduction principle that enhances flow matching models without architectural modifications, improving both sample quality and computational efficiency.
Abstract: Continuous-time generative models, such as diffusion models, flow matching, and rectified flow, learn time-dependent vector fields but are typically trained with objectives that treat timesteps independently, leading to high estimator variance and inefficient sampling. Prior approaches mitigate this via explicit smoothness penalties, trajectory regularization, or modified probability paths and solvers. We introduce Temporal Pair Consistency (TPC), a lightweight variance-reduction principle that couples velocity predictions at paired timesteps along the same probability path, operating entirely at the estimator level without modifying the model architecture, probability path, or solver. We provide a theoretical analysis showing that TPC induces a quadratic, trajectory-coupled regularization that provably reduces gradient variance while preserving the underlying flow-matching objective. Instantiated within flow matching, TPC improves sample quality and efficiency across CIFAR-10 and ImageNet at multiple resolutions, achieving lower FID at identical or lower computational cost than prior methods, and extends seamlessly to modern SOTA-style pipelines with noise-augmented training, score-based denoising, and rectified flow.
[375] Comparing Euclidean and Hyperbolic K-Means for Generalized Category Discovery
Mohamad Dalal, Thomas B. Moeslund, Joakim Bruslund Haurum
Main category: cs.LG
TL;DR: HC-GCD introduces hyperbolic clustering for Generalized Category Discovery, learning embeddings in hyperbolic space and clustering them directly with hyperbolic K-Means instead of transforming back to Euclidean space.
Details
Motivation: Prior hyperbolic GCD methods only use hyperbolic geometry for representation learning but transform back to Euclidean geometry for clustering, which is hypothesized to be suboptimal. The authors aim to explore whether direct hyperbolic clustering can improve performance.Method: HC-GCD learns embeddings in the Lorentz Hyperboloid model of hyperbolic geometry and clusters these embeddings directly in hyperbolic space using a hyperbolic K-Means algorithm, maintaining hyperbolic geometry throughout the entire pipeline.
Result: HC-GCD performs on par with previous state-of-the-art hyperbolic GCD methods on Semantic Shift Benchmark datasets. Hyperbolic K-Means leads to better accuracy than Euclidean K-Means, and ablation studies show that clipping Euclidean embedding norms affects seen/unseen class accuracy differently.
Conclusion: Direct hyperbolic clustering is a viable approach for GCD, with hyperbolic K-Means providing more consistent clusters across varying label granularities and better overall performance than Euclidean clustering in hyperbolic space.
Abstract: Hyperbolic representation learning has been widely used to extract implicit hierarchies within data, and recently it has found its way to the open-world classification task of Generalized Category Discovery (GCD). However, prior hyperbolic GCD methods only use hyperbolic geometry for representation learning and transform back to Euclidean geometry when clustering. We hypothesize this is suboptimal. Therefore, we present Hyperbolic Clustered GCD (HC-GCD), which learns embeddings in the Lorentz Hyperboloid model of hyperbolic geometry, and clusters these embeddings directly in hyperbolic space using a hyperbolic K-Means algorithm. We test our model on the Semantic Shift Benchmark datasets, and demonstrate that HC-GCD is on par with the previous state-of-the-art hyperbolic GCD method. Furthermore, we show that using hyperbolic K-Means leads to better accuracy than Euclidean K-Means. We carry out ablation studies showing that clipping the norm of the Euclidean embeddings leads to decreased accuracy in clustering unseen classes, and increased accuracy for seen classes, while the overall accuracy is dataset dependent. We also show that using hyperbolic K-Means leads to more consistent clusters when varying the label granularity.
[376] Learning Where It Matters: Geometric Anchoring for Robust Preference Alignment
Youngjae Cho, Jongsuk Kim, Ji-Hoon Kim
Main category: cs.LG
TL;DR: GAPO introduces a dynamic, geometry-aware anchor instead of fixed reference policy for preference optimization, improving robustness to noise by adaptively reweighting preference pairs based on local sensitivity.
Details
Motivation: DPO and related methods use fixed reference policies that become miscalibrated as policies drift, causing distributional mismatch and amplifying noise. Reference-free variants avoid mismatch but suffer from unconstrained reward drift.Method: GAPO replaces fixed reference with dynamic, geometry-aware anchor: adversarial local perturbation of current policy within small radius serving as pessimistic baseline. Uses adaptive reweighting mechanism based on local sensitivity, introducing Anchor Gap metric to approximate worst-case local margin degradation.
Result: Across diverse noise settings, GAPO consistently improves robustness while matching or improving performance on standard LLM alignment and reasoning benchmarks.
Conclusion: GAPO provides more robust preference optimization by using dynamic anchors that adapt to policy drift, addressing limitations of both fixed-reference and reference-free approaches.
Abstract: Direct Preference Optimization (DPO) and related methods align large language models from pairwise preferences by regularizing updates against a fixed reference policy. As the policy drifts, a static reference, however, can become increasingly miscalibrated, leading to distributional mismatch and amplifying spurious preference signals under noisy supervision. Conversely, reference-free variants avoid mismatch but often suffer from unconstrained reward drift. We propose Geometric Anchor Preference Optimization (GAPO), which replaces the fixed reference with a dynamic, geometry-aware anchor: an adversarial local perturbation of the current policy within a small radius that serves as a pessimistic baseline. This anchor enables an adaptive reweighting mechanism, modulating the importance of each preference pair based on its local sensitivity. We further introduce the Anchor Gap, the reward discrepancy between the policy and its anchor, and show under smoothness conditions that it approximates worst-case local margin degradation. Optimizing a logistic objective weighted by this gap downweights geometrically brittle instances while emphasizing robust preference signals. Across diverse noise settings, GAPO consistently improves robustness while matching or improving performance on standard LLM alignment and reasoning benchmarks.
[377] A logical re-conception of neural networks: Hamiltonian bitwise part-whole architecture
E Bowen, R Granger, A Rodriguez
Main category: cs.LG
TL;DR: A novel graph-based architecture using simple relational encodings and graph-Hamiltonian operators for symbolic computation with linear computational scaling.
Details
Motivation: To create a system that directly represents relations (like part-whole) through fundamentally different architecture and learning rules than standard ANNs, enabling intrinsic relational encoding rather than as an add-on feature.Method: Encodes arbitrary data as graphs with edges representing primitive pairwise relations, uses graph-Hamiltonian operators to calculate energies with ground states satisfying all relation constraints, employs radically low-precision arithmetic for efficiency.
Result: The system processes standard ANN examples while producing symbolic-like representations, identifies logical relational structures, builds hierarchical representations enabling abductive inference, and scales linearly with number of edges.
Conclusion: The architecture bridges neural and symbolic computation, offering a novel approach to semantic representation that could inform current work on higher-level semantic understanding in AI systems.
Abstract: We introduce a simple initial working system in which relations (such as part-whole) are directly represented via an architecture with operating and learning rules fundamentally distinct from standard artificial neural network methods. Arbitrary data are straightforwardly encoded as graphs whose edges correspond to codes from a small fixed primitive set of elemental pairwise relations, such that simple relational encoding is not an add-on, but occurs intrinsically within the most basic components of the system. A novel graph-Hamiltonian operator calculates energies among these encodings, with ground states denoting simultaneous satisfaction of all relation constraints among graph vertices. The method solely uses radically low-precision arithmetic; computational cost is correspondingly low, and scales linearly with the number of edges in the data. The resulting unconventional architecture can process standard ANN examples, but also produces representations that exhibit characteristics of symbolic computation. Specifically, the method identifies simple logical relational structures in these data (part-of; next-to), building hierarchical representations that enable abductive inferential steps generating relational position-based encodings, rather than solely statistical representations. Notably, an equivalent set of ANN operations are derived, identifying a special case of embedded vector encodings that may constitute a useful approach to current work in higher-level semantic representation. The very simple current state of the implemented system invites additional tools and improvements.
[378] SLAY: Geometry-Aware Spherical Linearized Attention with Yat-Kernel
Jose Miguel Luna, Taha Bouhsine, Krzysztof Choromanski
Main category: cs.LG
TL;DR: SLAY is a new linear-time attention mechanism using spherical Yat-kernels that constrains queries/keys to unit sphere for angular alignment, achieving near-softmax performance with linear scaling.
Details
Motivation: To create scalable Transformers without performance trade-offs by developing linear-time attention mechanisms that closely approximate standard softmax attention while maintaining computational efficiency.Method: Uses spherical Yat-kernels with queries/keys constrained to unit sphere, expresses kernel as nonnegative mixture of polynomial-exponential product kernels via Bernstein’s theorem, and derives positive random-feature approximation for linear-time O(L) attention.
Result: SLAY achieves performance nearly indistinguishable from standard softmax attention with linear time/memory scaling, consistently outperforms prior linear-time attention mechanisms like Performers and Cosformers.
Conclusion: SLAY represents the closest linear-time approximation to softmax attention to date, enabling scalable Transformers without typical performance trade-offs of attention linearization.
Abstract: We propose a new class of linear-time attention mechanisms based on a relaxed and computationally efficient formulation of the recently introduced E-Product, often referred to as the Yat-kernel (Bouhsine, 2025). The resulting interactions are geometry-aware and inspired by inverse-square interactions in physics. Our method, Spherical Linearized Attention with Yat Kernels (SLAY), constrains queries and keys to the unit sphere so that attention depends only on angular alignment. Using Bernstein’s theorem, we express the spherical Yat-kernel as a nonnegative mixture of polynomial-exponential product kernels and derive a strictly positive random-feature approximation enabling linear-time O(L) attention. We establish positive definiteness and boundedness on the sphere and show that the estimator yields well-defined, nonnegative attention scores. Empirically, SLAY achieves performance that is nearly indistinguishable from standard softmax attention while retaining linear time and memory scaling, and consistently outperforms prior linear-time attention mechanisms such as Performers and Cosformers. To the best of our knowledge, SLAY represents the closest linear-time approximation to softmax attention reported to date, enabling scalable Transformers without the typical performance trade-offs of attention linearization.
[379] Multi-Aspect Mining and Anomaly Detection for Heterogeneous Tensor Streams
Soshi Kakio, Yasuko Matsubara, Ren Fujiwara, Yasushi Sakurai
Main category: cs.LG
TL;DR: HeteroComp is a method for summarizing heterogeneous tensor streams (with both categorical and continuous attributes) and detecting group anomalies using Gaussian process priors to model unknown distributions and temporal dynamics.
Details
Motivation: Existing tensor decomposition and anomaly detection methods cannot handle heterogeneous tensor streams with both categorical and continuous attributes, and they fail to track temporal dynamics by discretizing timestamps, making them ineffective for detecting group anomalies like DoS attacks.Method: Proposes HeteroComp which uses Gaussian process priors to model unknown distributions of continuous attributes and temporal dynamics, directly estimating probability densities from data to continuously summarize heterogeneous tensor streams into components representing latent groups.
Result: Extensive experiments on real datasets show HeteroComp outperforms state-of-the-art algorithms for group anomaly detection accuracy, and its computational time does not depend on data stream length.
Conclusion: HeteroComp effectively addresses limitations of existing methods by handling heterogeneous attributes and temporal dynamics for accurate group anomaly detection in tensor streams.
Abstract: Analysis and anomaly detection in event tensor streams consisting of timestamps and multiple attributes - such as communication logs(time, IP address, packet length)- are essential tasks in data mining. While existing tensor decomposition and anomaly detection methods provide useful insights, they face the following two limitations. (i) They cannot handle heterogeneous tensor streams, which comprises both categorical attributes(e.g., IP address) and continuous attributes(e.g., packet length). They typically require either discretizing continuous attributes or treating categorical attributes as continuous, both of which distort the underlying statistical properties of the data.Furthermore, incorrect assumptions about the distribution family of continuous attributes often degrade the model’s performance. (ii) They discretize timestamps, failing to track the temporal dynamics of streams(e.g., trends, abnormal events), which makes them ineffective for detecting anomalies at the group level, referred to as ‘group anomalies’ (e.g, DoS attacks). To address these challenges, we propose HeteroComp, a method for continuously summarizing heterogeneous tensor streams into ‘components’ representing latent groups in each attribute and their temporal dynamics, and detecting group anomalies. Our method employs Gaussian process priors to model unknown distributions of continuous attributes, and temporal dynamics, which directly estimate probability densities from data. Extracted components give concise but effective summarization, enabling accurate group anomaly detection. Extensive experiments on real datasets demonstrate that HeteroComp outperforms the state-of-the-art algorithms for group anomaly detection accuracy, and its computational time does not depend on the data stream length.
[380] Simulated Adoption: Decoupling Magnitude and Direction in LLM In-Context Conflict Resolution
Long Zhang, Fangwei Lin
Main category: cs.LG
TL;DR: LLMs prioritize conflicting in-context information over parametric memory (sycophancy), and this study reveals it’s caused by orthogonal geometric interference rather than magnitude dilution.
Details
Motivation: To understand the mechanistic basis of how LLMs resolve knowledge conflicts through compliance/sycophancy - specifically whether this suppression arises from signal magnitude dilution or directional geometric alteration within the residual stream.Method: Conducted layer-wise geometric analysis across Qwen-4B, Llama-3.1-8B, and GLM-4-9B, decomposing residual stream updates induced by counter-factual contexts into radial (norm-based) and angular (cosine-based) components.
Result: Rejected the “Manifold Dilution” hypothesis - two of three architectures maintained stable residual norms despite performance degradation. Compliance is characterized by “Orthogonal Interference” where conflicting context injects a steering vector quasi-orthogonal to ground-truth direction, rotating hidden state representations.
Conclusion: Models don’t “unlearn” or suppress internal truth magnitude but use geometric displacement to bypass correct unembedding vectors, simulating adoption while preserving original structural magnitude. This challenges scalar confidence metrics and requires vectorial monitoring to distinguish genuine knowledge from in-context mimicry.
Abstract: Large Language Models (LLMs) frequently prioritize conflicting in-context information over pre-existing parametric memory, a phenomenon often termed sycophancy or compliance. However, the mechanistic realization of this behavior remains obscure, specifically how the model resolves these knowledge conflicts through compliance, and whether this suppression arises from signal magnitude dilution or directional geometric alteration within the residual stream. To resolve this, we conducted a layer-wise geometric analysis across Qwen-4B, Llama-3.1-8B, and GLM-4-9B, decomposing the residual stream updates induced by counter-factual contexts into radial (norm-based) and angular (cosine-based) components. Our empirical results reject the universality of the “Manifold Dilution” hypothesis, as two of the three architectures maintained stable residual norms despite exhibiting significant performance degradation on factual queries. Instead, we observed that compliance is consistently characterized by “Orthogonal Interference,” where the conflicting context injects a steering vector that is quasi-orthogonal to the ground-truth direction, effectively rotating the hidden state representation. This suggests that models do not “unlearn” or suppress the magnitude of internal truths but rather employ a mechanism of geometric displacement to bypass the correct unembedding vector, effectively simulating adoption while preserving the original structural magnitude. These findings challenge scalar confidence metrics for detecting hallucinations and underscore the necessity of vectorial monitoring to distinguish between genuine knowledge integration and superficial in-context mimicry.
[381] Gradually Compacting Large Language Models for Reasoning Like a Boiling Frog
Yiran Zhao, Shengyang Zhou, Zijian Wu, Tongyan Hu, Yuhui Xu, Rengan Dou, Kenji Kawaguchi, Shafiq Joty, Junnan Li, Michael Qizhe Shieh
Main category: cs.LG
TL;DR: A gradual compression method called Prune-Tune Loop (PTL) that incrementally reduces LLM size through multiple fine-grained iterations, maintaining reasoning performance with lightweight post-training.
Details
Motivation: LLMs have impressive reasoning capabilities but require substantial computational resources. Conventional pruning methods cause dramatic performance drops in reasoning tasks and need extensive post-training to recover capabilities.Method: Proposes Prune-Tune Loop (PTL) that divides compression into multiple fine-grained iterations, applying prune-tune cycles at each stage to incrementally reduce model size while restoring performance with finetuning, similar to the “boiling frog” effect.
Result: PTL can compress LLMs to nearly half their original size with only lightweight post-training while maintaining performance comparable to original models on reasoning tasks. Works with various pruning strategies (neuron/layer pruning) and post-training methods (continual pre-training, RL).
Conclusion: PTL provides an effective gradual compression approach that maintains LLM reasoning capabilities while significantly reducing model size, with broad applicability across different tasks including mathematical reasoning and code generation.
Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, but their substantial size often demands significant computational resources. To reduce resource consumption and accelerate inference, it is essential to eliminate redundant parameters without compromising performance. However, conventional pruning methods that directly remove such parameters often lead to a dramatic drop in model performance in reasoning tasks, and require extensive post-training to recover the lost capabilities. In this work, we propose a gradual compacting method that divides the compression process into multiple fine-grained iterations, applying a Prune-Tune Loop (PTL) at each stage to incrementally reduce model size while restoring performance with finetuning. This iterative approach-reminiscent of the “boiling frog” effect-enables the model to be progressively compressed without abrupt performance loss. Experimental results show that PTL can compress LLMs to nearly half their original size with only lightweight post-training, while maintaining performance comparable to the original model on reasoning tasks. Moreover, PTL is flexible and can be applied to various pruning strategies, such as neuron pruning and layer pruning, as well as different post-training methods, including continual pre-training and reinforcement learning. Additionally, experimental results confirm the effectiveness of PTL on a variety of tasks beyond mathematical reasoning, such as code generation, demonstrating its broad applicability.
[382] High-probability Convergence Guarantees of Decentralized SGD
Aleksandar Armacki, Ali H. Sayed
Main category: cs.LG
TL;DR: Decentralized SGD achieves high-probability convergence under same conditions as mean-squared error convergence, with order-optimal rates and linear speed-up in number of users.
Details
Motivation: There's a significant gap between assumptions needed for high-probability convergence vs mean-squared error convergence in decentralized settings, unlike centralized settings where SGD converges under same conditions. Existing decentralized works require strong assumptions like uniformly bounded gradients or asymptotically vanishing noise.Method: Analyzes Decentralized SGD (DSGD) in presence of light-tailed noise, providing technical results including variance-reduction effect of decentralized methods in high-probability sense and novel bound on MGF of strongly convex costs.
Result: DSGD converges in high-probability under same conditions as MSE convergence, achieves order-optimal rates for both non-convex and strongly convex costs, and shows linear speed-up in number of users with matching or better transient times than MSE results.
Conclusion: This work bridges the gap between HP and MSE convergence assumptions in decentralized settings, providing first demonstration of DSGD achieving linear speed-up in HP sense with relaxed assumptions and sharp rates.
Abstract: Convergence in high-probability (HP) has attracted increasing interest, due to implying exponentially decaying tail bounds and strong guarantees for individual runs of an algorithm. While many works study HP guarantees in centralized settings, much less is understood in the decentralized setup, where existing works require strong assumptions, like uniformly bounded gradients, or asymptotically vanishing noise. This results in a significant gap between the assumptions used to establish convergence in the HP and the mean-squared error (MSE) sense, and is also contrary to centralized settings, where it is known that $\mathtt{SGD}$ converges in HP under the same conditions on the cost function as needed for MSE convergence. Motivated by these observations, we study the HP convergence of Decentralized $\mathtt{SGD}$ ($\mathtt{DSGD}$) in the presence of light-tailed noise, providing several strong results. First, we show that $\mathtt{DSGD}$ converges in HP under the same conditions on the cost as in the MSE sense, removing the restrictive assumptions used in prior works. Second, our sharp analysis yields order-optimal rates for both non-convex and strongly convex costs. Third, we establish a linear speed-up in the number of users, leading to matching, or strictly better transient times than those obtained from MSE results, further underlining the tightness of our analysis. To the best of our knowledge, this is the first work that shows $\mathtt{DSGD}$ achieves a linear speed-up in the HP sense. Our relaxed assumptions and sharp rates stem from several technical results of independent interest, including a result on the variance-reduction effect of decentralized methods in the HP sense, as well as a novel bound on the MGF of strongly convex costs, which is of interest even in centralized settings. Finally, we provide experiments that validate our theory.
[383] Imposing Boundary Conditions on Neural Operators via Learned Function Extensions
Sepehr Mousavi, Siddhartha Mishra, Laura De Lorenzis
Main category: cs.LG
TL;DR: A framework for conditioning neural operators on complex boundary conditions via function extensions, enabling existing architectures to handle diverse PDE problems with variable boundary forcings.
Details
Motivation: Neural operators struggle with general, highly variable boundary conditions, especially when solution operators are sensitive to boundary forcings. Existing approaches often fail in these challenging scenarios.Method: Proposes mapping boundary data to latent pseudo-extensions over the entire spatial domain, allowing any standard neural operator architecture to consume boundary information. This enables learning rich dependencies on complex BCs and input domain functions simultaneously.
Result: Achieves state-of-the-art accuracy on 18 challenging datasets spanning Poisson, linear elasticity, and hyperelasticity problems with highly variable, mixed-type BCs. Outperforms baselines by large margins without hyperparameter tuning across datasets.
Conclusion: Learning boundary-to-domain extensions is an effective strategy for imposing complex BCs in neural operator frameworks, enabling accurate and robust scientific machine learning models for broader PDE-governed problems.
Abstract: Neural operators have emerged as powerful surrogates for the solution of partial differential equations (PDEs), yet their ability to handle general, highly variable boundary conditions (BCs) remains limited. Existing approaches often fail when the solution operator exhibits strong sensitivity to boundary forcings. We propose a general framework for conditioning neural operators on complex non-homogeneous BCs through function extensions. Our key idea is to map boundary data to latent pseudo-extensions defined over the entire spatial domain, enabling any standard operator learning architecture to consume boundary information. The resulting operator, coupled with an arbitrary domain-to-domain neural operator, can learn rich dependencies on complex BCs and input domain functions at the same time. To benchmark this setting, we construct 18 challenging datasets spanning Poisson, linear elasticity, and hyperelasticity problems, with highly variable, mixed-type, component-wise, and multi-segment BCs on diverse geometries. Our approach achieves state-of-the-art accuracy, outperforming baselines by large margins, while requiring no hyperparameter tuning across datasets. Overall, our results demonstrate that learning boundary-to-domain extensions is an effective and practical strategy for imposing complex BCs in existing neural operator frameworks, enabling accurate and robust scientific machine learning models for a broader range of PDE-governed problems.
[384] Internalizing LLM Reasoning via Discovery and Replay of Latent Actions
Zhenning Shi, Yijia Zhu, Junhan Shi, Xun Zhang, Lei Wang, Congcong Miao
Main category: cs.LG
TL;DR: STIR is a framework that internalizes chain-of-thought reasoning into dynamic latent trajectory control, improving accuracy while reducing token consumption compared to explicit reasoning generation.
Details
Motivation: Existing activation steering methods use static control vectors that don't adapt to the non-stationary evolution of complex reasoning tasks, limiting their effectiveness for internalizing reasoning processes.Method: Three-stage pipeline: (1) differential intrinsic action induction harvests latent reasoning successes, (2) sparse control basis construction creates a compact tool library, and (3) value-modulated trajectory intervention dynamically injects context-specific impulses via anchor-based gating.
Result: On six arithmetic and logical benchmarks across four models, STIR improves average accuracy by 1.9% to 7.5% while reducing average token consumption by up to 35% compared to vanilla decoding.
Conclusion: The benefits of explicit chain-of-thought can be realized through dynamic latent trajectory control, internalizing reasoning processes to bypass explicit generation while achieving superior fidelity.
Abstract: The internalization of chain-of-thought processes into hidden states has emerged as a highly efficient paradigm for scaling test-time compute. However, existing activation steering methods rely on static control vectors that fail to adapt to the non-stationary evolution of complex reasoning tasks. To address this limitation, we propose STIR (Self-Distilled Tools for Internal Reasoning), a framework that reformulates reasoning enhancement as a dynamic latent trajectory control problem. STIR introduces a synergistic three-stage pipeline: (1) differential intrinsic action induction harvests latent reasoning successes to crystallize steering primitives; (2) sparse control basis construction curates a compact, geometrically diverse tool library; and (3) value-modulated trajectory intervention dynamically injects context-specific impulses via anchor-based gating. Extensive experiments on six arithmetic and logical benchmarks across four representative models demonstrate that STIR improves average accuracy by 1.9% to 7.5% while reducing average token consumption by up to 35% compared to vanilla decoding. These findings demonstrate that the benefits of explicit chain-of-thought can be realized through dynamic latent trajectory control, internalizing the reasoning process to bypass the explicit generation while achieving superior fidelity. Our code is available at https://github.com/sznnzs/LLM-Latent-Action.
[385] Euphonium: Steering Video Flow Matching via Process Reward Gradient Guided Stochastic Dynamics
Ruizhe Zhong, Jiesong Lian, Xiaoyue Mi, Zixiang Zhou, Yuan Zhou, Qinglin Lu, Junchi Yan
Main category: cs.LG
TL;DR: Euphonium: A novel RL framework for aligning flow matching models with human preferences using process reward gradient guided dynamics to steer generation and improve exploration efficiency.
Details
Motivation: Current online RL approaches for aligning flow matching models suffer from inefficient exploration during training rollouts, relying on undirected stochasticity and sparse outcome rewards, which struggle to discover high-reward samples and result in slow, data-inefficient optimization.Method: Formulates sampling process as Stochastic Differential Equation incorporating gradient of Process Reward Model into flow drift; uses Dual-Reward Group Relative Policy Optimization combining latent process rewards for credit assignment with pixel-level outcome rewards for visual fidelity; includes distillation to internalize guidance into flow network.
Result: Achieves better alignment compared to existing methods while accelerating training convergence by 1.66x in text-to-video generation experiments.
Conclusion: Euphonium provides a principled framework for efficient exploration in RL-based alignment of flow matching models, enabling dense, step-by-step steering toward high-reward regions and eliminating inference-time dependency on reward models.
Abstract: While online Reinforcement Learning has emerged as a crucial technique for aligning flow matching models with human preferences, current approaches are hindered by inefficient exploration during training rollouts. Relying on undirected stochasticity and sparse outcome rewards, these methods struggle to discover high-reward samples, resulting in data-inefficient and slow optimization. To address these limitations, we propose Euphonium, a novel framework that steers generation via process reward gradient guided dynamics. Our key insight is to formulate the sampling process as a theoretically principled Stochastic Differential Equation that explicitly incorporates the gradient of a Process Reward Model into the flow drift. This design enables dense, step-by-step steering toward high-reward regions, advancing beyond the unguided exploration in prior works, and theoretically encompasses existing sampling methods (e.g., Flow-GRPO, DanceGRPO) as special cases. We further derive a distillation objective that internalizes the guidance signal into the flow network, eliminating inference-time dependency on the reward model. We instantiate this framework with a Dual-Reward Group Relative Policy Optimization algorithm, combining latent process rewards for efficient credit assignment with pixel-level outcome rewards for final visual fidelity. Experiments on text-to-video generation show that Euphonium achieves better alignment compared to existing methods while accelerating training convergence by 1.66x.
[386] TurboBoA: Faster and Exact Attention-aware Quantization without Backpropagation
Junhan Kim, Yeo Jeong Park, Seungwoo Son, Chungman Lee, Ho-young Kim, Joonyoung Kim, Yongkweon Jeon
Main category: cs.LG
TL;DR: TurboBoA is an efficient post-training quantization method for LLMs that improves upon BoA by accelerating quantization through joint channel quantization and error correction while maintaining accuracy.
Details
Motivation: Existing PTQ methods like GPTQ suffer from accuracy drops in low-bit regimes due to layer-wise independence assumptions, while BoA improves accuracy but is inefficient due to sequential quantization across all out-channels.Method: TurboBoA introduces three innovations: (1) joint quantization of multiple out-channels with closed-form error compensation for speedup, (2) correction mechanism for errors from preceding quantized layers, and (3) adaptive grid computation with coordinate descent refinement for alignment during iterative updates.
Result: TurboBoA achieves more than 3x speedup over BoA while improving accuracy, and combined with outlier suppression techniques, achieves SOTA results in both weight-only and weight-activation quantization.
Conclusion: TurboBoA provides an efficient backpropagation-free PTQ algorithm that preserves accuracy benefits while significantly accelerating quantization for large language models.
Abstract: The rapid growth of large language models (LLMs) has heightened the importance of post-training quantization (PTQ) for reducing memory and computation costs. Among PTQ methods, GPTQ has gained significant attention for its efficiency, enabling billion-scale LLMs to be quantized within a few GPU hours. However, GPTQ’s assumption of layer-wise independence leads to severe accuracy drops in low-bit regimes. Recently, BoA improved upon GPTQ by incorporating inter-layer dependencies within attention modules, but its reliance on sequential quantization across all out-channels makes it substantially less efficient. In this paper, we propose TurboBoA, a new backpropagation-free PTQ algorithm that preserves the accuracy benefits of BoA while significantly accelerating the process. The proposed TurboBoA introduces three key innovations: (i) joint quantization of multiple out-channels with a closed-form error compensation rule, which reduces sequential bottlenecks and yields more than a three-fold speedup; (ii) a correction mechanism for errors propagated from preceding quantized layers; and (iii) adaptive grid computation with coordinate descent refinement to maintain alignment during iterative updates. Extensive experiments demonstrate that TurboBoA delivers substantial acceleration over BoA while consistently improving accuracy. When combined with outlier suppression techniques, it achieves state-of-the-art results in both weight-only and weight-activation quantization. The code will be available at https://github.com/SamsungLabs/TurboBoA.
[387] Depth-Wise Emergence of Prediction-Centric Geometry in Large Language Models
Shahar Haim, Daniel C McNamee
Main category: cs.LG
TL;DR: Decoder-only LLMs show depth-wise transition from context processing to prediction forming, with late layers implementing structured geometric code for selective causal control over token prediction.
Details
Motivation: To understand the internal computational dynamics of decoder-only large language models, specifically how they transform context into predictions through representational geometry and mechanistic phases.Method: Combined geometric analysis with mechanistic intervention to examine depth-wise transitions in LLM representations, focusing on angular organization and norm encoding in late-layer representations.
Result: Late-layer representations implement structured geometric code where angular organization parametrizes prediction distributional similarity, while representation norms encode context-specific information that doesn’t determine prediction.
Conclusion: Provides mechanistic-geometric account of LLM dynamics showing transition from context-processing to prediction-forming phases with structured geometric coding enabling selective causal control over token prediction.
Abstract: We show that decoder-only large language models exhibit a depth-wise transition from context-processing to prediction-forming phases of computation accompanied by a reorganization of representational geometry. Using a unified framework combining geometric analysis with mechanistic intervention, we demonstrate that late-layer representations implement a structured geometric code that enables selective causal control over token prediction. Specifically, angular organization of the representation geometry parametrizes prediction distributional similarity, while representation norms encode context-specific information that does not determine prediction. Together, these results provide a mechanistic-geometric account of the dynamics of transforming context into predictions in LLMs.
[388] Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization
Davide Berasi, Matteo Farina, Massimiliano Mancini, Elisa Ricci
Main category: cs.LG
TL;DR: Model merging of domain-specific multimodal experts serves as an efficient proxy for evaluating data mixture weights, decoupling mixture optimization from expensive training runs.
Details
Motivation: Data Mixture Optimization (DMO) for multimodal LLMs is computationally expensive due to combinatorial search space and high training costs, while model merging is efficient but often suboptimal. The paper aims to combine the best of both approaches.Method: Train domain-specific multimodal experts, then evaluate weighted parameter-space combinations of these experts to estimate the performance of corresponding data mixtures. This creates proxy models that correlate with actual mixture-trained models.
Result: Extensive experiments on 14 multimodal benchmarks show that merged proxy models exhibit high rank correlation with models trained on actual data mixtures, validating the approach.
Conclusion: Model merging provides an efficient strategy for navigating mixture weight search space, decoupling optimal mixture discovery from resource-intensive training processes.
Abstract: Selecting the best data mixture is critical for successful Supervised Fine-Tuning (SFT) of Multimodal Large Language Models. However, determining the optimal mixture weights across multiple domain-specific datasets remains a significant bottleneck due to the combinatorial search space and the high cost associated with even a single training run. This is the so-called Data Mixture Optimization (DMO) problem. On the other hand, model merging unifies domain-specific experts through parameter interpolation. This strategy is efficient, as it only requires a single training run per domain, yet oftentimes leads to suboptimal models. In this work, we take the best of both worlds, studying model merging as an efficient strategy for estimating the performance of different data mixtures. We train domain-specific multimodal experts and evaluate their weighted parameter-space combinations to estimate the efficacy of corresponding data mixtures. We conduct extensive experiments on 14 multimodal benchmarks, and empirically demonstrate that the merged proxy models exhibit a high rank correlation with models trained on actual data mixtures. This decouples the search for optimal mixtures from the resource-intensive training process, thereby providing a scalable and efficient strategy for navigating the complex landscape of mixture weights. Code is publicly available at https://github.com/BerasiDavide/mLLMs_merging_4_DMO.
[389] Transolver-3: Scaling Up Transformer Solvers to Industrial-Scale Geometries
Hang Zhou, Haixu Wu, Haonan Shangguan, Yuezhou Ma, Huikun Weng, Jianmin Wang, Mingsheng Long
Main category: cs.LG
TL;DR: Transolver-3 is a scalable neural PDE solver framework that handles industrial-scale meshes with over 160 million cells through architectural optimizations and training strategies.
Details
Motivation: Current neural PDE solvers struggle with industrial-scale geometries requiring over 10^8 cells due to prohibitive memory complexity from high-resolution meshes, creating a gap between GPU capacity and resolution requirements for complex engineering tasks.Method: Introduces two key architectural optimizations: 1) faster slice and deslice using matrix multiplication associative property, and 2) geometry slice tiling to partition physical state computation. Combines these with amortized training on random subsets of high-resolution meshes and physical state caching during inference.
Result: Transolver-3 successfully handles meshes with over 160 million cells and demonstrates impressive performance across three challenging simulation benchmarks including aircraft and automotive design tasks.
Conclusion: The framework enables high-fidelity field prediction on industrial-scale meshes, bridging the gap between limited GPU capacity and resolution requirements for complex engineering simulations.
Abstract: Deep learning has emerged as a transformative tool for the neural surrogate modeling of partial differential equations (PDEs), known as neural PDE solvers. However, scaling these solvers to industrial-scale geometries with over $10^8$ cells remains a fundamental challenge due to the prohibitive memory complexity of processing high-resolution meshes. We present Transolver-3, a new member of the Transolver family as a highly scalable framework designed for high-fidelity physics simulations. To bridge the gap between limited GPU capacity and the resolution requirements of complex engineering tasks, we introduce two key architectural optimizations: faster slice and deslice by exploiting matrix multiplication associative property and geometry slice tiling to partition the computation of physical states. Combined with an amortized training strategy by learning on random subsets of original high-resolution meshes and a physical state caching technique during inference, Transolver-3 enables high-fidelity field prediction on industrial-scale meshes. Extensive experiments demonstrate that Transolver-3 is capable of handling meshes with over 160 million cells, achieving impressive performance across three challenging simulation benchmarks, including aircraft and automotive design tasks.
[390] Improving Set Function Approximation with Quasi-Arithmetic Neural Networks
Tomas Tokar, Scott Sanner
Main category: cs.LG
TL;DR: QUANNs introduce Neuralized Kolmogorov Mean as a learnable aggregation function for sets, outperforming fixed pooling methods and enabling better transfer learning.
Details
Motivation: Current set models use fixed, non-learnable pooling operations (sum/max) which limit expressivity and transferability of learned embeddings. There's a need for more expressive, learnable aggregation functions for set-structured data.Method: Proposes Neuralized Kolmogorov Mean (NKM) - a trainable framework for learning generalized central tendency measures through invertible neural functions. Incorporates NKM into quasi-arithmetic neural networks (QUANNs) as learnable aggregation functions.
Result: QUANNs outperform state-of-the-art baselines across diverse benchmarks. They learn more structured latent representations that transfer effectively even to non-set tasks. Theoretical analysis shows QUANNs are universal approximators for broad class of set-function decompositions.
Conclusion: QUANNs with NKM provide superior learnable aggregation for sets, enabling better expressivity and transfer learning compared to fixed pooling methods like DeepSets and PointNet.
Abstract: Sets represent a fundamental abstraction across many types of data. To handle the unordered nature of set-structured data, models such as DeepSets and PointNet rely on fixed, non-learnable pooling operations (e.g., sum or max) – a design choice that can hinder the transferability of learned embeddings and limits model expressivity. More recently, learnable aggregation functions have been proposed as more expressive alternatives. In this work, we advance this line of research by introducing the Neuralized Kolmogorov Mean (NKM) – a novel, trainable framework for learning a generalized measure of central tendency through an invertible neural function. We further propose quasi-arithmetic neural networks (QUANNs), which incorporate the NKM as a learnable aggregation function. We provide a theoretical analysis showing that, QUANNs are universal approximators for a broad class of common set-function decompositions and, thanks to their invertible neural components, learn more structured latent representations. Empirically, QUANNs outperform state-of-the-art baselines across diverse benchmarks, while learning embeddings that transfer effectively even to tasks that do not involve sets.
[391] Privileged Information Distillation for Language Models
Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, Massimo Caccia
Main category: cs.LG
TL;DR: π-Distill and OPSD methods for distilling frontier models using privileged information during training when reasoning is not observable at inference time
Details
Motivation: Training-time privileged information can help language models succeed on complex tasks, but transferring these capabilities to policies that must act without PI at inference remains challenging, especially when closed-source systems only expose action trajectories without internal reasoning.Method: Two approaches: π-Distill (joint teacher-student objective training PI-conditioned teacher and unconditioned student simultaneously) and On-Policy Self-Distillation (OPSD) using RL with reverse KL-penalty between student and PI-conditioned teacher.
Result: Both algorithms effectively distill frontier agents using action-only PI, outperforming standard practices (supervised finetuning + RL) that assume full Chain-of-Thought supervision across multiple agentic benchmarks, models, and PI forms.
Conclusion: π-Distill and OPSD provide effective methods for knowledge distillation when only action trajectories are observable, enabling transfer of capabilities learned with privileged information to policies that must operate without it at inference.
Abstract: Training-time privileged information (PI) can enable language models to succeed on tasks they would otherwise fail, making it a powerful tool for reinforcement learning in hard, long-horizon settings. However, transferring capabilities learned with PI to policies that must act without it at inference time remains a fundamental challenge. We study this problem in the context of distilling frontier models for multi-turn agentic environments, where closed-source systems typically hide their internal reasoning and expose only action trajectories. This breaks standard distillation pipelines, since successful behavior is observable but the reasoning process is not. For this, we introduce π-Distill, a joint teacher-student objective that trains a PI-conditioned teacher and an unconditioned student simultaneously using the same model. Additionally, we also introduce On-Policy Self-Distillation (OPSD), an alternative approach that trains using Reinforcement Learning (RL) with a reverse KL-penalty between the student and the PI-conditioned teacher. We show that both of these algorithms effectively distill frontier agents using action-only PI. Specifically we find that π-Distill and in some cases OPSD, outperform industry standard practices (Supervised finetuning followed by RL) that assume access to full Chain-of-Thought supervision across multiple agentic benchmarks, models, and forms of PI. We complement our results with extensive analysis that characterizes the factors enabling effective learning with PI, focusing primarily on π-Distill and characterizing when OPSD is competitive.
[392] Stochastic hierarchical data-driven optimization: application to plasma-surface kinetics
José Afonso, Vasco Guerra, Pedro Viegas
Main category: cs.LG
TL;DR: A stochastic hierarchical optimization framework using reduced Hessian approximation for efficient calibration of physical models, validated on plasma-surface interaction problems.
Details
Motivation: Physical model calibration is computationally expensive due to high-dimensional parameter spaces and complex simulations. Current optimization methods struggle with anisotropic landscapes and require excessive simulation queries, especially for complex systems like plasma-surface interactions.Method: Proposes a stochastic hierarchical optimization framework inspired by Sloppy Model theory. Uses reduced Hessian approximation to identify stiff parameter subspaces with minimal simulation queries. Integrates with probabilistic formulation to derive principled objective loss functions from observed data.
Result: Method consistently outperforms baseline optimization techniques in sample efficiency. Successfully applied to plasma-surface interaction problems where accurate modeling is limited by uncertainties in surface reactivity parameters and computational costs.
Conclusion: Provides a general and scalable tool for optimizing complex reaction system models, applicable from plasma chemistry to biochemical networks, offering efficient navigation of anisotropic parameter landscapes.
Abstract: This work introduces a stochastic hierarchical optimization framework inspired by Sloppy Model theory for the efficient calibration of physical models. Central to this method is the use of a reduced Hessian approximation, which identifies and targets the stiff parameter subspace using minimal simulation queries. This strategy enables efficient navigation of highly anisotropic landscapes, avoiding the computational burden of exhaustive sampling. To ensure rigorous inference, we integrate this approach with a probabilistic formulation that derives a principled objective loss function directly from observed data. We validate the framework by applying it to the problem of plasma-surface interactions, where accurate modelling is strictly limited by uncertainties in surface reactivity parameters and the computational cost of kinetic simulations. Comparative analysis demonstrates that our method consistently outperforms baseline optimization techniques in sample efficiency. This approach offers a general and scalable tool for optimizing models of complex reaction systems, ranging from plasma chemistry to biochemical networks.
[393] Near-Optimal Dynamic Matching via Coarsening with Application to Heart Transplantation
Itai Zilberstein, Ioannis Anagnostides, Zachary W. Sollie, Arman Kilic, Tuomas Sandholm
Main category: cs.LG
TL;DR: Online matching algorithms using coarsening/aggregation approach achieve near-optimal theoretical guarantees for applications like organ allocation, bridging data-driven heuristics with theoretical bounds.
Details
Motivation: Practical online matching algorithms in domains like Internet advertising and organ allocation often lack strong theoretical guarantees, creating a gap between data-driven heuristics and theoretical lower bounds.Method: Develop new online matching algorithms based on coarsening approach that aggregates offline nodes into capacitated clusters, then apply methodology to heart transplant allocation using structural properties of historical data.
Result: The coarsening approach yields near-optimal theoretical guarantees, and in realistic simulations for heart transplant allocation, the policy closely matches the performance of the omniscient benchmark.
Conclusion: The work bridges the gap between data-driven heuristics and pessimistic theoretical lower bounds, providing rigorous justification for prior clustering-based approaches in organ allocation.
Abstract: Online matching has been a mainstay in domains such as Internet advertising and organ allocation, but practical algorithms often lack strong theoretical guarantees. We take an important step toward addressing this by developing new online matching algorithms based on a coarsening approach. Although coarsening typically implies a loss of granularity, we show that, to the contrary, aggregating offline nodes into capacitated clusters can yield near-optimal theoretical guarantees. We apply our methodology to heart transplant allocation to develop theoretically grounded policies based on structural properties of historical data. In realistic simulations, our policy closely matches the performance of the omniscient benchmark. Our work bridges the gap between data-driven heuristics and pessimistic theoretical lower bounds, and provides rigorous justification for prior clustering-based approaches in organ allocation.
[394] Position: Machine Learning for Heart Transplant Allocation Policy Optimization Should Account for Incentives
Ioannis Anagnostides, Itai Zilberstein, Zachary W. Sollie, Arman Kilic, Tuomas Sandholm
Main category: cs.LG
TL;DR: Position paper arguing that organ allocation algorithms must consider incentive structures and strategic behavior of stakeholders, not just treat it as a static optimization problem.
Details
Motivation: Current organ allocation approaches overlook incentive misalignments among transplant centers, clinicians, and regulators, treating allocation as a static optimization problem rather than a complex game with strategic behavior.Method: Position paper analyzing US adult heart transplant allocation, identifying critical incentive misalignments across the decision-making pipeline using data analysis, and proposing integration of mechanism design, strategic classification, causal inference, and social choice methods.
Result: Identifies adverse consequences from current incentive misalignments and argues that next-generation allocation policies must be incentive-aware to ensure robustness, efficiency, and fairness.
Conclusion: Organ allocation should be treated as a game-theoretic problem requiring incentive-aware algorithms that integrate mechanism design and strategic behavior considerations, with a research agenda calling for ML community involvement.
Abstract: The allocation of scarce donor organs constitutes one of the most consequential algorithmic challenges in healthcare. While the field is rapidly transitioning from rigid, rule-based systems to machine learning and data-driven optimization, we argue that current approaches often overlook a fundamental barrier: incentives. In this position paper, we highlight that organ allocation is not merely a static optimization problem, but rather a complex game involving transplant centers, clinicians, and regulators. Focusing on US adult heart transplant allocation, we identify critical incentive misalignments across the decision-making pipeline, and present data showing that they are having adverse consequences today. Our main position is that the next generation of allocation policies should be incentive aware. We outline a research agenda for the machine learning community, calling for the integration of mechanism design, strategic classification, causal inference, and social choice to ensure robustness, efficiency, and fairness in the face of strategic behavior from the various constituent groups.
[395] Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning
Yu-Ang Lee, Ching-Yun Ko, Pin-Yu Chen, Mi-Yen Yeh
Main category: cs.LG
TL;DR: Systematic hyperparameter tuning reveals that once learning rates are properly optimized, various LoRA variants perform similarly to vanilla LoRA, suggesting reported improvements may be due to configuration differences rather than methodological advantages.
Details
Motivation: Recent studies have proposed alternative LoRA initialization strategies and architectural modifications claiming substantial improvements over vanilla LoRA, but these gains are often demonstrated under fixed or narrowly tuned hyperparameter settings despite known neural network sensitivity to training configurations.Method: Systematically re-evaluated four representative LoRA variants alongside vanilla LoRA through extensive hyperparameter searches across mathematical and code generation tasks on diverse model scales, with second-order analysis examining Hessian eigenvalues.
Result: Different LoRA methods favor distinct learning rate ranges, but once learning rates are properly tuned, all methods achieve similar peak performance (within 1-2%), with only subtle rank-dependent behaviors. Second-order analysis attributes differing optimal learning rate ranges to variations in the largest Hessian eigenvalue.
Conclusion: Vanilla LoRA remains a competitive baseline, and improvements reported under single training configuration may not reflect consistent methodological advantages. Proper hyperparameter tuning is crucial for fair comparison of LoRA variants.
Abstract: Low-Rank Adaptation (LoRA) is the prevailing approach for efficient large language model (LLM) fine-tuning. Building on this paradigm, recent studies have proposed alternative initialization strategies and architectural modifications, reporting substantial improvements over vanilla LoRA. However, these gains are often demonstrated under fixed or narrowly tuned hyperparameter settings, despite the known sensitivity of neural networks to training configurations. In this work, we systematically re-evaluate four representative LoRA variants alongside vanilla LoRA through extensive hyperparameter searches. Across mathematical and code generation tasks on diverse model scales, we find that different LoRA methods favor distinct learning rate ranges. Crucially, once learning rates are properly tuned, all methods achieve similar peak performance (within 1-2%), with only subtle rank-dependent behaviors. These results suggest that vanilla LoRA remains a competitive baseline and that improvements reported under single training configuration may not reflect consistent methodological advantages. Finally, a second-order analysis attributes the differing optimal learning rate ranges to variations in the largest Hessian eigenvalue, aligning with classical learning theories.
[396] EntRGi: Entropy Aware Reward Guidance for Diffusion Language Models
Atula Tejaswi, Litu Rout, Constantine Caramanis, Sanjay Shakkottai, Sujay Sanghavi
Main category: cs.LG
TL;DR: EntRGi: A novel entropy-aware reward guidance method for discrete diffusion language models that dynamically regulates reward gradients to overcome limitations of existing continuous relaxation and straight-through estimator approaches.
Details
Motivation: Existing reward guidance methods for discrete diffusion language models face fundamental limitations: continuous relaxation approaches degrade gradient feedback because reward models aren't trained on continuous inputs, while straight-through estimators involve incorrect optimization by using gradients evaluated at discrete tokens to update continuous logits.Method: Introduces EntRGi (Entropy-aware Reward Guidance) that dynamically regulates gradients from reward models by modulating continuous relaxation using the model’s confidence. This approach provides reliable inputs to reward models while improving reward guidance.
Result: Empirical validation on a 7B-parameter diffusion language model across 3 diverse reward models and 3 multi-skill benchmarks shows consistent improvements over state-of-the-art methods.
Conclusion: EntRGi successfully addresses the tradeoff between continuous relaxation and straight-through estimator approaches for reward guidance in discrete diffusion language models, providing a more effective mechanism for test-time adaptation.
Abstract: Reward guidance has been applied to great success in the test-time adaptation of continuous diffusion models; it updates each denoising step using the gradients from a downstream reward model. We study reward guidance for discrete diffusion language models, where one cannot differentiate through the natural outputs of the model because they are discrete tokens. Existing approaches either replace these discrete tokens with continuous relaxations, or employ techniques like the straight-through estimator. In this work, we show the downsides of both these methods. The former degrades gradient feedback because the reward model has never been trained with continuous inputs. The latter involves incorrect optimization because the gradient evaluated at discrete tokens is used to update continuous logits. Our key innovation is to go beyond this tradeoff by introducing a novel mechanism called EntRGi: Entropy aware Reward Guidance that dynamically regulates the gradients from the reward model. By modulating the continuous relaxation using the model’s confidence, our approach substantially improves reward guidance while providing reliable inputs to the reward model. We empirically validate our approach on a 7B-parameter diffusion language model across 3 diverse reward models and 3 multi-skill benchmarks, showing consistent improvements over state-of-the-art methods.
[397] Enhanced QKNorm normalization for neural transformers with the Lp norm
Ezequiel Lopez-Rubio, Javier Montes-Perez, Esteban Jose Palomo
Main category: cs.LG
TL;DR: Proposes a generalization of QKNorm using Lp norms for Transformer query-key normalization, allowing non-Euclidean norms.
Details
Motivation: Normalization of query and key vectors is crucial for stable learning in Transformers, but existing approaches are limited. The paper aims to generalize QKNorm to support non-Euclidean norms for potentially better performance.Method: Extends QKNorm normalization scheme using Lp norm generalization, allowing flexible p-values beyond standard Euclidean (L2) norm. This enables exploration of non-Euclidean normalization approaches for Transformer attention mechanisms.
Result: Experimental results on a simple problem demonstrate the suitability of the proposed method, showing it works effectively with various Lp norms.
Conclusion: The Lp norm generalization of QKNorm provides a flexible normalization approach for Transformers, enabling exploration of non-Euclidean norms with promising initial results.
Abstract: The normalization of query and key vectors is an essential part of the Transformer architecture. It ensures that learning is stable regardless of the scale of these vectors. Some normalization approaches are available. In this preliminary work, a generalization of the QKNorm normalization scheme is proposed. The approach is based on the Lp norm, allowing non-Euclidean norms to be employed. Experimental results demonstrate the suitability of the method for a simple problem.
[398] Private PoEtry: Private In-Context Learning via Product of Experts
Rob Romijnders, Mohammad Mahdi Derakhshani, Jonathan Petit, Max Welling, Christos Louizos, Yuki M. Asano
Main category: cs.LG
TL;DR: A new differential privacy framework for in-context learning that improves accuracy by 30+ percentage points over prior methods while maintaining strong privacy guarantees.
Details
Motivation: In-context learning enables LLMs to adapt to new tasks without fine-tuning, but in-context examples may contain privacy-sensitive information. Existing DP approaches to ICL are either computationally expensive or rely on limited heuristics.Method: Reformulates private ICL through the lens of a Product-of-Experts model, providing a theoretically grounded framework that can be trivially parallelized.
Result: Improves accuracy by more than 30 percentage points on average compared to prior DP-ICL methods across five datasets in text classification, math, and vision-language tasks.
Conclusion: The proposed Product-of-Experts framework provides an effective and efficient solution for private in-context learning with strong privacy guarantees and significantly improved accuracy.
Abstract: In-context learning (ICL) enables Large Language Models (LLMs) to adapt to new tasks with only a small set of examples at inference time, thereby avoiding task-specific fine-tuning. However, in-context examples may contain privacy-sensitive information that should not be revealed through model outputs. Existing differential privacy (DP) approaches to ICL are either computationally expensive or rely on heuristics with limited effectiveness, including context oversampling, synthetic data generation, or unnecessary thresholding. We reformulate private ICL through the lens of a Product-of-Experts model. This gives a theoretically grounded framework, and the algorithm can be trivially parallelized. We evaluate our method across five datasets in text classification, math, and vision-language. We find that our method improves accuracy by more than 30 percentage points on average compared to prior DP-ICL methods, while maintaining strong privacy guarantees.
[399] A Simple Reduction Scheme for Constrained Contextual Bandits with Adversarial Contexts via Regression
Dhruv Sarkar, Abhishek Sinha
Main category: cs.LG
TL;DR: A constrained contextual bandits framework with adversarial contexts, using online regression oracles to reduce constrained problems to unconstrained ones with surrogate rewards.
Details
Motivation: To address constrained contextual bandits with adversarial contexts, where actions yield random rewards and costs, and the algorithm must operate even after budget exhaustion, controlling both regret and cumulative constraint violation.Method: Builds on SquareCB framework, proposes modular algorithmic scheme leveraging online regression oracles to reduce constrained problems to unconstrained contextual bandits with adaptively defined surrogate reward functions.
Result: Improved guarantees for adversarial context setting compared to prior work focusing on stochastic contexts, with compact and transparent analysis.
Conclusion: The reduction approach provides a simple and effective method for handling constrained contextual bandits with adversarial contexts, offering better theoretical guarantees and clearer analysis than previous methods.
Abstract: We study constrained contextual bandits (CCB) with adversarially chosen contexts, where each action yields a random reward and incurs a random cost. We adopt the standard realizability assumption: conditioned on the observed context, rewards and costs are drawn independently from fixed distributions whose expectations belong to known function classes. We consider the continuing setting, in which the algorithm operates over the entire horizon even after the budget is exhausted. In this setting, the objective is to simultaneously control regret and cumulative constraint violation. Building on the seminal SquareCB framework of Foster et al. (2018), we propose a simple and modular algorithmic scheme that leverages online regression oracles to reduce the constrained problem to a standard unconstrained contextual bandit problem with adaptively defined surrogate reward functions. In contrast to most prior work on CCB, which focuses on stochastic contexts, our reduction yields improved guarantees for the more general adversarial context setting, together with a compact and transparent analysis.
[400] Laws of Learning Dynamics and the Core of Learners
Inkee Jung, Siu Cheong Lau
Main category: cs.LG
TL;DR: Introduces entropy-based lifelong ensemble learning with conservation laws, applied to defend against adversarial attacks on CIFAR-10 via immunization mechanism.
Details
Motivation: To develop fundamental laws governing learning dynamics (conservation law and entropy decrease) and apply them to create robust ensemble learning methods that can defend against adversarial attacks.Method: Formulates conservation laws for learning dynamics, introduces entropy-based lifelong ensemble learning, and constructs an immunization mechanism against transfer-based adversarial attacks on CIFAR-10 dataset.
Result: The proposed logifold ensemble achieves higher accuracy than naive averaging of clean and adversarial models, with particularly large gains under strong perturbations.
Conclusion: Entropy-based lifelong ensemble learning with fundamental learning dynamics laws provides effective defense against adversarial attacks, outperforming simple ensemble methods.
Abstract: We formulate the fundamental laws governing learning dynamics, namely the conservation law and the decrease of total entropy. Within this framework, we introduce an entropy-based lifelong ensemble learning method. We evaluate its effectiveness by constructing an immunization mechanism to defend against transfer-based adversarial attacks on the CIFAR-10 dataset. Compared with a naive ensemble formed by simply averaging models specialized on clean and adversarial samples, the resulting logifold achieves higher accuracy in most test cases, with particularly large gains under strong perturbations.
[401] Laplacian Representations for Decision-Time Planning
Dikshant Shehmar, Matthew Schlegel, Matthew E. Taylor, Marlos C. Machado
Main category: cs.LG
TL;DR: ALPS uses Laplacian representations for hierarchical planning in model-based RL, outperforming baselines on offline goal-conditioned tasks.
Details
Motivation: Planning with learned models in RL is challenging, especially for decision-time planning where state representations must support local cost computation while preserving long-horizon structure.Method: Uses Laplacian representation to capture state-space distances at multiple time scales, creating an effective latent space for planning. Introduces ALPS, a hierarchical planning algorithm that decomposes long-horizon problems into subgoals using this representation.
Result: ALPS outperforms commonly used baselines on offline goal-conditioned RL tasks from OGBench, a benchmark previously dominated by model-free methods.
Conclusion: Laplacian representations provide effective latent spaces for planning by preserving meaningful distances and mitigating compounding errors over long prediction horizons.
Abstract: Planning with a learned model remains a key challenge in model-based reinforcement learning (RL). In decision-time planning, state representations are critical as they must support local cost computation while preserving long-horizon structure. In this paper, we show that the Laplacian representation provides an effective latent space for planning by capturing state-space distances at multiple time scales. This representation preserves meaningful distances and naturally decomposes long-horizon problems into subgoals, also mitigating the compounding errors that arise over long prediction horizons. Building on these properties, we introduce ALPS, a hierarchical planning algorithm, and demonstrate that it outperforms commonly used baselines on a selection of offline goal-conditioned RL tasks from OGBench, a benchmark previously dominated by model-free methods.
[402] Causal Representation Meets Stochastic Modeling under Generic Geometry
Jiaxu Ren, Yixin Wang, Biwei Huang
Main category: cs.LG
TL;DR: Identifiable causal representation learning for continuous-time latent stochastic point processes using geometry-based identifiability analysis and MUTATE framework with time-adaptive transitions.
Details
Motivation: Learning causal representations from observations is crucial for scientific discovery in fields like climate science, biology, and physics. Previous work focuses on i.i.d. or discrete-time processes, but many real-world settings require identifying latent variables that are continuous-time stochastic processes (e.g., multivariate point processes).Method: Develops identifiable causal representation learning for continuous-time latent stochastic point processes by analyzing geometry of parameter space for identifiability. Creates MUTATE, an identifiable variational autoencoder framework with time-adaptive transition module to infer stochastic dynamics.
Result: MUTATE effectively answers scientific questions in simulated and empirical studies, including accumulation of mutations in genomics and mechanisms driving neuron spike triggers in response to time-varying dynamics.
Conclusion: The paper presents an identifiable framework for learning causal representations from continuous-time stochastic processes, enabling scientific discovery in domains with complex temporal dynamics.
Abstract: Learning meaningful causal representations from observations has emerged as a crucial task for facilitating machine learning applications and driving scientific discoveries in fields such as climate science, biology, and physics. This process involves disentangling high-level latent variables and their causal relationships from low-level observations. Previous work in this area that achieves identifiability typically focuses on cases where the observations are either i.i.d. or follow a latent discrete-time process. Nevertheless, many real-world settings require identifying latent variables that are continuous-time stochastic processes (e.g., multivariate point processes). To this end, we develop identifiable causal representation learning for continuous-time latent stochastic point processes. We study its identifiability by analyzing the geometry of the parameter space. Furthermore, we develop MUTATE, an identifiable variational autoencoder framework with a time-adaptive transition module to infer stochastic dynamics. Across simulated and empirical studies, we find that MUTATE can effectively answer scientific questions, such as the accumulation of mutations in genomics and the mechanisms driving neuron spike triggers in response to time-varying dynamics.
[403] Feedback Control for Multi-Objective Graph Self-Supervision
Karish Grover, Theodore Vasiloudis, Han Xie, Sixing Lu, Xiang Song, Christos Faloutsos
Main category: cs.LG
TL;DR: ControlG: A control-theoretic framework for coordinating multiple self-supervised learning objectives on graphs using temporal allocation instead of per-update mixing, addressing objective interference through difficulty estimation and PID-controlled scheduling.
Details
Motivation: Current multi-task graph SSL methods suffer from objective interference when combining different pretext objectives (mutual information, reconstruction, contrastive learning), leading to negative transfer, training instability, and hidden starvation of some objectives due to per-update mixing approaches.Method: ControlG uses control theory to treat multi-objective graph SSL as a temporal allocation problem. It estimates per-objective difficulty and pairwise antagonism, plans target budgets via a Pareto-aware log-hypervolume planner, and schedules optimization using a Proportional-Integral-Derivative (PID) controller.
Result: ControlG consistently outperforms state-of-the-art baselines across 9 datasets while producing an auditable schedule that reveals which objectives drove learning at different stages.
Conclusion: Temporal allocation via control theory provides a principled solution to multi-objective coordination in graph SSL, addressing fundamental issues of objective interference and enabling more stable, effective multi-task learning.
Abstract: Can multi-task self-supervised learning on graphs be coordinated without the usual tug-of-war between objectives? Graph self-supervised learning (SSL) offers a growing toolbox of pretext objectives: mutual information, reconstruction, contrastive learning; yet combining them reliably remains a challenge due to objective interference and training instability. Most multi-pretext pipelines use per-update mixing, forcing every parameter update to be a compromise, leading to three failure modes: Disagreement (conflict-induced negative transfer), Drift (nonstationary objective utility), and Drought (hidden starvation of underserved objectives). We argue that coordination is fundamentally a temporal allocation problem: deciding when each objective receives optimization budget, not merely how to weigh them. We introduce ControlG, a control-theoretic framework that recasts multi-objective graph SSL as feedback-controlled temporal allocation by estimating per-objective difficulty and pairwise antagonism, planning target budgets via a Pareto-aware log-hypervolume planner, and scheduling with a Proportional-Integral-Derivative (PID) controller. Across 9 datasets, ControlG consistently outperforms state-of-the-art baselines, while producing an auditable schedule that reveals which objectives drove learning.
[404] ReFORM: Reflected Flows for On-support Offline RL via Noise Manipulation
Songyuan Zhang, Oswin So, H. M. Sabbir Ahmad, Eric Yang Yu, Matthew Cleaveland, Mitchell Black, Chuchu Fan
Main category: cs.LG
TL;DR: ReFORM is an offline RL method using flow policies that enforces support constraints to prevent OOD errors while maintaining policy expressiveness for multimodal distributions.
Details
Motivation: Addresses two key challenges in offline RL: 1) OOD errors when policies leave training distribution, and 2) difficulty representing multimodal optimal policy distributions. Existing methods either overly constrain policy improvement or don't fully prevent OOD issues.Method: Uses flow policies with bounded source distributions. First learns a behavior cloning flow policy to capture action distribution support, then optimizes a reflected flow that generates bounded noise for the BC flow while maintaining support constraints.
Result: Dominates all baselines on 40 challenging OGBench tasks across datasets of varying quality, using constant hyperparameters for all tasks while baselines used hand-tuned hyperparameters.
Conclusion: ReFORM effectively addresses OOD errors in offline RL through support-constrained flow policies, achieving superior performance while maintaining policy expressiveness for multimodal distributions.
Abstract: Offline reinforcement learning (RL) aims to learn the optimal policy from a fixed dataset generated by behavior policies without additional environment interactions. One common challenge that arises in this setting is the out-of-distribution (OOD) error, which occurs when the policy leaves the training distribution. Prior methods penalize a statistical distance term to keep the policy close to the behavior policy, but this constrains policy improvement and may not completely prevent OOD actions. Another challenge is that the optimal policy distribution can be multimodal and difficult to represent. Recent works apply diffusion or flow policies to address this problem, but it is unclear how to avoid OOD errors while retaining policy expressiveness. We propose ReFORM, an offline RL method based on flow policies that enforces the less restrictive support constraint by construction. ReFORM learns a behavior cloning (BC) flow policy with a bounded source distribution to capture the support of the action distribution, then optimizes a reflected flow that generates bounded noise for the BC flow while keeping the support, to maximize the performance. Across 40 challenging tasks from the OGBench benchmark with datasets of varying quality and using a constant set of hyperparameters for all tasks, ReFORM dominates all baselines with hand-tuned hyperparameters on the performance profile curves.
[405] Learning, Solving and Optimizing PDEs with TensorGalerkin: an efficient high-performance Galerkin assembly algorithm
Shizheng Wen, Mingyuan Chi, Tianwei Yu, Ben Moseley, Mike Yan Michelis, Pu Ren, Hao Sun, Siddhartha Mishra
Main category: cs.LG
TL;DR: A unified GPU-accelerated TensorGalerkin framework for variational PDE solving, constrained optimization, and physics-informed learning via tensorized element operations and sparse matrix assembly.
Details
Motivation: To create an efficient, unified framework for solving variational PDEs that can handle numerical solution, constrained optimization, and physics-informed learning tasks with high computational efficiency on GPUs.Method: TensorGalerkin framework based on Galerkin discretization with tensorized element-wise operations in Python Map stage, followed by global reduction via sparse matrix multiplication on mesh sparsity graphs.
Result: Demonstrated significant computational efficiency and accuracy gains over baselines for 2D/3D elliptic, parabolic, and hyperbolic PDEs on unstructured meshes across all three applications.
Conclusion: The framework provides a versatile, high-performance solution for variational PDE problems that bridges numerical methods, optimization, and machine learning applications.
Abstract: We present a unified algorithmic framework for the numerical solution, constrained optimization, and physics-informed learning of PDEs with a variational structure. Our framework is based on a Galerkin discretization of the underlying variational forms, and its high efficiency stems from a novel highly-optimized and GPU-compliant TensorGalerkin framework for linear system assembly (stiffness matrices and load vectors). TensorGalerkin operates by tensorizing element-wise operations within a Python-level Map stage and then performs global reduction with a sparse matrix multiplication that performs message passing on the mesh-induced sparsity graph. It can be seamlessly employed downstream as i) a highly-efficient numerical PDEs solver, ii) an end-to-end differentiable framework for PDE-constrained optimization, and iii) a physics-informed operator learning algorithm for PDEs. With multiple benchmarks, including 2D and 3D elliptic, parabolic, and hyperbolic PDEs on unstructured meshes, we demonstrate that the proposed framework provides significant computational efficiency and accuracy gains over a variety of baselines in all the targeted downstream applications.
[406] Quantile-Physics Hybrid Framework for Safe-Speed Recommendation under Diverse Weather Conditions Leveraging Connected Vehicle and Road Weather Information Systems Data
Wen Zhang, Adel W. Sadek, Chunming Qiao
Main category: cs.LG
TL;DR: A hybrid predictive framework that recommends real-time safe speed intervals for freeways using connected vehicle data and weather information, combining machine learning with physics-based safety constraints.
Details
Motivation: Inclement weather reduces driver visibility and tire-road friction, increasing crash risk. Current speed recommendations often don't adapt to real-time weather conditions, creating safety gaps.Method: Uses high-resolution Connected Vehicle and Road Weather Information System data to create spatiotemporally aligned dataset. Employs Quantile Regression Forests to predict speed distributions, then fuses these with physics-based upper speed limits derived from real-time road grip and visibility constraints.
Result: QRF model achieves MAE of 1.55 mph, with 96.43% of median speed predictions within 5 mph error, and PICP(50%) of 48.55%. Model generalizes well across weather types and road segments.
Conclusion: The hybrid framework effectively recommends safe speed intervals that adapt to changing weather conditions, showing promise for real-world deployment to improve traffic safety and reduce weather-related crashes.
Abstract: Inclement weather conditions can significantly impact driver visibility and tire-road surface friction, requiring adjusted safe driving speeds to reduce crash risk. This study proposes a hybrid predictive framework that recommends real-time safe speed intervals for freeway travel under diverse weather conditions. Leveraging high-resolution Connected Vehicle (CV) data and Road Weather Information System (RWIS) data collected in Buffalo, NY, from 2022 to 2023, we construct a spatiotemporally aligned dataset containing over 6.6 million records across 73 days. The core model employs Quantile Regression Forests (QRF) to estimate vehicle speed distributions in 10-minute windows, using 26 input features that capture meteorological, pavement, and temporal conditions. To enforce safety constraints, a physics-based upper speed limit is computed for each interval based on real-time road grip and visibility, ensuring that vehicles can safely stop within their sight distance. The final recommended interval fuses QRF-predicted quantiles with both posted speed limits and the physics-derived upper bound. Experimental results demonstrate strong predictive performance: the QRF model achieves a mean absolute error of 1.55 mph, with 96.43% of median speed predictions within 5 mph, a PICP (50%) of 48.55%, and robust generalization across weather types. The model’s ability to respond to changing weather conditions and generalize across road segments shows promise for real-world deployment, thereby improving traffic safety and reducing weather-related crashes.
[407] StagePilot: A Deep Reinforcement Learning Agent for Stage-Controlled Cybergrooming Simulation
Heajun An, Qi Zhang, Minqian Liu, Xinyi Zhang, Sang Won Lee, Lifu Huang, Pamela J. Wisniewski, Jin-Hee Cho
Main category: cs.LG
TL;DR: StagePilot: An offline RL-based dialogue agent that simulates stage-wise cyber grooming progression for youth prevention training, using composite rewards and constrained stage transitions.
Details
Motivation: Cybergrooming poses a serious threat to youth, requiring proactive educational interventions. Current prevention methods lack realistic simulation of grooming progression stages, limiting training effectiveness.Method: Offline reinforcement learning agent with stage-wise progression modeling. Uses composite reward balancing user sentiment and goal proximity, with transitions constrained to adjacent stages for realism. Evaluated through LLM-based simulations measuring stage completion, dialogue efficiency, and emotional engagement.
Result: StagePilot generates realistic and coherent conversations aligned with grooming dynamics. IQL+AWAC agent achieves best balance between strategic planning and emotional coherence, reaching final stage up to 43% more frequently than baselines while maintaining over 70% sentiment alignment.
Conclusion: StagePilot effectively simulates grooming progression for prevention training, with the IQL+AWAC agent demonstrating optimal performance in balancing strategic progression with emotional realism.
Abstract: Cybergrooming is an evolving threat to youth, necessitating proactive educational interventions. We propose StagePilot, an offline RL-based dialogue agent that simulates the stage-wise progression of grooming behaviors for prevention training. StagePilot selects conversational stages using a composite reward that balances user sentiment and goal proximity, with transitions constrained to adjacent stages for realism and interpretability. We evaluate StagePilot through LLM-based simulations, measuring stage completion, dialogue efficiency, and emotional engagement. Results show that StagePilot generates realistic and coherent conversations aligned with grooming dynamics. Among tested methods, the IQL+AWAC agent achieves the best balance between strategic planning and emotional coherence, reaching the final stage up to 43% more frequently than baselines while maintaining over 70% sentiment alignment.
[408] Does SGD Seek Flatness or Sharpness? An Exactly Solvable Model
Yizhou Xu, Pierfrancesco Beneventano, Isaac Chuang, Liu Ziyin
Main category: cs.LG
TL;DR: SGD’s flatness-seeking behavior is clarified through an analytically solvable model showing it prefers minimal gradient fluctuations, not flatness per se, with data distribution determining sharpness at convergence.
Details
Motivation: To resolve conflicting evidence about when SGD prefers flatter vs. sharper solutions during training by developing a causal understanding of flatness-seeking behavior.Method: Developed an analytically solvable model that exhibits both flattening and sharpening behavior during training, then validated insights with MLP, RNN, and transformer architectures in controlled settings.
Result: SGD has no inherent preference for flatness but prefers minimal gradient fluctuations; data distribution uniquely determines sharpness at convergence; flat minima are preferred only with isotropic label noise across output dimensions.
Conclusion: Flatness-seeking behavior in SGD is not intrinsic but emerges from preference for minimal gradient fluctuations, with data distribution (specifically label noise isotropy) determining whether flat or sharp solutions are preferred.
Abstract: A large body of theory and empirical work hypothesizes a connection between the flatness of a neural network’s loss landscape during training and its performance. However, there have been conceptually opposite pieces of evidence regarding when SGD prefers flatter or sharper solutions during training. In this work, we partially but causally clarify the flatness-seeking behavior of SGD by identifying and exactly solving an analytically solvable model that exhibits both flattening and sharpening behavior during training. In this model, the SGD training has no \textit{a priori} preference for flatness, but only a preference for minimal gradient fluctuations. This leads to the insight that, at least within this model, it is data distribution that uniquely determines the sharpness at convergence, and that a flat minimum is preferred if and only if the noise in the labels is isotropic across all output dimensions. When the noise in the labels is anisotropic, the model instead prefers sharpness and can converge to an arbitrarily sharp solution, depending on the imbalance in the noise in the labels spectrum. We reproduce this key insight in controlled settings with different model architectures such as MLP, RNN, and transformers.
[409] E-Globe: Scalable $ε$-Global Verification of Neural Networks via Tight Upper Bounds and Pattern-Aware Branching
Wenting Li, Saif R. Kazi, Russell Bent, Duo Zhou, Huan Zhang
Main category: cs.LG
TL;DR: A hybrid neural network verifier using branch-and-bound with exact nonlinear programming for tighter robustness bounds and faster verification.
Details
Motivation: Neural networks lack robustness guarantees for safety-critical applications, and current formal verification methods face scalability-completeness trade-offs.Method: Proposes a hybrid verifier in branch-and-bound framework with exact nonlinear program with complementarity constraints (NLP-CC) for upper bounding, warm-started NLP solves, and pattern-aligned strong branching.
Result: Achieves markedly tighter upper bounds than PGD across perturbation radii, fast per-node solves, and substantial end-to-end speedups over MIP-based verification.
Conclusion: The hybrid verifier effectively addresses scalability-completeness trade-off in neural network verification through tighter bounds and faster solving.
Abstract: Neural networks achieve strong empirical performance, but robustness concerns still hinder deployment in safety-critical applications. Formal verification provides robustness guarantees, but current methods face a scalability-completeness trade-off. We propose a hybrid verifier in a branch-and-bound (BaB) framework that efficiently tightens both upper and lower bounds until an $ε-$global optimum is reached or early stop is triggered. The key is an exact nonlinear program with complementarity constraints (NLP-CC) for upper bounding that preserves the ReLU input-output graph, so any feasible solution yields a valid counterexample and enables rapid pruning of unsafe subproblems. We further accelerate verification with (i) warm-started NLP solves requiring minimal constraint-matrix updates and (ii) pattern-aligned strong branching that prioritizes splits most effective at tightening relaxations. We also provide conditions under which NLP-CC upper bounds are tight. Experiments on MNIST and CIFAR-10 show markedly tighter upper bounds than PGD across perturbation radii spanning up to three orders of magnitude, fast per-node solves in practice, and substantial end-to-end speedups over MIP-based verification, amplified by warm-starting, GPU batching, and pattern-aligned branching.
[410] Reliable Explanations or Random Noise? A Reliability Metric for XAI
Poushali Sengupta, Sabita Maharjan, Frank Eliassen, Shashi Raj Pandey, Yan Zhang
Main category: cs.LG
TL;DR: ERI is a family of metrics that quantifies explanation stability under four reliability axioms for XAI methods, with formal guarantees and benchmarks revealing widespread reliability failures in popular methods like SHAP and IG.
Details
Motivation: The reliability of explanations from complex ML models remains largely unmeasured, with popular methods like SHAP and IG showing substantial variability under realistic conditions (small input perturbations, correlated representations, minor model updates), undermining trust in XAI systems.Method: Introduced Explanation Reliability Index (ERI) family of metrics based on four reliability axioms: robustness to input perturbations, consistency under feature redundancy, smoothness across model evolution, and resilience to distributional shifts. Developed formal guarantees (Lipschitz-type bounds, temporal stability), ERI-T for sequential models, and ERI-Bench benchmark for systematic testing.
Result: Experimental results reveal widespread reliability failures in popular explanation methods, showing explanations can be unstable under realistic deployment conditions. ERI enables principled assessment of explanation reliability.
Conclusion: ERI exposes and quantifies explanation instabilities, supporting more trustworthy XAI systems by enabling systematic assessment of explanation reliability across various deployment conditions.
Abstract: In recent years, explaining decisions made by complex machine learning models has become essential in high-stakes domains such as energy systems, healthcare, finance, and autonomous systems. However, the reliability of these explanations, namely, whether they remain stable and consistent under realistic, non-adversarial changes, remains largely unmeasured. Widely used methods such as SHAP and Integrated Gradients (IG) are well-motivated by axiomatic notions of attribution, yet their explanations can vary substantially even under system-level conditions, including small input perturbations, correlated representations, and minor model updates. Such variability undermines explanation reliability, as reliable explanations should remain consistent across equivalent input representations and small, performance-preserving model changes. We introduce the Explanation Reliability Index (ERI), a family of metrics that quantifies explanation stability under four reliability axioms: robustness to small input perturbations, consistency under feature redundancy, smoothness across model evolution, and resilience to mild distributional shifts. For each axiom, we derive formal guarantees, including Lipschitz-type bounds and temporal stability results. We further propose ERI-T, a dedicated measure of temporal reliability for sequential models, and introduce ERI-Bench, a benchmark designed to systematically stress-test explanation reliability across synthetic and real-world datasets. Experimental results reveal widespread reliability failures in popular explanation methods, showing that explanations can be unstable under realistic deployment conditions. By exposing and quantifying these instabilities, ERI enables principled assessment of explanation reliability and supports more trustworthy explainable AI (XAI) systems.
[411] Individual Fairness In Strategic Classification
Zhiqun Zuo, Mohammad Mahdi Khalili
Main category: cs.LG
TL;DR: Strategic classification fairness analysis showing deterministic thresholds violate individual fairness, proposing randomized classifiers with linear programming for optimal individual fairness.
Details
Motivation: Strategic classification presents fairness challenges when individuals modify features to influence ML decisions. While group fairness has been studied, individual fairness remains underexplored in this setting.Method: Analyze threshold-based classifiers, prove deterministic thresholds violate individual fairness, investigate randomized classifiers. Introduce conditions for individual fairness with randomized classifiers, formulate optimal individually fair randomized classifier as linear programming problem. Extend approach to group fairness notions.
Result: Experiments on real-world datasets confirm the method effectively mitigates unfairness and improves the fairness-accuracy trade-off.
Conclusion: Randomized classifiers can achieve individual fairness in strategic classification settings where deterministic thresholds fail, with linear programming providing optimal solutions that balance fairness and accuracy.
Abstract: Strategic classification, where individuals modify their features to influence machine learning (ML) decisions, presents critical fairness challenges. While group fairness in this setting has been widely studied, individual fairness remains underexplored. We analyze threshold-based classifiers and prove that deterministic thresholds violate individual fairness. Then, we investigate the possibility of using a randomized classifier to achieve individual fairness. We introduce conditions under which a randomized classifier ensures individual fairness and leverage these conditions to find an optimal and individually fair randomized classifier through a linear programming problem. Additionally, we demonstrate that our approach can be extended to group fairness notions. Experiments on real-world datasets confirm that our method effectively mitigates unfairness and improves the fairness-accuracy trade-off.
[412] Autodiscover: A reinforcement learning recommendation system for the cold-start imbalance challenge in active learning, powered by graph-aware thompson sampling
Parsa Vares
Main category: cs.LG
TL;DR: AutoDiscover: An adaptive active learning framework for systematic literature review screening that models literature as a heterogeneous graph and uses a reinforcement learning agent to dynamically manage query strategies, outperforming static approaches.
Details
Motivation: Manual screening for systematic literature reviews is a bottleneck due to growing scientific output, low prevalence of relevant studies, and scarce expert decisions. Traditional active learning systems use fixed query strategies that don't adapt over time and ignore relational structure in scientific literature.Method: Models literature as a heterogeneous graph capturing relationships among documents, authors, and metadata. Uses a Heterogeneous Graph Attention Network (HAN) to learn node representations, and a Discounted Thompson Sampling (DTS) agent to dynamically manage a portfolio of query strategies in real-time with human-in-the-loop labels.
Result: On the 26-dataset SYNERGY benchmark, AutoDiscover achieves higher screening efficiency than static AL baselines. The agent mitigates cold start by bootstrapping discovery from minimal initial labels where static approaches fail.
Conclusion: AutoDiscover accelerates systematic literature review screening under scarce expert labels and low prevalence of relevant studies. The framework includes TS-Insight, an open-source visual analytics dashboard for interpreting and diagnosing the agent’s decisions.
Abstract: Systematic literature reviews (SLRs) are fundamental to evidence-based research, but manual screening is an increasing bottleneck as scientific output grows. Screening features low prevalence of relevant studies and scarce, costly expert decisions. Traditional active learning (AL) systems help, yet typically rely on fixed query strategies for selecting the next unlabeled documents. These static strategies do not adapt over time and ignore the relational structure of scientific literature networks. This thesis introduces AutoDiscover, a framework that reframes AL as an online decision-making problem driven by an adaptive agent. Literature is modeled as a heterogeneous graph capturing relationships among documents, authors, and metadata. A Heterogeneous Graph Attention Network (HAN) learns node representations, which a Discounted Thompson Sampling (DTS) agent uses to dynamically manage a portfolio of query strategies. With real-time human-in-the-loop labels, the agent balances exploration and exploitation under non-stationary review dynamics, where strategy utility changes over time. On the 26-dataset SYNERGY benchmark, AutoDiscover achieves higher screening efficiency than static AL baselines. Crucially, the agent mitigates cold start by bootstrapping discovery from minimal initial labels where static approaches fail. We also introduce TS-Insight, an open-source visual analytics dashboard to interpret, verify, and diagnose the agent’s decisions. Together, these contributions accelerate SLR screening under scarce expert labels and low prevalence of relevant studies.
[413] Unbiased Single-Queried Gradient for Combinatorial Objective
Thanawat Sornwanee
Main category: cs.LG
TL;DR: Proposes a stochastic gradient method for combinatorial optimization problems that requires only a single query of the combinatorial function, encompassing REINFORCE and introducing new gradient estimators.
Details
Motivation: Combinatorial optimization problems often involve optimization over hypercubes corresponding to Bernoulli probability parameters, where exact gradient computation requires multiple function queries, which is computationally expensive.Method: Develops a stochastic gradient estimator that is unbiased and requires only a single query of the combinatorial function, generalizing REINFORCE through importance sampling and introducing new classes of stochastic gradients.
Result: The proposed method provides efficient gradient estimation for combinatorial optimization with theoretical guarantees of unbiasedness and reduced computational cost compared to exact gradient computation.
Conclusion: The single-query stochastic gradient method offers an efficient alternative to exact gradient computation for combinatorial optimization problems, with REINFORCE as a special case and extensions to new gradient estimators.
Abstract: In a probabilistic reformulation of a combinatorial problem, we often face an optimization over a hypercube, which corresponds to the Bernoulli probability parameter for each binary variable in the primal problem. The combinatorial nature suggests that an exact gradient computation requires multiple queries. We propose a stochastic gradient that is unbiased and requires only a single query of the combinatorial function. This method encompasses a well-established REINFORCE (through an importance sampling), as well as including a class of new stochastic gradients.
[414] Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks
William F. Shen, Xinchi Qiu, Chenxi Whitehouse, Lisa Alazraki, Shashwat Goel, Francesco Barbieri, Timon Willi, Akhil Mathur, Ilias Leontiadis
Main category: cs.LG
TL;DR: RRD is a recursive rubric refinement framework that improves LLM judging and reward modeling by decomposing coarse rubrics into fine-grained criteria and filtering misaligned/redundant ones.
Details
Motivation: Existing rubric generation for LLM judges lacks control, leading to issues like poor coverage, conflated dimensions, misaligned preferences, and redundant criteria, which degrade judge accuracy and produce suboptimal rewards during reinforcement fine-tuning.Method: RRD uses a recursive decompose-filter cycle: decomposes coarse rubrics into fine-grained discriminative criteria, filters misaligned/redundant rubrics, and employs correlation-aware weighting to prevent over-representation of correlated criteria.
Result: RRD improves preference-judgment accuracy on JudgeBench and PPE benchmarks for GPT-4o and Llama3.1-405B judges, achieving up to +17.7 points on JudgeBench. As reward source for RFT, it boosts rewards by up to 160% for Qwen3-4B and 60% for Llama3.1-8B.
Conclusion: RRD establishes recursive rubric refinement as a scalable and interpretable foundation for LLM judging and reward modeling in open-ended domains, delivering consistent gains across both evaluation and training tasks.
Abstract: Recently, rubrics have been used to guide LLM judges in capturing subjective, nuanced, multi-dimensional human preferences, and have been extended from evaluation to reward signals for reinforcement fine-tuning (RFT). However, rubric generation remains hard to control: rubrics often lack coverage, conflate dimensions, misalign preference direction, and contain redundant or highly correlated criteria, degrading judge accuracy and producing suboptimal rewards during RFT. We propose RRD, a principled framework for rubric refinement built on a recursive decompose-filter cycle. RRD decomposes coarse rubrics into fine-grained, discriminative criteria, expanding coverage while sharpening separation between responses. A complementary filtering mechanism removes misaligned and redundant rubrics, and a correlation-aware weighting scheme prevents over-representing highly correlated criteria, yielding rubric sets that are informative, comprehensive, and non-redundant. Empirically, RRD delivers large, consistent gains across both evaluation and training: it improves preference-judgment accuracy on JudgeBench and PPE for both GPT-4o and Llama3.1-405B judges, achieving top performance in all settings with up to +17.7 points on JudgeBench. When used as the reward source for RFT on WildChat, it yields substantially stronger and more stable learning signals, boosting reward by up to 160% (Qwen3-4B) and 60% (Llama3.1-8B) versus 10-20% for prior rubric baselines, with gains that transfer to HealthBench-Hard and BiGGen Bench. Overall, RRD establishes recursive rubric refinement as a scalable and interpretable foundation for LLM judging and reward modeling in open-ended domains.
[415] SemPipes – Optimizable Semantic Data Operators for Tabular Machine Learning Pipelines
Olga Ovcharenko, Matthias Boehm, Sebastian Schelter
Main category: cs.LG
TL;DR: SemPipes introduces a declarative programming model that uses LLMs to synthesize code for tabular ML pipelines through semantic operators specified in natural language, enabling automatic optimization of data operations.
Details
Motivation: Real-world tabular ML requires complex data preparation pipelines that demand substantial domain expertise and engineering effort. The paper aims to leverage LLMs to support tabular ML through code synthesis, reducing the complexity and expertise needed for pipeline design.Method: SemPipes introduces semantic operators that specify data transformations in natural language while delegating execution to a runtime system. During training, it synthesizes custom operator implementations based on data characteristics, operator instructions, and pipeline context. The system uses LLM-based code synthesis guided by evolutionary search to automatically optimize data operations in pipelines.
Result: Evaluation across diverse tabular ML tasks shows that semantic operators substantially improve end-to-end predictive performance for both expert-designed and agent-generated pipelines, while reducing pipeline complexity.
Conclusion: SemPipes demonstrates that LLM-powered semantic operators can effectively support tabular ML through code synthesis, improving performance while reducing the engineering burden of pipeline design.
Abstract: Real-world machine learning on tabular data relies on complex data preparation pipelines for prediction, data integration, augmentation, and debugging. Designing these pipelines requires substantial domain expertise and engineering effort, motivating the question of how large language models (LLMs) can support tabular ML through code synthesis. We introduce SemPipes, a novel declarative programming model that integrates LLM-powered semantic data operators into tabular ML pipelines. Semantic operators specify data transformations in natural language while delegating execution to a runtime system. During training, SemPipes synthesizes custom operator implementations based on data characteristics, operator instructions, and pipeline context. This design enables the automatic optimization of data operations in a pipeline via LLM-based code synthesis guided by evolutionary search. We evaluate SemPipes across diverse tabular ML tasks and show that semantic operators substantially improve end-to-end predictive performance for both expert-designed and agent-generated pipelines, while reducing pipeline complexity. We implement SemPipes in Python and release it at https://github.com/deem-data/sempipes/tree/v1.
[416] Decoupled Orthogonal Dynamics: Regularization for Deep Network Optimizers
Hao Chen, Jinghui Yuan, Hanmin Zhang
Main category: cs.LG
TL;DR: AdamO: A new optimizer that decouples radial (norm) and tangential (direction) dynamics in AdamW to address the Radial Tug-of-War problem, improving generalization and stability.
Details
Motivation: AdamW's weight decay creates a fundamental conflict called the "Radial Tug-of-War" where gradients push parameter norms to expand capacity while weight decay suppresses norm growth, causing radial oscillations that inject noise into Adam's second-moment estimates and degrade feature learning.Method: Proposes Orthogonal Dynamics Decoupling: uses SGD-style update for one-dimensional norm control while confining Adam’s adaptive preconditioning to the tangential subspace. Incorporates curvature-adaptive radial step sizing and architecture-aware rules/projections for scale-invariant layers and low-dimensional parameters.
Result: Experiments on vision and language tasks show AdamO improves generalization and stability over AdamW without introducing additional complex constraints.
Conclusion: Magnitude and direction should be decoupled in optimizer dynamics, and AdamO provides an effective implementation of this principle that outperforms AdamW.
Abstract: Is the standard weight decay in AdamW truly optimal? Although AdamW decouples weight decay from adaptive gradient scaling, a fundamental conflict remains: the Radial Tug-of-War. In deep learning, gradients tend to increase parameter norms to expand effective capacity while steering directions to learn features, whereas weight decay indiscriminately suppresses norm growth. This push–pull interaction induces radial oscillations, injecting noise into Adam’s second-moment estimates and potentially degrading delicate tangential feature learning. We argue that magnitude and direction play distinct roles and should be decoupled in optimizer dynamics. We propose Orthogonal Dynamics Decoupling and instantiate it as AdamO: an SGD-style update handles the one-dimensional norm control, while Adam’s adaptive preconditioning is confined to the tangential subspace. AdamO further incorporates curvature-adaptive radial step sizing and architecture-aware rules and projections for scale-invariant layers and low-dimensional parameters. Experiments on vision and language tasks show that AdamO improves generalization and stability over AdamW without introducing additional complex constraints.
[417] Adaptive Exploration for Latent-State Bandits
Jikai Jin, Kenneth Hung, Sanath Kumar Krishnamurthy, Baoyi Shi, Congshan Zhang
Main category: cs.LG
TL;DR: State-model-free bandit algorithms that use lagged contexts and coordinated probing to handle hidden time-varying states without explicit state modeling
Details
Motivation: Classical multi-armed bandit algorithms fail in environments with hidden, time-varying states that confound reward estimation and optimal action selection due to unobserved confounders causing biased reward estimates and limited state informationMethod: Introduces a family of state-model-free bandit algorithms that leverage lagged contextual features and coordinated probing strategies to implicitly track latent states and disambiguate state-dependent reward patterns without explicit state modeling
Result: Empirical results across diverse settings demonstrate superior performance over classical approaches, with adaptive variants combining computational efficiency with robust adaptation to non-stationary rewards
Conclusion: The state-model-free approach effectively handles hidden time-varying states in bandit problems, with practical recommendations provided for algorithm selection in real-world applications
Abstract: The multi-armed bandit problem is a core framework for sequential decision-making under uncertainty, but classical algorithms often fail in environments with hidden, time-varying states that confound reward estimation and optimal action selection. We address key challenges arising from unobserved confounders, such as biased reward estimates and limited state information, by introducing a family of state-model-free bandit algorithms that leverage lagged contextual features and coordinated probing strategies. These implicitly track latent states and disambiguate state-dependent reward patterns. Our methods and their adaptive variants can learn optimal policies without explicit state modeling, combining computational efficiency with robust adaptation to non-stationary rewards. Empirical results across diverse settings demonstrate superior performance over classical approaches, and we provide practical recommendations for algorithm selection in real-world applications.
[418] Fairness Under Group-Conditional Prior Probability Shift: Invariance, Drift, and Target-Aware Post-Processing
Amir Asiaee, Kaveh Aryan
Main category: cs.LG
TL;DR: The paper analyzes fairness in machine learning under group-conditional prior probability shift (GPPS), showing that error-rate fairness criteria are invariant while acceptance-rate criteria can drift, and proposes methods to estimate target fairness without labels.
Details
Motivation: Machine learning fairness systems are typically trained on historical data but deployed in shifting environments where label prevalence changes differently across demographic groups, creating challenges for maintaining fairness guarantees.Method: Theoretical analysis of GPPS, proving structural invariance properties, developing identification results for target-domain metrics without labels using ROC invariance, and proposing TAP-GPPS algorithm for label-free post-processing to achieve demographic parity.
Result: Proved dichotomy: equalized odds is invariant under GPPS while demographic parity can drift unavoidably; showed target metrics are identifiable without target labels; TAP-GPPS achieves target fairness with minimal utility loss in experiments.
Conclusion: Fairness criteria behave differently under distribution shift, with error-rate metrics being robust but acceptance-rate metrics requiring correction; label-free estimation and post-processing can maintain fairness in shifting environments.
Abstract: Machine learning systems are often trained and evaluated for fairness on historical data, yet deployed in environments where conditions have shifted. A particularly common form of shift occurs when the prevalence of positive outcomes changes differently across demographic groups–for example, when disease rates rise faster in one population than another, or when economic conditions affect loan default rates unequally. We study group-conditional prior probability shift (GPPS), where the label prevalence $P(Y=1\mid A=a)$ may change between training and deployment while the feature-generation process $P(X\mid Y,A)$ remains stable. Our analysis yields three main contributions. First, we prove a fundamental dichotomy: fairness criteria based on error rates (equalized odds) are structurally invariant under GPPS, while acceptance-rate criteria (demographic parity) can drift–and we prove this drift is unavoidable for non-trivial classifiers (shift-robust impossibility). Second, we show that target-domain risk and fairness metrics are identifiable without target labels: the invariance of ROC quantities under GPPS enables consistent estimation from source labels and unlabeled target data alone, with finite-sample guarantees. Third, we propose TAP-GPPS, a label-free post-processing algorithm that estimates prevalences from unlabeled data, corrects posteriors, and selects thresholds to satisfy demographic parity in the target domain. Experiments validate our theoretical predictions and demonstrate that TAP-GPPS achieves target fairness with minimal utility loss.
[419] TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference
Jiyoung Park, Hankyu Jang, Changseok Song, Wookeun Jung
Main category: cs.LG
TL;DR: TIDE is a serving-engine-native framework for LLM inference that integrates online draft adaptation with speculative decoding, using hidden states as training signals and adaptive runtime control to improve throughput.
Details
Motivation: Speculative decoding can accelerate LLM inference but faces practical challenges due to evolving workloads and system constraints; existing approaches lack efficient online adaptation and system-level integration.Method: TIDE integrates draft adaptation directly into LLM serving engines, reuses target model hidden states as training signals for zero-overhead adaptation, employs adaptive runtime control to activate speculation/training when beneficial, and exploits heterogeneous clusters by mapping inference and training to appropriate GPU classes.
Result: TIDE achieves up to 1.15x throughput improvement over static speculative decoding while reducing draft training time by 1.67x compared to approaches that recompute training signals, across diverse real-world workloads.
Conclusion: TIDE demonstrates that serving-engine-native integration of online draft adaptation with speculative decoding can significantly improve LLM inference efficiency in practical deployment scenarios.
Abstract: Speculative decoding can substantially accelerate LLM inference, but realizing its benefits in practice is challenging due to evolving workloads and system-level constraints. We present TIDE (Temporal Incremental Draft Engine), a serving-engine-native framework that integrates online draft adaptation directly into high-performance LLM inference systems. TIDE reuses target model hidden states generated during inference as training signals, enabling zero-overhead draft adaptation without reloading the target model, and employs adaptive runtime control to activate speculation and training only when beneficial. TIDE exploits heterogeneous clusters by mapping decoupled inference and training to appropriate GPU classes. Across diverse real-world workloads, TIDE achieves up to 1.15x throughput improvement over static speculative decoding while reducing draft training time by 1.67x compared to approaches that recompute training signals.
[420] Cross-talk based multi-task learning for fault classification of physically coupled machine system
Wonjun Yi, Rismaya Kumar Mishra, Yong-Hwa Park
Main category: cs.LG
TL;DR: Multi-task learning with cross-talk architecture improves fault classification by leveraging physical coupling between fault conditions and related physical variables in machine systems.
Details
Motivation: Machine signals naturally embed coupled information about fault conditions and physical variables, but most fault classification studies only use direct fault labels, missing valuable coupled information that could improve classification accuracy.Method: Proposes a multi-task learning framework with cross-talk architecture that allows controlled information exchange between tasks (fault classification and physical variable prediction) while preventing negative transfer, building on a residual neural dimension reductor model.
Result: The cross-talk architecture consistently outperformed single-task models, multi-class models merging all label combinations, and shared trunk multi-task models across two benchmarks: drone fault dataset and motor compound fault dataset.
Conclusion: Leveraging physical coupling through multi-task learning with cross-talk architecture significantly improves fault classification performance by enabling better feature learning through controlled information exchange between related tasks.
Abstract: Machine systems inherently generate signals in which fault conditions and various physical variables are physically coupled. Although many existing fault classification studies rely solely on direct fault labels, the aforementioned signals naturally embed additional information shaped by other physically coupled information. Herein, we leverage this coupling through a multi-task learning (MTL) framework that jointly learns fault conditions and the related physical variables. Among MTL architectures, crosstalk structures have distinct advantages because they allow for controlled information exchange between tasks through the cross-talk layer while preventing negative transfer, in contrast to shared trunk architectures that often mix incompatible features. We build on our previously introduced residual neural dimension reductor model, and extend its application to two benchmarks where physical coupling is prominent. The first benchmark is a drone fault dataset, in which machine type and maneuvering direction significantly alter the frequency components of measured signals even under the same nominal condition. By learning fault classification together with these physical attributes, the cross-talk architecture can better classify faults. The second benchmark dataset is the motor compound fault dataset. In this system, each fault component, inner race fault, outer race fault, misalignment, and unbalance is coupled to the other. For motor compound fault, we also test classification performance when we use single-channel data or multi-channel data as input to the classifier. Across both benchmarks, our residual neural dimension reductor, consistently outperformed single-task models, multi-class models that merge all label combinations, and shared trunk multi-task models.
[421] CoSA: Compressed Sensing-Based Adaptation of Large Language Models
Songtao Wei, Yi Li, Bohan Zhang, Zhichun Guo, Ying Huang, Yuede Ji, Miao Yin, Guanpeng Li, Bingzhe Li
Main category: cs.LG
TL;DR: CoSA: A new Parameter-Efficient Fine-Tuning method using compressed sensing theory to express weight updates through random projections and a compact learnable core, overcoming low-rank limitations of existing PEFT methods.
Details
Motivation: Existing PEFT methods like LoRA and PiSSA rely on low-rank decompositions which may restrict expressivity, especially when task-specific adaptations have uniformly distributed singular values. There's a need for more expressive yet parameter-efficient adaptation methods.Method: CoSA extends compressed sensing theory to PEFT. Instead of low-rank constraints, it expresses weight updates through fixed random projection matrices and a compact learnable core. The method provides formal theoretical analysis as a synthesis process, proving weight updates can be compactly encoded into low-dimensional space and mapped back through random projections.
Result: Extensive experiments on 10 diverse tasks (natural language understanding and generation) using 5 models from RoBERTa, Llama, and Qwen families show CoSA consistently matches or outperforms state-of-the-art PEFT methods across different model scales.
Conclusion: CoSA provides a principled perspective for efficient and expressive multi-scale model adaptation, offering theoretical foundations from compressed sensing while achieving practical performance improvements over existing PEFT approaches.
Abstract: Parameter-Efficient Fine-Tuning (PEFT) has emerged as a practical paradigm for adapting large language models (LLMs) without updating all parameters. Most existing approaches, such as LoRA and PiSSA, rely on low-rank decompositions of weight updates. However, the low-rank assumption may restrict expressivity, particularly in task-specific adaptation scenarios where singular values are distributed relatively uniformly. To address this limitation, we propose CoSA (Compressed Sensing-Based Adaptation), a new PEFT method extended from compressed sensing theory. Instead of constraining weight updates to a low-rank subspace, CoSA expresses them through fixed random projection matrices and a compact learnable core. We provide a formal theoretical analysis of CoSA as a synthesis process, proving that weight updates can be compactly encoded into a low-dimensional space and mapped back through random projections. Extensive experimental results show that CoSA provides a principled perspective for efficient and expressive multi-scale model adaptation. Specifically, we evaluate CoSA on 10 diverse tasks, including natural language understanding and generation, employing 5 models of different scales from RoBERTa, Llama, and Qwen families. Across these settings, CoSA consistently matches or outperforms state-of-the-art PEFT methods.
[422] Position: Capability Control Should be a Separate Goal From Alignment
Shoaib Ahmed Siddiqui, Eleni Triantafillou, David Krueger, Adrian Weller
Main category: cs.LG
TL;DR: This position paper argues for treating capability control as distinct from alignment, proposing a defense-in-depth approach with three layers of control mechanisms across the model lifecycle to impose hard operational limits on model behavior.
Details
Motivation: Foundation models have broad capabilities that enable many applications but also expand potential misuse and failure modes. Current alignment approaches are often context-dependent, while capability control aims to impose hard operational limits on permissible behaviors, especially under adversarial conditions.Method: Proposes a three-layer framework for capability control: (1) data-based control of training distribution, (2) learning-based control via weight- or representation-level interventions, and (3) system-based control via post-deployment guardrails over inputs, outputs, and actions. Advocates for defense-in-depth by composing complementary controls across all layers.
Result: The paper organizes capability control mechanisms systematically and identifies that each layer has characteristic failure modes when used in isolation, necessitating a comprehensive, multi-layered approach to achieve robust control.
Conclusion: Capability control should be treated as distinct from alignment, requiring a defense-in-depth strategy across the full model stack. Key challenges include the dual-use nature of knowledge and compositional generalization that complicate control efforts.
Abstract: Foundation models are trained on broad data distributions, yielding generalist capabilities that enable many downstream applications but also expand the space of potential misuse and failures. This position paper argues that capability control – imposing restrictions on permissible model behavior – should be treated as a distinct goal from alignment. While alignment is often context and preference-driven, capability control aims to impose hard operational limits on permissible behaviors, including under adversarial elicitation. We organize capability control mechanisms across the model lifecycle into three layers: (i) data-based control of the training distribution, (ii) learning-based control via weight- or representation-level interventions, and (iii) system-based control via post-deployment guardrails over inputs, outputs, and actions. Because each layer has characteristic failure modes when used in isolation, we advocate for a defense-in-depth approach that composes complementary controls across the full stack. We further outline key open challenges in achieving such control, including the dual-use nature of knowledge and compositional generalization.
[423] EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization
Kevin Han, Yuhang Zhou, Mingze Gao, Gedi Zhou, Serena Li, Abhishek Kumar, Xiangjun Fan, Weiwei Li, Lizhu Zhang
Main category: cs.LG
TL;DR: EBPO improves RLVR for LLMs by using empirical Bayes shrinkage to stabilize policy optimization, reducing variance and preventing vanishing gradients in failure regimes.
Details
Motivation: Current RLVR methods like GRPO suffer from stability issues: high variance with small group sizes and vanishing gradients when all responses get zero rewards (saturated failure regimes).Method: Empirical Bayes Policy Optimization (EBPO) regularizes local group baselines by borrowing strength from global policy statistics using shrinkage estimators, with global prior updated via Welford’s online algorithm.
Result: EBPO outperforms GRPO and other baselines across benchmarks (AIME, OlympiadBench), shows superior training stability, works well with small group sizes, and benefits from difficulty-stratified curriculum learning.
Conclusion: EBPO provides a more stable and effective RLVR framework for LLMs by addressing critical stability challenges in existing approaches through empirical Bayes regularization.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for enhancing the reasoning capabilities of Large Language Models (LLMs). However, dominant approaches like Group Relative Policy Optimization (GRPO) face critical stability challenges: they suffer from high estimator variance under computational constraints (small group sizes) and vanishing gradient signals in saturated failure regimes where all responses yield identical zero rewards. To address this, we propose Empirical Bayes Policy Optimization (EBPO), a novel framework that regularizes local group-based baselines by borrowing strength from the policy’s accumulated global statistics. Instead of estimating baselines in isolation, EBPO employs a shrinkage estimator that dynamically balances local group statistics with a global prior updated via Welford’s online algorithm. Theoretically, we demonstrate that EBPO guarantees strictly lower Mean Squared Error (MSE), bounded entropy decay, and non-vanishing penalty signals in failure scenarios compared to GRPO. Empirically, EBPO consistently outperforms GRPO and other established baselines across diverse benchmarks, including AIME and OlympiadBench. Notably, EBPO exhibits superior training stability, achieving high-performance gains even with small group sizes, and benefits significantly from difficulty-stratified curriculum learning.
[424] Benchmarking Artificial Intelligence Models for Daily Coastal Hypoxia Forecasting
Magesh Rajasekaran, Md Saiful Sajol, Chris Alvin, Supratik Mukhopadhyay, Yanda Ou, Z. George Xue
Main category: cs.LG
TL;DR: Deep learning models for daily hypoxia classification in Gulf of Mexico using BiLSTM, Medformer, ST-Transformer, and TCN architectures, with ST-Transformer achieving best performance for operational real-time prediction.
Details
Motivation: Coastal hypoxia in the Gulf of Mexico requires fine-scale daily forecasts for responsive ecosystem management, but existing seasonal models are too coarse. Need for operational real-time hypoxia prediction to support environmental monitoring and ecosystem resilience.Method: Compare four deep learning architectures (BiLSTM, Medformer, ST-Transformer, TCN) for daily hypoxia classification using 12 years of hindcast data (2009-2020). Models incorporate water column stratification, sediment oxygen consumption, and temperature-dependent decomposition rates. Use same preprocessing, input/output formulation, and validation protocols. Evaluate with McNemar’s test for statistical significance.
Result: All models achieved high classification accuracy and strong discriminative ability. ST-Transformer achieved highest performance across all metrics and test periods (AUC-ROC: 0.982-0.992). McNemar’s test identified statistically significant differences in model predictions.
Conclusion: Developed reproducible framework for operational real-time hypoxia prediction that can support environmental and ocean modeling systems. ST-Transformer outperforms other architectures for this spatio-temporal classification task.
Abstract: Coastal hypoxia, especially in the northern part of Gulf of Mexico, presents a persistent ecological and economic concern. Seasonal models offer coarse forecasts that miss the fine-scale variability needed for daily, responsive ecosystem management. We present study that compares four deep learning architectures for daily hypoxia classification: Bidirectional Long Short-Term Memory (BiLSTM), Medformer (Medical Transformer), Spatio-Temporal Transformer (ST-Transformer), and Temporal Convolutional Network (TCN). We trained our models with twelve years of daily hindcast data from 2009-2020 Our training data consists of 2009-2020 hindcast data from a coupled hydrodynamic-biogeochemical model. Similarly, we use hindcast data from 2020 through 2024 as a test data. We constructed classification models incorporating water column stratification, sediment oxygen consumption, and temperature-dependent decomposition rates. We evaluated each architectures using the same data preprocessing, input/output formulation, and validation protocols. Each model achieved high classification accuracy and strong discriminative ability with ST-Transformer achieving the highest performance across all metrics and tests periods (AUC-ROC: 0.982-0.992). We also employed McNemar’s method to identify statistically significant differences in model predictions. Our contribution is a reproducible framework for operational real-time hypoxia prediction that can support broader efforts in the environmental and ocean modeling systems community and in ecosystem resilience. The source code is available https://github.com/rmagesh148/hypoxia-ai/
[425] Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning
John Yan, Michael Yu, Yuqi Sun, Alexander Duffy, Tyler Marques, Matthew Lyle Olson
Main category: cs.LG
TL;DR: SAEs and LLM-summarizers analyze RL training dynamics in complex environments like Diplomacy, revealing fine-grained behaviors and strategic patterns, though human interpretability remains challenging.
Details
Motivation: To understand how LLM behavior changes during complex RL training in multi-agent environments like Diplomacy, using interpretability tools to analyze training dynamics.Method: Apply pretrained Sparse Autoencoders (SAEs) and LLM-summarizer methods to analyze large-scale RL training runs, introduce Meta-Autointerp for grouping SAE features into interpretable hypotheses about training dynamics.
Result: Discovered fine-grained behaviors (role-playing, degenerate outputs, language switching) and high-level strategic behaviors; validated 90% of SAE Meta-Features as significant, found reward hacking; SAE-derived hypotheses can be predictively useful for downstream tasks (+14.2% score improvement).
Conclusion: SAEs and LLM-summarizers provide complementary views into agent behavior, forming a practical starting point for data-centric interpretability work on ensuring trustworthy LLM behavior throughout training.
Abstract: Large language models (LLMs) are increasingly trained in complex Reinforcement Learning, multi-agent environments, making it difficult to understand how behavior changes over training. Sparse Autoencoders (SAEs) have recently shown to be useful for data-centric interpretability. In this work, we analyze large-scale reinforcement learning training runs from the sophisticated environment of Full-Press Diplomacy by applying pretrained SAEs, alongside LLM-summarizer methods. We introduce Meta-Autointerp, a method for grouping SAE features into interpretable hypotheses about training dynamics. We discover fine-grained behaviors including role-playing patterns, degenerate outputs, language switching, alongside high-level strategic behaviors and environment-specific bugs. Through automated evaluation, we validate that 90% of discovered SAE Meta-Features are significant, and find a surprising reward hacking behavior. However, through two user studies, we find that even subjectively interesting and seemingly helpful SAE features may be worse than useless to humans, along with most LLM generated hypotheses. However, a subset of SAE-derived hypotheses are predictively useful for downstream tasks. We further provide validation by augmenting an untrained agent’s system prompt, improving the score by +14.2%. Overall, we show that SAEs and LLM-summarizer provide complementary views into agent behavior, and together our framework forms a practical starting point for future data-centric interpretability work on ensuring trustworthy LLM behavior throughout training.
[426] SpectraKAN: Conditioning Spectral Operators
Chun-Wun Cheng, Carola-Bibiane Schönlieb, Angelica I. Aviles-Rivero
Main category: cs.LG
TL;DR: SpectraKAN introduces an input-conditioned spectral neural operator that adapts Fourier kernels based on global system state, improving multi-scale PDE solution learning.
Details
Motivation: Existing spectral neural operators like FNO use static Fourier kernels applied uniformly across inputs, limiting their ability to capture multi-scale, regime-dependent, and anisotropic dynamics that depend on the global state of the system.Method: SpectraKAN conditions spectral operators on input by extracting compact global representations from spatio-temporal history and using them to modulate multi-scale Fourier trunks via single-query cross-attention, creating input-conditioned integral operators while retaining spectral mixing efficiency.
Result: Achieves state-of-the-art performance across diverse PDE benchmarks, reducing RMSE by up to 49% over strong baselines, with particularly large gains on challenging spatio-temporal prediction tasks.
Conclusion: SpectraKAN provides a theoretically justified framework for adaptive spectral operators that can capture complex, multi-scale dynamics while maintaining computational efficiency, representing a significant advance in neural operator architectures for PDE solving.
Abstract: Spectral neural operators, particularly Fourier Neural Operators (FNO), are a powerful framework for learning solution operators of partial differential equations (PDEs) due to their efficient global mixing in the frequency domain. However, existing spectral operators rely on static Fourier kernels applied uniformly across inputs, limiting their ability to capture multi-scale, regime-dependent, and anisotropic dynamics governed by the global state of the system. We introduce SpectraKAN, a neural operator that conditions the spectral operator on the input itself, turning static spectral convolution into an input-conditioned integral operator. This is achieved by extracting a compact global representation from spatio-temporal history and using it to modulate a multi-scale Fourier trunk via single-query cross-attention, enabling the operator to adapt its behaviour while retaining the efficiency of spectral mixing. We provide theoretical justification showing that this modulation converges to a resolution-independent continuous operator under mesh refinement and KAN gives smooth, Lipschitz-controlled global modulation. Across diverse PDE benchmarks, SpectraKAN achieves state-of-the-art performance, reducing RMSE by up to 49% over strong baselines, with particularly large gains on challenging spatio-temporal prediction tasks.
[427] Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLMs
Wentao Ni, Kangqi Zhang, Zhongming Yu, Oren Nelson, Mingu Lee, Hong Cai, Fatih Porikli, Jongryool Kim, Zhijian Liu, Jishen Zhao
Main category: cs.LG
TL;DR: Double-P: Hierarchical sparse attention framework for efficient long-context LLM inference using two-stage top-p selection to optimize accuracy, selection overhead, and sparse attention cost.
Details
Motivation: As long-context inference becomes central to LLMs, attention over growing key-value caches becomes a dominant decoding bottleneck. Fixed-budget top-k sparse attention cannot adapt to heterogeneous attention distributions across heads and layers, while existing top-p methods fail to jointly optimize top-p accuracy, selection overhead, and sparse attention cost.Method: Double-P uses hierarchical sparse attention with two stages: first performs coarse-grained top-p estimation at cluster level using size-weighted centroids, then adaptively refines computation through a second top-p stage that allocates token-level attention only when needed.
Result: Across long-context benchmarks, Double-P consistently achieves near-zero accuracy drop, reduces attention computation overhead by up to 1.8x, and delivers up to 1.3x end-to-end decoding speedup over state-of-the-art fixed-budget sparse attention methods.
Conclusion: Double-P provides an efficient hierarchical sparse attention framework that optimizes all three critical aspects of sparse attention (accuracy, selection overhead, and computation cost) for scalable long-context LLM inference.
Abstract: As long-context inference becomes central to large language models (LLMs), attention over growing key-value caches emerges as a dominant decoding bottleneck, motivating sparse attention for scalable inference. Fixed-budget top-k sparse attention cannot adapt to heterogeneous attention distributions across heads and layers, whereas top-p sparse attention directly preserves attention mass and provides stronger accuracy guarantees. Existing top-p methods, however, fail to jointly optimize top-p accuracy, selection overhead, and sparse attention cost, which limits their overall efficiency. We present Double-P, a hierarchical sparse attention framework that optimizes all three stages. Double-P first performs coarse-grained top-p estimation at the cluster level using size-weighted centroids, then adaptively refines computation through a second top-p stage that allocates token-level attention only when needed. Across long-context benchmarks, Double-P consistently achieves near-zero accuracy drop, reducing attention computation overhead by up to 1.8x and delivers up to 1.3x end-to-end decoding speedup over state-of-the-art fixed-budget sparse attention methods.
[428] Extreme Weather Nowcasting via Local Precipitation Pattern Prediction
Changhoon Song, Teng Yuan Chang, Youngjoon Hong
Main category: cs.LG
TL;DR: exPreCast: Efficient deterministic framework for precipitation nowcasting with balanced dataset handling both normal and extreme rainfall events
Details
Motivation: Current precipitation nowcasting models face challenges: diffusion-based models are computationally expensive for real-time use, deterministic models are biased toward normal rainfall, and existing datasets are skewed toward either ordinary or extreme events, limiting real-world applicability.Method: Proposes exPreCast framework with local spatiotemporal attention, texture-preserving cubic dual upsampling decoder, and temporal extractor for flexible forecasting horizons. Introduces balanced KMA radar dataset covering both ordinary precipitation and extreme events.
Result: Achieves state-of-the-art performance on established benchmarks (SEVIR and MeteoNet) and the new KMA dataset, delivering accurate and reliable nowcasts across both normal and extreme rainfall regimes.
Conclusion: exPreCast provides an efficient deterministic solution for precipitation nowcasting that handles both normal and extreme rainfall events effectively, addressing computational and dataset bias limitations of previous approaches.
Abstract: Accurate forecasting of extreme weather events such as heavy rainfall or storms is critical for risk management and disaster mitigation. Although high-resolution radar observations have spurred extensive research on nowcasting models, precipitation nowcasting remains particularly challenging due to pronounced spatial locality, intricate fine-scale rainfall structures, and variability in forecasting horizons. While recent diffusion-based generative ensembles show promising results, they are computationally expensive and unsuitable for real-time applications. In contrast, deterministic models are computationally efficient but remain biased toward normal rainfall. Furthermore, the benchmark datasets commonly used in prior studies are themselves skewed–either dominated by ordinary rainfall events or restricted to extreme rainfall episodes–thereby hindering general applicability in real-world settings. In this paper, we propose exPreCast, an efficient deterministic framework for generating finely detailed radar forecasts, and introduce a newly constructed balanced radar dataset from the Korea Meteorological Administration (KMA), which encompasses both ordinary precipitation and extreme events. Our model integrates local spatiotemporal attention, a texture-preserving cubic dual upsampling decoder, and a temporal extractor to flexibly adjust forecasting horizons. Experiments on established benchmarks (SEVIR and MeteoNet) as well as on the balanced KMA dataset demonstrate that our approach achieves state-of-the-art performance, delivering accurate and reliable nowcasts across both normal and extreme rainfall regimes.
[429] Disentangled Representation Learning via Flow Matching
Jinjin Chi, Taoping Liu, Mengtao Yin, Ximing Li, Yongcheng Jing, Dacheng Tao
Main category: cs.LG
TL;DR: A flow matching framework for disentangled representation learning that uses factor-conditioned flows and orthogonality regularization to achieve better semantic alignment and disentanglement.
Details
Motivation: Existing diffusion-based disentanglement methods often lack strong semantic alignment despite encouraging factor independence through inductive biases. There's a need for approaches that better align learned factors with underlying semantic concepts while maintaining disentanglement.Method: Proposes a flow matching-based framework that casts disentanglement as learning factor-conditioned flows in a compact latent space. Introduces a non-overlap (orthogonality) regularizer to suppress cross-factor interference and reduce information leakage between factors.
Result: Extensive experiments across multiple datasets show consistent improvements over representative baselines, yielding higher disentanglement scores as well as improved controllability and sample fidelity.
Conclusion: The flow matching approach with orthogonality regularization provides an effective framework for learning semantically aligned disentangled representations with better factor separation and generation quality.
Abstract: Disentangled representation learning aims to capture the underlying explanatory factors of observed data, enabling a principled understanding of the data-generating process. Recent advances in generative modeling have introduced new paradigms for learning such representations. However, existing diffusion-based methods encourage factor independence via inductive biases, yet frequently lack strong semantic alignment. In this work, we propose a flow matching-based framework for disentangled representation learning, which casts disentanglement as learning factor-conditioned flows in a compact latent space. To enforce explicit semantic alignment, we introduce a non-overlap (orthogonality) regularizer that suppresses cross-factor interference and reduces information leakage between factors. Extensive experiments across multiple datasets demonstrate consistent improvements over representative baselines, yielding higher disentanglement scores as well as improved controllability and sample fidelity.
[430] Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions
Yuntai Bao, Xuhong Zhang, Jintao Chen, Ge Su, Yuxiang Cai, Hao Peng, Bing Sun, Haiqin Weng, Liu Yan, Jianwei Yin
Main category: cs.LG
TL;DR: CDAS is a new model steering method using distributed interchange interventions with distribution matching for more faithful and stable control compared to preference optimization approaches.
Details
Motivation: Current intervention-based steering methods often underperform and generate unnatural outputs because they adapt strong optimization objectives from fine-tuning, leading to overfitting. Effective steering requires faithful identification of internal model mechanisms rather than enforcement of external preferences.Method: Builds on distributed alignment search (DAS) principles, using distributed interchange interventions (DII) with a novel distribution matching objective that aligns intervened output distributions with counterfactual distributions. Uses weak-supervised distribution matching instead of probability maximization, enabling bi-directional steering and data-derived steering factors.
Result: On AxBench benchmark, CDAS doesn’t always outperform preference-optimization methods but benefits more from increased model scale. In safety case studies (overriding refusal behaviors and neutralizing chain-of-thought backdoors), achieves systematic steering while maintaining general model utility.
Conclusion: CDAS is complementary to preference-optimization approaches and conditionally constitutes a robust approach to intervention-based model steering, offering more faithful and stable control through distribution matching and DII mechanisms.
Abstract: Intervention-based model steering offers a lightweight and interpretable alternative to prompting and fine-tuning. However, by adapting strong optimization objectives from fine-tuning, current methods are susceptible to overfitting and often underperform, sometimes generating unnatural outputs. We hypothesize that this is because effective steering requires the faithful identification of internal model mechanisms, not the enforcement of external preferences. To this end, we build on the principles of distributed alignment search (DAS), the standard for causal variable localization, to propose a new steering method: Concept DAS (CDAS). While we adopt the core mechanism of DAS, distributed interchange intervention (DII), we introduce a novel distribution matching objective tailored for the steering task by aligning intervened output distributions with counterfactual distributions. CDAS differs from prior work in two main ways: first, it learns interventions via weak-supervised distribution matching rather than probability maximization; second, it uses DIIs that naturally enable bi-directional steering and allow steering factors to be derived from data, reducing the effort required for hyperparameter tuning and resulting in more faithful and stable control. On AxBench, a large-scale model steering benchmark, we show that CDAS does not always outperform preference-optimization methods but may benefit more from increased model scale. In two safety-related case studies, overriding refusal behaviors of safety-aligned models and neutralizing a chain-of-thought backdoor, CDAS achieves systematic steering while maintaining general model utility. These results indicate that CDAS is complementary to preference-optimization approaches and conditionally constitutes a robust approach to intervention-based model steering. Our code is available at https://github.com/colored-dye/concept_das.
[431] Private Prediction via Shrinkage
Chao Yan
Main category: cs.LG
TL;DR: Improved differentially private prediction with polylogarithmic dependence on number of queries instead of square root dependence
Details
Motivation: Standard differentially private prediction suffers from √T dependence on number of queries T, which is inefficient for large query streams. The paper aims to reduce this to polylogarithmic dependence.Method: Develops private predictors for streaming settings: 1) for oblivious online adversaries and any concept class using VC dimension analysis, 2) for adaptive online adversaries and halfspaces using geometric techniques
Result: Achieves polylogarithmic dependence on T: Õ(VC(C)³·⁵log³·⁵T) samples for oblivious adversaries, and Õ(d⁵·⁵logT) for adaptive adversaries with halfspaces
Conclusion: Significant improvement over standard composition for differentially private prediction in streaming settings, enabling efficient handling of large query streams
Abstract: We study differentially private prediction introduced by Dwork and Feldman (COLT 2018): an algorithm receives one labeled sample set $S$ and then answers a stream of unlabeled queries while the output transcript remains $(\varepsilon,δ)$-differentially private with respect to $S$. Standard composition yields a $\sqrt{T}$ dependence for $T$ queries. We show that this dependence can be reduced to polylogarithmic in $T$ in streaming settings. For an oblivious online adversary and any concept class $\mathcal{C}$, we give a private predictor that answers $T$ queries with $|S|= \tilde{O}(VC(\mathcal{C})^{3.5}\log^{3.5}T)$ labeled examples. For an adaptive online adversary and halfspaces over $\mathbb{R}^d$, we obtain $|S|=\tilde{O}\left(d^{5.5}\log T\right)$.
[432] Hybrid Gated Flow (HGF): Stabilizing 1.58-bit LLMs via Selective Low-Rank Correction
David Alejandro Trejo Pizzo
Main category: cs.LG
TL;DR: HGF introduces a dual-stream architecture combining 1.58-bit ternary backbone with FP16 correction path to address memory wall in edge LLMs, recovering 55% of quality gap with minimal memory overhead.
Details
Motivation: Address the "Memory Wall" bottleneck in deploying LLMs on edge devices by improving upon existing 1.58-bit quantization techniques that suffer 20-25% perplexity degradation compared to FP16 baselines.Method: Hybrid Gated Flow (HGF) - dual-stream architecture coupling a 1.58-bit ternary backbone with a learnable, low-rank FP16 correction path controlled by adaptive gates.
Result: Achieved validation loss of 0.9306 vs BitNet’s 1.0294, recovering ~55% of quality gap between ternary quantization and FP16 baseline with only ~12-15% memory overhead; demonstrated emergent phenomenon of quantization as structural regularization.
Conclusion: HGF effectively balances memory efficiency and model quality for edge deployment, with architectural stability and quality recovery scaling linearly to production-grade models up to 3B parameters.
Abstract: The deployment of Large Language Models (LLMs) on edge devices is fundamentally constrained by the “Memory Wall” – a hardware limitation where memory bandwidth, not compute, becomes the bottleneck. Recent 1.58-bit quantization techniques (e.g., BitNet b1.58) dramatically reduce memory footprint but typically incur a perplexity degradation of 20-25% compared to FP16 baselines. In this work, we introduce Hybrid Gated Flow (HGF), a dual-stream architecture that couples a 1.58-bit ternary backbone with a learnable, low-rank FP16 correction path controlled by adaptive gates. Through extensive experiments on the TinyStories dataset across two training regimes (2500 and 3500 steps), we demonstrate that HGF 5.4 achieves a validation loss of 0.9306 compared to BitNet’s 1.0294, recovering approximately 55% of the quality gap between pure ternary quantization and the FP16 baseline (0.8490). This recovery is achieved with only ~12-15% memory overhead beyond the ternary backbone. Furthermore, we provide empirical evidence for an emergent phenomenon: quantization as structural regularization. While a full-precision differential attention baseline (Diff_Only) exhibited training instability with validation loss exceeding 1.68, the ternary-anchored HGF maintained robust convergence throughout training. Finally, we report preliminary results extending this architecture to 1.2B and 3B parameter models trained on SlimPajama and FineWeb-Edu. These larger-scale experiments confirm that the architectural stability and quality recovery observed in small-scale proxies scale linearly to production-grade language modeling regimes.
[433] ZeroS: Zero-Sum Linear Attention for Efficient Transformers
Jiecheng Lu, Xu Han, Yan Sun, Viresh Pati, Yubin Kim, Siddhartha Somani, Shihao Yang
Main category: cs.LG
TL;DR: ZeroS is a linear attention method that addresses limitations of existing approaches by enabling both positive/negative weights and contrastive operations while maintaining O(N) complexity, matching or exceeding softmax attention performance.
Details
Motivation: Current linear attention methods offer O(N) complexity but underperform standard softmax attention due to two fundamental limitations: restriction to convex combinations (only additive information blending) and uniform accumulated weight bias that dilutes attention in long contexts.Method: Proposes Zero-Sum Linear Attention (ZeroS) which removes the constant zero-order term 1/t and reweights the remaining zero-sum softmax residuals. This creates mathematically stable weights that can be both positive and negative, enabling a single attention layer to perform contrastive operations while maintaining O(N) complexity.
Result: ZeroS theoretically expands the set of representable functions compared to convex combinations. Empirically, it matches or exceeds standard softmax attention across various sequence modeling benchmarks while maintaining O(N) complexity.
Conclusion: ZeroS addresses fundamental limitations of linear attention methods, enabling both positive and negative attention weights and contrastive operations while maintaining efficiency, making it a competitive alternative to standard softmax attention.
Abstract: Linear attention methods offer Transformers $O(N)$ complexity but typically underperform standard softmax attention. We identify two fundamental limitations affecting these approaches: the restriction to convex combinations that only permits additive information blending, and uniform accumulated weight bias that dilutes attention in long contexts. We propose Zero-Sum Linear Attention (ZeroS), which addresses these limitations by removing the constant zero-order term $1/t$ and reweighting the remaining zero-sum softmax residuals. This modification creates mathematically stable weights, enabling both positive and negative values and allowing a single attention layer to perform contrastive operations. While maintaining $O(N)$ complexity, ZeroS theoretically expands the set of representable functions compared to convex combinations. Empirically, it matches or exceeds standard softmax attention across various sequence modeling benchmarks.
[434] Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities
Pengyi Li, Elizaveta Goncharova, Andrey Kuznetsov, Ivan Oseledets
Main category: cs.LG
TL;DR: ProGRPO introduces Advantage Re-weighting Mechanism (ARM) to address mode collapse in RLVR by equilibrating confidence across correct responses, enhancing diversity while maintaining accuracy.
Details
Motivation: Standard RLVR methods like GRPO converge to low-entropy policies causing mode collapse and limited output diversity in LLM reasoning tasks, disproportionately reinforcing highest-likelihood paths while suppressing valid alternatives.Method: Proposes Advantage Re-weighting Mechanism (ARM) incorporating Prompt Perplexity and Answer Confidence into advantage estimation to dynamically reshape reward signals, attenuating gradient updates of over-confident paths while redistributing probability mass toward under-explored correct solutions.
Result: Significantly enhances generative diversity and response entropy while maintaining competitive accuracy; on Qwen2.5-7B, outperforms GRPO by 5.7% in Pass@1 and 13.9% in Pass@32, showing superior capability in generating diverse correct reasoning paths.
Conclusion: ProGRPO effectively addresses entropy collapse in RLVR, achieving better exploration-exploitation trade-off in reasoning tasks by equilibrating confidence across correct responses through novel advantage re-weighting.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs). However, standard policy optimization methods, such as Group Relative Policy Optimization (GRPO), often converge to low-entropy policies, leading to severe mode collapse and limited output diversity. We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths, thereby suppressing valid alternative reasoning chains. To address this, we propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses. By incorporating Prompt Perplexity and Answer Confidence into the advantage estimation, our method dynamically reshapes the reward signal to attenuate the gradient updates of over-confident reasoning paths, while redistributing probability mass toward under-explored correct solutions. Empirical results demonstrate that our approach significantly enhances generative diversity and response entropy while maintaining competitive accuracy, effectively achieving a superior trade-off between exploration and exploitation in reasoning tasks. Empirical results on Qwen2.5 and DeepSeek models across mathematical and coding benchmarks show that ProGRPO significantly mitigates entropy collapse. Specifically, on Qwen2.5-7B, our method outperforms GRPO by 5.7% in Pass@1 and, notably, by 13.9% in Pass@32, highlighting its superior capability in generating diverse correct reasoning paths.
[435] Balanced Anomaly-guided Ego-graph Diffusion Model for Inductive Graph Anomaly Detection
Chunyu Wei, Siyuan He, Yu Wang, Yueguo Chen, Yunhai Wang, Bing Bai, Yidong Zhang, Yong Xie, Shunming Zhang, Fei Wang
Main category: cs.LG
TL;DR: Novel data-centric framework for graph anomaly detection using discrete ego-graph diffusion and curriculum anomaly augmentation to address dynamic graph challenges and class imbalance.
Details
Motivation: Address two major challenges in graph anomaly detection: 1) transductive learning limitations for dynamic graphs, and 2) extreme class imbalance causing biased models that fail to generalize to unseen anomalies.Method: Proposes a data-centric framework with: (1) discrete ego-graph diffusion model capturing local anomaly topology to generate anomalous structural distributions, and (2) curriculum anomaly augmentation mechanism that dynamically adjusts synthetic data generation during training to focus on underrepresented patterns.
Result: Experiments on five datasets demonstrate the effectiveness of the proposed framework in improving anomaly detection and generalization.
Conclusion: The framework successfully addresses both dynamic graph modeling and class imbalance challenges through integrated data-centric approaches, showing improved detection performance across multiple datasets.
Abstract: Graph anomaly detection (GAD) is crucial in applications like fraud detection and cybersecurity. Despite recent advancements using graph neural networks (GNNs), two major challenges persist. At the model level, most methods adopt a transductive learning paradigm, which assumes static graph structures, making them unsuitable for dynamic, evolving networks. At the data level, the extreme class imbalance, where anomalous nodes are rare, leads to biased models that fail to generalize to unseen anomalies. These challenges are interdependent: static transductive frameworks limit effective data augmentation, while imbalance exacerbates model distortion in inductive learning settings. To address these challenges, we propose a novel data-centric framework that integrates dynamic graph modeling with balanced anomaly synthesis. Our framework features: (1) a discrete ego-graph diffusion model, which captures the local topology of anomalies to generate ego-graphs aligned with anomalous structural distribution, and (2) a curriculum anomaly augmentation mechanism, which dynamically adjusts synthetic data generation during training, focusing on underrepresented anomaly patterns to improve detection and generalization. Experiments on five datasets demonstrate that the effectiveness of our framework.
[436] CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Vision Transformers
Boxiang Zhang, Baijian Yang
Main category: cs.LG
TL;DR: CORP is a closed-form one-shot structured pruning framework for Vision Transformers that removes MLP dimensions and attention structures without retraining, using only a small unlabeled calibration set.
Details
Motivation: Vision Transformers have high compute/memory costs, and existing pruning methods require retraining or multi-stage optimization, limiting post-training deployment. There's a need for efficient pruning that works under strict post-training constraints.Method: CORP formulates structured pruning as a representation recovery problem, modeling removed activations/attention logits as affine functions of retained components. It derives closed-form ridge regression solutions that fold compensation into model weights to minimize expected representation error.
Result: On ImageNet with DeiT models, CORP preserves accuracy under aggressive sparsity. For DeiT-Huge, it retains 82.8% Top-1 accuracy after pruning 50% of both MLP and attention structures, completing pruning in under 20 minutes on a single GPU.
Conclusion: CORP enables efficient post-training structured pruning of Vision Transformers without labels, gradients, or fine-tuning, demonstrating strong redundancy in MLP and attention representations while delivering real-world efficiency gains.
Abstract: Vision Transformers achieve strong accuracy but incur high compute and memory cost. Structured pruning can reduce inference cost, but most methods rely on retraining or multi-stage optimization. These requirements limit post-training deployment. We propose \textbf{CORP}, a closed-form one-shot structured pruning framework for Vision Transformers. CORP removes entire MLP hidden dimensions and attention substructures without labels, gradients, or fine-tuning. It operates under strict post-training constraints using only a small unlabeled calibration set. CORP formulates structured pruning as a representation recovery problem. It models removed activations and attention logits as affine functions of retained components and derives closed-form ridge regression solutions that fold compensation into model weights. This minimizes expected representation error under the calibration distribution. Experiments on ImageNet with DeiT models show strong redundancy in MLP and attention representations. Without compensation, one-shot structured pruning causes severe accuracy degradation. With CORP, models preserve accuracy under aggressive sparsity. On DeiT-Huge, CORP retains 82.8% Top-1 accuracy after pruning 50% of both MLP and attention structures. CORP completes pruning in under 20 minutes on a single GPU and delivers substantial real-world efficiency gains.
[437] TADS: Task-Aware Data Selection for Multi-Task Multimodal Pre-Training
Guanjie Cheng, Boyi Li, Lingyu Sun, Mengying Zhu, Yangyang Wu, Xinkui Zhao, Shuiguang Deng
Main category: cs.LG
TL;DR: TADS is a task-aware data selection framework for multimodal pre-training that optimizes data quality, task relevance, and diversity to improve training efficiency and downstream performance.
Details
Motivation: Raw web-crawled datasets for multimodal pre-training are noisy, misaligned, and redundant, leading to inefficient training and suboptimal generalization. Existing data selection methods are either heuristic-based (biased) or task-agnostic, failing to optimize for multi-task scenarios.Method: TADS integrates Intrinsic Quality, Task Relevance, and Distributional Diversity into a learnable value function. It uses unimodal/cross-modal quality assessment, quantifies task relevance via similarity vectors, optimizes diversity through cluster-based weighting, and employs feedback-driven meta-learning to refine selection based on proxy model performance across multiple downstream tasks.
Result: On CC12M dataset, TADS achieves superior zero-shot performance on ImageNet, CIFAR-100, MS-COCO, and Flickr30K benchmarks, using only 36% of the data while outperforming baselines by an average of 1.0%.
Conclusion: TADS significantly enhances data efficiency by curating high-utility subsets that yield higher performance ceilings within the same computational constraints, demonstrating that intelligent data selection is crucial for effective multimodal pre-training.
Abstract: Large-scale multimodal pre-trained models like CLIP rely heavily on high-quality training data, yet raw web-crawled datasets are often noisy, misaligned, and redundant, leading to inefficient training and suboptimal generalization. Existing data selection methods are either heuristic-based, suffering from bias and limited diversity, or data-driven but task-agnostic, failing to optimize for multi-task scenarios. To address these gaps, we introduce TADS (Task-Aware Data Selection), a novel framework for multi-task multimodal pre-training that integrates Intrinsic Quality, Task Relevance, and Distributional Diversity into a learnable value function. TADS employs a comprehensive quality assessment system with unimodal and cross-modal operators, quantifies task relevance via interpretable similarity vectors, and optimizes diversity through cluster-based weighting. A feedback-driven meta-learning mechanism adaptively refines the selection strategy based on proxy model performance across multiple downstream tasks. Experiments on CC12M demonstrate that TADS achieves superior zero-shot performance on benchmarks like ImageNet, CIFAR-100, MS-COCO, and Flickr30K, using only 36% of the data while outperforming baselines by an average of 1.0%. This highlights that TADS significantly enhances data efficiency by curating a high-utility subset that yields a much higher performance ceiling within the same computational constraints.
[438] When Shared Knowledge Hurts: Spectral Over-Accumulation in Model Merging
Yayuan Li, Ze Peng, Jian Zhang, Jintao Guo, Yue Duan, Yinghuan Shi
Main category: cs.LG
TL;DR: SVC is a training-free, data-free post-processing method that addresses over-counting of shared knowledge in model merging by calibrating inflated singular values in overlapping spectral directions.
Details
Motivation: Existing model merging methods focus on resolving task conflicts but fail to address the problem of over-counting shared knowledge when tasks share aligned spectral directions, leading to biased merged models.Method: Singular Value Calibration (SVC) quantifies subspace overlap between tasks and rescales inflated singular values to restore balanced spectrum, modifying only singular values without additional training or data.
Result: SVC consistently improves strong merging baselines across vision and language benchmarks, achieving state-of-the-art performance and improving Task Arithmetic by 13.0%.
Conclusion: SVC effectively addresses the over-counting problem in model merging through spectral calibration, providing a simple yet powerful post-processing solution that works across modalities.
Abstract: Model merging combines multiple fine-tuned models into a single model by adding their weight updates, providing a lightweight alternative to retraining. Existing methods primarily target resolving conflicts between task updates, leaving the failure mode of over-counting shared knowledge unaddressed. We show that when tasks share aligned spectral directions (i.e., overlapping singular vectors), a simple linear combination repeatedly accumulates these directions, inflating the singular values and biasing the merged model toward shared subspaces. To mitigate this issue, we propose Singular Value Calibration (SVC), a training-free and data-free post-processing method that quantifies subspace overlap and rescales inflated singular values to restore a balanced spectrum. Across vision and language benchmarks, SVC consistently improves strong merging baselines and achieves state-of-the-art performance. Furthermore, by modifying only the singular values, SVC improves the performance of Task Arithmetic by 13.0%. Code is available at: https://github.com/lyymuwu/SVC.
[439] Robust Inference-Time Steering of Protein Diffusion Models via Embedding Optimization
Minhuan Li, Jiequn Han, Pilar Cossio, Luhuan Wu
Main category: cs.LG
TL;DR: EmbedOpt is a new inference-time method for steering diffusion models to optimize experimental likelihoods in conditional embedding space rather than coordinate space, improving robustness and efficiency for biomolecular conformation generation.
Details
Motivation: Current posterior sampling methods for biomolecular conformation generation using diffusion models face limitations when target conformations lie in low-density regions of the prior, requiring aggressive and brittle likelihood weighting that reduces robustness.Method: EmbedOpt optimizes experimental likelihoods in the conditional embedding space of diffusion models rather than directly in coordinate space. This embedding space encodes rich sequence and coevolutionary signals, allowing effective shifting of the diffusion prior to align with experimental constraints.
Result: EmbedOpt outperforms coordinate-based posterior sampling in cryo-EM map fitting tasks, matches performance on distance constraint tasks, shows superior robustness across hyperparameters spanning two orders of magnitude, and enables significant reduction in diffusion steps for better efficiency.
Conclusion: Optimizing in embedding space rather than coordinate space provides a more robust and efficient approach for steering diffusion models to generate biomolecular conformations consistent with experimental measurements.
Abstract: In many biophysical inverse problems, the goal is to generate biomolecular conformations that are both physically plausible and consistent with experimental measurements. As recent sequence-to-structure diffusion models provide powerful data-driven priors, posterior sampling has emerged as a popular framework by guiding atomic coordinates to target conformations using experimental likelihoods. However, when the target lies in a low-density region of the prior, posterior sampling requires aggressive and brittle weighting of the likelihood guidance. Motivated by this limitation, we propose EmbedOpt, an alternative inference-time approach for steering diffusion models to optimize experimental likelihoods in the conditional embedding space. As this space encodes rich sequence and coevolutionary signals, optimizing over it effectively shifts the diffusion prior to align with experimental constraints. We validate EmbedOpt on two benchmarks simulating cryo-electron microscopy map fitting and experimental distance constraints. We show that EmbedOpt outperforms the coordinate-based posterior sampling method in map fitting tasks, matches performance on distance constraint tasks, and exhibits superior engineering robustness across hyperparameters spanning two orders of magnitude. Moreover, its smooth optimization behavior enables a significant reduction in the number of diffusion steps required for inference, leading to better efficiency.
[440] Steering Large Reasoning Models towards Concise Reasoning via Flow Matching
Yawei Li, Benjamin Bergner, Yinghan Zhao, Vihang Prakash Patil, Bei Chen, Cheng Wang
Main category: cs.LG
TL;DR: FlowSteer introduces a nonlinear steering method using Flow Matching to transform verbose reasoning distributions into concise ones, improving token efficiency in Large Reasoning Models while maintaining performance.
Details
Motivation: Large Reasoning Models produce overly verbose outputs that reduce efficiency. Existing linear steering methods are limited by the restrictive linear representation hypothesis, requiring a more principled approach to control reasoning verbosity.Method: FlowSteer uses Flow Matching to learn a complete transformation between verbose and concise reasoning distributions as a velocity field, enabling precise, input-dependent control over model representations rather than applying uniform linear shifts.
Result: FlowSteer achieves strong task performance and token efficiency across diverse reasoning benchmarks compared to leading inference-time baselines, producing more compact reasoning than linear steering methods.
Conclusion: Modeling full distributional transport with generative techniques provides a more effective and principled foundation for controlling Large Reasoning Models’ verbosity and efficiency.
Abstract: Large Reasoning Models (LRMs) excel at complex reasoning tasks, but their efficiency is often hampered by overly verbose outputs. Prior steering methods attempt to address this issue by applying a single, global vector to hidden representations – an approach grounded in the restrictive linear representation hypothesis. In this work, we introduce FlowSteer, a nonlinear steering method that goes beyond uniform linear shifts by learning a complete transformation between the distributions associated with verbose and concise reasoning. This transformation is learned via Flow Matching as a velocity field, enabling precise, input-dependent control over the model’s reasoning process. By aligning steered representations with the distribution of concise-reasoning activations, FlowSteer yields more compact reasoning than the linear shifts. Across diverse reasoning benchmarks, FlowSteer demonstrates strong task performance and token efficiency compared to leading inference-time baselines. Our work demonstrates that modeling the full distributional transport with generative techniques offers a more effective and principled foundation for controlling LRMs.
[441] HealthMamba: An Uncertainty-aware Spatiotemporal Graph State Space Model for Effective and Reliable Healthcare Facility Visit Prediction
Dahai Yu, Lin Jiang, Rongchao Xu, Guang Wang
Main category: cs.LG
TL;DR: HealthMamba: Uncertainty-aware spatiotemporal framework for healthcare facility visit prediction using Graph State Space Model with uncertainty quantification.
Details
Motivation: Existing healthcare facility visit prediction methods treat it as time-series forecasting without considering spatial dependencies between different facility types and lack reliability during abnormal situations like public emergencies.Method: Three-component framework: 1) Unified Spatiotemporal Context Encoder for heterogeneous static/dynamic information fusion, 2) GraphMamba (Graph State Space Model) for hierarchical spatiotemporal modeling, 3) comprehensive uncertainty quantification module with three mechanisms.
Result: Evaluated on four large-scale real-world datasets from California, New York, Texas, and Florida, achieving ~6.0% improvement in prediction accuracy and ~3.5% improvement in uncertainty quantification over state-of-the-art baselines.
Conclusion: HealthMamba provides accurate and reliable healthcare facility visit prediction by effectively modeling spatiotemporal dependencies and quantifying uncertainty, advancing healthcare resource optimization and public health policy.
Abstract: Healthcare facility visit prediction is essential for optimizing healthcare resource allocation and informing public health policy. Despite advanced machine learning methods being employed for better prediction performance, existing works usually formulate this task as a time-series forecasting problem without considering the intrinsic spatial dependencies of different types of healthcare facilities, and they also fail to provide reliable predictions under abnormal situations such as public emergencies. To advance existing research, we propose HealthMamba, an uncertainty-aware spatiotemporal framework for accurate and reliable healthcare facility visit prediction. HealthMamba comprises three key components: (i) a Unified Spatiotemporal Context Encoder that fuses heterogeneous static and dynamic information, (ii) a novel Graph State Space Model called GraphMamba for hierarchical spatiotemporal modeling, and (iii) a comprehensive uncertainty quantification module integrating three uncertainty quantification mechanisms for reliable prediction. We evaluate HealthMamba on four large-scale real-world datasets from California, New York, Texas, and Florida. Results show HealthMamba achieves around 6.0% improvement in prediction accuracy and 3.5% improvement in uncertainty quantification over state-of-the-art baselines.
[442] A Short and Unified Convergence Analysis of the SAG, SAGA, and IAG Algorithms
Feng Zhu, Robert W. Heath, Aritra Mitra
Main category: cs.LG
TL;DR: Unified convergence analysis for stochastic variance-reduced algorithms (SAG, SAGA, IAG) using novel Lyapunov function and delay bounds
Details
Motivation: Existing analyses for stochastic variance-reduced algorithms are disparate and rely on different proof techniques; SAG's original proof is notoriously involved and requires computer-aided analysis. There's a need for a unified, simpler analysis framework.Method: Develops a single unified convergence analysis for SAG, SAGA, and IAG algorithms using: (1) bounds on delays due to stochastic sub-sampling using concentration tools, and (2) a novel Lyapunov function that accounts for such delays.
Result: Provides first high-probability bounds for SAG and SAGA that can be extended to non-convex objectives and Markov sampling. Also obtains best known rates for IAG algorithm, significantly improving prior bounds.
Conclusion: The unified analysis framework offers short, modular proofs that apply to multiple stochastic variance-reduced algorithms, enabling extensions to broader settings and improving theoretical understanding.
Abstract: Stochastic variance-reduced algorithms such as Stochastic Average Gradient (SAG) and SAGA, and their deterministic counterparts like the Incremental Aggregated Gradient (IAG) method, have been extensively studied in large-scale machine learning. Despite their popularity, existing analyses for these algorithms are disparate, relying on different proof techniques tailored to each method. Furthermore, the original proof of SAG is known to be notoriously involved, requiring computer-aided analysis. Focusing on finite-sum optimization with smooth and strongly convex objective functions, our main contribution is to develop a single unified convergence analysis that applies to all three algorithms: SAG, SAGA, and IAG. Our analysis features two key steps: (i) establishing a bound on delays due to stochastic sub-sampling using simple concentration tools, and (ii) carefully designing a novel Lyapunov function that accounts for such delays. The resulting proof is short and modular, providing the first high-probability bounds for SAG and SAGA that can be seamlessly extended to non-convex objectives and Markov sampling. As an immediate byproduct of our new analysis technique, we obtain the best known rates for the IAG algorithm, significantly improving upon prior bounds.
[443] Formal Synthesis of Certifiably Robust Neural Lyapunov-Barrier Certificates
Chengxiao Wang, Haoze Wu, Gagandeep Singh
Main category: cs.LG
TL;DR: Robust neural Lyapunov barrier certificates for verifying safety/stability of RL controllers under perturbed dynamics via Lipschitz-based conditions and adversarial training.
Details
Motivation: Existing neural Lyapunov/barrier certificates only guarantee safety/stability under ideal unperturbed dynamics, limiting reliability in real-world applications where system dynamics may deviate due to uncertainties.Method: Define robust Lyapunov barrier functions with sufficient conditions based on Lipschitz continuity for robustness against bounded perturbations. Propose practical training objectives enforcing these conditions via adversarial training, Lipschitz neighborhood bound, and global Lipschitz regularization.
Result: Methods significantly improve both certified robustness bounds (up to 4.6×) and empirical success rates under strong perturbations (up to 2.4×) compared to baseline in Inverted Pendulum and 2D Docking environments.
Conclusion: Demonstrates effectiveness of training robust neural certificates for safe reinforcement learning under perturbations in dynamics, addressing a critical limitation of existing verification methods.
Abstract: Neural Lyapunov and barrier certificates have recently been used as powerful tools for verifying the safety and stability properties of deep reinforcement learning (RL) controllers. However, existing methods offer guarantees only under fixed ideal unperturbed dynamics, limiting their reliability in real-world applications where dynamics may deviate due to uncertainties. In this work, we study the problem of synthesizing \emph{robust neural Lyapunov barrier certificates} that maintain their guarantees under perturbations in system dynamics. We formally define a robust Lyapunov barrier function and specify sufficient conditions based on Lipschitz continuity that ensure robustness against bounded perturbations. We propose practical training objectives that enforce these conditions via adversarial training, Lipschitz neighborhood bound, and global Lipschitz regularization. We validate our approach in two practically relevant environments, Inverted Pendulum and 2D Docking. The former is a widely studied benchmark, while the latter is a safety-critical task in autonomous systems. We show that our methods significantly improve both certified robustness bounds (up to $4.6$ times) and empirical success rates under strong perturbations (up to $2.4$ times) compared to the baseline. Our results demonstrate effectiveness of training robust neural certificates for safe RL under perturbations in dynamics.
[444] Rewards as Labels: Revisiting RLVR from a Classification Perspective
Zepeng Zhai, Meilin Chen, Jiaxuan Zhao, Junlang Qian, Lei Shen, Yuan Lu
Main category: cs.LG
TL;DR: REAL reformulates RL with verifiable rewards as a classification problem using rewards as categorical labels, addressing gradient issues in GRPO methods for improved mathematical reasoning performance.
Details
Motivation: Current RLVR methods like GRPO suffer from Gradient Misassignment in Positives and Gradient Domination in Negatives, leading to inefficient policy updates and suboptimal performance in complex reasoning tasks.Method: Proposes Rewards as Labels (REAL) framework that treats verifiable rewards as categorical labels rather than scalar weights, reformulating policy optimization as classification. Introduces anchor logits to enhance policy learning, creating monotonic and bounded gradient weighting.
Result: REAL improves training stability and consistently outperforms GRPO and variants like DAPO on mathematical reasoning benchmarks. On 1.5B model, REAL improves average Pass@1 over DAPO by 6.7%, and gains scale to 7B model with 6.2% and 1.7% improvements over DAPO and GSPO respectively.
Conclusion: REAL effectively addresses gradient issues in RLVR methods by reformulating reward-based optimization as classification, leading to more stable training and superior performance in complex reasoning tasks with large language models.
Abstract: Reinforcement Learning with Verifiable Rewards has recently advanced the capabilities of Large Language Models in complex reasoning tasks by providing explicit rule-based supervision. Among RLVR methods, GRPO and its variants have achieved strong empirical performance. Despite their success, we identify that they suffer from Gradient Misassignment in Positives and Gradient Domination in Negatives, which lead to inefficient and suboptimal policy updates. To address these issues, we propose Rewards as Labels (REAL), a novel framework that revisits verifiable rewards as categorical labels rather than scalar weights, thereby reformulating policy optimization as a classification problem. Building on this, we further introduce anchor logits to enhance policy learning. Our analysis reveals that REAL induces a monotonic and bounded gradient weighting, enabling balanced gradient allocation across rollouts and effectively mitigating the identified mismatches. Extensive experiments on mathematical reasoning benchmarks show that REAL improves training stability and consistently outperforms GRPO and strong variants such as DAPO. On the 1.5B model, REAL improves average Pass@1 over DAPO by 6.7%. These gains further scale to 7B model, REAL continues to outperform DAPO and GSPO by 6.2% and 1.7%, respectively. Notably, even with a vanilla binary cross-entropy, REAL remains stable and exceeds DAPO by 4.5% on average.
[445] Accelerated Sequential Flow Matching: A Bayesian Filtering Perspective
Yinan Huang, Hans Hao-Hsun Hsu, Junran Wang, Bo Dai, Pan Li
Main category: cs.LG
TL;DR: Sequential Flow Matching: A Bayesian filtering framework for efficient real-time streaming inference that accelerates flow-based models by initializing generation from previous posterior distributions.
Details
Motivation: Current diffusion and flow-matching models for sequential prediction require repeated sampling from non-informative initial distributions, causing substantial inference latency and system backlogs in real-time streaming environments.Method: Treats streaming inference as learning a probability flow that transports predictive distributions from one time step to the next, aligning with recursive Bayesian belief updates. Initializes generation from previous posterior distributions as a principled warm start.
Result: Achieves performance competitive with full-step diffusion models while requiring only one or very few sampling steps, resulting in significantly faster sampling across forecasting, decision-making, and state estimation tasks.
Conclusion: Framing sequential inference through Bayesian filtering provides a principled perspective for efficient real-time deployment of flow-based models, enabling faster sampling without sacrificing performance.
Abstract: Sequential prediction from streaming observations is a fundamental problem in stochastic dynamical systems, where inherent uncertainty often leads to multiple plausible futures. While diffusion and flow-matching models are capable of modeling complex, multi-modal trajectories, their deployment in real-time streaming environments typically relies on repeated sampling from a non-informative initial distribution, incurring substantial inference latency and potential system backlogs. In this work, we introduce Sequential Flow Matching, a principled framework grounded in Bayesian filtering. By treating streaming inference as learning a probability flow that transports the predictive distribution from one time step to the next, our approach naturally aligns with the recursive structure of Bayesian belief updates. We provide theoretical justification that initializing generation from the previous posterior offers a principled warm start that can accelerate sampling compared to naïve re-sampling. Across a wide range of forecasting, decision-making and state estimation tasks, our method achieves performance competitive with full-step diffusion while requiring only one or very few sampling steps, therefore with faster sampling. It suggests that framing sequential inference via Bayesian filtering provides a new and principled perspective towards efficient real-time deployment of flow-based models.
[446] GAS: Enhancing Reward-Cost Balance of Generative Model-assisted Offline Safe RL
Zifan Liu, Xinran Li, Shibo Chen, Jun Zhang
Main category: cs.LG
TL;DR: GAS is a novel offline safe RL algorithm that enhances stitching of suboptimal trajectories and balances reward-cost tradeoffs using goal functions and dataset augmentation.
Details
Motivation: Current GM-assisted offline safe RL methods struggle with stitching optimal transitions from suboptimal trajectories and balancing conflicting reward and cost targets, limiting their effectiveness in constrained decision-making.Method: Proposes Goal-Assisted Stitching (GAS) with: 1) Transition-level dataset augmentation and relabeling to enhance stitching capability, 2) Goal functions trained via expectile regression to estimate optimal achievable reward/cost goals, 3) Dataset reshaping for uniform reward-cost distribution to improve training stability.
Result: Empirical results show GAS achieves superior performance in balancing reward maximization and constraint satisfaction compared to existing offline safe RL methods.
Conclusion: GAS effectively addresses key challenges in GM-assisted offline safe RL by improving stitching capabilities and achieving better reward-cost tradeoffs through goal estimation and dataset augmentation techniques.
Abstract: Offline Safe Reinforcement Learning (OSRL) aims to learn a policy to achieve high performance in sequential decision-making while satisfying constraints, using only pre-collected datasets. Recent works, inspired by the strong capabilities of Generative Models (GMs), reformulate decision-making in OSRL as a conditional generative process, where GMs generate desirable actions conditioned on predefined reward and cost values. However, GM-assisted methods face two major challenges in OSRL: (1) lacking the ability to “stitch” optimal transitions from suboptimal trajectories within the dataset, and (2) struggling to balance reward targets with cost targets, particularly when they are conflict. To address these issues, we propose Goal-Assisted Stitching (GAS), a novel algorithm designed to enhance stitching capabilities while effectively balancing reward maximization and constraint satisfaction. To enhance the stitching ability, GAS first augments and relabels the dataset at the transition level, enabling the construction of high-quality trajectories from suboptimal ones. GAS also introduces novel goal functions, which estimate the optimal achievable reward and cost goals from the dataset. These goal functions, trained using expectile regression on the relabeled and augmented dataset, allow GAS to accommodate a broader range of reward-cost return pairs and achieve a better tradeoff between reward maximization and constraint satisfaction compared to human-specified values. The estimated goals then guide policy training, ensuring robust performance under constrained settings. Furthermore, to improve training stability and efficiency, we reshape the dataset to achieve a more uniform reward-cost return distribution. Empirical results validate the effectiveness of GAS, demonstrating superior performance in balancing reward maximization and constraint satisfaction compared to existing methods.
[447] Pool-based Active Learning as Noisy Lossy Compression: Characterizing Label Complexity via Finite Blocklength Analysis
Kosuke Sugiyama, Masato Uchida
Main category: cs.LG
TL;DR: Information-theoretic framework for pool-based active learning, treating it as noisy lossy compression to derive theoretical limits on label complexity and generalization error.
Details
Motivation: To establish theoretical limits for pool-based active learning by developing an information-theoretic framework that connects data selection and learning processes.Method: Reformulates pool-based AL as a noisy lossy compression problem, mapping pool observations to noisy symbols, data selection to compression, and learning to decoding. Applies finite blocklength analysis to derive information-theoretic lower bounds.
Result: Derived information-theoretic lower bounds on label complexity and generalization error that include terms reflecting overfitting and discrepancy between inductive bias and target task.
Conclusion: Provides a new theoretical perspective on pool-based AL by connecting it to information theory and stability theory, offering fundamental limits for learning algorithms under optimal data selection.
Abstract: This paper proposes an information-theoretic framework for analyzing the theoretical limits of pool-based active learning (AL), in which a subset of instances is selectively labeled. The proposed framework reformulates pool-based AL as a noisy lossy compression problem by mapping pool observations to noisy symbol observations, data selection to compression, and learning to decoding. This correspondence enables a unified information-theoretic analysis of data selection and learning in pool-based AL. Applying finite blocklength analysis of noisy lossy compression, we derive information-theoretic lower bounds on label complexity and generalization error that serve as theoretical limits for a given learning algorithm under its associated optimal data selection strategy. Specifically, our bounds include terms that reflect overfitting induced by the learning algorithm and the discrepancy between its inductive bias and the target task, and are closely related to established information-theoretic bounds and stability theory, which have not been previously applied to the analysis of pool-based AL. These properties yield a new theoretical perspective on pool-based AL.
[448] Smoothness Errors in Dynamics Models and How to Avoid Them
Edward Berman, Luisa Li, Jung Yeon Park, Robin Walters
Main category: cs.LG
TL;DR: Relaxed unitary graph convolutions for PDEs on surfaces balance smoothness preservation with natural physical smoothing, outperforming unitary convolutions and other baselines on mesh-based tasks.
Details
Motivation: Graph neural networks for PDEs on surfaces suffer from oversmoothing, but unitary convolutions that preserve smoothness may be overconstraining for physical systems where smoothness naturally increases (e.g., diffusion). Need to balance smoothness preservation with natural physical smoothing.Method: Propose relaxed unitary convolutions that balance smoothness preservation with natural smoothing required for physical systems. Generalize unitary and relaxed unitary convolutions from graphs to meshes for surface PDE modeling.
Result: Outperforms several strong baselines including mesh-aware transformers and equivariant neural networks on PDE tasks (heat and wave equations) over complex meshes and weather forecasting.
Conclusion: Relaxed unitary convolutions provide better balance for physical system modeling than strict unitary convolutions, with improved performance on mesh-based PDE tasks and weather forecasting.
Abstract: Modern neural networks have shown promise for solving partial differential equations over surfaces, often by discretizing the surface as a mesh and learning with a mesh-aware graph neural network. However, graph neural networks suffer from oversmoothing, where a node’s features become increasingly similar to those of its neighbors. Unitary graph convolutions, which are mathematically constrained to preserve smoothness, have been proposed to address this issue. Despite this, in many physical systems, such as diffusion processes, smoothness naturally increases and unitarity may be overconstraining. In this paper, we systematically study the smoothing effects of different GNNs for dynamics modeling and prove that unitary convolutions hurt performance for such tasks. We propose relaxed unitary convolutions that balance smoothness preservation with the natural smoothing required for physical systems. We also generalize unitary and relaxed unitary convolutions from graphs to meshes. In experiments on PDEs such as the heat and wave equations over complex meshes and on weather forecasting, we find that our method outperforms several strong baselines, including mesh-aware transformers and equivariant neural networks.
[449] Bayesian Neighborhood Adaptation for Graph Neural Networks
Paribesh Regmi, Rui Li, Kishan K C
Main category: cs.LG
TL;DR: Bayesian framework for adaptive neighborhood scope selection in GNNs using beta process modeling
Details
Motivation: Current GNNs require manual specification of neighborhood scope (number of hops), which is time-consuming and biased. Need adaptive method for both homophilic and heterophilic graphs.Method: Model GNN message-passing as stochastic process using beta process to treat number of hops as random variable. Bayesian framework infers optimal neighborhood scope simultaneously with GNN parameter optimization.
Result: Method improves GNN expressivity theoretically. Achieves competitive/superior performance on node classification across homophilic and heterophilic benchmark datasets. Provides well-calibrated predictions.
Conclusion: Proposed Bayesian framework enables adaptive neighborhood scope selection, enhancing GNN performance and calibration across diverse graph types.
Abstract: The neighborhood scope (i.e., number of hops) where graph neural networks (GNNs) aggregate information to characterize a node’s statistical property is critical to GNNs’ performance. Two-stage approaches, training and validating GNNs for every pre-specified neighborhood scope to search for the best setting, is a time-consuming task and tends to be biased due to the search space design. How to adaptively determine proper neighborhood scopes for the aggregation process for both homophilic and heterophilic graphs remains largely unexplored. We thus propose to model the GNNs’ message-passing behavior on a graph as a stochastic process by treating the number of hops as a beta process. This Bayesian framework allows us to infer the most plausible neighborhood scope for message aggregation simultaneously with the optimization of GNN parameters. Our theoretical analysis shows that the scope inference improves the expressivity of a GNN. Experiments on benchmark homophilic and heterophilic datasets show that the proposed method is compatible with state-of-the-art GNN variants, achieving competitive or superior performance on the node classification task, and providing well-calibrated predictions.
[450] DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders
Xu Wang, Bingqing Jiang, Yu Wan, Baosong Yang, Lingpeng Kong, Difan Zou
Main category: cs.LG
TL;DR: First SAE-based interpretability framework for diffusion language models (DLMs) showing SAEs can extract interpretable features and enable effective interventions, with different behaviors than in autoregressive LLMs.
Details
Motivation: As diffusion language models become a promising alternative to autoregressive LLMs, there's a need for tailored mechanistic interpretability tools for this emerging class of models.Method: Developed DLM-Scope, the first SAE-based interpretability framework for DLMs, using trained Top-K sparse autoencoders to extract features and test interventions.
Result: SAEs can faithfully extract interpretable features from DLMs; SAE insertion affects DLMs differently than LLMs (can reduce loss in early layers); SAE features enable effective diffusion-time interventions; SAEs provide useful signals for DLM decoding order; features are stable during post-training.
Conclusion: Establishes foundation for mechanistic interpretability in DLMs and shows great potential for applying SAEs to DLM-related tasks and algorithms.
Abstract: Sparse autoencoders (SAEs) have become a standard tool for mechanistic interpretability in autoregressive large language models (LLMs), enabling researchers to extract sparse, human-interpretable features and intervene on model behavior. Recently, as diffusion language models (DLMs) have become an increasingly promising alternative to the autoregressive LLMs, it is essential to develop tailored mechanistic interpretability tools for this emerging class of models. In this work, we present DLM-Scope, the first SAE-based interpretability framework for DLMs, and demonstrate that trained Top-K SAEs can faithfully extract interpretable features. Notably, we find that inserting SAEs affects DLMs differently than autoregressive LLMs: while SAE insertion in LLMs typically incurs a loss penalty, in DLMs it can reduce cross-entropy loss when applied to early layers, a phenomenon absent or markedly weaker in LLMs. Additionally, SAE features in DLMs enable more effective diffusion-time interventions, often outperforming LLM steering. Moreover, we pioneer certain new SAE-based research directions for DLMs: we show that SAEs can provide useful signals for DLM decoding order; and the SAE features are stable during the post-training phase of DLMs. Our work establishes a foundation for mechanistic interpretability in DLMs and shows a great potential of applying SAEs to DLM-related tasks and algorithms.
[451] Hinge Regression Tree: A Newton Method for Oblique Regression Tree Splitting
Hongyi Li, Han Lin, Jun Xu
Main category: cs.LG
TL;DR: HRT (Hinge Regression Tree) is a novel oblique decision tree method that reframes splits as non-linear least-squares problems using two linear predictors with max/min envelope for ReLU-like expressive power, achieving fast convergence and compact structures.
Details
Motivation: Oblique decision trees offer better decision boundaries than axis-aligned trees but learning high-quality oblique splits is NP-hard. Current methods rely on slow search or theory-free heuristics, creating a need for principled, efficient optimization methods for oblique splits.Method: HRT frames each split as a non-linear least-squares problem over two linear predictors whose max/min envelope provides ReLU-like expressive power. Uses alternating fitting procedure equivalent to damped Newton (Gauss-Newton) method within fixed partitions, with backtracking line-search for monotonic convergence.
Result: HRT achieves fast, stable convergence with both fixed and adaptive damping. Proves HRT’s model class is a universal approximator with explicit O(δ²) approximation rate. Outperforms single-tree baselines with more compact structures on synthetic and real-world benchmarks.
Conclusion: HRT provides a principled optimization framework for learning oblique decision trees with theoretical guarantees, efficient convergence, and compact structures, addressing limitations of existing oblique tree methods.
Abstract: Oblique decision trees combine the transparency of trees with the power of multivariate decision boundaries, but learning high-quality oblique splits is NP-hard, and practical methods still rely on slow search or theory-free heuristics. We present the Hinge Regression Tree (HRT), which reframes each split as a non-linear least-squares problem over two linear predictors whose max/min envelope induces ReLU-like expressive power. The resulting alternating fitting procedure is exactly equivalent to a damped Newton (Gauss-Newton) method within fixed partitions. We analyze this node-level optimization and, for a backtracking line-search variant, prove that the local objective decreases monotonically and converges; in practice, both fixed and adaptive damping yield fast, stable convergence and can be combined with optional ridge regularization. We further prove that HRT’s model class is a universal approximator with an explicit $O(δ^2)$ approximation rate, and show on synthetic and real-world benchmarks that it matches or outperforms single-tree baselines with more compact structures.
[452] Constrained Group Relative Policy Optimization
Roger Girgis, Rodrigue de Schaetzen, Luke Rowe, Azalée Robitaille, Christopher Pal, Liam Paull
Main category: cs.LG
TL;DR: Constrained GRPO extends Group Relative Policy Optimization with Lagrangian constraints for embodied AI, fixing advantage estimation issues to enable stable constraint control in robotics tasks.
Details
Motivation: Extend GRPO to constrained policy optimization settings, addressing the challenge of maintaining proper trade-offs between reward and constraint terms when using Lagrangian methods with multiple objective components.Method: Introduces Constrained GRPO with Lagrangian relaxation using indicator cost functions. Identifies that naive multi-component advantage estimation breaks constraint learning due to mismatched standard deviations, and proposes scalarized advantage construction to preserve intended trade-offs.
Result: Experiments in toy gridworld confirm the optimization pathology and show scalarizing advantages restores stable constraint control. In robotics tasks, Constrained GRPO improves constraint satisfaction while increasing task success.
Conclusion: Provides a simple and effective recipe for constrained policy optimization in embodied AI domains that rely on large multimodal foundation models, addressing critical issues in advantage estimation for Lagrangian methods.
Abstract: While Group Relative Policy Optimization (GRPO) has emerged as a scalable framework for critic-free policy learning, extending it to settings with explicit behavioral constraints remains underexplored. We introduce Constrained GRPO, a Lagrangian-based extension of GRPO for constrained policy optimization. Constraints are specified via indicator cost functions, enabling direct optimization of violation rates through a Lagrangian relaxation. We show that a naive multi-component treatment in advantage estimation can break constrained learning: mismatched component-wise standard deviations distort the relative importance of the different objective terms, which in turn corrupts the Lagrangian signal and prevents meaningful constraint enforcement. We formally derive this effect to motivate our scalarized advantage construction that preserves the intended trade-off between reward and constraint terms. Experiments in a toy gridworld confirm the predicted optimization pathology and demonstrate that scalarizing advantages restores stable constraint control. In addition, we evaluate Constrained GRPO on robotics tasks, where it improves constraint satisfaction while increasing task success, establishing a simple and effective recipe for constrained policy optimization in embodied AI domains that increasingly rely on large multimodal foundation models.
[453] Erase at the Core: Representation Unlearning for Machine Unlearning
Jaewon Lee, Yongwoo Kim, Donghyun Kim
Main category: cs.LG
TL;DR: EC framework addresses superficial forgetting in machine unlearning by enforcing forgetting throughout network hierarchy using multi-layer contrastive unlearning and deeply supervised learning.
Details
Motivation: Current machine unlearning methods show strong logit-level forgetting but preserve substantial information in internal feature representations (superficial forgetting), primarily altering only the final classifier while leaving intermediate representations unchanged.Method: Erase at the Core (EC) integrates multi-layer contrastive unlearning on forget set with retain set preservation through deeply supervised learning. Attaches auxiliary modules to intermediate layers and applies both contrastive unlearning and cross-entropy losses at each supervision point with layer-wise weighted losses.
Result: EC achieves effective logit-level forgetting while substantially reducing representational similarity to original model across intermediate layers. It’s model-agnostic and can be incorporated as plug-in module into existing unlearning methods, improving representation-level forgetting while maintaining retain set performance.
Conclusion: EC framework effectively addresses superficial forgetting by enforcing forgetting throughout entire network hierarchy, providing more comprehensive unlearning while preserving model utility on retained data.
Abstract: Many approximate machine unlearning methods demonstrate strong logit-level forgetting – such as near-zero accuracy on the forget set – yet continue to preserve substantial information within their internal feature representations. We refer to this discrepancy as superficial forgetting. Recent studies indicate that most existing unlearning approaches primarily alter the final classifier, leaving intermediate representations largely unchanged and highly similar to those of the original model. To address this limitation, we introduce the Erase at the Core (EC), a framework designed to enforce forgetting throughout the entire network hierarchy. EC integrates multi-layer contrastive unlearning on the forget set with retain set preservation through deeply supervised learning. Concretely, EC attaches auxiliary modules to intermediate layers and applies both contrastive unlearning and cross-entropy losses at each supervision point, with layer-wise weighted losses. Experimental results show that EC not only achieves effective logit-level forgetting, but also substantially reduces representational similarity to the original model across intermediate layers. Furthermore, EC is model-agnostic and can be incorporated as a plug-in module into existing unlearning methods, improving representation-level forgetting while maintaining performance on the retain set.
[454] Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations
Wei Liu, Jiawei Xu, Yingru Li, Longtao Zheng, Tianjian Li, Qian Liu, Junxian He
Main category: cs.LG
TL;DR: A reinforcement learning framework called KernelGYM for training LLMs to generate high-performance GPU kernels, with methods to address reward hacking and lazy optimization, achieving competitive performance with state-of-the-art models.
Details
Motivation: High-quality kernel code is critical for scalable AI systems, but training LLMs for kernel generation faces challenges including insufficient data, lack of robust environments, vulnerability to reward hacking, and lazy optimization where models prioritize trivial correctness over meaningful speedup.Method: Developed KernelGYM, a distributed GPU environment supporting reward hacking checks and multi-turn RL training. Proposed TRLOO (Turn-level Reinforce-Leave-One-Out) to address biased policy gradients in multi-turn RL, incorporated mismatch correction for stability, and introduced Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome lazy optimization.
Result: Dr.Kernel-14B achieves competitive performance with Claude-4.5-Sonnet on Kernelbench. On KernelBench Level-2 subset, 31.6% of generated kernels achieve ≥1.2x speedup over Torch reference, surpassing Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%). When selecting best candidate across all turns, the 1.2x speedup rate increases to 47.8%.
Conclusion: The proposed RL framework effectively trains LLMs for high-performance kernel generation, addressing key challenges like reward hacking and lazy optimization, and demonstrates superior performance compared to leading commercial models.
Abstract: High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to reward hacking and lazy optimization. In these cases, models may hack training rewards and prioritize trivial correctness over meaningful speedup. In this paper, we systematically study reinforcement learning (RL) for kernel generation. We first design KernelGYM, a robust distributed GPU environment that supports reward hacking check, data collection from multi-turn interactions and long-term RL training. Building on KernelGYM, we investigate effective multi-turn RL methods and identify a biased policy gradient issue caused by self-inclusion in GRPO. To solve this, we propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation for multi-turn RL. To alleviate lazy optimization, we incorporate mismatch correction for training stability and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue. The trained model, Dr.Kernel-14B, reaches performance competitive with Claude-4.5-Sonnet in Kernelbench. Finally, we study sequential test-time scaling for Dr.Kernel-14B. On the KernelBench Level-2 subset, 31.6% of the generated kernels achieve at least a 1.2x speedup over the Torch reference, surpassing Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%). When selecting the best candidate across all turns, this 1.2x speedup rate further increases to 47.8%. All resources, including environment, training code, models, and dataset, are included in https://www.github.com/hkust-nlp/KernelGYM.
[455] A Decomposition-based State Space Model for Multivariate Time-Series Forecasting
Shunya Nagashima, Shuntaro Suzuki, Shuitsu Koyama, Shinnosuke Hirano
Main category: cs.LG
TL;DR: DecompSSM: A multivariate time series forecasting framework using three parallel deep state space models to separately capture trend, seasonal, and residual components with adaptive temporal scales and cross-variable context refinement.
Details
Motivation: Real-world multivariate time series contain intertwined slow trends, multi-rate seasonalities, and irregular residuals. Existing methods use rigid decompositions or generic end-to-end architectures that entangle components and underuse structure shared across variables.Method: Proposes DecompSSM with three parallel deep state space model branches for trend, seasonal, and residual components. Features adaptive temporal scales via input-dependent predictor, refinement module for shared cross-variable context, and auxiliary loss enforcing reconstruction and orthogonality.
Result: Outperformed strong baselines across standard benchmarks (ECL, Weather, ETTm2, and PEMS04), demonstrating effectiveness of combining component-wise deep state space models with global context refinement.
Conclusion: DecompSSM effectively addresses limitations of existing methods by combining component-wise decomposition with deep state space models and cross-variable context refinement for improved multivariate time series forecasting.
Abstract: Multivariate time series (MTS) forecasting is crucial for decision-making in domains such as weather, energy, and finance. It remains challenging because real-world sequences intertwine slow trends, multi-rate seasonalities, and irregular residuals. Existing methods often rely on rigid, hand-crafted decompositions or generic end-to-end architectures that entangle components and underuse structure shared across variables. To address these limitations, we propose DecompSSM, an end-to-end decomposition framework using three parallel deep state space model branches to capture trend, seasonal, and residual components. The model features adaptive temporal scales via an input-dependent predictor, a refinement module for shared cross-variable context, and an auxiliary loss that enforces reconstruction and orthogonality. Across standard benchmarks (ECL, Weather, ETTm2, and PEMS04), DecompSSM outperformed strong baselines, indicating the effectiveness of combining component-wise deep state space models and global context refinement.
[456] DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training
Dingwei Zhu, Zhiheng Xi, Shihan Dou, Jiahan Li, Chenhao Huang, Junjie Ye, Sixian Li, Mingxu Chai, Yuhui Wang, Yajie Yang, Ming Zhang, Jiazheng Zhang, Shichun Liu, Caishuang Huang, Yunke Zhang, Yuran Wang, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang
Main category: cs.LG
TL;DR: DFPO is a robust distributional RL framework that models values as continuous flows across time steps instead of isolated quantile predictions, improving training stability and generalization under noisy supervision.
Details
Motivation: Training RL systems in real-world environments faces challenges with noisy supervision and poor out-of-domain generalization, especially in LLM post-training. Existing distributional RL methods model values with multiple quantile points but learn each independently as scalars, resulting in rough-grained value representations that lack fine-grained conditioning on state information.Method: DFPO models values as continuous flows across time steps by learning a value flow field instead of isolated quantile predictions. It integrates conditional risk control and consistency constraints along value flow trajectories to stabilize training under noisy feedback.
Result: Experiments on dialogue, math reasoning, and scientific tasks show DFPO outperforms PPO, FlowRL, and other robust baselines under noisy supervision, achieving improved training stability and generalization.
Conclusion: DFPO provides a robust distributional RL framework that captures richer state information through continuous value flow modeling, enabling better advantage estimation and improved performance in noisy real-world environments.
Abstract: Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar. This results in rough-grained value representations that lack fine-grained conditioning on state information, struggling under complex and OOD conditions. We propose DFPO (Distributional Value Flow Policy Optimization with Conditional Risk and Consistency Control), a robust distributional RL framework that models values as continuous flows across time steps. By scaling value modeling through learning of a value flow field instead of isolated quantile predictions, DFPO captures richer state information for more accurate advantage estimation. To stabilize training under noisy feedback, DFPO further integrates conditional risk control and consistency constraints along value flow trajectories. Experiments on dialogue, math reasoning, and scientific tasks show that DFPO outperforms PPO, FlowRL, and other robust baselines under noisy supervision, achieving improved training stability and generalization.
[457] Assessing Electricity Demand Forecasting with Exogenous Data in Time Series Foundation Models
Wei Soon Cheong, Lian Lian Jiang, Jamie Ng Suat Ling
Main category: cs.LG
TL;DR: Empirical evaluation of time-series foundation models for electricity demand forecasting shows variable effectiveness, with simple baselines sometimes outperforming foundation models, especially in stable climates.
Details
Motivation: To evaluate whether time-series foundation models can effectively leverage exogenous features for electricity demand forecasting, which is critical for accurate predictions in energy markets.Method: Empirical evaluation of foundation models (MOIRAI, MOMENT, TinyTimeMixers, ChronosX, Chronos-2) against baseline LSTM with reversible instance normalization across Singaporean and Australian electricity markets at hourly and daily granularities, using three feature configurations.
Result: Chronos-2 performed best among foundation models in zero-shot settings, but simple baseline LSTM frequently outperformed all foundation models in Singapore’s stable climate, especially for short-term horizons. Model architecture and geographic context were critical factors.
Conclusion: Foundation models don’t universally outperform simpler models for electricity forecasting; domain-specific models are needed, especially in energy domain. Geographic context and model architecture significantly impact performance.
Abstract: Time-series foundation models have emerged as a new paradigm for forecasting, yet their ability to effectively leverage exogenous features – critical for electricity demand forecasting – remains unclear. This paper empirically evaluates foundation models capable of modeling cross-channel correlations against a baseline LSTM with reversible instance normalization across Singaporean and Australian electricity markets at hourly and daily granularities. We systematically assess MOIRAI, MOMENT, TinyTimeMixers, ChronosX, and Chronos-2 under three feature configurations: all features, selected features, and target-only. Our findings reveal highly variable effectiveness: while Chronos-2 achieves the best performance among foundation models (in zero-shot settings), the simple baseline frequently outperforms all foundation models in Singapore’s stable climate, particularly for short-term horizons. Model architecture proves critical, with synergistic architectural implementations (TTM’s channel-mixing, Chronos-2’s grouped attention) consistently leveraging exogenous features, while other approaches show inconsistent benefits. Geographic context emerges as equally important, with foundation models demonstrating advantages primarily in variable climates. These results challenge assumptions about universal foundation model superiority and highlight the need for domain-specific models, specifically in the energy domain.
[458] Robust Federated Learning via Byzantine Filtering over Encrypted Updates
Adda Akram Bendoukha, Aymen Boudguiga, Nesrine Kaaniche, Renaud Sirdey, Didem Demirag, Sébastien Gambs
Main category: cs.LG
TL;DR: A federated learning approach combining homomorphic encryption for privacy-preserving aggregation with property-inference-inspired meta-classifiers for Byzantine filtering.
Details
Motivation: Federated Learning faces privacy and security challenges including inference attacks and Byzantine behaviors. Existing solutions often address secure aggregation and Byzantine resilience independently, leaving a gap for integrated solutions.Method: 1) Train filtering meta-classifiers on labeled shadow updates to detect Byzantine attacks (backdoor, gradient-inversion, label-flipping, shuffling). 2) Use meta-classifier outputs to reweight and cancel Byzantine encrypted updates. 3) Automated method for selecting optimal kernel and dimensionality hyperparameters for homomorphic inference over CKKS cryptosystem.
Result: Achieves 90-94% accuracy for identifying Byzantine updates with marginal losses in model utility. Encrypted inference runtimes range from 6-24 seconds for inference and 9-26 seconds for overall aggregation across FEMNIST, CIFAR10, GTSRB, and acsincome benchmarks.
Conclusion: Proposed approach effectively combines privacy-preserving homomorphic encryption with Byzantine filtering using property-inference-inspired meta-classifiers, addressing both security and privacy challenges in federated learning.
Abstract: Federated Learning (FL) aims to train a collaborative model while preserving data privacy. However, the distributed nature of this approach still raises privacy and security issues, such as the exposure of sensitive data due to inference attacks and the influence of Byzantine behaviors on the trained model. In particular, achieving both secure aggregation and Byzantine resilience remains challenging, as existing solutions often address these aspects independently. In this work, we propose to address these challenges through a novel approach that combines homomorphic encryption for privacy-preserving aggregation with property-inference-inspired meta-classifiers for Byzantine filtering. First, following the property-inference attacks blueprint, we train a set of filtering meta-classifiers on labeled shadow updates, reproducing a diverse ensemble of Byzantine misbehaviors in FL, including backdoor, gradient-inversion, label-flipping and shuffling attacks. The outputs of these meta-classifiers are then used to cancel the Byzantine encrypted updates by reweighting. Second, we propose an automated method for selecting the optimal kernel and the dimensionality hyperparameters with respect to homomorphic inference, aggregation constraints and efficiency over the CKKS cryptosystem. Finally, we demonstrate through extensive experiments the effectiveness of our approach against Byzantine participants on the FEMNIST, CIFAR10, GTSRB, and acsincome benchmarks. More precisely, our SVM filtering achieves accuracies between $90$% and $94$% for identifying Byzantine updates at the cost of marginal losses in model utility and encrypted inference runtimes ranging from $6$ to $24$ seconds and from $9$ to $26$ seconds for an overall aggregation.
[459] BLITZRANK: Principled Zero-shot Ranking Agents with Tournament Graphs
Sheshansh Agrawal, Thien Hang Nguyen, Douwe Kiela
Main category: cs.LG
TL;DR: A tournament graph framework for efficient k-wise LLM reranking that aggregates pairwise preferences from document comparisons to reduce token usage while maintaining accuracy.
Details
Motivation: Existing LLM reranking methods are either heuristic-based and don't fully exploit ranking information, or inefficient when they do try to use more information. There's a need for a principled approach that can efficiently leverage the information revealed during document comparisons.Method: Proposes a tournament graph framework where each k-document comparison reveals a complete tournament of pairwise preferences. These tournaments are aggregated into a global preference graph, whose transitive closure yields additional orderings without further model invocations. The method includes formal criteria for when a candidate’s rank is certifiably determined, a query schedule that greedily maximizes information gain for identifying top-m items, and handles non-transitive preferences by collapsing cycles into equivalence classes for tiered rankings.
Result: Across 14 benchmarks and 5 LLMs, the method achieves Pareto dominance: matching or exceeding accuracy while requiring 25-40% fewer tokens than comparable approaches, and 7× fewer tokens than pairwise methods at near-identical quality.
Conclusion: The tournament graph framework provides a principled foundation for efficient k-wise reranking that significantly reduces computational costs while maintaining or improving ranking quality, offering a better trade-off between accuracy and efficiency for LLM-based retrieval-augmented generation.
Abstract: Large language models have emerged as powerful zero-shot rerankers for retrieval-augmented generation, offering strong generalization without task-specific training. However, existing LLM reranking methods either rely on heuristics that fail to fully exploit the information revealed by each ranking decision or are inefficient when they do. We introduce a tournament graph framework that provides a principled foundation for $k$-wise reranking. Our key observation is that each $k$-document comparison reveals a complete tournament of $\binom{k}{2}$ pairwise preferences. These tournaments are aggregated into a global preference graph, whose transitive closure yields many additional orderings without further model invocations. We formalize when a candidate’s rank is certifiably determined and design a query schedule that greedily maximizes information gain towards identifying the top-$m$ items. Our framework also gracefully handles non-transitive preferences - cycles induced by LLM judgments - by collapsing them into equivalence classes that yield principled tiered rankings. Empirically, across 14 benchmarks and 5 LLMs, our method achieves Pareto dominance over existing methods: matching or exceeding accuracy while requiring 25-40% fewer tokens than comparable approaches, and 7$\times$ fewer than pairwise methods at near-identical quality.
[460] When Are RL Hyperparameters Benign? A Study in Offline Goal-Conditioned RL
Jan Malte Töpperwien, Aditya Mohan, Marius Lindauer
Main category: cs.LG
TL;DR: Deep RL hyperparameter sensitivity is not inevitable but amplified by bootstrapping dynamics, with quasimetric representation learning showing greater robustness than bootstrapped TD-learning in offline goal-conditioned RL.
Details
Motivation: The paper investigates whether hyperparameter sensitivity in Deep RL is intrinsic to RL problems or exacerbated by specific training mechanisms, particularly in offline goal-conditioned RL where data distributions are fixed and non-stationarity can be controlled.Method: The study examines offline goal-conditioned RL under both stationary and non-stationary regimes with controlled data quality shifts. It compares two representative algorithms: HIQL (bootstrapped TD-learning) and QRL (quasimetric representation learning). An inter-goal gradient alignment diagnostic is introduced to analyze gradient interference.
Result: Results show substantially greater robustness to hyperparameter changes than commonly reported for online RL. With modest expert data (~20%), QRL maintains broad stable near-optimal regions while HIQL exhibits sharp optima that drift across training phases. Bootstrapped objectives show stronger destructive gradient interference correlating with hyperparameter sensitivity.
Conclusion: High hyperparameter sensitivity in RL is not inevitable but amplified by bootstrapping dynamics. This insight offers a pathway toward more robust algorithmic objective design by addressing gradient interference issues in bootstrapped methods.
Abstract: Hyperparameter sensitivity in Deep Reinforcement Learning (RL) is often accepted as unavoidable. However, it remains unclear whether it is intrinsic to the RL problem or exacerbated by specific training mechanisms. We investigate this question in offline goal-conditioned RL, where data distributions are fixed, and non-stationarity can be explicitly controlled via scheduled shifts in data quality. Additionally, we study varying data qualities under both stationary and non-stationary regimes, and cover two representative algorithms: HIQL (bootstrapped TD-learning) and QRL (quasimetric representation learning). Overall, we observe substantially greater robustness to changes in hyperparameter configurations than commonly reported for online RL, even under controlled non-stationarity. Once modest expert data is present ($\approx$ 20%), QRL maintains broad, stable near-optimal regions, while HIQL exhibits sharp optima that drift significantly across training phases. To explain this divergence, we introduce an inter-goal gradient alignment diagnostic. We find that bootstrapped objectives exhibit stronger destructive gradient interference, which coincides directly with hyperparameter sensitivity. These results suggest that high sensitivity to changes in hyperparameter configurations during training is not inevitable in RL, but is amplified by the dynamics of bootstrapping, offering a pathway toward more robust algorithmic objective design.
[461] Thermodynamic Limits of Physical Intelligence
Koichi Takahashi, Yusuke Hayashi
Main category: cs.LG
TL;DR: Proposes two bits-per-joule metrics for AI energy efficiency: Thermodynamic Epiplexity per Joule (recognition/model-building) and Empowerment per Joule (control/action influence), with explicit accounting conventions.
Details
Motivation: Modern AI systems achieve remarkable capabilities but consume substantial energy. The paper aims to connect intelligence to physical efficiency by establishing rigorous metrics that quantify information processing per unit energy.Method: Proposes two complementary metrics: (1) Thermodynamic Epiplexity per Joule - bits of structural information encoded per unit energy, and (2) Empowerment per Joule - sensorimotor channel capacity per expected energetic cost. Uses stochastic thermodynamics to establish Landauer-scale benchmarks and addresses boundary/accounting conventions.
Result: Develops a unified efficiency framework with explicit accounting conventions, showing how Landauer-scaled costs act as closed-cycle benchmarks under specific assumptions, and demonstrates that without proper boundary assumptions, information gain and dissipation need not be tightly linked.
Conclusion: Provides rigorous bits-per-joule metrics for AI energy efficiency with explicit accounting conventions, enabling consistent comparisons and connecting thermodynamic principles to AI system evaluation. Recommends reporting both metrics with clear boundary/energy accounting conventions.
Abstract: Modern AI systems achieve remarkable capabilities at the cost of substantial energy consumption. To connect intelligence to physical efficiency, we propose two complementary bits-per-joule metrics under explicit accounting conventions: (1) Thermodynamic Epiplexity per Joule – bits of structural information about a theoretical environment-instance variable newly encoded in an agent’s internal state per unit measured energy within a stated boundary – and (2) Empowerment per Joule – the embodied sensorimotor channel capacity (control information) per expected energetic cost over a fixed horizon. These provide two axes of physical intelligence: recognition (model-building) vs.control (action influence). Drawing on stochastic thermodynamics, we show how a Landauer-scale closed-cycle benchmark for epiplexity acquisition follows as a corollary of a standard thermodynamic-learning inequality under explicit subsystem assumptions, and we clarify how Landauer-scaled costs act as closed-cycle benchmarks under explicit reset/reuse and boundary-closure assumptions; conversely, we give a simple decoupling construction showing that without such assumptions – and without charging for externally prepared low-entropy resources (e.g.fresh memory) crossing the boundary – information gain and in-boundary dissipation need not be tightly linked. For empirical settings where the latent structure variable is unavailable, we align the operational notion of epiplexity with compute-bounded MDL epiplexity and recommend reporting MDL-epiplexity / compression-gain surrogates as companions. Finally, we propose a unified efficiency framework that reports both metrics together with a minimal checklist of boundary/energy accounting, coarse-graining/noise, horizon/reset, and cost conventions to reduce ambiguity and support consistent bits-per-joule comparisons, and we sketch connections to energy-adjusted scaling analyses.
[462] A Unified Framework for Rethinking Policy Divergence Measures in GRPO
Qingyuan Wu, Yuhui Wang, Simon Sinong Zhan, Yanning Dai, Shilong Deng, Sarra Habchi, Qi Zhu, Matthias Gallé, Chao Huang
Main category: cs.LG
TL;DR: RLVR framework unifies policy divergence constraints for LLM reasoning, introduces KL3 estimator that enables asymmetric clipping for better exploration while maintaining GRPO stability
Details
Motivation: Existing RLVR methods like GRPO use likelihood ratio clipping for stable updates, but there's a need for a unified framework to systematically analyze how different policy divergence measures affect exploration and performance in LLM reasoning tasks.Method: Proposes unified clipping framework characterizing existing methods via general policy divergence measures (likelihood ratios, KL divergences, alternatives). Identifies KL3 estimator as key constraint, theoretically showing it’s equivalent to asymmetric ratio-based clipping that reallocates probability mass toward high-confidence actions.
Result: Empirical results on mathematical reasoning benchmarks show KL3 estimator incorporated into GRPO improves both training stability and final performance compared to baseline methods.
Conclusion: The KL3-based constraint promotes stronger exploration while retaining GRPO simplicity, highlighting importance of principled policy divergence constraints in policy optimization for LLM reasoning.
Abstract: Reinforcement Learning with Verified Reward (RLVR) has emerged as a critical paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). Most existing RLVR methods, such as GRPO and its variants, ensure stable updates by constraining policy divergence through clipping likelihood ratios. This paper introduces a unified clipping framework that characterizes existing methods via a general notion of policy divergence, encompassing both likelihood ratios and Kullback-Leibler (KL) divergences and extending to alternative measures. The framework provides a principled foundation for systematically analyzing how different policy divergence measures affect exploration and performance. We further identify the KL3 estimator, a variance-reduced Monte Carlo estimator of the KL divergence, as a key policy divergence constraint. We theoretically demonstrate that the KL3-based constraint is mathematically equivalent to an asymmetric ratio-based clipping that reallocates probability mass toward high-confidence actions, promoting stronger exploration while retaining the simplicity of GRPO-style methods. Empirical results on mathematical reasoning benchmarks demonstrate that incorporating the KL3 estimator into GRPO improves both training stability and final performance, highlighting the importance of principled policy divergence constraints in policy optimization.
[463] Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification
Tao Huang, Rui Wang, Xiaofei Liu, Yi Qin, Li Duan, Liping Jing
Main category: cs.LG
TL;DR: EUQ is a fine-grained uncertainty quantification method for LVLMs that detects misbehaviors by measuring internal conflict and ignorance through evidence theory.
Details
Motivation: LVLMs often produce unreliable/harmful content when faced with incompetent/adversarial inputs, stemming from epistemic uncertainty (conflicting knowledge or information absence). Existing uncertainty methods only capture overall uncertainty and are ineffective at identifying specific misbehaviors.Method: EUQ interprets model output features as supporting/opposing evidence, uses Evidence Theory to model and aggregate this evidence in a single forward pass to quantify internal conflict and knowledge gaps.
Result: EUQ outperforms baselines across four misbehavior categories (hallucinations, jailbreaks, adversarial vulnerabilities, OOD failures) with SOTA LVLMs. Hallucinations correlate with high internal conflict, OOD failures with high ignorance. Layer-wise analysis reveals uncertainty dynamics.
Conclusion: EUQ provides effective fine-grained uncertainty quantification for detecting LVLM misbehaviors, offering insights into internal representation evolution and improving model reliability.
Abstract: Large vision-language models (LVLMs) have shown substantial advances in multimodal understanding and generation. However, when presented with incompetent or adversarial inputs, they frequently produce unreliable or even harmful content, such as fact hallucinations or dangerous instructions. This misalignment with human expectations, referred to as \emph{misbehaviors} of LVLMs, raises serious concerns for deployment in critical applications. These misbehaviors are found to stem from epistemic uncertainty, specifically either conflicting internal knowledge or the absence of supporting information. However, existing uncertainty quantification methods, which typically capture only overall epistemic uncertainty, have shown limited effectiveness in identifying such issues. To address this gap, we propose Evidential Uncertainty Quantification (EUQ), a fine-grained method that captures both information conflict and ignorance for effective detection of LVLM misbehaviors. In particular, we interpret features from the model output head as either supporting (positive) or opposing (negative) evidence. Leveraging Evidence Theory, we model and aggregate this evidence to quantify internal conflict and knowledge gaps within a single forward pass. We extensively evaluate our method across four categories of misbehavior, including hallucinations, jailbreaks, adversarial vulnerabilities, and out-of-distribution (OOD) failures, using state-of-the-art LVLMs, and find that EUQ consistently outperforms strong baselines, showing that hallucinations correspond to high internal conflict and OOD failures to high ignorance. Furthermore, layer-wise evidential uncertainty dynamics analysis helps interpret the evolution of internal representations from a new perspective. The source code is available at https://github.com/HT86159/EUQ.
[464] Unveiling Implicit Advantage Symmetry: Why GRPO Struggles with Exploration and Difficulty Adaptation
Zhiqi Yu, Zhangquan Chen, Mengting Liu, Heye Zhang, Liangqiong Qu
Main category: cs.LG
TL;DR: A-GRAE improves GRPO by addressing advantage symmetry issues in reinforcement learning for LLMs, enhancing exploration and difficulty adaptation through asymmetric advantage estimation.
Details
Motivation: Current RLVR methods like GRPO have exploration and difficulty adaptation limitations due to implicit advantage symmetry in Group Relative Advantage Estimation, which hinders novel solution discovery and optimal sample difficulty prioritization.Method: Proposes Asymmetric GRAE (A-GRAE) that dynamically modulates exploration incentives and sample-difficulty focus by asymmetrically suppressing advantages of correct trajectories and implementing curriculum-like transitions from simple to complex samples.
Result: Experiments across seven benchmarks show A-GRAE consistently improves GRPO and its variants across both LLMs and MLLMs, demonstrating better exploration and difficulty adaptation.
Conclusion: Addressing advantage symmetry in RLVR methods through asymmetric advantage estimation significantly enhances exploration and learning efficiency for LLM reasoning tasks.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR), particularly GRPO, has become the standard for eliciting LLM reasoning. However, its efficiency in exploration and difficulty adaptation remains an open challenge. In this work, we argue that these bottlenecks stem from an implicit advantage symmetry inherent in Group Relative Advantage Estimation (GRAE). This symmetry induces two critical limitations: (i) at the group level, strict symmetry in weights between correct and incorrect trajectories leaves unsampled action logits unchanged, thereby hindering exploration of novel correct solution. (ii) at the sample level, the algorithm implicitly prioritizes medium-difficulty samples, remaining agnostic to the non-stationary demands of difficulty focus. Through controlled experiments, we reveal that this symmetric property is sub-optimal, yielding two pivotal insights: (i) asymmetrically suppressing the advantages of correct trajectories encourages essential exploration. (ii) learning efficiency is maximized by a curriculum-like transition-prioritizing simpler samples initially before gradually shifting to complex ones. Motivated by these findings, we propose Asymmetric GRAE (A-GRAE), which dynamically modulates exploration incentives and sample-difficulty focus. Experiments across seven benchmarks demonstrate that A-GRAE consistently improves GRPO and its variants across both LLMs and MLLMs.
[465] Logical Guidance for the Exact Composition of Diffusion Models
Francesco Alesiani, Jonathan Warrell, Tanja Bien, Henrik Christiansen, Matheus Ferraz, Mathias Niepert
Main category: cs.LG
TL;DR: LOGDIFF is a guidance framework for diffusion models that enables constrained generation using complex logical expressions at inference time through exact Boolean calculus and hybrid guidance.
Details
Motivation: Current diffusion models lack principled methods for constrained generation with complex logical expressions, limiting their ability to generate samples that satisfy multiple constraints simultaneously.Method: Develops exact Boolean calculus for logical guidance, provides sufficient conditions for exact logical guidance via circuit representations, and introduces hybrid guidance combining atomic scores with posterior probabilities.
Result: Demonstrates effectiveness on multiple image and protein structure generation tasks, showing that complex logical constraints can be exactly enforced during generation.
Conclusion: LOGDIFF provides a principled framework for exact constrained generation with logical expressions in diffusion models, bridging classifier guidance and classifier-free guidance approaches.
Abstract: We propose LOGDIFF (Logical Guidance for the Exact Composition of Diffusion Models), a guidance framework for diffusion models that enables principled constrained generation with complex logical expressions at inference time. We study when exact score-based guidance for complex logical formulas can be obtained from guidance signals associated with atomic properties. First, we derive an exact Boolean calculus that provides a sufficient condition for exact logical guidance. Specifically, if a formula admits a circuit representation in which conjunctions combine conditionally independent subformulas and disjunctions combine subformulas that are either conditionally independent or mutually exclusive, exact logical guidance is achievable. In this case, the guidance signal can be computed exactly from atomic scores and posterior probabilities using an efficient recursive algorithm. Moreover, we show that, for commonly encountered classes of distributions, any desired Boolean formula is compilable into such a circuit representation. Second, by combining atomic guidance scores with posterior probability estimates, we introduce a hybrid guidance approach that bridges classifierguidance and classifier-free guidance, applicable to both compositional logical guidance and standard conditional generation. We demonstrate the effectiveness of our framework on multiple image and protein structure generation tasks.
[466] MAGPrompt: Message-Adaptive Graph Prompt Tuning for Graph Neural Networks
Long D. Nguyen, Binh P. Nguyen
Main category: cs.LG
TL;DR: Message-adaptive graph prompt tuning injects learnable prompts into GNN message passing to adapt neighborhood interactions for downstream tasks while keeping the backbone frozen.
Details
Motivation: Existing graph prompt tuning methods only modify inputs or representations, leaving message passing unchanged, which limits their ability to adapt neighborhood interactions crucial for downstream task performance.Method: Proposes injecting learnable prompts into the message passing step to reweight incoming neighbor messages and add task-specific prompt vectors during message aggregation, while keeping the pre-trained GNN backbone frozen.
Result: Experiments on diverse node- and graph-level datasets show consistent gains over prior graph prompting methods in few-shot settings, while achieving performance competitive with fine-tuning in full-shot regimes.
Conclusion: Message-adaptive graph prompt tuning effectively adapts pre-trained GNNs to downstream tasks by modifying the message passing process, offering a parameter-efficient alternative to fine-tuning with strong performance across settings.
Abstract: Pre-trained graph neural networks (GNNs) transfer well, but adapting them to downstream tasks remains challenging due to mismatches between pre-training objectives and task requirements. Graph prompt tuning offers a parameter-efficient alternative to fine-tuning, yet most methods only modify inputs or representations and leave message passing unchanged, limiting their ability to adapt neighborhood interactions. We propose message-adaptive graph prompt tuning, which injects learnable prompts into the message passing step to reweight incoming neighbor messages and add task-specific prompt vectors during message aggregation, while keeping the backbone GNN frozen. The approach is compatible with common GNN backbones and pre-training strategies, and applicable across downstream settings. Experiments on diverse node- and graph-level datasets show consistent gains over prior graph prompting methods in few-shot settings, while achieving performance competitive with fine-tuning in full-shot regimes.
[467] EdgeMask-DG*: Learning Domain-Invariant Graph Structures via Adversarial Edge Masking
Rishabh Bhattacharya, Naresh Manwani
Main category: cs.LG
TL;DR: EdgeMask-DG*: A graph domain generalization method that uses adversarial edge masking on feature-enriched graphs to find domain-invariant structural information.
Details
Motivation: Graph neural networks struggle with structural shifts across domains. Existing methods use fixed augmentations or global perturbations that don't identify which edges encode domain-invariant information. The authors argue that domain-invariant structural information resides in consensus across multiple graph structures derived from both topology and feature similarity.Method: 1) EdgeMask-DG: A min-max algorithm where an edge masker learns to find worst-case continuous masks subject to sparsity constraints, forcing a task GNN to perform well under adversarial structural perturbations. 2) EdgeMask-DG*: Extends this to enriched graphs that combine original topology with feature-derived edges, enabling invariance discovery even with noisy or domain-specific original topology.
Result: Achieves state-of-the-art performance on diverse graph domain generalization benchmarks including citation networks, social networks, and temporal graphs. On Cora OOD benchmark, lifts worst-case domain accuracy to 78.0% (+3.8 pp improvement over prior SOTA of 74.2%).
Conclusion: EdgeMask-DG* is the first method to systematically combine adaptive adversarial topology search with feature-enriched graphs for graph domain generalization, providing formal justification from robust optimization perspective and demonstrating superior performance across multiple benchmarks.
Abstract: Structural shifts pose a significant challenge for graph neural networks, as graph topology acts as a covariate that can vary across domains. Existing domain generalization methods rely on fixed structural augmentations or training on globally perturbed graphs, mechanisms that do not pinpoint which specific edges encode domain-invariant information. We argue that domain-invariant structural information is not rigidly tied to a single topology but resides in the consensus across multiple graph structures derived from topology and feature similarity. To capture this, we first propose EdgeMask-DG, a novel min-max algorithm where an edge masker learns to find worst-case continuous masks subject to a sparsity constraint, compelling a task GNN to perform effectively under these adversarial structural perturbations. Building upon this, we introduce EdgeMask-DG*, an extension that applies this adversarial masking principle to an enriched graph. This enriched graph combines the original topology with feature-derived edges, allowing the model to discover invariances even when the original topology is noisy or domain-specific. EdgeMask-DG* is the first to systematically combine adaptive adversarial topology search with feature-enriched graphs. We provide a formal justification for our approach from a robust optimization perspective. We demonstrate that EdgeMask-DG* achieves new state-of-the-art performance on diverse graph domain generalization benchmarks, including citation networks, social networks, and temporal graphs. Notably, on the Cora OOD benchmark, EdgeMask-DG* lifts the worst-case domain accuracy to 78.0%, a +3.8 pp improvement over the prior state of the art (74.2%). The source code for our experiments can be found here: https://anonymous.4open.science/r/TMLR-EAEF/
[468] OpenMAG: A Comprehensive Benchmark for Multimodal-Attributed Graph
Chenxi Wan, Xunkai Li, Yilong Zuo, Haokun Deng, Sihan Li, Bowen Fan, Hongchao Qin, Ronghua Li, Guoren Wang
Main category: cs.LG
TL;DR: OpenMAG is a comprehensive benchmark for Multimodal-Attributed Graph learning that integrates 19 datasets across 6 domains, 16 encoders, 24 models, and 8 downstream tasks to establish rigorous evaluation standards.
Details
Motivation: Existing benchmarks for Multimodal-Attributed Graph (MAG) learning have critical limitations in domain coverage, encoder flexibility, model diversity, and task scope, creating challenges for fair evaluation despite rapid proliferation of novel MAG models.Method: OpenMAG integrates 19 datasets across 6 domains, incorporates 16 encoders for static and trainable feature encoding, implements 24 state-of-the-art models, and supports 8 downstream tasks within a unified framework for systematic assessment.
Result: Through systematic assessment across necessity, data quality, effectiveness, robustness, and efficiency dimensions, the authors derive 14 fundamental insights into MAG learning to guide future advancements.
Conclusion: OpenMAG provides a comprehensive benchmark that addresses limitations of existing MAG evaluation frameworks and establishes rigorous standards for fair comparison and future research in multimodal-attributed graph learning.
Abstract: Multimodal-Attributed Graph (MAG) learning has achieved remarkable success in modeling complex real-world systems by integrating graph topology with rich attributes from multiple modalities. With the rapid proliferation of novel MAG models capable of handling intricate cross-modal semantics and structural dependencies, establishing a rigorous and unified evaluation standard has become imperative. Although existing benchmarks have facilitated initial progress, they exhibit critical limitations in domain coverage, encoder flexibility, model diversity, and task scope, presenting significant challenges to fair evaluation. To bridge this gap, we present OpenMAG, a comprehensive benchmark that integrates 19 datasets across 6 domains and incorporates 16 encoders to support both static and trainable feature encoding. OpenMAG further implements a standardized library of 24 state-of-the-art models and supports 8 downstream tasks, enabling fair comparisons within a unified framework. Through systematic assessment of necessity, data quality, effectiveness, robustness, and efficiency, we derive 14 fundamental insights into MAG learning to guide future advancements. Our code is available at https://github.com/YUKI-N810/OpenMAG.
[469] On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature
Yikuan Zhang, Ning Yang, Yuhai Tu
Main category: cs.LG
TL;DR: SGD noise covariance C is not proportional to Hessian H in deep networks; using Activity-Weight Duality, C ∝ 𝔼[h_p²] where h_p is per-sample Hessian, leading to approximate commutation and power-law relation C_ii ∝ H_ii^γ with 1≤γ≤2.
Details
Motivation: Prior work incorrectly assumes equivalence between Fisher Information Matrix and Hessian for negative log-likelihood losses, claiming SGD noise covariance C is proportional to Hessian H. This assumption holds only under restrictive conditions typically violated in deep neural networks.Method: Using Activity-Weight Duality to derive a more general relationship agnostic to specific loss formulation. Shows C ∝ 𝔼[h_p²] where h_p is per-sample Hessian with H = 𝔼[h_p]. Demonstrates C and H commute approximately rather than coincide exactly.
Result: Diagonal elements follow approximate power-law relation C_ii ∝ H_ii^γ with theoretically bounded exponent 1 ≤ γ ≤ 2, determined by per-sample Hessian spectra. Experiments across datasets, architectures, and loss functions validate these bounds.
Conclusion: Provides unified characterization of noise-curvature relationship in deep learning, correcting prior misconceptions about SGD noise covariance and Hessian relationship in deep neural networks.
Abstract: Stochastic Gradient Descent (SGD) introduces anisotropic noise that is correlated with the local curvature of the loss landscape, thereby biasing optimization toward flat minima. Prior work often assumes an equivalence between the Fisher Information Matrix and the Hessian for negative log-likelihood losses, leading to the claim that the SGD noise covariance $\mathbf{C}$ is proportional to the Hessian $\mathbf{H}$. We show that this assumption holds only under restrictive conditions that are typically violated in deep neural networks. Using the recently discovered Activity–Weight Duality, we find a more general relationship agnostic to the specific loss formulation, showing that $\mathbf{C} \propto \mathbb{E}_p[\mathbf{h}_p^2]$, where $\mathbf{h}_p$ denotes the per-sample Hessian with $\mathbf{H} = \mathbb{E}p[\mathbf{h}p]$. As a consequence, $\mathbf{C}$ and $\mathbf{H}$ commute approximately rather than coincide exactly, and their diagonal elements follow an approximate power-law relation $C{ii} \propto H{ii}^γ$ with a theoretically bounded exponent $1 \leq γ\leq 2$, determined by per-sample Hessian spectra. Experiments across datasets, architectures, and loss functions validate these bounds, providing a unified characterization of the noise-curvature relationship in deep learning.
[470] Shiva-DiT: Residual-Based Differentiable Top-$k$ Selection for Efficient Diffusion Transformers
Jiaji Zhang, Hailiang Zhao, Guoxuan Zhu, Ruichao Sun, Jiaju Wu, Xinkui Zhao, Hanlin Tang, Weiyi Lu, Kan Liu, Tao Lan, Lin Qu, Shuiguang Deng
Main category: cs.LG
TL;DR: Shiva-DiT: A differentiable pruning method for Diffusion Transformers that uses residual-based selection to achieve deterministic token counts for hardware efficiency while maintaining learnability.
Details
Motivation: Diffusion Transformers suffer from high computational costs due to quadratic self-attention scaling. Existing pruning methods can't simultaneously satisfy differentiability, efficiency, and strict static hardware budgets needed for practical deployment.Method: Proposes Residual-Based Differentiable Top-k Selection using a residual-aware straight-through estimator to enforce deterministic token counts for static compilation while preserving end-to-end learnability. Also introduces Context-Aware Router and Adaptive Ratio Policy to learn adaptive pruning schedules autonomously.
Result: Achieves 1.54× wall-clock speedup with superior fidelity compared to existing baselines, establishes new Pareto frontier, and effectively eliminates ragged tensor overheads in mainstream models including SD3.5.
Conclusion: Shiva-DiT successfully reconciles conflicting requirements of differentiability, efficiency, and strict static hardware budgets for Diffusion Transformer pruning, enabling practical deployment with significant speed improvements.
Abstract: Diffusion Transformers (DiTs) incur prohibitive computational costs due to the quadratic scaling of self-attention. Existing pruning methods fail to simultaneously satisfy differentiability, efficiency, and the strict static budgets required for hardware overhead. To address this, we propose Shiva-DiT, which effectively reconciles these conflicting requirements via Residual-Based Differentiable Top-$k$ Selection. By leveraging a residual-aware straight-through estimator, our method enforces deterministic token counts for static compilation while preserving end-to-end learnability through residual gradient estimation. Furthermore, we introduce a Context-Aware Router and Adaptive Ratio Policy to autonomously learn an adaptive pruning schedule. Experiments on mainstream models, including SD3.5, demonstrate that Shiva-DiT establishes a new Pareto frontier, achieving a 1.54$\times$ wall-clock speedup with superior fidelity compared to existing baselines, effectively eliminating ragged tensor overheads.
[471] Path-Guided Flow Matching for Dataset Distillation
Xuhui Li, Zhengquan Luo, Xiwei Liu, Yongqiang Yu, Zhiqiang Xu
Main category: cs.LG
TL;DR: PGFM is a flow matching-based dataset distillation method that enables fast deterministic synthesis via ODE solving, using continuous path-to-prototype guidance for stable trajectory control in latent space.
Details
Motivation: Current diffusion-based dataset distillation methods suffer from time-consuming sampling, trajectory instability, and poor downstream generalization under strong control or low IPC settings, necessitating a more efficient and stable approach.Method: Proposes Path-Guided Flow Matching (PGFM) framework that performs flow matching in the latent space of a frozen VAE to learn class-conditional transport from Gaussian noise to data distribution, with continuous path-to-prototype guidance for ODE-consistent path control.
Result: PGFM matches or surpasses prior diffusion-based distillation approaches with fewer sampling steps, achieving 7.6× more efficiency than diffusion-based counterparts with 78% mode coverage across high-resolution benchmarks.
Conclusion: PGFM provides an efficient and stable alternative to diffusion-based dataset distillation, enabling fast deterministic synthesis with reliable trajectory control while maintaining competitive performance.
Abstract: Dataset distillation compresses large datasets into compact synthetic sets with comparable performance in training models. Despite recent progress on diffusion-based distillation, this type of method typically depends on heuristic guidance or prototype assignment, which comes with time-consuming sampling and trajectory instability and thus hurts downstream generalization especially under strong control or low IPC. We propose \emph{Path-Guided Flow Matching (PGFM)}, the first flow matching-based framework for generative distillation, which enables fast deterministic synthesis by solving an ODE in a few steps. PGFM conducts flow matching in the latent space of a frozen VAE to learn class-conditional transport from Gaussian noise to data distribution. Particularly, we develop a continuous path-to-prototype guidance algorithm for ODE-consistent path control, which allows trajectories to reliably land on assigned prototypes while preserving diversity and efficiency. Extensive experiments across high-resolution benchmarks demonstrate that PGFM matches or surpasses prior diffusion-based distillation approaches with fewer steps of sampling while delivering competitive performance with remarkably improved efficiency, e.g., 7.6$\times$ more efficient than the diffusion-based counterparts with 78% mode coverage.
[472] Mode-Dependent Rectification for Stable PPO Training
Mohamad Mohamad, Francesco Ponzio, Xavier Descombes
Main category: cs.LG
TL;DR: MDR stabilizes PPO with mode-dependent layers like BatchNorm by using dual-phase training to address training-evaluation discrepancies.
Details
Motivation: Mode-dependent architectural components (e.g., BatchNorm, dropout) commonly used in visual RL can destabilize on-policy optimization like PPO by causing discrepancies between training and evaluation behavior.Method: Proposes Mode-Dependent Rectification (MDR), a lightweight dual-phase training procedure that stabilizes PPO under mode-dependent layers without requiring architectural changes.
Result: Experiments across procedurally generated games and real-world patch-localization tasks show MDR consistently improves stability and performance, and extends naturally to other mode-dependent layers.
Conclusion: MDR effectively addresses the instability caused by mode-dependent layers in PPO, providing a practical solution for visual reinforcement learning applications.
Abstract: Mode-dependent architectural components (layers that behave differently during training and evaluation, such as Batch Normalization or dropout) are commonly used in visual reinforcement learning but can destabilize on-policy optimization. We show that in Proximal Policy Optimization (PPO), discrepancies between training and evaluation behavior induced by Batch Normalization lead to policy mismatch, distributional drift, and reward collapse. We propose Mode-Dependent Rectification (MDR), a lightweight dual-phase training procedure that stabilizes PPO under mode-dependent layers without architectural changes. Experiments across procedurally generated games and real-world patch-localization tasks demonstrate that MDR consistently improves stability and performance, and extends naturally to other mode-dependent layers.
[473] Structural Disentanglement in Bilinear MLPs via Architectural Inductive Bias
Ojasva Nema, Kaustubh Sharma, Aditya Chauhan, Parikshit Pareek
Main category: cs.LG
TL;DR: Bilinear MLPs with multiplicative interactions enable better structural disentanglement and surgical model editing compared to standard nonlinear networks, particularly for tasks with algebraic structure.
Details
Motivation: Selective unlearning and long-horizon extrapolation remain fragile in neural networks, even for tasks with algebraic structure. The authors argue these failures stem from how models structure internal representations during training, not just optimization algorithms.Method: Propose Bilinear MLPs with explicit multiplicative interactions as architectural inductive bias. Show analytically that bilinear parameterizations have a ’non-mixing’ property under gradient flow where functional components separate into orthogonal subspace representations.
Result: Bilinear architectures recover true operators aligned with underlying algebraic structure in experiments on modular arithmetic, cyclic reasoning, Lie group dynamics, and targeted unlearning benchmarks, unlike pointwise nonlinear networks.
Conclusion: Model editability and generalization are constrained by representational structure, and architectural inductive bias (multiplicative interactions) plays a central role in enabling reliable unlearning and structural disentanglement.
Abstract: Selective unlearning and long-horizon extrapolation remain fragile in modern neural networks, even when tasks have underlying algebraic structure. In this work, we argue that these failures arise not solely from optimization or unlearning algorithms, but from how models structure their internal representations during training. We explore if having explicit multiplicative interactions as an architectural inductive bias helps in structural disentanglement, through Bilinear MLPs. We show analytically that bilinear parameterizations possess a `non-mixing’ property under gradient flow conditions, where functional components separate into orthogonal subspace representations. This provides a mathematical foundation for surgical model modification. We validate this hypothesis through a series of controlled experiments spanning modular arithmetic, cyclic reasoning, Lie group dynamics, and targeted unlearning benchmarks. Unlike pointwise nonlinear networks, multiplicative architectures are able to recover true operators aligned with the underlying algebraic structure. Our results suggest that model editability and generalization are constrained by representational structure, and that architectural inductive bias plays a central role in enabling reliable unlearning.
[474] Joint Embedding Variational Bayes
Amin Oji, Paul Fieguth
Main category: cs.LG
TL;DR: VJE is a self-supervised learning framework that combines joint embedding with variational inference to learn probabilistic representations without reconstruction or contrastive objectives, using a Student-t likelihood model with polar decomposition.
Details
Motivation: The paper aims to develop a self-supervised learning approach that learns probabilistic representations without requiring reconstruction or contrastive objectives, addressing limitations of existing methods that optimize pointwise discrepancies and suffer from norm-induced instabilities.Method: VJE synthesizes joint embedding and variational inference by maximizing a symmetric conditional ELBO for a latent-variable model on encoder embeddings. It uses a heavy-tailed Student-t likelihood with polar decomposition to decouple directional and radial factors, preventing training instabilities. An amortized inference network parameterizes a diagonal Gaussian variational posterior with feature-wise variances shared with the likelihood scale.
Result: VJE achieves performance comparable to standard non-contrastive baselines on ImageNet-1K, CIFAR-10/100, and STL-10 under linear and k-NN evaluation. It outperforms comparable self-supervised baselines in one-class CIFAR-10 anomaly detection using likelihood-based scoring.
Conclusion: VJE provides an effective framework for self-supervised learning of probabilistic representations without reconstruction or contrastive objectives, demonstrating competitive performance on standard benchmarks and improved anomaly detection capabilities.
Abstract: We introduce Variational Joint Embedding (VJE), a framework that synthesizes joint embedding and variational inference to enable self-supervised learning of probabilistic representations in a reconstruction-free, non-contrastive setting. Compared to energy-based predictive objectives that optimize pointwise discrepancies, VJE maximizes a symmetric conditional evidence lower bound (ELBO) for a latent-variable model defined directly on encoder embeddings. We instantiate the conditional likelihood with a heavy-tailed Student-$t$ model using a polar decomposition that explicitly decouples directional and radial factors to prevent norm-induced instabilities during training. VJE employs an amortized inference network to parameterize a diagonal Gaussian variational posterior whose feature-wise variances are shared with the likelihood scale to capture anisotropic uncertainty without auxiliary projection heads. Across ImageNet-1K, CIFAR-10/100, and STL-10, VJE achieves performance comparable to standard non-contrastive baselines under linear and k-NN evaluation. We further validate these probabilistic semantics through one-class CIFAR-10 anomaly detection, where likelihood-based scoring under the proposed model outperforms comparable self-supervised baselines.
[475] Empowering Time Series Analysis with Large-Scale Multimodal Pretraining
Peng Chen, Siyuan Wang, Shiyan Hu, Xingjian Wu, Yang Shu, Zhongwen Rao, Meng Wang, Yijie Li, Bin Yang, Chenjuan Guo
Main category: cs.LG
TL;DR: HORAI is a frequency-enhanced multimodal foundation model for time series analysis that integrates endogenous modalities (derived images/text) and exogenous knowledge (news) to enhance time series understanding through a unified pretraining paradigm.
Details
Motivation: Existing time series foundation models rely on unimodal pretraining and lack complementary modalities to enhance understanding. There's a need for multimodal foundation models but challenges include: 1) lack of unified multimodal pretraining paradigms and large-scale multimodal corpora for time series, and 2) how to effectively integrate heterogeneous modalities and enhance generalization.Method: Proposes a multimodal pretraining paradigm using time series with endogenous modalities (derived images and text) and exogenous knowledge (real-world news). Creates MM-TS, the first large-scale multimodal time series dataset spanning six domains with up to one billion points. Introduces HORAI with two core components: Frequency-enhanced Cross-Modality Encoder and Time-Frequency Decoder to fuse multimodal features and enhance generalization across modalities and domains.
Result: After pretraining on MM-TS, HORAI achieves state-of-the-art zero-shot performance on time series forecasting and anomaly detection tasks, demonstrating strong generalization capabilities.
Conclusion: The work represents an early step toward multimodal foundation models for time series analysis, successfully addressing key challenges through a novel multimodal pretraining paradigm, large-scale dataset creation, and a frequency-enhanced architecture that effectively integrates heterogeneous modalities.
Abstract: While existing time series foundation models primarily rely on large-scale unimodal pretraining, they lack complementary modalities to enhance time series understanding. Building multimodal foundation models is a natural next step, but it faces key challenges: 1) lack of a unified multimodal pretraining paradigm and large-scale multimodal corpora for time series analysis; 2) how to effectively integrate heterogeneous modalities and enhance model generalization. To address these challenges, we take an early step toward multimodal foundation models for time series analysis. We first propose a multimodal pretraining paradigm that leverages time series with endogenous modalities (derived images and text) and exogenous knowledge (real-world news), providing a comprehensive multi-view perspective for time series analysis. To support this, we develop an automated data construction pipeline to curate MM-TS, the first large-scale multimodal time series dataset spanning six domains, with up to one billion points. Then we propose HORAI, a frequency-enhanced multimodal foundation model. It integrates two core components: the Frequency-enhanced Cross-Modality Encoder and the Time-Frequency Decoder, designed to effectively fuse multimodal features and enhance model generalization across modalities and domains. After pretraining on MM-TS, HORAI achieves state-of-the-art zero-shot performance on time series forecasting and anomaly detection tasks, demonstrating strong generalization.
[476] End-to-End Compression for Tabular Foundation Models
Guri Zabërgja, Rafiq Kamel, Arlind Kadra, Christian M. M. Frey, Josif Grabocka
Main category: cs.LG
TL;DR: TACO is a tabular compression model that compresses training data into latent space, achieving 94x faster inference and 97% less memory usage than state-of-the-art tabular transformers while maintaining performance.
Details
Motivation: Tabular foundation models using transformers have quadratic complexity with dataset size, causing high training/inference overhead and limiting scalability for large datasets. There's a need for more efficient tabular models.Method: Proposes TACO, an end-to-end tabular compression model that compresses the training dataset into a latent space representation, reducing computational complexity while preserving information.
Result: On TabArena benchmark: 94x faster inference, 97% less memory usage compared to state-of-the-art tabular transformers, with no significant performance degradation. Better scalability and performance with increased dataset sizes.
Conclusion: TACO provides an efficient alternative to transformer-based tabular models, enabling better scalability and reduced computational overhead while maintaining competitive performance.
Abstract: The long-standing dominance of gradient-boosted decision trees for tabular data has recently been challenged by in-context learning tabular foundation models. In-context learning methods fit and predict in one forward pass without parameter updates by leveraging the training data as context for predicting on query test points. While recent tabular foundation models achieve state-of-the-art performance, their transformer architecture based on the attention mechanism has quadratic complexity regarding dataset size, which in turn increases the overhead on training and inference time, and limits the capacity of the models to handle large-scale datasets. In this work, we propose TACO, an end-to-end tabular compression model that compresses the training dataset in a latent space. We test our method on the TabArena benchmark, where our proposed method is up to 94x faster in inference time, while consuming up to 97% less memory compared to the state-of-the-art tabular transformer architecture, all while retaining performance without significant degradation. Lastly, our method not only scales better with increased dataset sizes, but it also achieves better performance compared to other baselines.
[477] Alignment Verifiability in Large Language Models: Normative Indistinguishability under Behavioral Evaluation
Igor Santos-Grueiro
Main category: cs.LG
TL;DR: Behavioral evaluation of LLM alignment is fundamentally limited - observed compliance doesn’t uniquely identify latent alignment properties due to normative indistinguishability under partial observability.
Details
Motivation: Current alignment evaluation relies on behavioral evidence (benchmarks, red-teaming) but treats observed compliance as evidence of underlying alignment without analyzing the inference problem itself.Method: Formal framing of alignment evaluation as identifiability question under partial observability, introducing Alignment Verifiability Problem and Normative Indistinguishability concepts.
Result: Negative identifiability theorem: under finite behavioral evaluation and evaluation-aware agents, behavioral compliance doesn’t uniquely identify latent alignment. Alignment tests estimate indistinguishability classes rather than verify alignment.
Conclusion: Behavioral evaluation provides upper bounds on observable compliance within a regime, not guarantees of underlying alignment. This reframes how alignment benchmarks should be interpreted.
Abstract: Behavioral evaluation is the dominant paradigm for assessing alignment in large language models (LLMs). In practice, alignment is inferred from performance under finite evaluation protocols - benchmarks, red-teaming suites, or automated pipelines - and observed compliance is often treated as evidence of underlying alignment. This inference step, from behavioral evidence to claims about latent alignment properties, is typically implicit and rarely analyzed as an inference problem in its own right. We study this problem formally. We frame alignment evaluation as an identifiability question under partial observability and allow agent behavior to depend on information correlated with the evaluation regime. Within this setting, we introduce the Alignment Verifiability Problem and the notion of Normative Indistinguishability, capturing when distinct latent alignment hypotheses induce identical distributions over all evaluator-accessible signals. Our main result is a negative but sharply delimited identifiability theorem. Under finite behavioral evaluation and evaluation-aware agents, observed behavioral compliance does not uniquely identify latent alignment. That is, even idealized behavioral evaluation cannot, in general, certify alignment as a latent property. We further show that behavioral alignment tests should be interpreted as estimators of indistinguishability classes rather than verifiers of alignment. Passing increasingly stringent tests may reduce the space of compatible hypotheses, but cannot collapse it to a singleton under the stated conditions. This reframes alignment benchmarks as providing upper bounds on observable compliance within a regime, rather than guarantees of underlying alignment.
[478] Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization
Aleksandar Armacki, Dragana Bajović, Dušan Jakovetić, Soummya Kar, Ali H. Sayed
Main category: cs.LG
TL;DR: This paper studies the long-term tail decay of SGD-based methods using large deviations theory, providing upper and lower bounds on failure probabilities for both vanilla SGD and clipped SGD under different noise assumptions.
Details
Motivation: Existing work on SGD tail behavior focuses on finite-time guarantees and high-probability bounds for fixed probability thresholds, but lacks analysis of long-term tail decay rates for fixed error thresholds, which is crucial for modern models trained for millions of iterations.Method: The authors use large deviations theory to analyze the long-term tail decay of SGD-based methods. They study vanilla SGD with non-convex costs and bounded noise, and clipped SGD under heavy-tailed noise with bounded moments of order p ∈ (1,2]. They provide both upper bounds and matching lower bounds on tail decay rates.
Result: For vanilla SGD with bounded noise, they show long-term tail decay at rate e^{-t/log(t)}. For clipped SGD with heavy-tailed noise (p ∈ (1,2]), they show decay at rate e^{-t^{β_p}/log(t)} where β_p = 4(p-1)/(3p-2) for p ∈ (1,2) and e^{-t/log²(t)} for p=2. They also provide matching lower bounds at rate e^{-t}, showing their rates are tight up to poly-logarithmic factors.
Conclusion: The paper demonstrates significantly faster long-term tail decay rates than previously known (e^{-t} vs e^{-√t}), providing stronger guarantees for individual runs of SGD-based algorithms and uncovering regimes where tails decay much faster than existing finite-time bounds suggest.
Abstract: The study of tail behaviour of SGD-induced processes has been attracting a lot of interest, due to offering strong guarantees with respect to individual runs of an algorithm. While many works provide high-probability guarantees, quantifying the error rate for a fixed probability threshold, there is a lack of work directly studying the probability of failure, i.e., quantifying the tail decay rate for a fixed error threshold. Moreover, existing results are of finite-time nature, limiting their ability to capture the true long-term tail decay which is more informative for modern learning models, typically trained for millions of iterations. Our work closes these gaps, by studying the long-term tail decay of SGD-based methods through the lens of large deviations theory, establishing several strong results in the process. First, we provide an upper bound on the tails of the gradient norm-squared of the best iterate produced by (vanilla) SGD, for non-convex costs and bounded noise, with long-term decay at rate $e^{-t/\log(t)}$. Next, we relax the noise assumption by considering clipped SGD (c-SGD) under heavy-tailed noise with bounded moment of order $p \in (1,2]$, showing an upper bound with long-term decay at rate $e^{-t^{β_p}/\log(t)}$, where $β_p = \frac{4(p-1)}{3p-2}$ for $p \in (1,2)$ and $e^{-t/\log^2(t)}$ for $p = 2$. Finally, we provide lower bounds on the tail decay, at rate $e^{-t}$, showing that our rates for both SGD and c-SGD are tight, up to poly-logarithmic factors. Notably, our results demonstrate an order of magnitude faster long-term tail decay compared to existing work based on finite-time bounds, which show rates $e^{-\sqrt{t}}$ and $e^{-t^{β_p/2}}$, $p \in (1,2]$, for SGD and c-SGD, respectively. As such, we uncover regimes where the tails decay much faster than previously known, providing stronger long-term guarantees for individual runs.
[479] Probabilistic Multi-Regional Solar Power Forecasting with Any-Quantile Recurrent Neural Networks
Slawek Smyl, Paweł Pełka, Grzegorz Dudek
Main category: cs.LG
TL;DR: AQ-RNN framework for probabilistic PV power forecasting across multiple regions using any-quantile estimation and spatial-temporal modeling
Details
Motivation: Increasing PV generation introduces uncertainty in power systems, requiring probabilistic forecasting beyond deterministic predictions to support uncertainty-aware energy managementMethod: Any-Quantile Recurrent Neural Network (AQ-RNN) with dual-track architecture for series-specific and cross-regional information, dilated recurrent cells, patch-based temporal modeling, and dynamic ensemble mechanism
Result: Demonstrated consistent improvements in forecast accuracy, calibration, and prediction interval quality using 30 years of hourly PV data from 259 European regions compared to statistical and neural baselines
Conclusion: The proposed framework enables calibrated conditional quantile estimation at arbitrary probability levels and effectively exploits spatial dependencies for robust system-level forecasting suitable for renewable-dominated power systems
Abstract: The increasing penetration of photovoltaic (PV) generation introduces significant uncertainty into power system operation, necessitating forecasting approaches that extend beyond deterministic point predictions. This paper proposes an any-quantile probabilistic forecasting framework for multi-regional PV power generation based on the Any-Quantile Recurrent Neural Network (AQ-RNN). The model integrates an any-quantile forecasting paradigm with a dual-track recurrent architecture that jointly processes series-specific and cross-regional contextual information, supported by dilated recurrent cells, patch-based temporal modeling, and a dynamic ensemble mechanism. The proposed framework enables the estimation of calibrated conditional quantiles at arbitrary probability levels within a single trained model and effectively exploits spatial dependencies to enhance robustness at the system level. The approach is evaluated using 30 years of hourly PV generation data from 259 European regions and compared against established statistical and neural probabilistic baselines. The results demonstrate consistent improvements in forecast accuracy, calibration, and prediction interval quality, underscoring the suitability of the proposed method for uncertainty-aware energy management and operational decision-making in renewable-dominated power systems.
[480] Accelerating Benchmarking of Functional Connectivity Modeling via Structure-aware Core-set Selection
Ling Zhan, Zhen Li, Junjie Huang, Tao Jia
Main category: cs.LG
TL;DR: SCLCS is a self-supervised framework for selecting representative core-sets that preserve relative performance rankings of functional connectivity operators in fMRI data, enabling efficient benchmarking.
Details
Motivation: Exhaustive evaluation of hundreds of functional connectivity modeling methods on large-scale fMRI datasets is computationally prohibitive, preventing routine benchmarking in neuroscience.Method: Uses adaptive Transformer to learn each sample’s unique FC structure, introduces Structural Perturbation Score to quantify stability, and applies density-balanced sampling for diversity.
Result: On REST-meta-MDD dataset, SCLCS preserves ground-truth model ranking with just 10% of data, outperforming SOTA core-set selection methods by up to 23.2% in ranking consistency.
Conclusion: First work to formalize core-set selection for FC operator benchmarking, making large-scale comparisons feasible in computational neuroscience.
Abstract: Benchmarking the hundreds of functional connectivity (FC) modeling methods on large-scale fMRI datasets is critical for reproducible neuroscience. However, the combinatorial explosion of model-data pairings makes exhaustive evaluation computationally prohibitive, preventing such assessments from becoming a routine pre-analysis step. To break this bottleneck, we reframe the challenge of FC benchmarking by selecting a small, representative core-set whose sole purpose is to preserve the relative performance ranking of FC operators. We formalize this as a ranking-preserving subset selection problem and propose Structure-aware Contrastive Learning for Core-set Selection (SCLCS), a self-supervised framework to select these core-sets. SCLCS first uses an adaptive Transformer to learn each sample’s unique FC structure. It then introduces a novel Structural Perturbation Score (SPS) to quantify the stability of these learned structures during training, identifying samples that represent foundational connectivity archetypes. Finally, while SCLCS identifies stable samples via a top-k ranking, we further introduce a density-balanced sampling strategy as a necessary correction to promote diversity, ensuring the final core-set is both structurally robust and distributionally representative. On the large-scale REST-meta-MDD dataset, SCLCS preserves the ground-truth model ranking with just 10% of the data, outperforming state-of-the-art (SOTA) core-set selection methods by up to 23.2% in ranking consistency (nDCG@k). To our knowledge, this is the first work to formalize core-set selection for FC operator benchmarking, thereby making large-scale operators comparisons a feasible and integral part of computational neuroscience. Code is publicly available on https://github.com/lzhan94swu/SCLCS
[481] Stable but Wrong: When More Data Degrades Scientific Conclusions
Zhipeng Zhang, Kai Li
Main category: cs.LG
TL;DR: Standard inference procedures can systematically converge to incorrect conclusions despite appearing well-calibrated, due to unobservable degradation in observational reliability.
Details
Motivation: To challenge the implicit belief that accumulating more data always makes scientific conclusions more reliable, and to identify fundamental limits of data-driven science.Method: Identifies a structural regime where standard inference procedures converge smoothly and appear well-calibrated but systematically produce incorrect conclusions. Uses minimal synthetic experiments to demonstrate the phenomenon.
Result: Shows that in this regime, additional data amplify errors rather than correct them, while diagnostic checks remain misleadingly normal. Reveals that stability, convergence, and confidence are insufficient indicators of epistemic validity.
Conclusion: Inference cannot be treated as an unconditional consequence of data availability; it must be governed by explicit constraints on the integrity of the observational process.
Abstract: Modern science increasingly relies on ever-growing observational datasets and automated inference pipelines, under the implicit belief that accumulating more data makes scientific conclusions more reliable. Here we show that this belief can fail in a fundamental and irreversible way. We identify a structural regime in which standard inference procedures converge smoothly, remain well calibrated, and pass conventional diagnostic checks, yet systematically converge to incorrect conclusions. This failure arises when the reliability of observations degrades in a manner that is intrinsically unobservable to the inference process itself. Using minimal synthetic experiments, we demonstrate that in this regime additional data do not correct error but instead amplify it, while residual-based and goodness-of-fit diagnostics remain misleadingly normal. These results reveal an intrinsic limit of data-driven science: stability, convergence, and confidence are not sufficient indicators of epistemic validity. We argue that inference cannot be treated as an unconditional consequence of data availability, but must instead be governed by explicit constraints on the integrity of the observational process.
[482] Perception-Based Beliefs for POMDPs with Visual Observations
Miriam Schäfers, Merlijn Krale, Thiago D. Simão, Nils Jansen, Maximilian Weininger
Main category: cs.LG
TL;DR: PBP framework integrates perception models (image classifiers) with traditional POMDP solvers to handle high-dimensional visual observations by mapping images to state distributions, enabling efficient planning without explicit reasoning over observation spaces.
Details
Motivation: Traditional POMDP solvers struggle with high-dimensional observations like camera images. There's a need to bridge perception (visual understanding) with planning under uncertainty for real-world robotics and AI applications.Method: Introduces Perception-based Beliefs for POMDPs (PBP) framework that uses an image classifier to map visual observations to probability distributions over states. These distributions are incorporated into belief updates, allowing traditional solvers to work without reasoning over high-dimensional observations. Includes uncertainty quantification methods to handle classifier imprecision.
Result: PBP outperforms existing end-to-end deep RL methods and uncertainty quantification improves robustness against visual corruption. The belief update coincides with standard belief update when classifier is exact.
Conclusion: PBP successfully bridges perception and planning for POMDPs with high-dimensional observations, demonstrating that traditional solvers can be effectively combined with perception models for robust decision-making under uncertainty.
Abstract: Partially observable Markov decision processes (POMDPs) are a principled planning model for sequential decision-making under uncertainty. Yet, real-world problems with high-dimensional observations, such as camera images, remain intractable for traditional belief- and filtering-based solvers. To tackle this problem, we introduce the Perception-based Beliefs for POMDPs framework (PBP), which complements such solvers with a perception model. This model takes the form of an image classifier which maps visual observations to probability distributions over states. PBP incorporates these distributions directly into belief updates, so the underlying solver does not need to reason explicitly over high-dimensional observation spaces. We show that the belief update of PBP coincides with the standard belief update if the image classifier is exact. Moreover, to handle classifier imprecision, we incorporate uncertainty quantification and introduce two methods to adjust the belief update accordingly. We implement PBP using two traditional POMDP solvers and empirically show that (1) it outperforms existing end-to-end deep RL methods and (2) uncertainty quantification improves robustness of PBP against visual corruption.
[483] Mining Generalizable Activation Functions
Alex Vitvitskyi, Michael Boratko, Matej Grcic, Razvan Pascanu, Deep Shah, Petar Veličković
Main category: cs.LG
TL;DR: AlphaEvolve uses frontier LLMs as mutator operators to evolve novel activation functions in a flexible Python function search space, targeting both performance improvements and specific inductive biases using OOD data as fitness functions.
Details
Motivation: Activation functions significantly impact neural network optimization and inductive bias, but current search methods are limited by manually constructed search spaces. The paper aims to leverage modern evolutionary pipelines with LLMs for more flexible and efficient activation function discovery.Method: Uses AlphaEvolve pipeline with frontier LLMs as mutator operators to search over all possible Python functions within computational constraints. Employs out-of-distribution data performance as fitness function to discover activation functions with specific inductive biases.
Result: Shows that relatively small-scale synthetic datasets are sufficient for AlphaEvolve to discover meaningful activation functions. The LLM-based approach enables searching over a much wider and more flexible space than manually constructed search spaces.
Conclusion: Evolutionary search with LLMs provides a powerful framework for discovering novel activation functions that can encode specific inductive biases, going beyond just performance optimization to create architectures with desired non-linear behaviors.
Abstract: The choice of activation function is an active area of research, with different proposals aimed at improving optimization, while maintaining expressivity. Additionally, the activation function can significantly alter the implicit inductive bias of the architecture, controlling its non-linear behavior. In this paper, in line with previous work, we argue that evolutionary search provides a useful framework for finding new activation functions, while we also make two novel observations. The first is that modern pipelines, such as AlphaEvolve, which relies on frontier LLMs as a mutator operator, allows for a much wider and flexible search space; e.g., over all possible python functions within a certain FLOP budget, eliminating the need for manually constructed search spaces. In addition, these pipelines will be biased towards meaningful activation functions, given their ability to represent common knowledge, leading to a potentially more efficient search of the space. The second observation is that, through this framework, one can target not only performance improvements but also activation functions that encode particular inductive biases. This can be done by using performance on out-of-distribution data as a fitness function, reflecting the degree to which the architecture respects the inherent structure in the data in a manner independent of distribution shifts. We carry an empirical exploration of this proposal and show that relatively small scale synthetic datasets can be sufficient for AlphaEvolve to discover meaningful activations.
[484] Almost Asymptotically Optimal Active Clustering Through Pairwise Observations
Rachel S. Y. Teo, P. N. Karthik, Ramya Korlakai Vinayak, Vincent Y. F. Tan
Main category: cs.LG
TL;DR: A theoretical framework for clustering items via active pairwise queries with noisy bandit feedback, establishing fundamental query complexity lower bounds and designing asymptotically optimal algorithms.
Details
Motivation: The paper addresses the problem of clustering items when only noisy pairwise similarity feedback is available through active queries, which is common in real-world applications where ground truth labels are expensive or unavailable.Method: Uses change-of-measure technique to establish fundamental lower bounds on query complexity, then designs algorithms using Generalized Likelihood Ratio (GLR) statistics with empirical stopping criteria, including a computationally feasible variant.
Result: Establishes theoretical lower bounds on expected queries needed for desired clustering accuracy, develops asymptotically optimal algorithms whose performance remains within constant multiples of the lower bound.
Conclusion: Provides a comprehensive theoretical framework for active clustering with noisy feedback, with both fundamental limits and practical algorithms that approach these limits.
Abstract: We propose a new analysis framework for clustering $M$ items into an unknown number of $K$ distinct groups using noisy and actively collected responses. At each time step, an agent is allowed to query pairs of items and observe bandit binary feedback. If the pair of items belongs to the same (resp.\ different) cluster, the observed feedback is $1$ with probability $p>1/2$ (resp.\ $q<1/2$). Leveraging the ubiquitous change-of-measure technique, we establish a fundamental lower bound on the expected number of queries needed to achieve a desired confidence in the clustering accuracy, formulated as a sup-inf optimization problem. Building on this theoretical foundation, we design an asymptotically optimal algorithm in which the stopping criterion involves an empirical version of the inner infimum – the Generalized Likelihood Ratio (GLR) statistic – being compared to a threshold. We develop a computationally feasible variant of the GLR statistic and show that its performance gap to the lower bound can be accurately empirically estimated and remains within a constant multiple of the lower bound.
[485] FedRandom: Sampling Consistent and Accurate Contribution Values in Federated Learning
Arno Geimer, Beltran Fiz Pontiveros, Radu State
Main category: cs.LG
TL;DR: FedRandom addresses instability in contribution assessment for federated learning by treating it as a statistical estimation problem, generating more samples to provide more consistent and reliable evaluation of participant contributions.
Details
Motivation: In federated learning deployments where participants incur costs and expect compensation, fairly assessing individual contributions is crucial for identifying malicious actors and free-riders. However, recent works show significant inherent instability in contribution estimations across aggregation strategies, which can harm participant willingness to engage in federations.Method: FedRandom is a novel mitigation technique that tackles contribution instability as a statistical estimation problem. It allows generating more samples than regular FL strategies, providing more consistent and reliable evaluation of participant contributions through increased sampling.
Result: FedRandom reduces the overall distance to ground truth by more than a third in half of all evaluated scenarios across CIFAR-10, MNIST, CIFAR-100 and FMNIST datasets with different data distributions. It improves stability in more than 90% of cases.
Conclusion: FedRandom effectively addresses the contribution instability problem in federated learning by providing more reliable and consistent assessment of participant contributions, which is crucial for fair compensation and maintaining participant engagement in federations.
Abstract: Federated Learning is a privacy-preserving decentralized approach for Machine Learning tasks. In industry deployments characterized by a limited number of entities possessing abundant data, the significance of a participant’s role in shaping the global model becomes pivotal given that participation in a federation incurs costs, and participants may expect compensation for their involvement. Additionally, the contributions of participants serve as a crucial means to identify and address potential malicious actors and free-riders. However, fairly assessing individual contributions remains a significant hurdle. Recent works have demonstrated a considerable inherent instability in contribution estimations across aggregation strategies. While employing a different strategy may offer convergence benefits, this instability can have potentially harming effects on the willingness of participants in engaging in the federation. In this work, we introduce FedRandom, a novel mitigation technique to the contribution instability problem. Tackling the instability as a statistical estimation problem, FedRandom allows us to generate more samples than when using regular FL strategies. We show that these additional samples provide a more consistent and reliable evaluation of participant contributions. We demonstrate our approach using different data distributions across CIFAR-10, MNIST, CIFAR-100 and FMNIST and show that FedRandom reduces the overall distance to the ground truth by more than a third in half of all evaluated scenarios, and improves stability in more than 90% of cases.
[486] CSRv2: Unlocking Ultra-Sparse Embeddings
Lixuan Guo, Yifei Wang, Tiansheng Wen, Yifan Wang, Aosong Feng, Bo Chen, Stefanie Jegelka, Chenyu You
Main category: cs.LG
TL;DR: CSRv2 enables ultra-sparse embeddings (only 2 active features) to match performance of much denser representations while achieving 7x speedup over compact dense embeddings and 300x efficiency gains over dense embeddings.
Details
Motivation: Current dense embeddings are high-dimensional and computationally expensive, while existing sparse methods like CSR suffer severe performance degradation in ultra-sparse regimes where most neurons remain inactive, limiting efficiency gains.Method: CSRv2 introduces progressive k-annealing to stabilize sparsity learning, supervised contrastive objectives to enhance representation quality, and full backbone finetuning for end-to-end adaptability.
Result: Reduces dead neurons from 80% to 20%, achieves 14% accuracy gain at k=2, matches CSR at k=8 and MRL at 32 dimensions with only 2 active features, delivers 7x speedup over MRL and up to 300x compute/memory efficiency improvements over dense embeddings.
Conclusion: CSRv2 makes ultra-sparse embeddings practical without performance compromise, enabling real-time and edge-deployable AI systems where both embedding quality and efficiency are critical.
Abstract: In the era of large foundation models, the quality of embeddings has become a central determinant of downstream task performance and overall system capability. Yet widely used dense embeddings are often extremely high-dimensional, incurring substantial costs in storage, memory, and inference latency. To address these, Contrastive Sparse Representation (CSR) is recently proposed as a promising direction, mapping dense embeddings into high-dimensional but k-sparse vectors, in contrast to compact dense embeddings such as Matryoshka Representation Learning (MRL). Despite its promise, CSR suffers severe degradation in the ultra-sparse regime, where over 80% of neurons remain inactive, leaving much of its efficiency potential unrealized. In this paper, we introduce CSRv2, a principled training approach designed to make ultra-sparse embeddings viable. CSRv2 stabilizes sparsity learning through progressive k-annealing, enhances representational quality via supervised contrastive objectives, and ensures end-to-end adaptability with full backbone finetuning. CSRv2 reduces dead neurons from 80% to 20% and delivers a 14% accuracy gain at k=2, bringing ultra-sparse embeddings on par with CSR at k=8 and MRL at 32 dimensions, all with only two active features. While maintaining comparable performance, CSRv2 delivers a 7x speedup over MRL, and yields up to 300x improvements in compute and memory efficiency relative to dense embeddings in text representation. Extensive experiments across text and vision demonstrate that CSRv2 makes ultra-sparse embeddings practical without compromising performance, where CSRv2 achieves 7%/4% improvement over CSR when k=4 and further increases this gap to 14%/6% when k=2 in text/vision representation. By making extreme sparsity viable, CSRv2 broadens the design space for real-time and edge-deployable AI systems where both embedding quality and efficiency are critical.
[487] Limitations of SGD for Multi-Index Models Beyond Statistical Queries
Daniel Barzilai, Ohad Shamir
Main category: cs.LG
TL;DR: The paper develops a new framework to study limitations of standard SGD for single-index and multi-index models, addressing shortcomings of existing SQ-based analyses that don’t reflect real SGD noise.
Details
Motivation: Existing Statistical Queries (SQ) framework analyses of gradient methods have limitations: they use adversarial or specially-structured gradient noise that doesn't reflect standard SGD noise, sometimes leading to incorrect predictions. Many analyses also rely on non-trivial algorithmic modifications rather than studying standard vanilla SGD.Method: Develops a new, non-SQ framework to study limitations of standard vanilla SGD for single-index and multi-index models (where target function depends on low-dimensional projection of inputs). The framework applies to broad settings including potentially deep neural networks.
Result: The paper presents a new analytical framework that better captures the limitations of standard SGD compared to existing SQ-based approaches, providing more accurate predictions for single-index and multi-index models.
Conclusion: The new framework addresses shortcomings of existing SQ analyses by better modeling standard SGD noise and avoiding algorithmic modifications, providing more realistic limitations analysis for gradient methods on structured learning problems.
Abstract: Understanding the limitations of gradient methods, and stochastic gradient descent (SGD) in particular, is a central challenge in learning theory. To that end, a commonly used tool is the Statistical Queries (SQ) framework, which studies performance limits of algorithms based on noisy interaction with the data. However, it is known that the formal connection between the SQ framework and SGD is tenuous: Existing results typically rely on adversarial or specially-structured gradient noise that does not reflect the noise in standard SGD, and (as we point out here) can sometimes lead to incorrect predictions. Moreover, many analyses of SGD for challenging problems rely on non-trivial algorithmic modifications, such as restricting the SGD trajectory to the sphere or using very small learning rates. To address these shortcomings, we develop a new, non-SQ framework to study the limitations of standard vanilla SGD, for single-index and multi-index models (namely, when the target function depends on a low-dimensional projection of the inputs). Our results apply to a broad class of settings and architectures, including (potentially deep) neural networks.
[488] Learning to Inject: Automated Prompt Injection via Reinforcement Learning
Xin Chen, Jie Zhang, Florian Tramer
Main category: cs.LG
TL;DR: AutoInject: RL-based framework for automated prompt injection attacks on LLM agents, generating universal adversarial suffixes that compromise frontier models while preserving utility on benign tasks.
Details
Motivation: Prompt injection is a critical vulnerability in LLM agents, but current methods rely heavily on human red-teamers and hand-crafted prompts, limiting scalability and adaptability. There's a need for automated, optimization-based approaches to generate effective adversarial attacks.Method: AutoInject uses reinforcement learning to generate universal, transferable adversarial suffixes. It jointly optimizes for attack success and utility preservation on benign tasks. The black-box method supports both query-based optimization and transfer attacks to unseen models/tasks, using only a 1.5B parameter adversarial suffix generator.
Result: Successfully compromised frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.
Conclusion: AutoInject demonstrates that automated, optimization-based approaches can effectively generate universal adversarial suffixes for prompt injection attacks, overcoming limitations of human-dependent methods and providing a scalable framework for vulnerability assessment.
Abstract: Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.
[489] Fix Representation (Optimally) Before Fairness: Finite-Sample Shrinkage Population Correction and the True Price of Fairness Under Subpopulation Shift
Amir Asiaee, Kaveh Aryan
Main category: cs.LG
TL;DR: The paper analyzes fairness-accuracy tradeoffs under subpopulation shift, showing that apparent “fairness helps accuracy” can be artifacts of misrepresented subgroup proportions in training data, and proposes an evaluation protocol with shrinkage reweighting to isolate the true price of fairness.
Details
Motivation: The paper addresses the observed tension between predictive accuracy and group fairness constraints in machine learning, where sometimes fairness interventions appear to improve accuracy. The authors aim to understand whether these phenomena are genuine or artifacts of training data that misrepresents subgroup proportions.Method: The authors analyze fairness-accuracy tradeoffs under subpopulation shift (stable within-group distributions but shifted group proportions). They establish theoretical results about importance-weighted correction, propose an optimal finite-sample correction using shrinkage reweighting that interpolates between target and training mixtures, and develop an actionable evaluation protocol that fixes representation optimally before fairness interventions.
Result: Theoretical analysis shows that full importance-weighted correction is asymptotically unbiased but finite-sample suboptimal, and apparent “fairness helps accuracy” can arise from comparing fairness methods to an improperly-weighted baseline. Experiments on synthetic and real-world benchmarks (Adult, COMPAS) validate theoretical predictions and demonstrate that the proposed protocol eliminates spurious tradeoffs, revealing the genuine fairness-utility frontier.
Conclusion: The paper concludes that proper evaluation requires fixing representation (optimally) before fairness interventions, and comparing fairness methods against a shrinkage-corrected baseline to isolate the true, irreducible price of fairness. This protocol helps distinguish genuine fairness-accuracy tradeoffs from artifacts of data misrepresentation.
Abstract: Machine learning practitioners frequently observe tension between predictive accuracy and group fairness constraints – yet sometimes fairness interventions appear to improve accuracy. We show that both phenomena can be artifacts of training data that misrepresents subgroup proportions. Under subpopulation shift (stable within-group distributions, shifted group proportions), we establish: (i) full importance-weighted correction is asymptotically unbiased but finite-sample suboptimal; (ii) the optimal finite-sample correction is a shrinkage reweighting that interpolates between target and training mixtures; (iii) apparent “fairness helps accuracy” can arise from comparing fairness methods to an improperly-weighted baseline. We provide an actionable evaluation protocol: fix representation (optimally) before fairness – compare fairness interventions against a shrinkage-corrected baseline to isolate the true, irreducible price of fairness. Experiments on synthetic and real-world benchmarks (Adult, COMPAS) validate our theoretical predictions and demonstrate that this protocol eliminates spurious tradeoffs, revealing the genuine fairness-utility frontier.
[490] Projected Boosting with Fairness Constraints: Quantifying the Cost of Fair Training Distributions
Amir Asiaee, Kaveh Aryan
Main category: cs.LG
TL;DR: FairBoost incorporates group fairness constraints into AdaBoost by projecting the ensemble-induced distribution onto a fair convex set, quantifying the accuracy-fairness tradeoff through a KL-divergence term in convergence bounds.
Details
Motivation: To develop boosting algorithms that incorporate group fairness constraints while preserving analyzable training dynamics and theoretical guarantees, addressing the need for fair machine learning models with provable properties.Method: FairBoost projects the ensemble-induced exponential-weights distribution onto a convex set of distributions satisfying fairness constraints (as a reweighting surrogate), then trains weak learners on this fair distribution. The projection reduces the effective edge of weak learners by a quantity controlled by the KL-divergence of the projection.
Result: Theoretical analysis proves an exponential-loss bound where convergence rate depends on weak learner edge minus a “fairness cost” term δ_t = √(KL(w^t ∥ q^t)/2), directly quantifying accuracy-fairness tradeoff. Experiments on standard benchmarks validate theoretical predictions and show competitive fairness-accuracy tradeoffs with stable training curves.
Conclusion: FairBoost successfully incorporates fairness constraints into boosting while maintaining theoretical analyzability, providing a principled framework that quantifies the inherent tradeoff between accuracy and fairness in ensemble learning.
Abstract: Boosting algorithms enjoy strong theoretical guarantees: when weak learners maintain positive edge, AdaBoost achieves geometric decrease of exponential loss. We study how to incorporate group fairness constraints into boosting while preserving analyzable training dynamics. Our approach, FairBoost, projects the ensemble-induced exponential-weights distribution onto a convex set of distributions satisfying fairness constraints (as a reweighting surrogate), then trains weak learners on this fair distribution. The key theoretical insight is that projecting the training distribution reduces the effective edge of weak learners by a quantity controlled by the KL-divergence of the projection. We prove an exponential-loss bound where the convergence rate depends on weak learner edge minus a “fairness cost” term $δ_t = \sqrt{\mathrm{KL}(w^t | q^t)/2}$. This directly quantifies the accuracy-fairness tradeoff in boosting dynamics. Experiments on standard benchmarks validate the theoretical predictions and demonstrate competitive fairness-accuracy tradeoffs with stable training curves.
[491] Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance
Xiandong Zou, Jianshu Li, Jing Huang, Pan Zhou
Main category: cs.LG
TL;DR: VSD (Variational Speculative Decoding) improves speculative decoding for MLLMs by training draft models using variational inference over multiple draft paths, optimizing for target-model acceptance probability rather than single greedy trajectories.
Details
Motivation: Existing speculative decoding methods have a training-decoding discrepancy: they optimize for single greedy trajectories during training, but actual decoding involves verifying and ranking multiple sampled draft paths. This mismatch reduces efficiency.Method: VSD formulates draft training as variational inference over latent proposals (draft paths), maximizing marginal probability of target-model acceptance. Uses EM procedure with MCMC sampling from oracle-filtered posterior (E-step) and weighted likelihood maximization with Adaptive Rejection Weighting and Confidence-Aware Regularization (M-step).
Result: VSD achieves up to 9.6% speedup over EAGLE-3 and 7.9% over ViSpec across LLMs and MLLMs, significantly improving decoding efficiency with theoretical guarantees of increased expected acceptance length and speedup.
Conclusion: VSD addresses the training-decoding discrepancy in speculative decoding by optimizing for multiple draft paths through variational inference, leading to substantial efficiency improvements for MLLM inference.
Abstract: Speculative decoding accelerates inference for (M)LLMs, yet a training-decoding discrepancy persists: while existing methods optimize single greedy trajectories, decoding involves verifying and ranking multiple sampled draft paths. We propose Variational Speculative Decoding (VSD), formulating draft training as variational inference over latent proposals (draft paths). VSD maximizes the marginal probability of target-model acceptance, yielding an ELBO that promotes high-quality latent proposals while minimizing divergence from the target distribution. To enhance quality and reduce variance, we incorporate a path-level utility and optimize via an Expectation-Maximization procedure. The E-step draws MCMC samples from an oracle-filtered posterior, while the M-step maximizes weighted likelihood using Adaptive Rejection Weighting (ARW) and Confidence-Aware Regularization (CAR). Theoretical analysis confirms that VSD increases expected acceptance length and speedup. Extensive experiments across LLMs and MLLMs show that VSD achieves up to a 9.6% speedup over EAGLE-3 and 7.9% over ViSpec, significantly improving decoding efficiency.
[492] Muon in Associative Memory Learning: Training Dynamics and Scaling Laws
Binghui Li, Kaifei Wang, Han Zhong, Pinyan Lu, Liwei Wang
Main category: cs.LG
TL;DR: Muon optimizer shows exponential speedup over GD in linear associative memory with hierarchical frequency spectrum by mitigating imbalanced learning rates of frequency components through implicit matrix preconditioning.
Details
Motivation: Muon optimizer has shown strong empirical gains but lacks theoretical understanding of its dynamics and scaling behavior, particularly in how it differs from standard gradient descent in learning hierarchical frequency components.Method: Analyze Muon in a linear associative memory model with softmax retrieval and hierarchical frequency spectrum over query-answer pairs, comparing with Gradient Descent (GD) in both noiseless and noisy cases with power-decay frequency spectrum.
Result: GD learns frequency components at highly imbalanced rates (bottlenecked by low-frequency components), while Muon mitigates this imbalance for faster, more uniform progress. Muon achieves exponential speedup over GD in noiseless case and superior scaling efficiency in noisy case with power-decay spectrum.
Conclusion: Muon acts as an implicit matrix preconditioner from adaptive task alignment and block-symmetric gradient structure, providing theoretical justification for its empirical success and explaining why coordinate-wise sign operators cannot match its performance without oracle access to task representations.
Abstract: Muon updates matrix parameters via the matrix sign of the gradient and has shown strong empirical gains, yet its dynamics and scaling behavior remain unclear in theory. We study Muon in a linear associative memory model with softmax retrieval and a hierarchical frequency spectrum over query-answer pairs, with and without label noise. In this setting, we show that Gradient Descent (GD) learns frequency components at highly imbalanced rates, leading to slow convergence bottlenecked by low-frequency components. In contrast, the Muon optimizer mitigates this imbalance, leading to faster and more uniform progress. Specifically, in the noiseless case, Muon achieves an exponential speedup over GD; in the noisy case with a power-decay frequency spectrum, we derive Muon’s optimization scaling law and demonstrate its superior scaling efficiency over GD. Furthermore, we show that Muon can be interpreted as an implicit matrix preconditioner arising from adaptive task alignment and block-symmetric gradient structure. In contrast, the preconditioner with coordinate-wise sign operator could match Muon under oracle access to unknown task representations, which is infeasible for SignGD in practice. Experiments on synthetic long-tail classification and LLaMA-style pre-training corroborate the theory.
[493] How to Achieve the Intended Aim of Deep Clustering Now, without Deep Learning
Kai Ming Ting, Wei-Jie Xu, Hang Zhang
Main category: cs.LG
TL;DR: DEC’s deep-learned representation doesn’t overcome k-means limitations; non-deep methods using distributional information work better for arbitrary cluster shapes, sizes, and densities.
Details
Motivation: To investigate whether Deep Embedded Clustering (DEC) and deep clustering methods in general overcome the fundamental limitations of k-means clustering: inability to discover clusters of arbitrary shapes, varied sizes, and densities.Method: Analysis of DEC’s approach (autoencoder-based latent representation with k-means-like clustering) and comparison with non-deep learning methods that exploit underlying data distribution and cluster distributional information.
Result: DEC’s deep-learned representation fails to address k-means’ fundamental limitations. Non-deep learning approaches that use distributional information of clusters achieve better performance for discovering arbitrary cluster shapes, sizes, and densities.
Conclusion: Deep clustering methods like DEC don’t overcome k-means limitations; distributional information is key for handling arbitrary cluster characteristics, which non-deep methods can effectively utilize.
Abstract: Deep clustering (DC) is often quoted to have a key advantage over $k$-means clustering. Yet, this advantage is often demonstrated using image datasets only, and it is unclear whether it addresses the fundamental limitations of $k$-means clustering. Deep Embedded Clustering (DEC) learns a latent representation via an autoencoder and performs clustering based on a $k$-means-like procedure, while the optimization is conducted in an end-to-end manner. This paper investigates whether the deep-learned representation has enabled DEC to overcome the known fundamental limitations of $k$-means clustering, i.e., its inability to discover clusters of arbitrary shapes, varied sizes and densities. Our investigations on DEC have a wider implication on deep clustering methods in general. Notably, none of these methods exploit the underlying data distribution. We uncover that a non-deep learning approach achieves the intended aim of deep clustering by making use of distributional information of clusters in a dataset to effectively address these fundamental limitations.
[494] In-context Time Series Predictor
Jiecheng Lu, Yan Sun, Shihao Yang
Main category: cs.LG
TL;DR: Reformulating time series forecasting as (lookback, future) token pairs to leverage Transformer in-context learning without pre-trained LLM parameters
Details
Motivation: To fully utilize Transformer-based LLMs' in-context learning capabilities for time series forecasting, addressing limitations of previous Transformer-based or LLM-based methods that don't align well with inherent in-context mechanisms and suffer from issues like overfittingMethod: Reformulate time series forecasting tasks as input tokens by constructing series of (lookback, future) pairs within tokens, creating a parameter-efficient approach that doesn’t require pre-trained LLM parameters
Result: Consistently achieves better performance across full-data, few-shot, and zero-shot settings compared to previous architectures, addressing overfitting issues in existing Transformer-based TSF models
Conclusion: The proposed token-based reformulation better aligns with Transformer in-context learning mechanisms, providing a more parameter-efficient and effective approach for time series forecasting
Abstract: Recent Transformer-based large language models (LLMs) demonstrate in-context learning ability to perform various functions based solely on the provided context, without updating model parameters. To fully utilize the in-context capabilities in time series forecasting (TSF) problems, unlike previous Transformer-based or LLM-based time series forecasting methods, we reformulate “time series forecasting tasks” as input tokens by constructing a series of (lookback, future) pairs within the tokens. This method aligns more closely with the inherent in-context mechanisms, and is more parameter-efficient without the need of using pre-trained LLM parameters. Furthermore, it addresses issues such as overfitting in existing Transformer-based TSF models, consistently achieving better performance across full-data, few-shot, and zero-shot settings compared to previous architectures.
[495] Pseudo-Invertible Neural Networks
Yamit Ehrlich, Nimrod Berman, Assaf Shocher
Main category: cs.LG
TL;DR: SPNN introduces a non-linear generalization of Moore-Penrose pseudo-inverse for neural networks, enabling tractable inversion and back-projection for non-linear mappings, extending zero-shot inverse problem solving beyond linear degradations.
Details
Motivation: The Moore-Penrose pseudo-inverse is fundamental for linear systems but lacks non-linear generalization. The authors aim to extend pseudo-inversion to neural networks to enable tractable inversion of non-linear mappings, particularly for solving zero-shot inverse problems with complex non-linear degradations.Method: Proposes Surjective Pseudo-invertible Neural Networks (SPNN) - architectures explicitly designed to admit tractable non-linear pseudo-inverse. Formalizes Non-Linear Back-Projection (NLBP) that guarantees consistency constraints for non-linear mappings via the defined pseudo-inverse. Extends diffusion-based null-space projection to non-linear degradations.
Result: Enables zero-shot inversion of complex non-linear degradations including optical distortions and semantic abstractions like classification. Allows precise semantic control over generative outputs without retraining diffusion priors.
Conclusion: SPNN provides a principled framework for non-linear pseudo-inversion in neural networks, significantly expanding the scope of zero-shot inverse problem solving beyond linear cases to handle complex non-linear information loss.
Abstract: The Moore-Penrose Pseudo-inverse (PInv) serves as the fundamental solution for linear systems. In this paper, we propose a natural generalization of PInv to the nonlinear regime in general and to neural networks in particular. We introduce Surjective Pseudo-invertible Neural Networks (SPNN), a class of architectures explicitly designed to admit a tractable non-linear PInv. The proposed non-linear PInv and its implementation in SPNN satisfy fundamental geometric properties. One such property is null-space projection or “Back-Projection”, $x’ = x + A^\dagger(y-Ax)$, which moves a sample $x$ to its closest consistent state $x’$ satisfying $Ax=y$. We formalize Non-Linear Back-Projection (NLBP), a method that guarantees the same consistency constraint for non-linear mappings $f(x)=y$ via our defined PInv. We leverage SPNNs to expand the scope of zero-shot inverse problems. Diffusion-based null-space projection has revolutionized zero-shot solving for linear inverse problems by exploiting closed-form back-projection. We extend this method to non-linear degradations. Here, “degradation” is broadly generalized to include any non-linear loss of information, spanning from optical distortions to semantic abstractions like classification. This approach enables zero-shot inversion of complex degradations and allows precise semantic control over generative outputs without retraining the diffusion prior.
[496] Cross-Domain Offline Policy Adaptation via Selective Transition Correction
Mengbei Yan, Jiafei Lyu, Shengjie Sun, Zhongjian Qiao, Jingwen Yang, Zichuan Lin, Deheng Ye, Xiu Li
Main category: cs.LG
TL;DR: STC algorithm adapts policies across domains with mismatched dynamics in offline RL by correcting source domain transitions to align with target domain dynamics using inverse policy and reward models, with forward dynamics filtering for reliability.
Details
Motivation: Cross-domain offline RL faces challenges when datasets from different domains have mismatched dynamics. Direct merging leads to suboptimal performance, while existing approaches like transition filtering or reward modification insufficiently exploit valuable source domain data.Method: Proposes Selective Transition Correction (STC) algorithm that modifies source domain data to match target domain dynamics. Uses inverse policy model and reward model to correct actions and rewards of source transitions, then employs forward dynamics model to filter corrected samples that better match target dynamics than original transitions.
Result: Experiments on various environments with dynamics shifts demonstrate that STC achieves superior performance against existing baselines in cross-domain offline RL.
Conclusion: STC enables reliable usage of source domain data for policy adaptation across domains with dynamics mismatches by explicitly aligning source transitions with target domain dynamics through correction and selective filtering.
Abstract: It remains a critical challenge to adapt policies across domains with mismatched dynamics in reinforcement learning (RL). In this paper, we study cross-domain offline RL, where an offline dataset from another similar source domain can be accessed to enhance policy learning upon a target domain dataset. Directly merging the two datasets may lead to suboptimal performance due to potential dynamics mismatches. Existing approaches typically mitigate this issue through source domain transition filtering or reward modification, which, however, may lead to insufficient exploitation of the valuable source domain data. Instead, we propose to modify the source domain data into the target domain data. To that end, we leverage an inverse policy model and a reward model to correct the actions and rewards of source transitions, explicitly achieving alignment with the target dynamics. Since limited data may result in inaccurate model training, we further employ a forward dynamics model to retain corrected samples that better match the target dynamics than the original transitions. Consequently, we propose the Selective Transition Correction (STC) algorithm, which enables reliable usage of source domain data for policy adaptation. Experiments on various environments with dynamics shifts demonstrate that STC achieves superior performance against existing baselines.
[497] Shared LoRA Subspaces for almost Strict Continual Learning
Prakhar Kaushik, Ankit Vaidya, Shravan Chaudhari, Rama Chellappa, Alan Yuille
Main category: cs.LG
TL;DR: Share: A parameter-efficient continual finetuning method that learns a single shared low-rank subspace for lifelong learning across tasks and modalities, achieving massive parameter/memory savings while maintaining performance.
Details
Motivation: Current parameter-efficient tuning methods like LoRA lack mechanisms for strict continual learning and knowledge integration without data replay or multiple adapters, making them inefficient for lifelong learning scenarios.Method: Share learns and dynamically updates a single shared low-rank subspace that extracts core knowledge from past tasks and incrementally integrates new information by identifying essential subspace directions, enabling forward knowledge transfer while minimizing catastrophic interference.
Result: Achieves up to 100x parameter reduction and 281x memory savings over traditional LoRA methods while maintaining performance comparable to jointly trained models; a single Share model can replace hundreds of task-specific LoRA adapters.
Conclusion: Share provides a practical and scalable solution for lifelong learning in large-scale AI systems, validated across image classification, natural language understanding, 3D pose estimation, and text-to-image generation tasks.
Abstract: Adapting large pretrained models to new tasks efficiently and continually is crucial for real-world deployment but remains challenging due to catastrophic forgetting and the high cost of retraining. While parameter-efficient tuning methods like low rank adaptation (LoRA) reduce computational demands, they lack mechanisms for strict continual learning and knowledge integration, without relying on data replay, or multiple adapters. We propose Share, a novel approach to parameter efficient continual finetuning that learns and dynamically updates a single, shared low-rank subspace, enabling seamless adaptation across multiple tasks and modalities. Share constructs a foundational subspace that extracts core knowledge from past tasks and incrementally integrates new information by identifying essential subspace directions. Knowledge from each new task is incorporated into this evolving subspace, facilitating forward knowledge transfer, while minimizing catastrophic interference. This approach achieves up to 100x parameter reduction and 281x memory savings over traditional LoRA methods, maintaining performance comparable to jointly trained models. A single Share model can replace hundreds of task-specific LoRA adapters, supporting scalable, asynchronous continual learning. Experiments across image classification, natural language understanding, 3D pose estimation, and text-to-image generation validate its effectiveness, making Share a practical and scalable solution for lifelong learning in large-scale AI systems.
[498] How Controlling the Variance can Improve Training Stability of Sparsely Activated DNNs and CNNs
Emily Dent, Jared Tanner
Main category: cs.LG
TL;DR: Larger Gaussian process variances in network initialization enable 90% activation sparsity while maintaining accuracy, potentially reducing energy consumption in ML models.
Details
Motivation: To understand how initialization strategies affect training stability and expressivity in deep networks with sparse activations, particularly for energy-efficient ML models.Method: Analyzes Edge-of-Chaos initialization strategy and Gaussian process characterization of intermediate layers, focusing on variance effects with CReLU activation functions.
Result: Initializations with larger Gaussian process variances enable 90% activation sparsity in DNNs/CNNs while maintaining full accuracy and improving training stability.
Conclusion: Proper variance tuning in Gaussian process initialization enables high activation sparsity without accuracy loss, offering energy-saving potential for ML models.
Abstract: The intermediate layers of deep networks can be characterised as a Gaussian process, in particular the Edge-of-Chaos (EoC) initialisation strategy prescribes the limiting covariance matrix of the Gaussian process. Here we show that the under-utilised chosen variance of the Gaussian process is important in the training of deep networks with sparsity inducing activation, such as a shifted and clipped ReLU, $\text{CReLU}_{τ,m}(x)=\min(\max(x-τ,0),m)$. Specifically, initialisations leading to larger fixed Gaussian process variances, allow for improved expressivity with activation sparsity as large as 90% in DNNs and CNNs, and generally improve the stability of the training process. Enabling full, or near full, accuracy at such high levels of sparsity in the hidden layers suggests a promising mechanism to reduce the energy consumption of machine learning models involving fully connected layers.
[499] Parity, Sensitivity, and Transformers
Alexander Kozachinskiy, Tomasz Steifer, Przemysław Wałȩga
Main category: cs.LG
TL;DR: A theoretical analysis of transformer architecture capabilities for solving PARITY, presenting both a new constructive upper bound and the first lower bound for single-layer transformers.
Details
Motivation: Despite transformers being nearly a decade old, there's limited understanding of their computational capabilities. The paper aims to understand what transformers can or cannot compute, specifically focusing on the PARITY problem as a benchmark.Method: Theoretical analysis and constructive proof: (1) Provides a new transformer construction for PARITY using softmax, length-independent positional encoding, no layernorm, working with/without causal masking; (2) Proves lower bound showing PARITY cannot be solved with only one layer and one head.
Result: Successfully constructs a transformer that solves PARITY with practical features (softmax, length-independent positional encoding, no layernorm) and proves the first lower bound showing single-layer, single-head transformers cannot solve PARITY.
Conclusion: The paper advances theoretical understanding of transformer capabilities, showing PARITY can be solved with practical transformer variants while establishing fundamental limitations of minimal transformer architectures.
Abstract: The transformer architecture is almost a decade old. Despite that, we still have a limited understanding of what this architecture can or cannot compute. For instance, can a 1-layer transformer solve PARITY – or more generally – which kinds of transformers can do it? Known constructions for PARITY have at least 2 layers and employ impractical features: either a length-dependent positional encoding, or hardmax, or layernorm without the regularization parameter, or they are not implementable with causal masking. We give a new construction of a transformer for PARITY with softmax, length-independent and polynomially bounded positional encoding, no layernorm, working both with and without causal masking. We also give the first lower bound for transformers solving PARITY – by showing that it cannot be done with only one layer and one head.
[500] Distributional Reinforcement Learning with Diffusion Bridge Critics
Shutong Ding, Yimiao Zhou, Ke Hu, Mokai Pan, Shan Zhong, Yanwei Fu, Jingya Wang, Ye Shi
Main category: cs.LG
TL;DR: DBC introduces diffusion bridge critics for distributional RL, modeling Q-value inverse CDFs to capture value distributions accurately, with analytic integration to address discretization errors.
Details
Motivation: Existing diffusion RL methods focus on policies but neglect critics, despite critic accuracy being more crucial than policy expressiveness. Since RL tasks are stochastic, critics should be distributional models, but current approaches don't fully leverage diffusion's distribution-matching capabilities.Method: Proposes Diffusion Bridge Critics (DBC) that directly model the inverse CDF of Q values using diffusion bridge models. Uses diffusion’s strong distribution-matching to prevent collapse into trivial Gaussian distributions. Derives analytic integral formula to address discretization errors in value estimation.
Result: Experimental results on MuJoCo robot control benchmarks demonstrate DBC’s superiority over previous distributional critic models. Shows improved performance in continuous control tasks.
Conclusion: DBC successfully applies diffusion bridge models to critics, providing accurate value distribution estimation. The method is plug-and-play and can be integrated into existing RL frameworks, offering better distributional modeling than previous approaches.
Abstract: Recent advances in diffusion-based reinforcement learning (RL) methods have demonstrated promising results in a wide range of continuous control tasks. However, existing works in this field focus on the application of diffusion policies while leaving the diffusion critics unexplored. In fact, since policy optimization fundamentally relies on the critic, accurate value estimation is far more important than policy expressiveness. Furthermore, given the stochasticity of most reinforcement learning tasks, it has been confirmed that the critic is more appropriately depicted with a distributional model. Motivated by these points, we propose a novel distributional RL method with Diffusion Bridge Critics (DBC). DBC directly models the inverse cumulative distribution function (CDF) of the Q value. This allows us to accurately capture the value distribution and prevents it from collapsing into a trivial Gaussian distribution owing to the strong distribution-matching capability of the diffusion bridge. Moreover, we further derive an analytic integral formula to address discretization errors in DBC, which is essential in value estimation. To our knowledge, DBC is the first work to employ the diffusion bridge model as the critic. Notably, DBC is also a plug-and-play component and can be integrated into most existing RL frameworks. Experimental results on MuJoCo robot control benchmarks demonstrate the superiority of DBC compared with previous distributional critic models.
[501] Regularized Calibration with Successive Rounding for Post-Training Quantization
Seohyeon Cha, Huancheng Chen, Dongjun Kim, Haoran Zhang, Kevin Chan, Gustavo de Veciana, Haris Vikalo
Main category: cs.LG
TL;DR: A new post-training quantization method for LLMs that uses regularized asymmetric calibration and bounded search to improve quantization quality with modest computational overhead.
Details
Motivation: LLMs face deployment challenges due to memory and latency costs from billions of parameters. While PTQ enables efficient inference by mapping weights to low-bit formats without retraining, its effectiveness depends on quantization objectives and rounding procedures.Method: Proposes interpolating between symmetric and asymmetric calibration as regularization, preserving quadratic PTQ structure while providing robustness to activation mismatch. Derives a successive rounding procedure incorporating asymmetric calibration, plus a bounded-search extension for explicit trade-off between quantization quality and compute cost.
Result: Experiments across multiple LLM families, quantization bit-widths, and benchmarks show the proposed bounded search with regularized asymmetric calibration consistently improves perplexity and accuracy over PTQ baselines.
Conclusion: The method provides better quantization quality with only modest and controllable additional computational cost, addressing key deployment challenges for LLMs.
Abstract: Large language models (LLMs) deliver robust performance across diverse applications, yet their deployment often faces challenges due to the memory and latency costs of storing and accessing billions of parameters. Post-training quantization (PTQ) enables efficient inference by mapping pretrained weights to low-bit formats without retraining, but its effectiveness depends critically on both the quantization objective and the rounding procedure used to obtain low-bit weight representations. In this work, we show that interpolating between symmetric and asymmetric calibration acts as a form of regularization that preserves the standard quadratic structure used in PTQ while providing robustness to activation mismatch. Building on this perspective, we derive a simple successive rounding procedure that naturally incorporates asymmetric calibration, as well as a bounded-search extension that allows for an explicit trade-off between quantization quality and the compute cost. Experiments across multiple LLM families, quantization bit-widths, and benchmarks demonstrate that the proposed bounded search based on a regularized asymmetric calibration objective consistently improves perplexity and accuracy over PTQ baselines, while incurring only modest and controllable additional computational cost.
[502] LittleBit: Ultra Low-Bit Quantization via Latent Factorization
Banseok Lee, Dongkyu Kim, Youngcheon You, Youngmin Kim
Main category: cs.LG
TL;DR: LittleBit: A novel framework for extreme LLM compression to sub-1-bit quantization (as low as 0.1 bits per weight) using low-rank latent matrix factorization with multi-scale compensation mechanisms.
Details
Motivation: Large language models face prohibitive memory and computational requirements for deployment. While quantization helps, maintaining model fidelity in sub-1-bit regimes remains challenging, limiting practical deployment in resource-constrained environments.Method: Uses low-rank latent matrix factorization to represent weights, then binarizes the factors. Incorporates multi-scale compensation mechanism learning importance parameters across row, column, and latent dimensions. Key contributions: Dual Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware training initialization and Residual Compensation to minimize approximation errors.
Result: Achieves 0.1 bits per weight quantization, compressing Llama2-13B to under 0.9 GB (31× memory reduction). At 0.1 BPW, surpasses leading techniques operating at 0.7 BPW on Llama2-7B. Enables 11.6× inference speedup relative to FP16.
Conclusion: LittleBit establishes new size-performance trade-offs for extreme LLM compression, making powerful LLMs practical for resource-constrained environments while maintaining model fidelity in sub-1-bit regimes.
Abstract: The deployment of large language models (LLMs) is frequently hindered by prohibitive memory and computational requirements. While quantization mitigates these bottlenecks, maintaining model fidelity in the sub-1-bit regime remains a persistent challenge. In this paper, we introduce LittleBit, a novel framework for extreme LLM compression. We target quantization rates as low as $0.1$ bits per weight (BPW), achieving a memory reduction of approximately $31\times$, which effectively compresses Llama2-13B to under $0.9$ GB. We represent weights via low-rank latent matrix factorization and subsequently binarize the resulting factors. To counteract the information loss inherent to such drastic precision reduction, we integrate a multi-scale compensation mechanism that learns importance parameters across row, column, and latent dimensions. Two primary contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware training (QAT) initialization, and Residual Compensation to minimize approximation errors. Extensive experiments confirm the superiority of LittleBit in the sub-1-bit domain; for instance, our method at $0.1$ BPW surpasses the performance of leading techniques operating at $0.7$ BPW on Llama2-7B. We establish a new size-performance trade-off – unlocking a potential $11.6\times$ inference speedup relative to FP16 – and render powerful LLMs practical for resource-constrained environments. Our code is available at https://github.com/SamsungLabs/LittleBit.
[503] Selecting Hyperparameters for Tree-Boosting
Floris Jan Koster, Fabio Sigrist
Main category: cs.LG
TL;DR: Empirical comparison of hyperparameter optimization methods for tree-boosting on 59 datasets shows SMAC outperforms other methods, with key findings on tuning requirements and hyperparameter importance.
Details
Motivation: Tree-boosting is widely used for tabular data but its accuracy heavily depends on hyperparameters. There's a need to empirically compare popular hyperparameter optimization methods to determine which works best in practice.Method: Compared several hyperparameter optimization methods including random grid search, TPE, GP-BO, Hyperband, SMAC, and deterministic full grid search across 59 regression and classification datasets to evaluate their performance for tree-boosting.
Result: SMAC method clearly outperforms all other considered methods. Key findings: (1) >100 trials needed for accurate tuning, (2) default hyperparameters yield inaccurate models, (3) all hyperparameters can materially affect accuracy (no small set is more important), (4) early stopping for boosting iterations works better than including it in search space for regression.
Conclusion: SMAC is the recommended hyperparameter optimization method for tree-boosting, requiring substantial tuning effort (>100 trials) as default values are inadequate and all hyperparameters matter for model accuracy.
Abstract: Tree-boosting is a widely used machine learning technique for tabular data. However, its out-of-sample accuracy is critically dependent on multiple hyperparameters. In this article, we empirically compare several popular methods for hyperparameter optimization for tree-boosting including random grid search, the tree-structured Parzen estimator (TPE), Gaussian-process-based Bayesian optimization (GP-BO), Hyperband, the sequential model-based algorithm configuration (SMAC) method, and deterministic full grid search using $59$ regression and classification data sets. We find that the SMAC method clearly outperforms all the other considered methods. We further observe that (i) a relatively large number of trials larger than $100$ is required for accurate tuning, (ii) using default values for hyperparameters yields very inaccurate models, (iii) all considered hyperparameters can have a material effect on the accuracy of tree-boosting, i.e., there is no small set of hyperparameters that is more important than others, and (iv) choosing the number of boosting iterations using early stopping yields more accurate results compared to including it in the search space for regression tasks.
[504] Verification of the Implicit World Model in a Generative Model via Adversarial Sequences
András Balogh, Márk Jelasity
Main category: cs.LG
TL;DR: The paper proposes adversarial sequence generation methods to verify soundness of sequence models trained on chess games, finding no models are fully sound but some training techniques improve soundness.
Details
Motivation: To develop practical tools for verifying whether generative sequence models trained on sample sequences can capture the true structure of languages/world models, using chess as a test domain with simple rule-based world model.Method: Proposes adversarial sequence generation where adversaries generate valid chess sequences to force sequence models to predict invalid next moves. Evaluates on chess models trained on random/high-quality games with different training recipes, and investigates board state probes in training/attack methods.
Result: None of the trained chess models are sound, but some training techniques and dataset choices improve soundness remarkably. Board state probes show extracted board states have no causal role in next token prediction in most models.
Conclusion: Adversarial sequence generation is effective for falsifying soundness and analyzing failure modes of sequence models. While models can generate valid sequences, they are not fully sound, and training improvements only partially address this limitation.
Abstract: Generative sequence models are typically trained on sample sequences from natural or formal languages. It is a crucial question whether – or to what extent – sample-based training is able to capture the true structure of these languages, often referred to as the ``world model’’. Theoretical results indicate that we can hope for soundness at best, that is, generating valid sequences, but not necessarily all of them. However, it is still important to have practical tools that are able to verify whether a given sequence model is sound. In this study, we focus on chess, as it is a domain that provides enough complexity while having a simple rule-based world model. We propose adversarial sequence generation for verifying the soundness of the sequence model. Our adversaries generate valid sequences so as to force the sequence model to generate an invalid next move prediction. Apart from the falsification of soundness, this method is also suitable for a more fine-grained analysis of the failure modes and the effects of different choices during training. To demonstrate this, we propose a number of methods for adversarial sequence generation and evaluate the approach on a large set of chess models. We train models on random as well as high-quality chess games, using several training recipes. We find that none of the models are sound, but some training techniques and dataset choices are able to improve soundness remarkably. We also investigate the potential application of board state probes in both our training and attack methods. Our findings indicate that the extracted board states have no causal role in next token prediction in most of the models.
[505] Classification Under Local Differential Privacy with Model Reversal and Model Averaging
Caihong Qin, Yang Bai
Main category: cs.LG
TL;DR: The paper proposes transfer learning techniques for improving classification performance under Local Differential Privacy (LDP) by treating noisy LDP data as source domain and clean data as target domain, with methods including utility estimation, model reversal, and model averaging.
Details
Motivation: Local Differential Privacy (LDP) provides strong privacy guarantees but introduces significant noise that reduces data utility. The authors aim to improve classification performance under LDP without compromising privacy by reframing it as a transfer learning problem.Method: Three novel techniques: (1) noised binary feedback-based evaluation for estimating dataset utility, (2) model reversal to salvage underperforming classifiers by inverting decision boundaries, and (3) model averaging that weights multiple reversed classifiers based on estimated utility.
Result: Theoretical excess risk bounds under LDP are provided, showing how the methods reduce this risk. Empirical results on simulated and real-world datasets demonstrate substantial improvements in classification accuracy.
Conclusion: The proposed transfer learning approach effectively improves classification performance under LDP while maintaining privacy guarantees, offering practical solutions to the utility-privacy tradeoff in private machine learning.
Abstract: Local differential privacy (LDP) has become a central topic in data privacy research, offering strong privacy guarantees by perturbing user data at the source and removing the need for a trusted curator. However, the noise introduced by LDP often significantly reduces data utility. To address this issue, we reinterpret private learning under LDP as a transfer learning problem, where the noisy data serve as the source domain and the unobserved clean data as the target. We propose novel techniques specifically designed for LDP to improve classification performance without compromising privacy: (1) a noised binary feedback-based evaluation mechanism for estimating dataset utility; (2) model reversal, which salvages underperforming classifiers by inverting their decision boundaries; and (3) model averaging, which assigns weights to multiple reversed classifiers based on their estimated utility. We provide theoretical excess risk bounds under LDP and demonstrate how our methods reduce this risk. Empirical results on both simulated and real-world datasets show substantial improvements in classification accuracy.
[506] CellForge: Agentic Design of Virtual Cell Models
Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, Arman Cohan, Xihong Lin, Fabian Theis, Smita Krishnaswamy, Mark Gerstein
Main category: cs.LG
TL;DR: CellForge is a multi-agent framework that autonomously designs neural network architectures for virtual cell modeling using multi-omics data, generating executable implementations through collaborative agent reasoning.
Details
Motivation: Virtual cell modeling faces challenges from biological complexity, multimodal data heterogeneity, and the need for interdisciplinary expertise. Current approaches rely on manual human design or single-LLM prompting, limiting innovation.Method: Multi-agent framework where specialized agents collaboratively reason to discover candidate neural network architectures from raw multi-omics data and task descriptions, then generate executable implementations. Enables emergence of novel architectural components like trajectory-aware encoders and perturbation diffusion modules.
Result: Evaluated on six datasets spanning gene knockouts, drug treatments, and cytokine stimulations across multiple modalities (scRNA-seq, scATAC-seq, CITE-seq). Generated models are highly competitive with established baselines and reveal systematic patterns of architectural innovation.
Conclusion: Multi-agent collaboration enables genuine methodological innovation and executable solutions that single agents or human experts cannot achieve, representing a paradigm shift toward autonomous scientific method development in computational biology.
Abstract: Virtual cell modeling aims to predict cellular responses to diverse perturbations but faces challenges from biological complexity, multimodal data heterogeneity, and the need for interdisciplinary expertise. We introduce CellForge, a multi-agent framework that autonomously designs and synthesizes neural network architectures tailored to specific single-cell datasets and perturbation tasks. Given raw multi-omics data and task descriptions, CellForge discovers candidate architectures through collaborative reasoning among specialized agents, then generates executable implementations. Our core contribution is the framework itself: showing that multi-agent collaboration mechanisms - rather than manual human design or single-LLM prompting - can autonomously produce executable, high-quality computational methods. This approach goes beyond conventional hyperparameter tuning by enabling entirely new architectural components such as trajectory-aware encoders and perturbation diffusion modules to emerge from agentic deliberation. We evaluate CellForge on six datasets spanning gene knockouts, drug treatments, and cytokine stimulations across multiple modalities (scRNA-seq, scATAC-seq, CITE-seq). The results demonstrate that the models generated by CellForge are highly competitive with established baselines, while revealing systematic patterns of architectural innovation. CellForge highlights the scientific value of multi-agent frameworks: collaboration among specialized agents enables genuine methodological innovation and executable solutions that single agents or human experts cannot achieve. This represents a paradigm shift toward autonomous scientific method development in computational biology. Code is available at https://github.com/gersteinlab/CellForge.
[507] Bifrost: Steering Strategic Trajectories to Bridge Contextual Gaps for Self-Improving Agents
Quan M. Tran, Zhuo Huang, Wenbin Zhang, Bo Han, Koji Yatani, Masashi Sugiyama, Tongliang Liu
Main category: cs.LG
TL;DR: Bifrost is a training-free method that bridges context gaps for trajectory reuse in autonomous agents by leveraging context-trajectory correlations to adapt past successful trajectories to new tasks.
Details
Motivation: Existing autonomous agent self-improvement methods struggle with context mismatch when reusing successful task trajectories across different tasks, leading to either discarded trajectories, heuristic manipulation, high fine-tuning costs, or unreliable performance.Method: Bifrost leverages the discovered context-trajectory correlation where context shifts parallel trajectory shifts. It uses context differences to guide adaptation of previously solved trajectories to target tasks through representation-level transformation using agent hidden states, ensuring alignment in shared space.
Result: Across diverse benchmarks, Bifrost consistently outperforms existing trajectory reuse and fine-tuned self-improvement methods, demonstrating effective leverage of past experiences despite substantial context shifts.
Conclusion: Agents can effectively reuse past experiences across different contexts through context-aware trajectory adaptation without training, bridging the context gap that previously limited trajectory reuse in autonomous agents.
Abstract: Autonomous agents excel in self-improvement through reflection and iterative refinement, which reuse successful task trajectories as in-context examples to assist subsequent reasoning. However, shifting across tasks often introduces a context mismatch. Hence, existing approaches either discard the trajectories or manipulate them using heuristics, leading to a non-negligible fine-tuning cost or unguaranteed performance. To bridge this gap, we reveal a context-trajectory correlation, where shifts of context are highly parallel with shifts of trajectory. Based on this finding, we propose BrIdge contextual gap FoR imprOvised trajectory STeering (Bifrost), a training-free method that leverages context differences to precisely guide the adaptation of previously solved trajectories towards the target task, mitigating the misalignment caused by context shifts. Our trajectory adaptation is conducted at the representation level using agent hidden states, ensuring trajectory transformation accurately aligns with the target context in a shared space. Across diverse benchmarks, Bifrost consistently outperforms existing trajectory reuse and finetuned self-improvement methods, demonstrating that agents can effectively leverage past experiences despite substantial context shifts.
[508] Inverse Depth Scaling From Most Layers Being Similar
Yizhou Liu, Sara Kangaslahti, Ziming Liu, Jeff Gore
Main category: cs.LG
TL;DR: Depth scaling in LLMs shows loss inversely proportional to depth, likely due to ensemble averaging of functionally similar layers rather than compositional learning, suggesting architectural innovations needed for more efficient depth utilization.
Details
Motivation: While neural scaling laws relate loss to model size, the specific contributions of depth versus width remain unclear. The paper aims to quantify how depth affects loss in LLMs and understand the underlying mechanisms.Method: Analyzed depth scaling in LLMs and toy residual networks to study how loss scales with depth. Investigated whether depth contributes through compositional learning, discretizing smooth dynamics, or ensemble averaging of functionally similar layers.
Result: Found that loss scales inversely proportional to depth in LLMs. This scaling likely arises from ensemble averaging of functionally similar layers rather than compositional learning or discretizing smooth dynamics. This regime is inefficient but robust.
Conclusion: Current depth utilization in LLMs is inefficient due to architectural bias of residual networks and target functions incompatible with smooth dynamics. Improving LLM efficiency requires architectural innovations to encourage compositional use of depth.
Abstract: Neural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differently, requiring more detailed studies. Here, we quantify how depth affects loss via analysis of LLMs and toy residual networks. We find loss scales inversely proportional to depth in LLMs, probably due to functionally similar layers reducing error through ensemble averaging rather than compositional learning or discretizing smooth dynamics. This regime is inefficient yet robust and may arise from the architectural bias of residual networks and target functions incompatible with smooth dynamics. The findings suggest that improving LLM efficiency may require architectural innovations to encourage compositional use of depth.
[509] When Are Two RLHF Objectives the Same?
Madhava Gaikwad
Main category: cs.LG
TL;DR: Opal is a canonicalization algorithm that determines algebraic equivalence between preference optimization objectives, revealing many widely used methods optimize the same underlying objective while others are provably distinct.
Details
Motivation: The preference optimization literature contains many proposed objectives presented as distinct improvements, but there's a need to understand whether these objectives are genuinely different or just reparameterizations of the same underlying mathematical form.Method: Opal is a canonicalization algorithm that determines whether two preference objectives are algebraically equivalent by producing either a canonical form or a concrete witness of non-equivalence. It identifies structural mechanisms that give rise to genuinely different objectives.
Result: Opal reveals that many widely used methods optimize the same underlying objective, while others are provably distinct. For example, batch normalization can cause the same response pair to receive different gradients depending on batch composition.
Conclusion: Most differences in preference optimization objectives are reparameterizations rather than fundamentally different objectives, with only a small set of structural mechanisms giving rise to genuinely different objectives.
Abstract: The preference optimization literature contains many proposed objectives, often presented as distinct improvements. We introduce Opal, a canonicalization algorithm that determines whether two preference objectives are algebraically equivalent by producing either a canonical form or a concrete witness of non-equivalence. Applying Opal reveals that many widely used methods optimize the same underlying objective, while others are provably distinct. For example, batch normalization can cause the same response pair to receive different gradients depending on batch composition. We identify a small set of structural mechanisms that give rise to genuinely different objectives; most remaining differences are reparameterizations.
[510] Principled Confidence Estimation for Deep Computed Tomography
Matteo Gätzner, Johannes Kirschner
Main category: cs.LG
TL;DR: A framework for confidence estimation in CT reconstruction with theoretical coverage guarantees, applicable to both classical and deep learning methods, showing deep methods yield tighter confidence regions while maintaining coverage.
Details
Motivation: Need for reliable uncertainty quantification in medical CT reconstruction, especially for deep learning methods where hallucinations can occur, requiring principled confidence estimation with theoretical guarantees.Method: Uses sequential likelihood mixing framework to establish confidence regions for CT reconstruction under realistic Beer-Lambert law forward model with Poisson noise, applicable to U-Nets, ensembles, and diffusion models.
Result: Deep reconstruction methods produce substantially tighter confidence regions than classical methods while maintaining theoretical coverage guarantees, enabling detection of hallucinations and interpretable uncertainty visualization.
Conclusion: Deep models can serve as both powerful estimators and reliable tools for uncertainty-aware medical imaging when equipped with principled confidence estimation frameworks.
Abstract: We present a principled framework for confidence estimation in computed tomography (CT) reconstruction. Based on the sequential likelihood mixing framework (Kirschner et al., 2025), we establish confidence regions with theoretical coverage guarantees for deep-learning-based CT reconstructions. We consider a realistic forward model following the Beer-Lambert law, i.e., a log-linear forward model with Poisson noise, closely reflecting clinical and scientific imaging conditions. The framework is general and applies to both classical algorithms and deep learning reconstruction methods, including U-Nets, U-Net ensembles, and generative Diffusion models. Empirically, we demonstrate that deep reconstruction methods yield substantially tighter confidence regions than classical reconstructions, without sacrificing theoretical coverage guarantees. Our approach allows the detection of hallucinations in reconstructed images and provides interpretable visualizations of confidence regions. This establishes deep models not only as powerful estimators, but also as reliable tools for uncertainty-aware medical imaging.
[511] Clifford Kolmogorov-Arnold Networks
Matthias Wolff, Francesco Alesiani, Christof Duhme, Xiaoyi Jiang
Main category: cs.LG
TL;DR: ClKAN is a novel neural network architecture for function approximation in Clifford algebra spaces, addressing exponential scaling with randomized quasi Monte Carlo grids and introducing batch normalization for variable domain inputs.
Details
Motivation: The paper aims to develop efficient function approximation methods in Clifford algebra spaces, which are important for scientific discovery and engineering applications but face exponential scaling challenges in higher dimensions.Method: Proposes Clifford Kolmogorov-Arnold Network (ClKAN) architecture using Randomized Quasi Monte Carlo grid generation to handle exponential scaling in higher dimensional algebras, with new batch normalization strategies for variable domain inputs.
Result: Validated in synthetic and physics-inspired tasks, demonstrating effective function approximation in Clifford algebra spaces with improved efficiency over traditional approaches.
Conclusion: ClKAN provides a flexible and efficient solution for function approximation in Clifford algebra spaces, enabling practical applications in scientific discovery and engineering domains.
Abstract: We introduce Clifford Kolmogorov-Arnold Network (ClKAN), a flexible and efficient architecture for function approximation in arbitrary Clifford algebra spaces. We propose the use of Randomized Quasi Monte Carlo grid generation as a solution to the exponential scaling associated with higher dimensional algebras. Our ClKAN also introduces new batch normalization strategies to deal with variable domain input. ClKAN finds application in scientific discovery and engineering, and is validated in synthetic and physics inspired tasks.
[512] Where Does Warm-Up Come From? Adaptive Scheduling for Norm-Constrained Optimizers
Artem Riabinin, Andrey Veprikov, Arman Bolatov, Martin Takáč, Aleksandr Beznosikov
Main category: cs.LG
TL;DR: The paper introduces an adaptive learning rate scheduler for norm-constrained optimizers (like Muon and Lion) that automatically determines optimal warm-up duration based on theoretical convergence guarantees under a generalized smoothness assumption.
Details
Motivation: Current learning rate scheduling for norm-constrained optimizers often relies on heuristic warm-up and decay schedules that require manual tuning. The authors aim to develop a theoretically grounded, adaptive scheduler that eliminates the need for extensive hyperparameter search.Method: The authors first establish a generalized smoothness assumption where local curvature decreases with the suboptimality gap, which they empirically verify. Under this assumption, they derive convergence guarantees that naturally lead to warm-up followed by decay. Based on this theory, they develop a practical scheduler that automatically adapts warm-up duration using only standard hyperparameters.
Result: The method was evaluated on large language model pretraining with LLaMA architectures. The adaptive warm-up selection consistently outperformed or matched the best manually tuned warm-up schedules across all setups, without requiring additional hyperparameter search.
Conclusion: The paper presents a theoretically grounded, adaptive learning rate scheduler for norm-constrained optimizers that automatically determines optimal warm-up duration, eliminating the need for manual tuning while maintaining or improving performance.
Abstract: We study adaptive learning rate scheduling for norm-constrained optimizers (e.g., Muon and Lion). We introduce a generalized smoothness assumption under which local curvature decreases with the suboptimality gap and empirically verify that this behavior holds along optimization trajectories. Under this assumption, we establish convergence guarantees under an appropriate choice of learning rate, for which warm-up followed by decay arises naturally from the proof rather than being imposed heuristically. Building on this theory, we develop a practical learning rate scheduler that relies only on standard hyperparameters and adapts the warm-up duration automatically at the beginning of training. We evaluate this method on large language model pretraining with LLaMA architectures and show that our adaptive warm-up selection consistently outperforms or at least matches the best manually tuned warm-up schedules across all considered setups, without additional hyperparameter search. Our source code is available at https://github.com/brain-lab-research/llm-baselines/tree/warmup
[513] Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps
Peter Holderrieth, Douglas Chen, Luca Eyring, Ishin Shah, Giri Anantharaman, Yutong He, Zeynep Akata, Tommi Jaakkola, Nicholas Matthew Boffi, Max Simchowitz
Main category: cs.LG
TL;DR: Diamond Maps are stochastic flow map models designed for efficient reward alignment in generative models, enabling adaptation to arbitrary preferences and constraints at inference time without costly retraining.
Details
Motivation: Current flow and diffusion models produce high-quality samples but adapting them to user preferences or constraints post-training remains costly and brittle. The authors argue that efficient reward alignment should be a built-in property of generative models rather than an afterthought.Method: Propose “Diamond Maps” - stochastic flow map models that amortize many simulation steps into a single-step sampler while preserving stochasticity needed for optimal reward alignment. This enables efficient and consistent estimation of value functions for scalable search, sequential Monte Carlo, and guidance.
Result: Diamond Maps can be learned efficiently via distillation from GLASS Flows, achieve stronger reward alignment performance, and scale better than existing methods for adapting generative models to arbitrary preferences.
Conclusion: Diamond Maps provide a practical route to generative models that can be rapidly adapted to arbitrary preferences and constraints at inference time, making reward alignment an inherent capability rather than a costly post-training process.
Abstract: Flow and diffusion models produce high-quality samples, but adapting them to user preferences or constraints post-training remains costly and brittle, a challenge commonly called reward alignment. We argue that efficient reward alignment should be a property of the generative model itself, not an afterthought, and redesign the model for adaptability. We propose “Diamond Maps”, stochastic flow map models that enable efficient and accurate alignment to arbitrary rewards at inference time. Diamond Maps amortize many simulation steps into a single-step sampler, like flow maps, while preserving the stochasticity required for optimal reward alignment. This design makes search, sequential Monte Carlo, and guidance scalable by enabling efficient and consistent estimation of the value function. Our experiments show that Diamond Maps can be learned efficiently via distillation from GLASS Flows, achieve stronger reward alignment performance, and scale better than existing methods. Our results point toward a practical route to generative models that can be rapidly adapted to arbitrary preferences and constraints at inference time.
[514] Group-Adaptive Adversarial Learning for Robust Fake News Detection Against Malicious Comments
Zhao Tong, Chunlin Gong, Yimeng Gu, Haichao Shi, Qiang Liu, Shu Wu, Xiao-Yu Zhang
Main category: cs.LG
TL;DR: AdComment: Adaptive adversarial training framework for fake news detection that enhances robustness against diverse malicious comment attacks using LLM-generated perturbations and dynamic resampling.
Details
Motivation: Existing fake news detectors achieve good performance on benchmarks but are vulnerable to malicious comments designed to induce misclassification. Current systems fail to generalize across diverse and novel comment attack patterns, necessitating detection systems that prioritize both accuracy and structural robustness.Method: Proposes AdComment framework with: 1) Categorization of adversarial comments into Fact Distortion, Logical Confusion, and Emotional Manipulation based on cognitive psychology; 2) Use of LLMs to synthesize diverse, category-specific perturbations; 3) InfoDirichlet Resampling (IDR) mechanism that dynamically adjusts malicious comment proportions during training to steer optimization toward model’s most susceptible regions.
Result: Achieves state-of-the-art performance on three benchmark datasets, improving F1 scores by 17.9%, 14.5% and 9.0% respectively compared to existing methods.
Conclusion: AdComment effectively enhances fake news detection robustness against diverse malicious comment attacks through adaptive adversarial training and dynamic resampling, addressing the vulnerability of current detectors to evolving adversarial threats.
Abstract: Online fake news profoundly distorts public judgment and erodes trust in social platforms. While existing detectors achieve competitive performance on benchmark datasets, they remain notably vulnerable to malicious comments designed specifically to induce misclassification. This evolving threat landscape necessitates detection systems that simultaneously prioritize predictive accuracy and structural robustness. However, current detectors often fail to generalize across diverse and novel comment attack patterns. To bridge this gap, we propose AdComment, an adaptive adversarial training framework for robustness enhancement against diverse malicious comments. Based on cognitive psychology, we categorize adversarial comments into Fact Distortion, Logical Confusion, and Emotional Manipulation, and leverage LLMs to synthesize diverse, category-specific perturbations. Central to our framework is an InfoDirichlet Resampling (IDR) mechanism that dynamically adjusts malicious comment proportions during training, thereby steering optimization toward the model’s most susceptible regions. Experimental results demonstrate that our approach achieves state-of-the-art performance on three benchmark datasets, improving the F1 scores by 17.9%, 14.5% and 9.0%, respectively.
[515] Synthesizing Realistic Test Data without Breaking Privacy
Laura Plein, Alexi Turcotte, Arina Hallemans, Andreas Zeller
Main category: cs.LG
TL;DR: A privacy-preserving synthetic data generation approach using fuzzing and discriminator models to create datasets with original statistical properties without direct access to original data.
Details
Motivation: Need for synthetic datasets that replicate statistical distributions of original data while preserving confidentiality, addressing limitations of GANs which are vulnerable to membership inference attacks and dataset reconstruction attacks.Method: Uses test generator (fuzzer) to produce data from input specifications preserving original data constraints, with discriminator model evaluating closeness to original data. Evolves samples using discriminator feedback to generate privacy-preserving data with same statistical distributions.
Result: Evaluated on four datasets used for state-of-the-art techniques. Approach shows potential for generating synthetic datasets with high utility while preserving privacy.
Conclusion: Proposed method enables generation of privacy-preserving synthetic data that maintains statistical properties and utility of original datasets without direct data exposure.
Abstract: There is a need for synthetic training and test datasets that replicate statistical distributions of original datasets without compromising their confidentiality. A lot of research has been done in leveraging Generative Adversarial Networks (GANs) for synthetic data generation. However, the resulting models are either not accurate enough or are still vulnerable to membership inference attacks (MIA) or dataset reconstruction attacks since the original data has been leveraged in the training process. In this paper, we explore the feasibility of producing a synthetic test dataset with the same statistical properties as the original one, with only indirectly leveraging the original data in the generation process. The approach is inspired by GANs, with a generation step and a discrimination step. However, in our approach, we use a test generator (a fuzzer) to produce test data from an input specification, preserving constraints set by the original data; a discriminator model determines how close we are to the original data. By evolving samples and determining “good samples” with the discriminator, we can generate privacy-preserving data that follows the same statistical distributions are the original dataset, leading to a similar utility as the original data. We evaluated our approach on four datasets that have been used to evaluate the state-of-the-art techniques. Our experiments highlight the potential of our approach towards generating synthetic datasets that have high utility while preserving privacy.
[516] Optimism Stabilizes Thompson Sampling for Adaptive Inference
Shunxing Yan, Han Zhong
Main category: cs.LG
TL;DR: Thompson sampling with optimism modifications enables stable asymptotic inference in multi-armed bandits by ensuring arm pull counts concentrate around deterministic scales.
Details
Motivation: Thompson sampling is widely used for stochastic multi-armed bandits but has subtle inferential properties under adaptive data collection. Classical asymptotic theory fails because arm-specific sample sizes are random and coupled with rewards through action-selection rules.Method: Studies optimism as a mechanism for restoring stability in K-armed Gaussian bandits. Analyzes two approaches: 1) variance-inflated Thompson sampling, and 2) alternative optimistic modification that keeps posterior variance unchanged but adds explicit mean bonus to posterior mean.
Result: Proves variance-inflated TS is stable for any K≥2, resolving an open question from prior work. Establishes same stability conclusion for alternative optimistic modification. Both approaches enable asymptotically valid inference while incurring only mild additional regret cost.
Conclusion: Suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid inference in multi-armed bandits, addressing the fundamental challenge of adaptive data collection in bandit algorithms.
Abstract: Thompson sampling (TS) is widely used for stochastic multi-armed bandits, yet its inferential properties under adaptive data collection are subtle. Classical asymptotic theory for sample means can fail because arm-specific sample sizes are random and coupled with the rewards through the action-selection rule. We study this phenomenon in the $K$-armed Gaussian bandit and identify \emph{optimism} as a key mechanism for restoring \emph{stability}, a sufficient condition for valid asymptotic inference requiring each arm’s pull count to concentrate around a deterministic scale. First, we prove that variance-inflated TS \citep{halder2025stable} is stable for any $K \ge 2$, including the challenging regime where multiple arms are optimal. This resolves the open question raised by \citet{halder2025stable} through extending their results from the two-armed setting to the general $K$-armed setting. Second, we analyze an alternative optimistic modification that keeps the posterior variance unchanged but adds an explicit mean bonus to posterior mean, and establish the same stability conclusion. In summary, suitably implemented optimism stabilizes Thompson sampling and enables asymptotically valid inference in multi-armed bandits, while incurring only a mild additional regret cost.
[517] Visualizing the loss landscapes of physics-informed neural networks
Conor Rowan, Finn Murphy-Blanchard
Main category: cs.LG
TL;DR: This paper applies loss landscape visualization techniques from traditional ML to physics-informed neural networks, finding that physics loss landscapes share similar properties with data-driven classification problems and are often smooth and convex near solutions.
Details
Motivation: To extend loss landscape studies beyond image classification to physics-informed machine learning, where losses are defined by differential operators rather than large datasets, and to compare different physics loss formulations.Method: Comprehensive review of loss landscape literature, then empirical investigation using surveyed techniques to analyze landscapes of Deep Ritz and squared residual physics loss formulations in physics-informed neural networks.
Result: Physics-informed neural network loss landscapes share many properties with data-driven classification problems; both Deep Ritz and strong form losses produce similar landscapes that appear smooth, well-conditioned, and convex near solutions.
Conclusion: The loss landscape perspective is valuable for scientific ML, physics loss formulations produce surprisingly similar landscapes, challenging intuitions about complexity in physics-informed networks.
Abstract: Training a neural network requires navigating a high-dimensional, non-convex loss surface to find parameters that minimize this loss. In many ways, it is surprising that optimizers such as stochastic gradient descent and ADAM can reliably locate minima which perform well on both the training and test data. To understand the success of training, a “loss landscape” community has emerged to study the geometry of the loss function and the dynamics of optimization, often using visualization techniques. However, these loss landscape studies have mostly been limited to machine learning for image classification. In the newer field of physics-informed machine learning, little work has been conducted to visualize the landscapes of losses defined not by regression to large data sets, but by differential operators acting on state fields discretized by neural networks. In this work, we provide a comprehensive review of the loss landscape literature, as well as a discussion of the few existing physics-informed works which investigate the loss landscape. We then use a number of the techniques we survey to empirically investigate the landscapes defined by the Deep Ritz and squared residual forms of the physics loss function. We find that the loss landscapes of physics-informed neural networks have many of the same properties as the data-driven classification problems studied in the literature. Unexpectedly, we find that the two formulations of the physics loss often give rise to similar landscapes, which appear smooth, well-conditioned, and convex in the vicinity of the solution. The purpose of this work is to introduce the loss landscape perspective to the scientific machine learning community, compare the Deep Ritz and the strong form losses, and to challenge prevailing intuitions about the complexity of the loss landscapes of physics-informed networks.
[518] Correctness-Optimized Residual Activation Lens (CORAL): Transferrable and Calibration-Aware Inference-Time Steering
Miranda Muqing Miao, Young-Min Cho, Lyle Ungar
Main category: cs.LG
TL;DR: CORAL is an inference-time steering method that uses regularized MLP probes on model activations to improve both accuracy and calibration in multiple-choice QA without retraining.
Details
Motivation: LLMs suffer from miscalibration issues, especially after instruction tuning and preference alignment. While modified training objectives can help, retraining is expensive. Existing inference-time methods optimize proxies for correctness rather than correctness itself.Method: CORAL uses weight-decay MLP probes to capture distributed correctness signals from model internal activations. It’s a regularized inference-time steering method that operates during inference without requiring model retraining.
Result: CORAL consistently improves accuracy by 10% and expected calibration error (ECE) by 50% on average across three 7B-parameter models. Gains transfer to four held-out benchmarks (ARC-Challenge, HellaSwag, Math-MC, OpenBookQA) with 14% accuracy improvements and 49% ECE improvements.
Conclusion: Distributed information in model internals can be extracted using regularized probes when individual neurons are insufficient. CORAL provides a compute-efficient, transferable, and calibration-aware approach to improve MCQA performance during inference.
Abstract: Large language models (LLMs) exhibit persistent miscalibration, especially after instruction tuning and preference alignment. Modified training objectives can improve calibration, but retraining is expensive. Inference-time steering offers a lightweight alternative, yet most existing methods optimize proxies for correctness rather than correctness itself. We introduce CORAL (Correctness-Optimized Residual Activation Lens), a regularized inference-time steering method that captures distributed correctness signals from model internal activations using weight-decay MLP probes. We evaluate CORAL across three 7B-parameter models and find that it consistently improves accuracy by 10% and expected calibration error (ECE) by 50% on average. We additionally demonstrate that these gains transfer without retraining to the complete published test sets of four held-out benchmarks (ARC-Challenge, HellaSwag, Math-MC, OpenBookQA), averaging 14% accuracy improvements and 49% ECE improvements. Our results support the hypothesis that distributed information in model internals can be extracted using regularized probes when individual neurons are insufficient. CORAL thus provides a compute-efficient, transferable, and calibration-aware approach to improve MCQA performance during inference.
[519] GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA
Zhichao Wang
Main category: cs.LG
TL;DR: GIFT is a novel RL framework for aligning LLMs that minimizes discrepancy between implicit and explicit reward models through joint normalization, converting reward maximization into a convex MSE loss problem.
Details
Motivation: Existing RL methods for LLM alignment like PPO and GRPO directly maximize cumulative rewards but face challenges with complex optimization landscapes, hyperparameter sensitivity, and training instability. Offline methods like DPO and UNA lack exploration capability. There's a need for a method that combines the benefits of on-policy exploration with stable, convex optimization.Method: GIFT combines three key ideas: (1) online multi-response generation and normalization from GRPO, (2) implicit reward formulation from DPO, and (3) implicit-explicit reward alignment principle from UNA. It jointly normalizes implicit and explicit rewards to eliminate intractable terms, transforming the reward maximization objective into a simple mean squared error (MSE) loss between normalized reward functions.
Result: GIFT achieves superior reasoning and alignment performance on mathematical benchmarks while remaining computationally efficient. It requires fewer hyperparameters than GRPO, converges faster, generalizes better with reduced training overfitting, and retains on-policy exploration capability unlike offline methods.
Conclusion: GIFT provides a stable, convex, and analytically differentiable formulation for LLM alignment that combines the benefits of on-policy exploration with efficient optimization, outperforming existing methods in both performance and computational efficiency.
Abstract: I propose \textbf{G}roup-relative \textbf{I}mplicit \textbf{F}ine \textbf{T}uning (GIFT), a novel reinforcement learning framework for aligning LLMs. Instead of directly maximizing cumulative rewards like PPO or GRPO, GIFT minimizes the discrepancy between implicit and explicit reward models. It combines three key ideas: (1) the online multi-response generation and normalization of GRPO, (2) the implicit reward formulation of DPO, and (3) the implicit-explicit reward alignment principle of UNA. By jointly normalizing the implicit and explicit rewards, GIFT eliminates an otherwise intractable term that prevents effective use of implicit rewards. This normalization transforms the complex reward maximization objective into a simple mean squared error (MSE) loss between the normalized reward functions, converting a non-convex optimization problem into a convex, stable, and analytically differentiable formulation. Unlike offline methods such as DPO and UNA, GIFT remains on-policy and thus retains exploration capability. Compared to GRPO, it requires fewer hyperparameters, converges faster, and generalizes better with significantly reduced training overfitting. Empirically, GIFT achieves superior reasoning and alignment performance on mathematical benchmarks while remaining computationally efficient.
[520] Exact Recovery in the Data Block Model
Amir R. Asadi, Akbar Davoodi, Ramin Javadi, Farzad Parvaresh
Main category: cs.LG
TL;DR: The paper studies exact community recovery in networks with node attributes using the Data Block Model, introducing Chernoff-TV divergence to characterize sharp recovery thresholds and providing efficient algorithms achieving these bounds.
Details
Motivation: Real-world networks often contain additional node attributes beyond connectivity, but classical community detection methods focus only on graph structure. The authors aim to understand how side information from node data can improve exact community recovery.Method: Extends the stochastic block model to the Data Block Model with node-associated data, introduces Chernoff-TV divergence to analyze information-theoretic limits, develops efficient algorithms achieving the threshold, and provides matching converse results.
Result: Establishes sharp exact recovery threshold for DBM, shows efficient algorithm achieves this threshold, proves impossibility below threshold, and demonstrates benefits of vertex data through simulations.
Conclusion: Node attributes provide valuable side information for community detection, with Chernoff-TV divergence characterizing fundamental limits; incorporating data enables exact recovery in regimes where graph-only methods fail.
Abstract: Community detection in networks is a fundamental problem in machine learning and statistical inference, with applications in social networks, biological systems, and communication networks. The stochastic block model (SBM) serves as a canonical framework for studying community structure, and exact recovery, identifying the true communities with high probability, is a central theoretical question. While classical results characterize the phase transition for exact recovery based solely on graph connectivity, many real-world networks contain additional data, such as node attributes or labels. In this work, we study exact recovery in the Data Block Model (DBM), an SBM augmented with node-associated data, as formalized by Asadi, Abbe, and Verdú (2017). We introduce the Chernoff–TV divergence and use it to characterize a sharp exact recovery threshold for the DBM. We further provide an efficient algorithm that achieves this threshold, along with a matching converse result showing impossibility below the threshold. Finally, simulations validate our findings and demonstrate the benefits of incorporating vertex data as side information in community detection.
[521] CARL: Focusing Agentic Reinforcement Learning on Critical Actions
Leyang Shen, Yang Zhang, Chun Kai Ling, Xiaoyan Zhao, Tat-Seng Chua
Main category: cs.LG
TL;DR: CARL is a reinforcement learning algorithm that focuses training on critical actions in long-horizon agentic reasoning tasks, using entropy as a proxy for action criticality to improve efficiency and performance.
Details
Motivation: Conventional reinforcement learning algorithms assume all actions contribute equally to outcomes, which is suboptimal for multi-step agentic reasoning where only a small fraction of actions are critical to final success.Method: CARL uses entropy as a heuristic proxy to identify critical actions, then focuses training by assigning rewards to high-criticality actions while excluding low-criticality actions from model updates to avoid noisy credit assignment and redundant computation.
Result: Extensive experiments show CARL achieves both stronger performance and higher efficiency across diverse evaluation settings compared to conventional approaches.
Conclusion: Focusing training on critical actions through entropy-based criticality assessment enables more efficient and effective reinforcement learning for long-horizon agentic reasoning tasks.
Abstract: Agents capable of accomplishing complex tasks through multiple interactions with the environment have emerged as a popular research direction. However, in such multi-step settings, the conventional group-level policy optimization algorithm becomes suboptimal because of its underlying assumption that each action holds equal contribution, which deviates significantly from reality. Our analysis reveals that only a small fraction of actions are critical in determining the final outcome. Building on this insight, we propose CARL, a critical-action-focused reinforcement learning algorithm tailored for long-horizon agentic reasoning. CARL leverages entropy as a heuristic proxy for action criticality and achieves focused training by assigning rewards to high-criticality actions while excluding low-criticality actions from model updates, avoiding noisy credit assignment and redundant computation. Extensive experiments demonstrate that CARL achieves both stronger performance and higher efficiency across diverse evaluation settings. The source code will be publicly available.
[522] CFRecs: Counterfactual Recommendations on Real Estate User Listing Interaction Graphs
Seyedmasoud Mousavi, Ruomeng Xu, Xiaojing Zhu
Main category: cs.LG
TL;DR: CFRecs is a counterfactual graph learning framework that transforms graph-based explanations into actionable recommendations for real estate platforms like Zillow.
Details
Motivation: While graph neural networks are widely used for learning from graph-structured data, counterfactual explanations can improve model interpretability. However, existing counterfactual graph learning focuses on explanations rather than actionable insights for practical applications like recommender systems.Method: CFRecs employs a two-stage architecture with a graph neural network (GNN) and graph variational auto-encoder (Graph-VAE) to propose minimal yet high-impact changes in graph structure and node attributes. It optimizes for sparsity of changes and validity of predictions to generate actionable recommendations.
Result: Experimental results on Zillow’s user-listing interaction data demonstrate CFRecs’ effectiveness in providing actionable recommendations for home buyers and sellers, offering a fresh perspective on recommendations using counterfactual reasoning in graphs.
Conclusion: CFRecs successfully transforms counterfactual graph explanations into practical, actionable insights for recommender systems, particularly in competitive markets like real estate.
Abstract: Graph-structured data is ubiquitous and powerful in representing complex relationships in many online platforms. While graph neural networks (GNNs) are widely used to learn from such data, counterfactual graph learning has emerged as a promising approach to improve model interpretability. Counterfactual explanation research focuses on identifying a counterfactual graph that is similar to the original but leads to different predictions. These explanations optimize two objectives simultaneously: the sparsity of changes in the counterfactual graph and the validity of its predictions. Building on these qualitative optimization goals, this paper introduces CFRecs, a novel framework that transforms counterfactual explanations into actionable insights. CFRecs employs a two-stage architecture consisting of a graph neural network (GNN) and a graph variational auto-encoder (Graph-VAE) to strategically propose minimal yet high-impact changes in graph structure and node attributes to drive desirable outcomes in recommender systems. We apply CFRecs to Zillow’s graph-structured data to deliver actionable recommendations for both home buyers and sellers with the goal of helping them navigate the competitive housing market and achieve their homeownership goals. Experimental results on Zillow’s user-listing interaction data demonstrate the effectiveness of CFRecs, which also provides a fresh perspective on recommendations using counterfactual reasoning in graphs.
[523] Large-scale Score-based Variational Posterior Inference for Bayesian Deep Neural Networks
Minyoung Kim
Main category: cs.LG
TL;DR: Proposes a novel scalable variational inference method for Bayesian neural networks using score matching and proximal penalty, enabling large-scale applications including Vision Transformers.
Details
Motivation: Bayesian neural networks offer advantages like uncertainty quantification and robustness, but variational inference methods struggle with large-scale networks. Existing score-based VI methods have computational limitations for large BNNs.Method: Combines score matching loss with proximal penalty term in iterative optimization, avoids reparametrized sampling, allows noisy unbiased mini-batch scores through stochastic gradients, enabling scalability to large networks.
Result: Method scales to large-scale neural networks including Vision Transformers, allows richer variational density families, and shows effectiveness on visual recognition and time-series forecasting benchmarks.
Conclusion: Proposed score-based VI method provides scalable Bayesian inference for large neural networks, overcoming limitations of existing approaches while maintaining benefits of Bayesian modeling.
Abstract: Bayesian (deep) neural networks (BNN) are often more attractive than the mainstream point-estimate vanilla deep learning in various aspects including uncertainty quantification, robustness to noise, resistance to overfitting, and more. The variational inference (VI) is one of the most widely adopted approximate inference methods. Whereas the ELBO-based variational free energy method is a dominant choice in the literature, in this paper we introduce a score-based alternative for BNN variational inference. Although there have been quite a few score-based variational inference methods proposed in the community, most are not adequate for large-scale BNNs for various computational and technical reasons. We propose a novel scalable VI method where the learning objective combines the score matching loss and the proximal penalty term in iterations, which helps our method avoid the reparametrized sampling, and allows for noisy unbiased mini-batch scores through stochastic gradients. This in turn makes our method scalable to large-scale neural networks including Vision Transformers, and allows for richer variational density families. On several benchmarks including visual recognition and time-series forecasting with large-scale deep networks, we empirically show the effectiveness of our approach.
[524] Escaping Local Minima Provably in Non-convex Matrix Sensing: A Deterministic Framework via Simulated Lifting
Tianqi Shen, Jinji Yang, Junze He, Kunhan Gao, Ziye Ma
Main category: cs.LG
TL;DR: A deterministic framework called Simulated Oracle Direction (SOD) that escapes spurious local minima in low-rank matrix sensing by simulating over-parameterized escape directions without actual tensor lifting.
Details
Motivation: Low-rank matrix sensing has challenging nonconvex landscapes with many spurious local minima. While over-parameterization via tensor lifting can convert local minima into saddle points, actual lifting is computationally intractable. The goal is to achieve similar benefits without the computational cost.Method: Proposes Simulated Oracle Direction (SOD) mechanism that simulates the landscape and escape directions of over-parametrized space without actually lifting the problem. Designs a mathematical framework to project over-parametrized escape directions onto original parameter space to guarantee strict decrease from local minima.
Result: Numerical experiments show the framework reliably escapes local minima and facilitates convergence to global optima with minimal computational cost compared to explicit tensor over-parameterization.
Conclusion: First deterministic framework that can escape spurious local minima with guarantee without random perturbations or heuristic estimates. Has implications for nonconvex optimization beyond matrix sensing by showing how simulated over-parameterization can tame challenging optimization landscapes.
Abstract: Low-rank matrix sensing is a fundamental yet challenging nonconvex problem whose optimization landscape typically contains numerous spurious local minima, making it difficult for gradient-based optimizers to converge to the global optimum. Recent work has shown that over-parameterization via tensor lifting can convert such local minima into strict saddle points, an insight that also partially explains why massive scaling can improve generalization and performance in modern machine learning. Motivated by this observation, we propose a Simulated Oracle Direction (SOD) escape mechanism that simulates the landscape and escape direction of the over-parametrized space, without resorting to actually lifting the problem, since that would be computationally intractable. In essence, we designed a mathematical framework to project over-parametrized escape directions onto the original parameter space to guarantee a strict decrease of objective value from existing local minima. To the best of the our knowledge, this represents the first deterministic framework that could escape spurious local minima with guarantee, especially without using random perturbations or heuristic estimates. Numerical experiments demonstrate that our framework reliably escapes local minima and facilitates convergence to global optima, while incurring minimal computational cost when compared to explicit tensor over-parameterization. We believe this framework has non-trivial implications for nonconvex optimization beyond matrix sensing, by showcasing how simulated over-parameterization can be leveraged to tame challenging optimization landscapes.
[525] Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing
Anxin Guo, Jingwei Li
Main category: cs.LG
TL;DR: The paper formalizes LLM hallucination as a membership testing problem, showing hallucinations are information-theoretically optimal under limited capacity, not just training artifacts.
Details
Motivation: To understand why LLMs hallucinate random facts with high confidence, even when such facts lack inferable patterns and the models have perfect training data.Method: Formalizes memorization as a membership testing problem, unifies Bloom filter metrics with LLM log-loss, analyzes sparse facts regime, establishes rate-distortion theorem showing optimal memory efficiency requires KL divergence minimization.
Result: Establishes that hallucination is information-theoretically optimal under limited capacity - optimal strategy is to assign high confidence to some non-facts rather than abstain or forget.
Conclusion: Hallucinations persist as natural consequence of lossy compression, not just training artifacts, validated empirically on synthetic data.
Abstract: Large language models often hallucinate with high confidence on “random facts” that lack inferable patterns. We formalize the memorization of such facts as a membership testing problem, unifying the discrete error metrics of Bloom filters with the continuous log-loss of LLMs. By analyzing this problem in the regime where facts are sparse in the universe of plausible claims, we establish a rate-distortion theorem: the optimal memory efficiency is characterized by the minimum KL divergence between score distributions on facts and non-facts. This theoretical framework provides a distinctive explanation for hallucination: even with optimal training, perfect data, and a simplified “closed world” setting, the information-theoretically optimal strategy under limited capacity is not to abstain or forget, but to assign high confidence to some non-facts, resulting in hallucination. We validate this theory empirically on synthetic data, showing that hallucinations persist as a natural consequence of lossy compression.
[526] ContextBench: A Benchmark for Context Retrieval in Coding Agents
Han Li, Letian Zhu, Bohan Zhang, Rili Feng, Jiaming Wang, Yue Pan, Earl T. Barr, Sarro Federica, Zhaoyang Chu, He Ye
Main category: cs.LG
TL;DR: ContextBench is a process-oriented evaluation framework for coding agents that measures context retrieval performance during issue resolution, revealing that sophisticated agent scaffolding provides only marginal gains and LLMs favor recall over precision.
Details
Motivation: Existing evaluations of LLM-based coding agents focus mainly on final task success, providing limited insight into how agents retrieve and use code context during problem solving. There's a need for process-oriented evaluation that examines intermediate steps.Method: Created ContextBench with 1,136 issue-resolution tasks from 66 repositories across 8 programming languages, each augmented with human-annotated gold contexts. Implemented automated evaluation framework tracking agent trajectories and measuring context recall, precision, and efficiency throughout issue resolution.
Result: Evaluation of 4 frontier LLMs and 5 coding agents showed: 1) sophisticated agent scaffolding yields only marginal gains in context retrieval (“The Bitter Lesson”), 2) LLMs consistently favor recall over precision, and 3) substantial gaps exist between explored and utilized context.
Conclusion: ContextBench provides intermediate gold-context metrics that unbox the issue-resolution process, offering valuable intermediate signals for guiding LLM reasoning in software tasks. It augments existing end-to-end benchmarks with process-oriented evaluation.
Abstract: LLM-based coding agents have shown strong performance on automated issue resolution benchmarks, yet existing evaluations largely focus on final task success, providing limited insight into how agents retrieve and use code context during problem solving. We introduce ContextBench, a process-oriented evaluation of context retrieval in coding agents. ContextBench consists of 1,136 issue-resolution tasks from 66 repositories across eight programming languages, each augmented with human-annotated gold contexts. We further implement an automated evaluation framework that tracks agent trajectories and measures context recall, precision, and efficiency throughout issue resolution. Using ContextBench, we evaluate four frontier LLMs and five coding agents. Our results show that sophisticated agent scaffolding yields only marginal gains in context retrieval (“The Bitter Lesson” of coding agents), LLMs consistently favor recall over precision, and substantial gaps exist between explored and utilized context. ContextBench augments existing end-to-end benchmarks with intermediate gold-context metrics that unbox the issue-resolution process. These contexts offer valuable intermediate signals for guiding LLM reasoning in software tasks. Data and code are available at: https://cioutn.github.io/context-bench/.
[527] The Gradient-Causal Gap: Why Gradient Importance Fails on Complex Tasks
Donald Ye
Main category: cs.LG
TL;DR: Gradient magnitude in Transformers doesn’t reliably indicate causal importance - removing low-gradient components can destroy generalization while removing high-gradient ones can be harmless or harmful unpredictably.
Details
Motivation: The paper investigates a paradox where gradient magnitude doesn't align with causal importance in neural networks, challenging the common assumption that high-gradient components are important for model performance.Method: Formalized the Gradient-Causal Gap in Transformers trained on algorithmic tasks, measured correlation between gradient magnitude and causal importance across tasks of varying complexity, and conducted pruning experiments to test the relationship.
Result: Gradient-causal alignment collapses with task complexity (ρ=0.73 for reversal vs ρ=0.32 for sorting, sometimes inverted ρ=-0.11). Removing low-gradient “Hidden Heroes” consistently harms OOD accuracy (-32%), while removing high-gradient “Gradient Bloats” is unpredictable - harmless in most seeds but catastrophic in others.
Conclusion: Gradient magnitude is not just inaccurate but unpredictably so for indicating causal importance, making gradient-based pruning unreliable for preserving model capabilities due to the Gradient-Causal Gap phenomenon.
Abstract: Removing ‘‘important’’ high-gradient components from a neural network can improve generalization, while removing unimportant’’ low-gradient components can destroy it. We demonstrate this paradox by formalizing the \textit{Gradient-Causal Gap} in Transformers trained on algorithmic tasks. While gradient magnitude and causal importance align on simple tasks ($ρ=0.73$ for reversal), this relationship collapses as task complexity increases ($ρ=0.32$ for sorting), sometimes becoming inverted ($ρ=-0.11$). Pruning experiments reveal that gradient magnitude is not merely inaccurate but \textit{unpredictably} so. Removing low-gradient ‘‘Hidden Heroes’’ consistently devastates OOD accuracy ($-32%$). Removing high-gradient ‘‘Gradient Bloats’’ is a coin flip: harmless in most seeds (indicating optimization noise), catastrophic in others (indicating overfitting circuits). This unpredictability means gradient-based pruning cannot reliably preserve model capabilities.
[528] Chunky Post-Training: Data Driven Failures of Generalization
Seoirse Murray, Allison Qi, Timothy Qian, John Schulman, Collin Burns, Sara Price
Main category: cs.LG
TL;DR: SURF and TURF are tools for detecting and tracing unintended behaviors in LLMs caused by spurious correlations learned from post-training data chunks.
Details
Motivation: LLM post-training uses diverse datasets that encode incidental patterns alongside intended behaviors, creating spurious correlations that lead to surprising model behaviors like rejecting true facts in specific formats.Method: SURF: black-box pipeline for surfacing unintended behaviors at runtime; TURF: tool for tracing failures back to specific post-training data chunks.
Result: Applied to frontier models (Claude 4.5, GPT-5.1, Grok 4.1, Gemini 3) and open models (Tülu 3), showing chunky post-training produces miscalibrated behaviors from imbalanced or underspecified data chunks.
Conclusion: Chunky post-training creates unintended behaviors through spurious correlations in post-training data, requiring tools like SURF/TURF for detection and tracing.
Abstract: LLM post-training involves many diverse datasets, each targeting a specific behavior. But these datasets encode incidental patterns alongside intended ones: correlations between formatting and content, narrow phrasings across diverse problems, and implicit associations arising from the discrete data curation process. These patterns are often invisible to developers yet salient to models, producing behaviors that surprise their creators, such as rejecting true facts presented in a particular question format. We call this chunky post-training: the model learns spurious correlations as a result of distinct chunks of post-training data. We introduce SURF, a black-box pipeline which surfaces these unintended behaviors at run time, and TURF, a tool that traces these failures back to specific post-training data. Applying these tools to frontier models (Claude 4.5, GPT-5.1, Grok 4.1, Gemini 3) and open models (Tülu 3), we show that chunky post-training produces miscalibrated behaviors, which often result from imbalanced or underspecified chunks of post-training data.
[529] Prompt Augmentation Scales up GRPO Training on Mathematical Reasoning
Wenquan Lu, Hai Huang, Randall Balestriero
Main category: cs.LG
TL;DR: Prompt augmentation enables stable long-horizon RL training for math reasoning LLMs by using diverse reasoning templates to prevent entropy collapse, achieving SOTA results on math benchmarks.
Details
Motivation: Existing RL methods for improving LLM math reasoning suffer from entropy collapse during training, forcing short training horizons and limiting exploration. Most approaches also use fixed reasoning prompts, reducing diversity.Method: Introduces prompt augmentation strategy that instructs models to generate reasoning traces under diverse templates and formats, increasing rollout diversity and enabling stable scaling of training duration without KL regularization.
Result: Qwen2.5-Math-1.5B model trained with prompt augmentation on MATH Level 3-5 dataset achieves SOTA performance: 45.2% per-benchmark accuracy and 51.8% per-question accuracy on AIME24, AMC, MATH500, Minerva, and OlympiadBench.
Conclusion: Prompt augmentation enables stable long-horizon RL training for math reasoning LLMs, allowing models to tolerate low-entropy regimes without collapse and achieving superior performance on mathematical reasoning benchmarks.
Abstract: Reinforcement learning algorithms such as group-relative policy optimization (GRPO) have demonstrated strong potential for improving the mathematical reasoning capabilities of large language models. However, prior work has consistently observed an entropy collapse phenomenon during reinforcement post-training, characterized by a monotonic decrease in policy entropy that ultimately leads to training instability and collapse. As a result, most existing approaches restrict training to short horizons (typically 5-20 epochs), limiting sustained exploration and hindering further policy improvement. In addition, nearly all prior work relies on a single, fixed reasoning prompt or template during training. In this work, we introduce prompt augmentation, a training strategy that instructs the model to generate reasoning traces under diverse templates and formats, thereby increasing rollout diversity. We show that, without a KL regularization term, prompt augmentation enables stable scaling of training duration under a fixed dataset and allows the model to tolerate low-entropy regimes without premature collapse. Empirically, a Qwen2.5-Math-1.5B model trained with prompt augmentation on the MATH Level 3-5 dataset achieves state-of-the-art performance, reaching 45.2 per-benchmark accuracy and 51.8 per-question accuracy on standard mathematical reasoning benchmarks, including AIME24, AMC, MATH500, Minerva, and OlympiadBench. The code and model checkpoints are available at https://github.com/wenquanlu/prompt-augmentation-GRPO.
[530] Approximation of Log-Partition Function in Policy Mirror Descent Induces Implicit Regularization for LLM Post-Training
Zhenghao Xu, Qin Lu, Changlong Yu, Tuo Zhao
Main category: cs.LG
TL;DR: PMD-mean is a practical reinforcement learning algorithm for LLMs that approximates partition function estimation with mean reward, implicitly applying adaptive mixed KL-χ² regularization for more stable policy updates.
Details
Motivation: Policy mirror descent (PMD) is theoretically principled but requires reliable partition function estimation, which is challenging in LLMs with vast action spaces and limited rollouts. The authors aim to develop a practical approximation that maintains stability while being computationally feasible.Method: PMD-mean approximates the log-partition term in PMD updates with the mean reward under the sampling policy and performs regression in log-policy space. The method implicitly optimizes mirror descent subproblems with an adaptive mixed KL-χ² regularizer that constrains large probability changes.
Result: Experiments on math reasoning tasks show PMD-mean achieves superior performance with improved stability and time efficiency compared to standard approaches. The method produces more conservative updates when expected rewards are low, enhancing robustness against finite-sample estimation errors.
Conclusion: PMD-mean provides a practical and principled RL algorithm for LLMs that balances theoretical soundness with computational feasibility, offering improved stability and efficiency for policy optimization in large action spaces.
Abstract: Policy mirror descent (PMD) provides a principled framework for reinforcement learning (RL) by iteratively solving KL-regularized policy improvement subproblems. While this approach has been adopted in training advanced LLMs such as Kimi K1.5/K2, the ideal closed-form PMD updates require reliable partition function estimation, a significant challenge when working with limited rollouts in the vast action spaces of LLMs. We investigate a practical algorithm, termed PMD-mean, that approximates the log-partition term with the mean reward under the sampling policy and performs regression in log-policy space. Specifically, we characterize the population solution of PMD-mean and demonstrate that it implicitly optimizes mirror descent subproblems with an adaptive mixed KL–$χ^2$ regularizer. This additional $χ^2$ regularization constrains large probability changes, producing more conservative updates when expected rewards are low and enhancing robustness against finite-sample estimation errors. Experiments on math reasoning tasks show that PMD-mean achieves superior performance with improved stability and time efficiency. These findings deepen our understanding of PMD-mean and illuminate pathways toward principled improvements in RL algorithms for LLMs. Code is available at https://github.com/horizon-rl/OpenKimi.
[531] Tuning Out-of-Distribution (OOD) Detectors Without Given OOD Data
Sudeepta Mondal, Xinyi Mary Xie, Ruxiao Duan, Alex Wong, Ganesh Sundaramoorthi
Main category: cs.LG
TL;DR: Paper introduces OOD detector tuning without requiring separate OOD datasets, showing current methods have high variance based on adhoc dataset choices and proposing a new approach using only training data.
Details
Motivation: Current OOD detectors require separate OOD datasets for tuning, which may be unavailable or unrepresentative of actual unknown unknowns, leading to performance variance based on arbitrary dataset choices.Method: Proposes a new generic approach to OOD detector tuning that uses only the training data used to train the neural network, without requiring any extra OOD datasets.
Result: The approach improves over baseline methods consistently across higher-parameter OOD detector families, while being comparable across lower-parameter families.
Conclusion: Formalizes the problem of OOD detector tuning without OOD datasets and provides a practical solution that reduces dependency on arbitrary dataset choices.
Abstract: Existing out-of-distribution (OOD) detectors are often tuned by a separate dataset deemed OOD with respect to the training distribution of a neural network (NN). OOD detectors process the activations of NN layers and score the output, where parameters of the detectors are determined by fitting to an in-distribution (training) set and the aforementioned dataset chosen adhocly. At detector training time, this adhoc dataset may not be available or difficult to obtain, and even when it’s available, it may not be representative of actual OOD data, which is often ‘‘unknown unknowns." Current benchmarks may specify some left-out set from test OOD sets. We show that there can be significant variance in performance of detectors based on the adhoc dataset chosen in current literature, and thus even if such a dataset can be collected, the performance of the detector may be highly dependent on the choice. In this paper, we introduce and formalize the often neglected problem of tuning OOD detectors without a given ``OOD’’ dataset. To this end, we present strong baselines as an attempt to approach this problem. Furthermore, we propose a new generic approach to OOD detector tuning that does not require any extra data other than those used to train the NN. We show that our approach improves over baseline methods consistently across higher-parameter OOD detector families, while being comparable across lower-parameter families.
[532] Dimensionality Reduction on Riemannian Manifolds in Data Analysis
Alaa El Ichi, Khalide Jbilou
Main category: cs.LG
TL;DR: Riemannian geometry-based dimensionality reduction methods that respect manifold structure, focusing on Principal Geodesic Analysis as nonlinear PCA for manifold-valued data, with extensions to discriminant analysis and other techniques.
Details
Motivation: To develop dimensionality reduction methods that respect the underlying manifold structure of data, moving beyond Euclidean assumptions to handle data constrained to curved spaces like hyperspheres and symmetric positive definite manifolds.Method: Investigates Riemannian geometry-based approaches including Principal Geodesic Analysis (PGA) as a nonlinear generalization of PCA, extends discriminant analysis through Riemannian adaptations, and explores manifold learning techniques using geodesic distances, tangent space representations, and intrinsic statistical measures.
Result: Experimental results on representative datasets show Riemannian methods provide improved representation quality and classification performance compared to Euclidean counterparts, especially for data on curved spaces.
Conclusion: Riemannian geometry-aware dimensionality reduction is important for modern machine learning, offering theoretical foundations and practical advantages for handling manifold-structured data.
Abstract: In this work, we investigate Riemannian geometry based dimensionality reduction methods that respect the underlying manifold structure of the data. In particular, we focus on Principal Geodesic Analysis (PGA) as a nonlinear generalization of PCA for manifold valued data, and extend discriminant analysis through Riemannian adaptations of other known dimensionality reduction methods. These approaches exploit geodesic distances, tangent space representations, and intrinsic statistical measures to achieve more faithful low dimensional embeddings. We also discuss related manifold learning techniques and highlight their theoretical foundations and practical advantages. Experimental results on representative datasets demonstrate that Riemannian methods provide improved representation quality and classification performance compared to their Euclidean counterparts, especially for data constrained to curved spaces such as hyperspheres and symmetric positive definite manifolds. This study underscores the importance of geometry aware dimensionality reduction in modern machine learning and data science applications.
[533] Orthogonal Model Merging
Sihan Yang, Kexuan Shi, Weiyang Liu
Main category: cs.LG
TL;DR: OrthoMerge is a model merging method that performs merging on the Riemannian manifold of orthogonal groups to preserve geometric structure of pretrained weights, addressing limitations of linear arithmetic merging in Euclidean space.
Details
Motivation: Current model merging methods use linear arithmetic in Euclidean space, which destroys intrinsic geometric properties of pretrained weights like hyperspherical energy. There's a need for merging methods that preserve these geometric structures.Method: OrthoMerge performs merging on the Riemannian manifold formed by orthogonal groups. It maps task-specific orthogonal matrices from Orthogonal Finetuning (OFT) to Lie algebra for principled integration. For non-OFT models, uses Orthogonal-Residual Decoupling to extract orthogonal components via orthogonal Procrustes problem, merging them on the orthogonal group manifold while processing residuals with standard additive merging.
Result: Extensive empirical results show OrthoMerge effectively mitigates catastrophic forgetting and maintains model performance across diverse tasks compared to traditional merging methods.
Conclusion: OrthoMerge provides a geometric-aware merging approach that preserves intrinsic weight properties, offering improved performance and reduced forgetting in model merging scenarios.
Abstract: Merging finetuned Large Language Models (LLMs) has become increasingly important for integrating diverse capabilities into a single unified model. However, prevailing model merging methods rely on linear arithmetic in Euclidean space, which often destroys the intrinsic geometric properties of pretrained weights, such as hyperspherical energy. To address this, we propose Orthogonal Model Merging (OrthoMerge), a method that performs merging operations on the Riemannian manifold formed by the orthogonal group to preserve the geometric structure of the model’s weights. By mapping task-specific orthogonal matrices learned by Orthogonal Finetuning (OFT) to the Lie algebra, OrthoMerge enables a principled yet efficient integration that takes into account both the direction and intensity of adaptations. In addition to directly leveraging orthogonal matrices obtained by OFT, we further extend this approach to general models finetuned with non-OFT methods (i.e., low-rank finetuning, full finetuning) via an Orthogonal-Residual Decoupling strategy. This technique extracts the orthogonal components of expert models by solving the orthogonal Procrustes problem, which are then merged on the manifold of the orthogonal group, while the remaining linear residuals are processed through standard additive merging. Extensive empirical results demonstrate the effectiveness of OrthoMerge in mitigating catastrophic forgetting and maintaining model performance across diverse tasks.
[534] $f$-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment
Rajdeep Haldar, Lantao Mei, Guang Lin, Yue Xing, Qifan Song
Main category: cs.LG
TL;DR: A unified divergence-based framework for LLM alignment that extends preference alignment to general settings like RL with verifiable rewards, proposing f-GRPO and f-HAL methods with theoretical guarantees.
Details
Motivation: To extend the divergence-based perspective of preference alignment to general alignment settings where only environmental rewards are available (like RLVR), creating a unified framework for LLM alignment.Method: Proposes f-Group Relative Policy Optimization (f-GRPO) for on-policy RL and f-Hybrid Alignment Loss (f-HAL) for hybrid on/off-policy objectives based on variational representations of f-divergences.
Result: Empirical validation on RLVR (Math Reasoning) and PA tasks (Safety Alignment) shows superior performance and flexibility compared to current methods, with theoretical guarantees for reward improvement.
Conclusion: The divergence-based framework provides a unified approach to LLM alignment that works across different settings (preference alignment and RL with verifiable rewards) with strong theoretical foundations and empirical performance.
Abstract: Recent research shows that Preference Alignment (PA) objectives act as divergence estimators between aligned (chosen) and unaligned (rejected) response distributions. In this work, we extend this divergence-based perspective to general alignment settings, such as reinforcement learning with verifiable rewards (RLVR), where only environmental rewards are available. Within this unified framework, we propose $f$-Group Relative Policy Optimization ($f$-GRPO), a class of on-policy reinforcement learning, and $f$-Hybrid Alignment Loss ($f$-HAL), a hybrid on/off policy objectives, for general LLM alignment based on variational representation of $f$-divergences. We provide theoretical guarantees that these classes of objectives improve the average reward after alignment. Empirically, we validate our framework on both RLVR (Math Reasoning) and PA tasks (Safety Alignment), demonstrating superior performance and flexibility compared to current methods.
[535] Breaking Symmetry Bottlenecks in GNN Readouts
Mouad Talhi, Arne Wolf, Anthea Monod
Main category: cs.LG
TL;DR: GNN readout functions (like sum/mean pooling) fundamentally lose graph structure information due to symmetry constraints, but new projector-based invariant readouts can preserve this information while maintaining permutation invariance.
Details
Motivation: Current GNNs have expressivity limitations in distinguishing non-isomorphic graphs. While these limitations are often attributed to message passing, this paper identifies an independent bottleneck at the readout stage where linear permutation-invariant readouts (like sum/mean pooling) erase crucial symmetry-aware information.Method: Using finite-dimensional representation theory, the authors prove that all linear permutation-invariant readouts factor through the Reynolds operator, projecting node embeddings onto the fixed subspace and erasing non-trivial symmetry-aware components. To overcome this, they introduce projector-based invariant readouts that decompose node representations into symmetry-aware channels and summarize them with nonlinear invariant statistics.
Result: The new readout enables fixed encoders to separate WL-hard graph pairs and improves performance across multiple benchmarks. The readout design is shown to be a decisive factor in GNN expressivity that was previously under-appreciated.
Conclusion: Readout functions in GNNs create a fundamental expressivity bottleneck independent of message passing limitations. The proposed projector-based invariant readouts overcome this by preserving symmetry-aware information while maintaining permutation invariance, significantly improving GNN expressivity.
Abstract: Graph neural networks (GNNs) are widely used for learning on structured data, yet their ability to distinguish non-isomorphic graphs is fundamentally limited. These limitations are usually attributed to message passing; in this work we show that an independent bottleneck arises at the readout stage. Using finite-dimensional representation theory, we prove that all linear permutation-invariant readouts, including sum and mean pooling, factor through the Reynolds (group-averaging) operator and therefore project node embeddings onto the fixed subspace of the permutation action, erasing all non-trivial symmetry-aware components regardless of encoder expressivity. This yields both a new expressivity barrier and an interpretable characterization of what global pooling preserves or destroys. To overcome this collapse, we introduce projector-based invariant readouts that decompose node representations into symmetry-aware channels and summarize them with nonlinear invariant statistics, preserving permutation invariance while retaining information provably invisible to averaging. Empirically, swapping only the readout enables fixed encoders to separate WL-hard graph pairs and improves performance across multiple benchmarks, demonstrating that readout design is a decisive and under-appreciated factor in GNN expressivity.
[536] Discrete diffusion samplers and bridges: Off-policy algorithms and applications in latent spaces
Arran Carter, Sanghyeok Choi, Kirill Tamogashev, Víctor Elvira, Nikolay Malkin
Main category: cs.LG
TL;DR: The paper introduces off-policy training techniques for discrete diffusion samplers and generalizes them to Schrödinger bridge problems, with applications to data-free posterior sampling in discrete latent spaces.
Details
Motivation: While diffusion samplers have been successful for continuous-space sampling, their application to discrete spaces remains under-explored and doesn't fully leverage techniques used in continuous-space sampling. The paper aims to bridge this gap.Method: Proposes off-policy training techniques for discrete diffusion samplers, generalizes them to data-to-energy Schrödinger bridge training for discrete domains, and applies these to data-free posterior sampling in discrete latent spaces of image generative models.
Result: The off-policy techniques improve discrete sampler performance on established and new synthetic benchmarks. The method successfully generalizes to Schrödinger bridge problems and enables data-free posterior sampling in discrete latent spaces.
Conclusion: The paper successfully bridges the gap between continuous and discrete diffusion sampling by introducing off-policy training techniques and generalizing to Schrödinger bridge problems, with practical applications in image generative models.
Abstract: Sampling from a distribution $p(x) \propto e^{-\mathcal{E}(x)}$ known up to a normalising constant is an important and challenging problem in statistics. Recent years have seen the rise of a new family of amortised sampling algorithms, commonly referred to as diffusion samplers, that enable fast and efficient sampling from an unnormalised density. Such algorithms have been widely studied for continuous-space sampling tasks; however, their application to problems in discrete space remains largely unexplored. Although some progress has been made in this area, discrete diffusion samplers do not take full advantage of ideas commonly used for continuous-space sampling. In this paper, we propose to bridge this gap by introducing off-policy training techniques for discrete diffusion samplers. We show that these techniques improve the performance of discrete samplers on both established and new synthetic benchmarks. Next, we generalise discrete diffusion samplers to the task of bridging between two arbitrary distributions, introducing data-to-energy Schrödinger bridge training for the discrete domain for the first time. Lastly, we showcase the application of the proposed diffusion samplers to data-free posterior sampling in the discrete latent spaces of image generative models.
[537] A Hybrid Data-Driven Algorithm for Real-Time Friction Force Estimation in Hydraulic Cylinders
Mohamad Amin Jamshidi, Mehrbod Zarifi, Zolfa Anvari, Hamed Ghafarirad, Mohammad Zareinejad
Main category: cs.LG
TL;DR: Hybrid LSTM-Random Forest algorithm for real-time friction force estimation in hydraulic cylinders, achieving <10% error with 1.51ms computational cost.
Details
Motivation: Hydraulic cylinders require accurate friction models for precision control, but existing analytical models (like LuGre) lack adaptability to varying operating conditions and computational efficiency for real-time applications.Method: Data-driven hybrid algorithm combining Long Short-Term Memory (LSTM) networks and Random Forests for nonlinear friction force estimation, using experimental training data from hydraulic test setup.
Result: Achieves consistent model error <10% across diverse operating conditions and load variations, with computational cost of 1.51 milliseconds per estimation, outperforming traditional LuGre model.
Conclusion: The hybrid LSTM-Random Forest approach provides superior precision and real-time computational efficiency compared to analytical friction models, making it suitable for practical hydraulic system applications.
Abstract: Hydraulic systems are widely utilized in industrial applications due to their high force generation, precise control, and ability to function in harsh environments. Hydraulic cylinders, as actuators in these systems, apply force and position through the displacement of hydraulic fluid, but their operation is significantly influenced by friction force. Achieving precision in hydraulic cylinders requires an accurate friction model under various operating conditions. Existing analytical models, often derived from experimental tests, necessitate the identification or estimation of influencing factors but are limited in adaptability and computational efficiency. This research introduces a data-driven, hybrid algorithm based on Long Short-Term Memory (LSTM) networks and Random Forests for nonlinear friction force estimation. The algorithm effectively combines feature detection and estimation processes using training data acquired from an experimental hydraulic test setup. It achieves a consistent and stable model error of less than 10% across diverse operating conditions and external load variations, ensuring robust performance in complex situations. The computational cost of the algorithm is 1.51 milliseconds per estimation, making it suitable for real-time applications. The proposed method addresses the limitations of analytical models by delivering high precision and computational efficiency. The algorithm’s performance is validated through detailed analysis and experimental results, including direct comparisons with the LuGre model. The comparison highlights that while the LuGre model offers a theoretical foundation for friction modeling, its performance is limited by its inability to dynamically adjust to varying operational conditions of the hydraulic cylinder, further emphasizing the advantages of the proposed hybrid approach in real-time applications.
[538] Layer-wise LoRA fine-tuning: a similarity metric approach
Keith Ando Ogawa, Bruno Lopes Yamamoto, Lucas Lauton de Alcantara, Lucas Pellicer, Rosimeire Pereira Costa, Edson Bollis, Anna Helena Reali Costa, Artur Jordao
Main category: cs.LG
TL;DR: Layer-wise LoRA: A parameter-efficient fine-tuning method that selectively applies LoRA to only the most relevant layers based on representation changes, reducing trainable parameters by up to 50% while maintaining performance.
Details
Motivation: As LLMs continue to scale, even parameter-efficient fine-tuning methods like LoRA (which reduces trainable parameters by 99%) may become insufficient. The authors argue that not all layers contribute equally to model adaptation, suggesting that selectively fine-tuning only the most relevant layers could further reduce computational costs.Method: The method systematically selects only a few layers to fine-tune using LoRA or its variants. It identifies the most relevant layers by measuring their contribution to changes in internal representations using Centered Kernel Alignment (CKA). This layer-wise selection approach is orthogonal and compatible with existing LoRA techniques.
Result: The method reduces trainable parameters in LoRA-based techniques by up to 50% while maintaining predictive performance. On encoder-only architectures, there’s negligible performance drop on GLUE benchmark. On decoder-only architectures, there’s small drop or even improvements on mathematical problem-solving and coding tasks. The approach also works effectively for multimodal models.
Conclusion: Layer-wise LoRA provides an effective way to further reduce parameter-efficient fine-tuning costs by selectively targeting only the most relevant layers, demonstrating competitive performance across various model architectures and tasks while cutting trainable parameters by half.
Abstract: Pre-training Large Language Models (LLMs) on web-scale datasets becomes fundamental for advancing general-purpose AI. In contrast, enhancing their predictive performance on downstream tasks typically involves adapting their knowledge through fine-tuning. Parameter-efficient fine-tuning techniques, such as Low-Rank Adaptation (LoRA), aim to reduce the computational cost of this process by freezing the pre-trained model and updating a smaller number of parameters. In comparison to full fine-tuning, these methods achieve over 99% reduction in trainable parameter count, depending on the configuration. Unfortunately, such a reduction may prove insufficient as LLMs continue to grow in scale. In this work, we address the previous problem by systematically selecting only a few layers to fine-tune using LoRA or its variants. We argue that not all layers contribute equally to the model adaptation. Leveraging this, we identify the most relevant layers to fine-tune by measuring their contribution to changes in internal representations. Our method is orthogonal to and readily compatible with existing low-rank adaptation techniques. We reduce the trainable parameters in LoRA-based techniques by up to 50%, while maintaining the predictive performance across different models and tasks. Specifically, on encoder-only architectures, this reduction in trainable parameters leads to a negligible predictive performance drop on the GLUE benchmark. On decoder-only architectures, we achieve a small drop or even improvements in the predictive performance on mathematical problem-solving capabilities and coding tasks. Finally, this effectiveness extends to multimodal models, for which we also observe competitive results relative to fine-tuning with LoRA modules in all layers. Code is available at: https://github.com/c2d-usp/Layer-wise-LoRA-with-CKA
[539] Orthogonal Self-Attention
Leo Zhang, James Martens
Main category: cs.LG
TL;DR: OSA is a novel orthogonal self-attention mechanism that prevents rank collapse and Jacobian instability in skipless Transformers by parametrizing attention matrices as orthogonal via matrix exponential of skew-symmetric query-key values.
Details
Motivation: Softmax Self-Attention (SSA) in skipless Transformer architectures suffers from instability issues including rank collapse and poorly-conditioned Jacobians, which hinder training of Transformers without skip connections and normalization layers.Method: Orthogonal Self-Attention (OSA) parametrizes attention matrices to be orthogonal by mapping a skew-symmetric matrix (formed from query-key values) through the matrix exponential. This is implemented efficiently by exploiting low-rank structure of query-key values, achieving linear computational complexity and memory cost with sequence length.
Result: OSA enables stable training of non-causal Transformers without skip connections and normalization layers. The authors derive an initialization scheme that ensures the Jacobian of OSA is well-conditioned, and demonstrate practical implementation with linear scaling.
Conclusion: OSA provides a theoretically-grounded solution to stability issues in skipless Transformers, enabling more efficient training of Transformer architectures without traditional skip connections and normalization layers through orthogonal attention matrices.
Abstract: Softmax Self-Attention (SSA) is a key component of Transformer architectures. However, when utilised within skipless architectures, which aim to improve representation learning, recent work has highlighted the inherent instability of SSA due to inducing rank collapse and poorly-conditioned Jacobians. In this work, we design a novel attention mechanism: Orthogonal Self-Attention (OSA), which aims to bypass these issues with SSA, in order to allow for (non-causal) Transformers without skip connections and normalisation layers to be more easily trained. In particular, OSA parametrises the attention matrix to be orthogonal via mapping a skew-symmetric matrix, formed from query-key values, through the matrix exponential. We show that this can be practically implemented, by exploiting the low-rank structure of our query-key values, resulting in the computational complexity and memory cost of OSA scaling linearly with sequence length. Furthermore, we derive an initialisation scheme for which we prove ensures that the Jacobian of OSA is well-conditioned.
[540] On Computation and Reinforcement Learning
Raj Ghugare, Michał Bortkiewicz, Alicja Ziarko, Benjamin Eysenbach
Main category: cs.LG
TL;DR: Compute-bounded RL policies can solve harder problems and generalize better to longer-horizon tasks by using more computational resources, even with fixed parameters.
Details
Motivation: The paper addresses how computational resources affect RL policy learning, noting that standard RL frameworks conflate compute and parameters, and seeks to understand if policies can benefit from additional compute without increasing parameters.Method: Formalizes compute-bounded policies, proposes a minimal architecture that can use variable compute (building on algorithmic learning and model-free planning), and tests on 31 tasks spanning online and offline RL.
Result: The architecture achieves stronger performance with more compute and better generalization on longer-horizon tasks compared to standard feedforward or deep residual networks with more parameters.
Conclusion: Compute is a distinct resource from parameters in RL, and policies can benefit from additional compute to solve harder problems and generalize better to longer-horizon tasks.
Abstract: How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not provide a language to answer these questions formally. Empirically, deep RL policies are often parameterized as neural networks with static architectures, conflating the amount of compute and the number of parameters. In this paper, we formalize compute bounded policies and prove that policies which use more compute can solve problems and generalize to longer-horizon tasks that are outside the scope of policies with less compute. Building on prior work in algorithmic learning and model-free planning, we propose a minimal architecture that can use a variable amount of compute. Our experiments complement our theory. On a set 31 different tasks spanning online and offline RL, we show that $(1)$ this architecture achieves stronger performance simply by using more compute, and $(2)$ stronger generalization on longer-horizon test tasks compared to standard feedforward networks or deep residual network using up to 5 times more parameters.
[541] Mechanisms of AI Protein Folding in ESMFold
Kevin Lu, Jannik Brinkmann, Stefan Huber, Aaron Mueller, Yonatan Belinkov, David Bau, Chris Wendler
Main category: cs.LG
TL;DR: ESMFold’s protein folding mechanism involves two computational stages: early blocks initialize pairwise biochemical signals, while late blocks develop pairwise spatial features like distance and contact information.
Details
Motivation: To understand how protein structure prediction models like ESMFold actually fold proteins by investigating their internal mechanisms, specifically focusing on a beta hairpin structural motif.Method: Used counterfactual interventions on model latents to trace how ESMFold folds a beta hairpin, analyzing the computational stages in the folding trunk through interpretable representations.
Result: Identified two distinct computational stages: 1) Early blocks initialize pairwise biochemical signals (residue identities and associated features like charge flow), 2) Late blocks develop pairwise spatial features (distance and contact information accumulation).
Conclusion: ESMFold’s structural decision mechanisms can be localized, traced through interpretable representations, and manipulated with strong causal effects, providing insights into how protein structure prediction models work internally.
Abstract: How do protein structure prediction models fold proteins? We investigate this question by tracing how ESMFold folds a beta hairpin, a prevalent structural motif. Through counterfactual interventions on model latents, we identify two computational stages in the folding trunk. In the first stage, early blocks initialize pairwise biochemical signals: residue identities and associated biochemical features such as charge flow from sequence representations into pairwise representations. In the second stage, late blocks develop pairwise spatial features: distance and contact information accumulate in the pairwise representation. We demonstrate that the mechanisms underlying structural decisions of ESMFold can be localized, traced through interpretable representations, and manipulated with strong causal effects.
[542] A Differential and Pointwise Control Approach to Reinforcement Learning
Minh Nguyen, Chandrajit Bajaj
Main category: cs.LG
TL;DR: Differential RL reformulates reinforcement learning from continuous-time control perspective via differential dual formulation, embedding physics priors for consistent trajectories without explicit constraints.
Details
Motivation: RL in continuous state-action spaces faces challenges in scientific computing due to poor sample efficiency and lack of pathwise physical consistency. Current methods struggle with physics constraints and data efficiency.Method: Introduces Differential RL framework via differential dual formulation inducing Hamiltonian structure. Develops Differential Policy Optimization (dfPO) - a pointwise, stage-wise algorithm refining local movement operators along trajectories for improved sample efficiency and dynamic alignment.
Result: Establishes pointwise convergence guarantees (not available in standard RL) and theoretical regret bound of O(K^{5/6}). Empirically outperforms standard RL baselines on scientific computing tasks including surface modeling, grid control, and molecular dynamics under low-data and physics-constrained conditions.
Conclusion: Differential RL provides physics-consistent, sample-efficient framework for continuous control problems in scientific computing, addressing key limitations of traditional RL through differential formulation and Hamiltonian structure.
Abstract: Reinforcement learning (RL) in continuous state-action spaces remains challenging in scientific computing due to poor sample efficiency and lack of pathwise physical consistency. We introduce Differential Reinforcement Learning (Differential RL), a novel framework that reformulates RL from a continuous-time control perspective via a differential dual formulation. This induces a Hamiltonian structure that embeds physics priors and ensures consistent trajectories without requiring explicit constraints. To implement Differential RL, we develop Differential Policy Optimization (dfPO), a pointwise, stage-wise algorithm that refines local movement operators along the trajectory for improved sample efficiency and dynamic alignment. We establish pointwise convergence guarantees, a property not available in standard RL, and derive a competitive theoretical regret bound of $\mathcal{O}(K^{5/6})$. Empirically, dfPO outperforms standard RL baselines on representative scientific computing tasks, including surface modeling, grid control, and molecular dynamics, under low-data and physics-constrained conditions.
[543] Curiosity is Knowledge: Self-Consistent Learning and No-Regret Optimization with Active Inference
Yingke Li, Anjali Parashar, Enlu Zhou, Chuchu Fan
Main category: cs.LG
TL;DR: Active inference agents using Expected Free Energy achieve both consistent learning and no-regret optimization when curiosity is sufficiently high, connecting AIF to Bayesian experimental design and optimization.
Details
Motivation: Active inference balances exploration and exploitation via Expected Free Energy, but unclear theoretical conditions exist for when this balance yields both coherent learning and efficient decision-making.Method: Theoretical analysis establishing conditions for Expected Free Energy-minimizing agents, characterizing dependence on initial uncertainty, identifiability, and objective alignment within Bayesian framework.
Result: Single requirement of sufficient curiosity simultaneously ensures Bayesian posterior consistency (self-consistent learning) and bounded cumulative regret (no-regret optimization).
Conclusion: Theoretical guarantees connect active inference to classical Bayesian experimental design and optimization, providing practical guidelines for tuning epistemic-pragmatic trade-off in hybrid learning-optimization problems.
Abstract: Active inference (AIF) unifies exploration and exploitation by minimizing the Expected Free Energy (EFE), balancing epistemic value (information gain) and pragmatic value (task performance) through a curiosity coefficient. Yet it has been unclear when this balance yields both coherent learning and efficient decision-making: insufficient curiosity can drive myopic exploitation and prevent uncertainty resolution, while excessive curiosity can induce unnecessary exploration and regret. We establish the first theoretical guarantee for EFE-minimizing agents, showing that a single requirement–sufficient curiosity–simultaneously ensures self-consistent learning (Bayesian posterior consistency) and no-regret optimization (bounded cumulative regret). Our analysis characterizes how this mechanism depends on initial uncertainty, identifiability, and objective alignment, thereby connecting AIF to classical Bayesian experimental design and Bayesian optimization within one theoretical framework. We further translate these theories into practical design guidelines for tuning the epistemic-pragmatic trade-off in hybrid learning-optimization problems, validated through real-world experiments.
[544] The Use of AI-Robotic Systems for Scientific Discovery
Alexander H. Gower, Konstantin Korovin, Daniel Brunnsåker, Filip Kronström, Gabriel K. Reder, Ievgeniia A. Tiukova, Ronald S. Reiserer, John P. Wikswo, Ross D. King
Main category: cs.LG
TL;DR: Robot scientists combine AI and lab robotics to automate the entire scientific method, from theory induction to experimental design and implementation, with applications in systems biology.
Details
Motivation: To automate the complete scientific process by creating coupled AI-robotics systems that can autonomously generate hypotheses, design experiments, and conduct real-world testing, advancing scientific discovery.Method: Develops robot scientists as integrated systems of AI and laboratory robotics, maps scientific activities to machine learning paradigms (particularly active learning), and implements micro-fluidic systems with computer-controlled bioreactors and interpretable models using controlled vocabularies and logic.
Result: Demonstrates the concept through previous robot scientist implementations and introduces Genesis, a next-generation system for systems biology research featuring 1000 computer-controlled micro-bioreactors.
Conclusion: Robot scientists represent a promising approach to automating scientific discovery by combining AI reasoning with physical experimentation capabilities, with the scientific method showing strong analogy to active learning paradigms.
Abstract: The process of developing theories and models and testing them with experiments is fundamental to the scientific method. Automating the entire scientific method then requires not only automation of the induction of theories from data, but also experimentation from design to implementation. This is the idea behind a robot scientist – a coupled system of AI and laboratory robotics that has agency to test hypotheses with real-world experiments. In this chapter we explore some of the fundamentals of robot scientists in the philosophy of science. We also map the activities of a robot scientist to machine learning paradigms, and argue that the scientific method shares an analogy with active learning. We demonstrate these concepts using examples from previous robot scientists, and also from Genesis: a next generation robot scientist designed for research in systems biology, comprising a micro-fluidic system with 1000 computer-controlled micro-bioreactors and interpretable models based in controlled vocabularies and logic.
[545] AP-OOD: Attention Pooling for Out-of-Distribution Detection
Claus Hofmann, Christian Huber, Bernhard Lehner, Daniel Klotz, Sepp Hochreiter, Werner Zellinger
Main category: cs.LG
TL;DR: AP-OOD: A novel semi-supervised OOD detection method for natural language that uses token-level information beyond simple averaging, achieving state-of-the-art performance on text tasks.
Details
Motivation: Current OOD detection methods for language models struggle to effectively leverage and aggregate token embeddings to compute OOD scores. There's a need for methods that go beyond simple average-based aggregation and can utilize token-level information while being flexible enough to work in both unsupervised and supervised settings.Method: AP-OOD is a semi-supervised approach that flexibly interpolates between unsupervised and supervised settings. It exploits token-level information from language models rather than using simple average-based aggregation of token embeddings. The method can use limited auxiliary outlier data when available.
Result: AP-OOD achieves state-of-the-art performance in OOD detection for text. In unsupervised settings, it reduces FPR95 from 27.84% to 4.67% on XSUM summarization and from 77.08% to 70.37% on WMT15 En-Fr translation.
Conclusion: AP-OOD demonstrates that leveraging token-level information in language models significantly improves OOD detection performance for natural language tasks, offering a flexible semi-supervised approach that can work with or without auxiliary outlier data.
Abstract: Out-of-distribution (OOD) detection, which maps high-dimensional data into a scalar OOD score, is critical for the reliable deployment of machine learning models. A key challenge in recent research is how to effectively leverage and aggregate token embeddings from language models to obtain the OOD score. In this work, we propose AP-OOD, a novel OOD detection method for natural language that goes beyond simple average-based aggregation by exploiting token-level information. AP-OOD is a semi-supervised approach that flexibly interpolates between unsupervised and supervised settings, enabling the use of limited auxiliary outlier data. Empirically, AP-OOD sets a new state of the art in OOD detection for text: in the unsupervised setting, it reduces the FPR95 (false positive rate at 95% true positives) from 27.84% to 4.67% on XSUM summarization, and from 77.08% to 70.37% on WMT15 En-Fr translation.
[546] Can vision language models learn intuitive physics from interaction?
Luca M. Schulze Buschoff, Konstantinos Voudouris, Can Demircan, Eric Schulz
Main category: cs.LG
TL;DR: Models trained via interaction (RL) don’t develop generalizable physical intuitions, failing to transfer between related tasks despite shared visual/physical properties.
Details
Motivation: Vision-language models lack physical world intuitions. While supervised fine-tuning helps on specific tasks, it doesn't teach robust, generalizable physical rules. Cognitive science suggests interaction is needed for learning physical dynamics.Method: Train models using reinforcement learning through interaction with environments. Compare models trained via interaction vs. other methods on physical reasoning tasks.
Result: Interaction-based training improves within-task performance but fails to produce generalizable physical intuitions. Models don’t reliably transfer between related tasks even when they share visual statistics and physical principles.
Conclusion: Simply adding interaction (via RL) isn’t sufficient for models to learn generalizable physical rules. New approaches are needed to develop robust physical intuitions that transfer across contexts.
Abstract: Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with the environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.
[547] WAVE: Weighted Autoregressive Varying Gate for Time Series Forecasting
Jiecheng Lu, Xu Han, Yan Sun, Shihao Yang
Main category: cs.LG
TL;DR: WAVE attention mechanism combines AR and MA components to enhance time series forecasting by capturing both long-range and local temporal patterns, achieving state-of-the-art results.
Details
Motivation: The paper addresses limitations in existing attention mechanisms for time series forecasting, particularly their ability to capture both long-range dependencies and local temporal patterns effectively. The authors note that decoder-only autoregressive Transformers have been overlooked for TSF tasks despite their potential.Method: Proposes WAVE attention with AR and MA components inspired by ARMA models from statistics. Uses indirect MA weight generation to incorporate MA terms while maintaining efficiency of underlying attention models. Applies appropriate tokenization and training methods to decoder-only autoregressive Transformers for TSF tasks.
Result: WAVE attention consistently improves performance of various AR attentions on time series forecasting tasks, achieving state-of-the-art results. The decoder-only autoregressive Transformer with proper methods achieves comparable results to best baselines.
Conclusion: The ARMA structure in attention mechanisms effectively enhances temporal pattern modeling for time series forecasting, with WAVE attention providing a flexible framework that adapts to various attention mechanisms while maintaining efficiency.
Abstract: We propose a Weighted Autoregressive Varying gatE (WAVE) attention mechanism equipped with both Autoregressive (AR) and Moving-average (MA) components. It can adapt to various attention mechanisms, enhancing and decoupling their ability to capture long-range and local temporal patterns in time series data. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that WAVE attention that incorporates the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.
[548] Solving Prior Distribution Mismatch in Diffusion Models via Optimal Transport
Zhanpeng Wang, Shenghao Li, Jiameng Che, Chen Wang, Shangling Jui, Na Lei, Zhongxuan Luo
Main category: cs.LG
TL;DR: A framework using Optimal Transport to eliminate prior error in Diffusion Models by matching forward terminal and reverse initial distributions.
Details
Motivation: Diffusion Models suffer from prior error due to mismatch between forward terminal and reverse initial distributions, causing sampling trajectory deviations, degraded generation quality, and constrained sampling efficiency.Method: Proposes an Optimal Transport-based framework that constructs an OT map from reverse initial to forward terminal distribution for precise matching, quantifies prior error bound using Wasserstein distance, and leverages asymptotic consistency between dynamic OT and probability flow.
Result: The method completely eliminates prior error both theoretically and practically, providing a universal and rigorous solution for optimizing DM performance.
Conclusion: The OT-based framework effectively solves the prior error problem in Diffusion Models, enhancing generation quality and sampling efficiency through rigorous distribution matching.
Abstract: Diffusion Models (DMs) have achieved remarkable progress in generative modeling. However, the mismatch between the forward terminal distribution and reverse initial distribution introduces prior error, leading to deviations of sampling trajectories from the true distribution and severely limiting model performance. This issue further triggers cascading problems, including non-zero Signal-to-Noise Ratio, accumulated denoising errors, degraded generation quality, and constrained sampling efficiency. To address this issue, this paper proposes a prior error elimination framework based on Optimal Transport (OT). Specifically, an OT map from the reverse initial distribution to the forward terminal distribution is constructed to achieve precise matching of the two distributions. Meanwhile, the upper bound of the prior error is quantified using the Wasserstein distance, proving that the prior error can be effectively eliminated via the OT map. Additionally, by deriving the asymptotic consistency between dynamic OT and probability flow, this method is revealed to be highly compatible with the intrinsic mechanism of the diffusion process. Experimental results demonstrate that the proposed method completely eliminates the prior error both theoretically and practically, providing a universal and rigorous solution for optimizing the performance of DMs.
[549] Multiple Invertible and Partial-Equivariant Function for Latent Vector Transformation to Enhance Disentanglement in VAEs
Hee-Jun Jung, Jaehyoung Jeong, Kangil Kim
Main category: cs.LG
TL;DR: MIPE-Transformation improves VAE disentanglement via invertible partial-equivariant transformations and exponential-family prior conversion
Details
Motivation: Disentanglement learning is crucial for understanding and reusing learned representations in VAEs, but effectively exploiting equivariance for disentanglement remains challenging despite previous exploration.Method: Proposes MIPE-Transformation with two components: 1) IPE-Transformation for invertible latent-to-transformed-latent mapping while preserving partial input-to-latent equivariance, and 2) EF-Conversion to extend Gaussian prior to approximate exponential family via learnable conversion.
Result: Experiments on 3D Cars, 3D Shapes, and dSprites datasets show MIPE-Transformation improves disentanglement performance of state-of-the-art VAEs.
Conclusion: The proposed method effectively enhances disentanglement in VAEs through novel transformations and prior extensions.
Abstract: Disentanglement learning is central to understanding and reusing learned representations in variational autoencoders (VAEs). Although equivariance has been explored in this context, effectively exploiting it for disentanglement remains challenging. In this paper, we propose a novel method, called Multiple Invertible and Partial-Equivariant Transformation (MIPE-Transformation), which integrates two main parts: (1) Invertible and Partial-Equivariant Transformation (IPE-Transformation), guaranteeing an invertible latent-to-transformed-latent mapping while preserving partial input-to-latent equivariance in the transformed latent space; and (2) Exponential-Family Conversion (EF-Conversion) to extend the standard Gaussian prior to an approximate exponential family via a learnable conversion. In experiments on the 3D Cars, 3D Shapes, and dSprites datasets, MIPE-Transformation improves the disentanglement performance of state-of-the-art VAEs.
[550] EigenLoRAx: Recycling Adapters to Find Principal Subspaces for Resource-Efficient Adaptation and Inference
Prakhar Kaushik, Ankit Vaidya, Shravan Chaudhari, Alan Yuille
Main category: cs.LG
TL;DR: EigenLoRAx is a parameter-efficient finetuning method that recycles existing LoRA adapters to create a principal subspace for rapid adaptation to new tasks with minimal parameters.
Details
Motivation: Address environmental concerns and accessibility equity issues in large models by leveraging abundant publicly available LoRA adapters to create more efficient adaptation methods.Method: Recycles pretrained adapters to create a principal subspace aligned with shared domain knowledge, augmented with orthogonal basis vectors for low-resource scenarios. Learns only lightweight coefficients on principal components instead of finetuning entire adapters.
Result: Requires significantly fewer parameters and memory, improves efficiency for both training and inference, demonstrates strong performance across diverse domains and tasks.
Conclusion: Offers scalable solution for edge-based applications, personalization, and equitable deployment of large models in resource-constrained environments.
Abstract: The rapid growth of large models has raised concerns about their environmental impact and equity in accessibility due to significant computational costs. Low-Rank Adapters (LoRA) offer a lightweight solution for finetuning large models, resulting in an abundance of publicly available adapters tailored to diverse domains. We ask: Can these pretrained adapters be leveraged to further streamline adaptation to new tasks while addressing these challenges? We introduce EigenLoRAx, a parameter-efficient finetuning method that recycles existing adapters to create a principal subspace aligned with their shared domain knowledge which can be further augmented with orthogonal basis vectors in low-resource scenarios. This enables rapid adaptation to new tasks by learning only lightweight coefficients on the principal components of the subspace-eliminating the need to finetune entire adapters. EigenLoRAx requires significantly fewer parameters and memory, improving efficiency for both training and inference. Our method demonstrates strong performance across diverse domains and tasks, offering a scalable for edge-based applications, personalization, and equitable deployment of large models in resource-constrained environments.
[551] Linear Transformers as VAR Models: Aligning Autoregressive Attention Mechanisms with Autoregressive Forecasting
Jiecheng Lu, Shihao Yang
Main category: cs.LG
TL;DR: SAMoVAR is a linear Transformer variant for multivariate time series forecasting that aligns multi-layer linear attention with vector autoregressive (VAR) structure for improved performance, interpretability, and efficiency.
Details
Motivation: Existing multi-layer Transformers for time series forecasting have structural mismatches with autoregressive objectives, impairing interpretability and generalization. The paper aims to bridge the gap between linear attention mechanisms and VAR models to create more interpretable and effective forecasting architectures.Method: The authors show that a single linear attention layer can be interpreted as a dynamic VAR structure. They then rearrange MLP, attention, and input-output flow to align multi-layer linear attention as a VAR model, proposing SAMoVAR which integrates interpretable dynamic VAR weights for multivariate time series forecasting.
Result: SAMoVAR delivers improved performance, interpretability, and computational efficiency compared to state-of-the-art time series forecasting models by aligning Transformer architecture with autoregressive objectives.
Conclusion: By structurally aligning linear attention Transformers with VAR models, SAMoVAR provides a more interpretable and effective approach to multivariate time series forecasting while maintaining computational efficiency.
Abstract: Autoregressive attention-based time series forecasting (TSF) has drawn increasing interest, with mechanisms like linear attention sometimes outperforming vanilla attention. However, deeper Transformer architectures frequently misalign with autoregressive objectives, obscuring the underlying VAR structure embedded within linear attention and hindering their ability to capture the data generative processes in TSF. In this work, we first show that a single linear attention layer can be interpreted as a dynamic vector autoregressive (VAR) structure. We then explain that existing multi-layer Transformers have structural mismatches with the autoregressive forecasting objective, which impair interpretability and generalization ability. To address this, we show that by rearranging the MLP, attention, and input-output flow, multi-layer linear attention can also be aligned as a VAR model. Then, we propose Structural Aligned Mixture of VAR (SAMoVAR), a linear Transformer variant that integrates interpretable dynamic VAR weights for multivariate TSF. By aligning the Transformer architecture with autoregressive objectives, SAMoVAR delivers improved performance, interpretability, and computational efficiency, comparing to SOTA TSF models.
[552] ExplainReduce: Generating global explanations from many local explanations
Lauri Seppäläinen, Mudong Guo, Kai Puolamäki
Main category: cs.LG
TL;DR: ExplainReduce: A method to reduce many local XAI explanations into a small proxy set of simple models that can act as a generative global explanation for black-box models.
Details
Motivation: Most non-linear ML models are black-boxes, and while XAI tools like LIME, SHAP, and SLISEMAP provide local explanations, there's a need to aggregate these into more manageable global explanations.Method: Proposes ExplainReduce, which formulates the reduction of local explanations to a small proxy set as an optimization problem, solved efficiently using greedy heuristics.
Result: Show that as few as five explanations can faithfully emulate the black-box model, and the reduction procedure is competitive with other model aggregation methods.
Conclusion: ExplainReduce provides an effective way to create compact global explanations from numerous local XAI explanations, making complex models more interpretable.
Abstract: Most commonly used non-linear machine learning methods are closed-box models, uninterpretable to humans. The field of explainable artificial intelligence (XAI) aims to develop tools to examine the inner workings of these closed boxes. An often-used model-agnostic approach to XAI involves using simple models as local approximations to produce so-called local explanations; examples of this approach include LIME, SHAP, and SLISEMAP. This paper shows how a large set of local explanations can be reduced to a small “proxy set” of simple models, which can act as a generative global explanation. This reduction procedure, ExplainReduce, can be formulated as an optimisation problem and approximated efficiently using greedy heuristics. We show that, for many problems, as few as five explanations can faithfully emulate the closed-box model and that our reduction procedure is competitive with other model aggregation methods.
[553] MaxSup: Overcoming Representation Collapse in Label Smoothing
Yuxuan Zhou, Heng Li, Zhi-Qi Cheng, Xudong Yan, Yifei Dong, Mario Fritz, Margret Keuper
Main category: cs.LG
TL;DR: MaxSup regularization addresses Label Smoothing’s issues of overconfidence in misclassified samples and representation collapse by uniformly penalizing top-1 logits instead of ground-truth logits.
Details
Motivation: Label Smoothing has two critical issues: it induces overconfidence in misclassified samples and compacts feature representations into overly tight clusters, diluting intra-class diversity. The paper aims to understand the root causes and propose a better alternative.Method: The authors analytically decompose LS-induced loss to identify two key terms: a regularization term for correct predictions and an error-amplification term for misclassifications. They propose Max Suppression (MaxSup) which applies uniform regularization to both correct and incorrect predictions by penalizing the top-1 logit rather than the ground-truth logit.
Result: Through feature-space analyses, MaxSup restores intra-class variation and sharpens inter-class boundaries. Experiments on large-scale image classification and multiple downstream tasks confirm MaxSup is a more robust alternative to LS.
Conclusion: MaxSup effectively addresses the shortcomings of Label Smoothing by providing more balanced regularization that prevents overconfidence in misclassifications and maintains better feature representation diversity.
Abstract: Label Smoothing (LS) is widely adopted to reduce overconfidence in neural network predictions and improve generalization. Despite these benefits, recent studies reveal two critical issues with LS. First, LS induces overconfidence in misclassified samples. Second, it compacts feature representations into overly tight clusters, diluting intra-class diversity, although the precise cause of this phenomenon remained elusive. In this paper, we analytically decompose the LS-induced loss, exposing two key terms: (i) a regularization term that dampens overconfidence only when the prediction is correct, and (ii) an error-amplification term that arises under misclassifications. This latter term compels the network to reinforce incorrect predictions with undue certainty, exacerbating representation collapse. To address these shortcomings, we propose Max Suppression (MaxSup), which applies uniform regularization to both correct and incorrect predictions by penalizing the top-1 logit rather than the ground-truth logit. Through extensive feature-space analyses, we show that MaxSup restores intra-class variation and sharpens inter-class boundaries. Experiments on large-scale image classification and multiple downstream tasks confirm that MaxSup is a more robust alternative to LS. Code is available at: https://github.com/ZhouYuxuanYX/Maximum-Suppression-Regularization
[554] Relational Graph Transformer
Vijay Prakash Dwivedi, Sri Jaladi, Yangyi Shen, Federico López, Charilaos I. Kanatsoulis, Rishi Puri, Matthias Fey, Jure Leskovec
Main category: cs.LG
TL;DR: RelGT: A Graph Transformer architecture designed specifically for relational tables that outperforms GNN baselines by up to 18% on relational data tasks.
Details
Motivation: Graph Neural Networks have limitations in capturing complex structural patterns and long-range dependencies in relational data. Graph Transformers show promise but face challenges when applied to relational entity graphs: traditional positional encodings don't generalize to massive heterogeneous graphs, existing architectures can't model temporal dynamics and schema constraints, and tokenization schemes lose critical structural information.Method: Introduces Relational Graph Transformer (RelGT) with multi-element tokenization that decomposes each node into five components: features, type, hop distance, time, and local structure. Combines local attention over sampled subgraphs with global attention to learnable centroids, enabling efficient encoding of heterogeneity, temporality, and topology without expensive precomputation.
Result: Across 21 tasks from the RelBench benchmark, RelGT consistently matches or outperforms GNN baselines by up to 18%, establishing Graph Transformers as a powerful architecture for Relational Deep Learning.
Conclusion: RelGT successfully addresses the unique challenges of applying Graph Transformers to relational data and demonstrates superior performance over traditional GNN approaches for relational deep learning tasks.
Abstract: Relational Deep Learning (RDL) is a promising approach for building state-of-the-art predictive models on multi-table relational data by representing it as a heterogeneous temporal graph. However, commonly used Graph Neural Network models suffer from fundamental limitations in capturing complex structural patterns and long-range dependencies that are inherent in relational data. While Graph Transformers have emerged as powerful alternatives to GNNs on general graphs, applying them to relational entity graphs presents unique challenges: (i) Traditional positional encodings fail to generalize to massive, heterogeneous graphs; (ii) existing architectures cannot model the temporal dynamics and schema constraints of relational data; (iii) existing tokenization schemes lose critical structural information. Here we introduce the Relational Graph Transformer (RelGT), the first graph transformer architecture designed specifically for relational tables. RelGT employs a novel multi-element tokenization strategy that decomposes each node into five components (features, type, hop distance, time, and local structure), enabling efficient encoding of heterogeneity, temporality, and topology without expensive precomputation. Our architecture combines local attention over sampled subgraphs with global attention to learnable centroids, incorporating both local and database-wide representations. Across 21 tasks from the RelBench benchmark, RelGT consistently matches or outperforms GNN baselines by up to 18%, establishing Graph Transformers as a powerful architecture for Relational Deep Learning.
[555] Guided Diffusion Sampling on Function Spaces with Applications to PDEs
Jiachen Yao, Abbas Mammadov, Julius Berner, Gavin Kerrigan, Jong Chul Ye, Kamyar Azizzadenesheli, Anima Anandkumar
Main category: cs.LG
TL;DR: FunDPS: A function-space diffusion framework for PDE inverse problems that recovers whole solutions from sparse/noisy measurements using neural operators and gradient-based guidance.
Details
Motivation: Addressing the challenge of recovering complete solutions from extremely sparse or noisy measurements in PDE-based inverse problems, where traditional methods struggle with data scarcity and discretization dependencies.Method: Trains an unconditional, discretization-agnostic denoising model using neural operator architectures, then refines samples via gradient-based guidance to satisfy sparse observation data. Extends Tweedie’s formula to infinite-dimensional Banach spaces for theoretical foundation.
Result: Achieves 32% accuracy improvement over state-of-the-art fixed-resolution diffusion baselines with only 3% observation across five PDE tasks, while reducing sampling steps by 4x. Demonstrates strong cross-resolution generalizability.
Conclusion: First diffusion-based framework operating independently of discretization, offering practical and flexible solution for forward and inverse problems in PDE contexts with minimal supervision and severe data scarcity.
Abstract: We propose a general framework for conditional sampling in PDE-based inverse problems, targeting the recovery of whole solutions from extremely sparse or noisy measurements. This is accomplished by a function-space diffusion model and plug-and-play guidance for conditioning. Our method first trains an unconditional, discretization-agnostic denoising model using neural operator architectures. At inference, we refine the samples to satisfy sparse observation data via a gradient-based guidance mechanism. Through rigorous mathematical analysis, we extend Tweedie’s formula to infinite-dimensional Banach spaces, providing the theoretical foundation for our posterior sampling approach. Our method (FunDPS) accurately captures posterior distributions in function spaces under minimal supervision and severe data scarcity. Across five PDE tasks with only 3% observation, our method achieves an average 32% accuracy improvement over state-of-the-art fixed-resolution diffusion baselines while reducing sampling steps by 4x. Furthermore, multi-resolution fine-tuning ensures strong cross-resolution generalizability. To the best of our knowledge, this is the first diffusion-based framework to operate independently of discretization, offering a practical and flexible solution for forward and inverse problems in the context of PDEs. Code is available at https://github.com/neuraloperator/FunDPS
[556] Generalizable Trajectory Prediction via Inverse Reinforcement Learning with Mamba-Graph Architecture
Wenyun Li, Wenjie Huang, Zejian Deng, Chen Sun
Main category: cs.LG
TL;DR: Novel IRL framework for driving behavior modeling using Mamba blocks and graph attention networks to infer diverse reward functions for robust trajectory prediction across scenarios.
Details
Motivation: Accurate driving behavior modeling is challenging in complex traffic scenarios, and existing methods struggle with cross-scenario adaptability and generalization to unseen situations.Method: Inverse Reinforcement Learning framework that infers diverse reward functions, integrates Mamba blocks for efficient long-sequence dependency modeling, and uses graph attention networks to encode spatial interactions among traffic agents.
Result: Outperforms popular approaches in prediction accuracy, achieves 2.3× higher generalization to unseen scenarios compared to baselines, and demonstrates competitive adaptability in Out-of-Distribution settings.
Conclusion: The proposed IRL framework with Mamba and graph attention networks effectively captures human-like decision-making and enables robust cross-scenario adaptability for trajectory prediction.
Abstract: Accurate driving behavior modeling is fundamental to safe and efficient trajectory prediction, yet remains challenging in complex traffic scenarios. This paper presents a novel Inverse Reinforcement Learning (IRL) framework that captures human-like decision-making by inferring diverse reward functions, enabling robust cross-scenario adaptability. The learned reward function is utilized to maximize the likelihood of output by integrating Mamba blocks for efficient long-sequence dependency modeling with graph attention networks to encode spatial interactions among traffic agents. Comprehensive evaluations on urban intersections and roundabouts demonstrate that the proposed method not only outperforms various popular approaches in terms of prediction accuracy but also achieves 2.3 times higher generalization performance to unseen scenarios compared to other baselines, achieving adaptability in Out-of-Distribution settings that is competitive with fine-tuning.
[557] Dual Perspectives on Non-Contrastive Self-Supervised Learning
Jean Ponce, Basile Terver, Martial Hebert, Michael Arbel
Main category: cs.LG
TL;DR: Analysis of stop gradient and exponential moving average procedures in self-supervised learning, showing they avoid representation collapse through optimization and dynamical systems perspectives.
Details
Motivation: To understand why stop gradient and exponential moving average procedures work in non-contrastive self-supervised learning to prevent representation collapse, despite not optimizing the original objective function.Method: Uses optimization theory and dynamical systems analysis to examine these procedures, particularly in linear settings. Shows that without these procedures, collapse always occurs, while with them, equilibria are asymptotically stable.
Result: Theoretical analysis proves that stop gradient and exponential moving average procedures avoid collapse by creating asymptotically stable equilibria, unlike the original objective which always leads to collapse. Empirical experiments with real and synthetic data support the findings.
Conclusion: Stop gradient and exponential moving average are essential mechanisms in non-contrastive self-supervised learning that prevent representation collapse through their dynamical properties, even though they don’t optimize the original objective.
Abstract: The {\em stop gradient} and {\em exponential moving average} iterative procedures are commonly used in non-contrastive approaches to self-supervised learning to avoid representation collapse, with excellent performance in downstream applications in practice. This presentation investigates these procedures from the dual viewpoints of optimization and dynamical systems. We show that, in general, although they {\em do not} optimize the original objective, or {\em any} other smooth function, they {\em do} avoid collapse Following~\citet{Tian21}, but without any of the extra assumptions used in their proofs, we then show using a dynamical system perspective that, in the linear case, minimizing the original objective function without the use of a stop gradient or exponential moving average {\em always} leads to collapse. Conversely, we characterize explicitly the equilibria of the dynamical systems associated with these two procedures in this linear setting as algebraic varieties in their parameter space, and show that they are, in general, {\em asymptotically stable}. Our theoretical findings are illustrated by empirical experiments with real and synthetic data.
[558] Learning to summarize user information for personalized reinforcement learning from human feedback
Hyunji Nam, Yanming Wan, Mickel Liu, Peter Ahnn, Jianxun Lian, Natasha Jaques
Main category: cs.LG
TL;DR: PLUS framework uses RL to learn personalized user preference summaries that condition reward models for personalized LLM responses, improving accuracy and enabling zero-shot personalization with SOTA models.
Details
Motivation: Current RLHF approaches assume uniform user preferences, but real-world LLM assistants need personalization to align with diverse user preferences and goals.Method: PLUS uses reinforcement learning to train both a user-summarization model and reward model simultaneously. The summarization model produces text-based summaries of each user’s preferences, characteristics, and past conversations, which then condition the reward model for personalized predictions.
Result: Achieves 11-77% improvement in reward model accuracy, 25% improvement over best personalized RLHF techniques, and 72% win rate for PLUS-summary-conditioned GPT-4 responses vs 28% for default GPT-4o.
Conclusion: PLUS enables effective personalization of LLM responses through interpretable user preference summaries, supporting robust performance with new users/topics and zero-shot personalization with proprietary models.
Abstract: As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users’ preferences and goals. While reinforcement learning from human feedback (RLHF) is effective at improving LLMs to be generally more helpful and fluent, it does not account for variability across users, as it models the entire user population with a single reward model, meaning it assumes that everyone’s preferences are the same. We present a novel framework, Preference Learning Using Summarization (PLUS), that uses reinforcement learning (RL) to learn to produce text-based summaries of each user’s preferences, characteristics, and past conversations. These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user. Both the user-summarization model and reward model are trained simultaneously, creating an online co-adaptation loop. We show that in contrast to the standard Bradley-Terry model, summaries produced by PLUS capture diverse aspects of user preferences, achieving a 11-77/% improvement in reward model accuracy. Key strengths of PLUS are: (1) robust performance with new users and conversation topics, achieving a 25% improvement over the best personalized reward model technique used for RLHF; (2) zero-shot personalization with state-of-the-art proprietary models like GPT-4 (e.g., PLUS-summary-conditioned responses achieved a 72% win rate compared to 28% for default GPT-4o); (3) learning from flexible user contexts beyond preference labels, and (4) interpretable representation of users, enabling greater transparency and user control in pluralistic LLM alignment.
[559] Amortized Sampling with Transferable Normalizing Flows
Charlie B. Tan, Majdi Hassan, Leon Klein, Saifuddin Syed, Dominique Beaini, Michael M. Bronstein, Alexander Tong, Kirill Neklyudov
Main category: cs.LG
TL;DR: Prose is a 285M parameter transferable normalizing flow for zero-shot sampling of peptide conformations, enabling amortized molecular sampling across different sequences.
Details
Motivation: Traditional molecular sampling methods like MD and MCMC lack amortization - computational cost must be paid for each system. Learned samplers have shown limited transferability across different molecular systems.Method: Prose uses a 285M parameter all-atom transferable normalizing flow trained on peptide MD trajectories (up to 8 residues). It enables zero-shot uncorrelated proposal sampling for arbitrary peptide systems with efficient likelihood evaluation.
Result: Prose achieves previously intractable transferability across sequence length while maintaining efficient likelihood evaluation. It works well as a proposal for various sampling algorithms, with importance sampling-based fine-tuning achieving competitive performance to established methods.
Conclusion: Deep learning enables scalable and transferable molecular samplers. Prose demonstrates zero-shot transfer across peptide systems and opens new possibilities for amortized sampling in computational chemistry.
Abstract: Efficient equilibrium sampling of molecular conformations remains a core challenge in computational chemistry and statistical inference. Classical approaches such as molecular dynamics or Markov chain Monte Carlo inherently lack amortization; the computational cost of sampling must be paid in full for each system of interest. The widespread success of generative models has inspired interest towards overcoming this limitation through learning sampling algorithms. Despite performing competitively with conventional methods when trained on a single system, learned samplers have so far demonstrated limited ability to transfer across systems. We demonstrate that deep learning enables the design of scalable and transferable samplers by introducing Prose, a 285 million parameter all-atom transferable normalizing flow trained on a corpus of peptide molecular dynamics trajectories up to 8 residues in length. Prose draws zero-shot uncorrelated proposal samples for arbitrary peptide systems, achieving the previously intractable transferability across sequence length, whilst retaining the efficient likelihood evaluation of normalizing flows. Through extensive empirical evaluation we demonstrate the efficacy of Prose as a proposal for a variety of sampling algorithms, finding a simple importance sampling-based fine-tuning procedure to achieve competitive performance to established methods such as sequential Monte Carlo. We open-source the Prose codebase, model weights, and training dataset, to further stimulate research into amortized sampling methods and objectives.
[560] Calibration and Transformation-Free Weight-Only LLMs Quantization via Dynamic Grouping
Xinzhe Zheng, Zhen-Qun Yang, Zishan Liu, Haoran Xie, S. Joe Qin, Arlene Chen, Fangzhen Lin
Main category: cs.LG
TL;DR: MSB is a calibration-free, transformation-free post-training quantization method that generalizes binary quantization to multi-bit settings for efficient LLM deployment under memory constraints.
Details
Motivation: LLMs are difficult to deploy under tight memory and compute constraints. Existing low-bit post-training quantization methods typically rely on calibration data, auxiliary transformations, and GPU tools, which the authors aim to eliminate.Method: MSB optimizes a dynamic grouping criterion that minimizes within-group variance, yielding group-wise multiscale levels that can be applied consistently across granularities from per-tensor to block-wise configurations (64 elements per row) without calibration or intermediate transforms.
Result: On Llama 3.2 3B, MSB achieves 8.43 perplexity on WikiText-2 under 4-bit weight-only block-wise quantization, compared to 7.81 in full precision and 12.23 with GPTQ in its default setup.
Conclusion: MSB provides a new optimization perspective for low-bit PTQ while simplifying the pipeline by removing calibration and transformations, making LLM deployment more efficient under memory constraints.
Abstract: Large Language Models (LLMs) deliver strong performance but are difficult to deploy under tight memory and compute constraints. Low-bit post-training quantization (PTQ) is a promising direction; however, it typically relies on calibration data, auxiliary transformations, and GPU tools. To address these limitations, we propose MSB (Multi Scale Binary), a calibration-free and transformation-free PTQ method that generalizes binary quantization to multi-bit settings. MSB optimizes a dynamic grouping criterion that minimizes within group variance, yielding group-wise multiscale levels that can be applied consistently across granularities from per tensor to block-wise configurations with 64 elements groups per row, without calibration or intermediate transforms. We implement the optimization in a CPU based solver for the quantization step and evaluate using standard bfloat16 execution without low-bit packing. On Llama 3.2 3B, MSB achieves 8.43 perplexity on WikiText-2 under 4-bit weight only block-wise quantization, compared to 7.81 in full precision and 12.23 with GPTQ its default setup. Overall, MSB provides a new optimization perspective for low-bit PTQ while simplifying the pipeline by removing calibration and transformations.
[561] On Entropy Control in LLM-RL Algorithms
Han Shen
Main category: cs.LG
TL;DR: AEnt: A new entropy control method for LLM-RL that addresses issues with conventional entropy regularization in large language model reinforcement learning by using clamped entropy bonus with automatic coefficient adjustment.
Details
Motivation: Conventional entropy regularization (used in PPO, SAC, A3C) works well in robotic and game RL but gives weak to no gains in LLM-RL training due to LLM's extremely large response space and sparsity of optimal outputs.Method: Proposes AEnt with clamped entropy bonus evaluated on re-normalized policy over smaller token space, encouraging exploration within compact response set. Automatically adjusts entropy coefficient based on clamped entropy value to control entropy-induced bias while leveraging benefits.
Result: AEnt outperforms baselines consistently across multiple math-reasoning benchmarks with different base models and datasets.
Conclusion: AEnt effectively addresses entropy control issues in LLM-RL by adapting entropy regularization to handle large response spaces and sparse optimal outputs, demonstrating superior performance in reasoning tasks.
Abstract: For RL algorithms, appropriate entropy control is crucial to their effectiveness. To control the policy entropy, a commonly used method is entropy regularization, which is adopted in various popular RL algorithms including PPO, SAC and A3C. Although entropy regularization proves effective in robotic and games RL conventionally, studies found that it gives weak to no gains in LLM-RL training. In this work, we study the issues of entropy bonus in LLM-RL setting. Specifically, we first argue that the conventional entropy regularization suffers from the LLM’s extremely large response space and the sparsity of the optimal outputs. As a remedy, we propose AEnt, an entropy control method that utilizes a new clamped entropy bonus with an automatically adjusted coefficient. The clamped entropy is evaluated with the re-normalized policy defined on certain smaller token space, which encourages exploration within a more compact response set. In addition, the algorithm automatically adjusts entropy coefficient according to the clamped entropy value, effectively controlling the entropy-induced bias while leveraging the entropy’s benefits. AEnt is tested in math-reasoning tasks under different base models and datasets, and it is observed that AEnt outperforms the baselines consistently across multiple benchmarks.
[562] Unveiling m-Sharpness Through the Structure of Stochastic Gradient Noise
Haocheng Luo, Mehrtash Harandi, Dinh Phung, Trung Le
Main category: cs.LG
TL;DR: SAM’s generalization improves with smaller micro-batch sizes due to implicit variance-based sharpness regularization; proposed RW-SAM mimics this effect while remaining parallelizable.
Details
Motivation: To understand why SAM performance improves monotonically as micro-batch size decreases (m-sharpness phenomenon), which is critical for distributed training but lacks rigorous explanation.Method: Used extended Stochastic Differential Equation (SDE) framework to analyze stochastic gradient noise, characterizing SAM variants (n-SAM and m-SAM). Proposed Reweighted SAM (RW-SAM) with sharpness-weighted sampling to mimic m-SAM benefits while maintaining parallelizability.
Result: Analysis revealed that stochastic perturbations induce implicit variance-based sharpness regularization whose strength increases as micro-batch size decreases. RW-SAM successfully mimics generalization benefits of m-SAM while remaining parallelizable.
Conclusion: The m-sharpness phenomenon in SAM is explained by implicit variance-based regularization, and RW-SAM provides a practical solution to achieve similar generalization benefits in distributed settings.
Abstract: Sharpness-aware minimization (SAM) has emerged as a highly effective technique to improve model generalization, but its underlying principles are not fully understood. We investigate m-sharpness, where SAM performance improves monotonically as the micro-batch size for computing perturbations decreases, a phenomenon critical for distributed training yet lacking rigorous explanation. We leverage an extended Stochastic Differential Equation (SDE) framework and analyze stochastic gradient noise (SGN) to characterize the dynamics of SAM variants, including n-SAM and m-SAM. Our analysis reveals that stochastic perturbations induce an implicit variance-based sharpness regularization whose strength increases as m decreases. Motivated by this insight, we propose Reweighted SAM (RW-SAM), which employs sharpness-weighted sampling to mimic the generalization benefits of m-SAM while remaining parallelizable. Comprehensive experiments validate our theory and method.
[563] TensLoRA: Tensor Alternatives for Low-Rank Adaptation
Axel Marmoret, Reda Bensaid, Jonathan Lys, Vincent Gripon, François Leduc-Primeau
Main category: cs.LG
TL;DR: TensLoRA: A unified tensor-based framework for low-rank adaptation that aggregates LoRA updates into higher-order tensors, enabling mode-specific compression rates for vision and language tasks.
Details
Motivation: Current LoRA methods treat attention projection matrices independently for each Query, Key, and Value projection across layers, lacking a systematic framework for joint tensor-based adaptations that could enable more efficient parameter allocation based on modality and task requirements.Method: Introduces TensLoRA, a unified framework that aggregates LoRA updates into higher-order tensors, modeling a broad family of tensor-based low-rank adaptations. The formulation generalizes existing tensor-based methods and enables mode-specific compression rates, allowing parameter budgets to be tailored according to modality and task.
Result: Experiments on vision and language benchmarks show that the tensor construction directly impacts performance, sometimes outperforming standard LoRA under similar parameter counts.
Conclusion: TensLoRA provides a systematic framework for tensor-based low-rank adaptation that offers flexibility in parameter allocation across modalities and tasks, potentially improving efficiency over standard LoRA approaches.
Abstract: Low-Rank Adaptation (LoRA) is widely used to efficiently adapt Transformers by adding trainable low-rank matrices to attention projections. While effective, these matrices are considered independent for each attention projection (Query, Key, and Value) and each layer. Recent extensions have considered joint, tensor-based adaptations, but only in limited forms and without a systematic framework. We introduce TensLoRA, a unified framework that aggregates LoRA updates into higher-order tensors and models a broad family of tensor-based low-rank adaptations. Our formulation generalizes existing tensor-based methods and enables mode-specific compression rates, allowing parameter budgets to be tailored according to the modality and task. Experiments on vision and language benchmarks reveal that the tensor construction directly impacts performance, sometimes better than standard LoRA under similar parameter counts.
[564] SurvDiff: A Diffusion Model for Generating Synthetic Data in Survival Analysis
Marie Brockschmidt, Maresa Schröder, Stefan Feuerriegel
Main category: cs.LG
TL;DR: SurvDiff: A diffusion model for generating synthetic survival analysis data that jointly models covariates, event times, and censoring mechanisms with survival-tailored loss functions.
Details
Motivation: Survival analysis in clinical research faces challenges with incomplete event information due to censoring, making synthetic data generation difficult. Existing methods struggle to faithfully reproduce both event-time distributions and censoring mechanisms crucial for clinical research.Method: Proposes SurvDiff, an end-to-end diffusion model that jointly generates mixed-type covariates, event times, and right-censoring. Uses a survival-tailored loss function that encodes time-to-event structure and optimizes for downstream survival tasks.
Result: SurvDiff consistently outperforms state-of-the-art generative baselines across multiple medical datasets in both distributional fidelity and survival model evaluation metrics.
Conclusion: SurvDiff is the first end-to-end diffusion model explicitly designed for generating synthetic survival data, effectively capturing the data-generating mechanism while preserving censoring mechanisms.
Abstract: Survival analysis is a cornerstone of clinical research by modeling time-to-event outcomes such as metastasis, disease relapse, or patient death. Unlike standard tabular data, survival data often come with incomplete event information due to dropout, or loss to follow-up. This poses unique challenges for synthetic data generation, where it is crucial for clinical research to faithfully reproduce both the event-time distribution and the censoring mechanism. In this paper, we propose SurvDiff an end-to-end diffusion model specifically designed for generating synthetic data in survival analysis. SurvDiff is tailored to capture the data-generating mechanism by jointly generating mixed-type covariates, event times, and right-censoring, guided by a survival-tailored loss function. The loss encodes the time-to-event structure and directly optimizes for downstream survival tasks, which ensures that SurvDiff (i) reproduces realistic event-time distributions and (ii preserves the censoring mechanism. Across multiple datasets, we show that SurvDiff consistently outperforms state-of-the-art generative baselines in both distributional fidelity and survival model evaluation metrics across multiple medical datasets. To the best of our knowledge, SurvDiff is the first end-to-end diffusion model explicitly designed for generating synthetic survival data.
[565] VAO: Validation-Aligned Optimization for Cross-Task Generative Auto-Bidding
Yiqin Lv, Zhiyu Mou, Miao Xu, Jinghao Chen, Qi Wang, Yixiu Mao, Yun Qu, Rongquan Bai, Chuan Yu, Jian Xu, Bo Zheng, Xiangyang Ji
Main category: cs.LG
TL;DR: VAO is a validation-aligned optimization method for generative auto-bidding that addresses data scarcity by adaptively reweighting cross-task data contributions based on validation performance, mitigating gradient bias from distribution shifts.
Details
Motivation: Generative auto-bidding suffers from data scarcity in small-scale settings with limited advertiser participation. Cross-task data sharing can help but introduces gradient bias due to distribution shifts across tasks, and existing methods aren't applicable to generative auto-bidding.Method: Propose Validation-Aligned Optimization (VAO) that adaptively reweights cross-task data contributions based on validation performance feedback. VAO aligns training dynamics to prioritize updates that improve generalization on the target task. Build a unified generative autobidding framework using a single model across multiple tasks with all available data.
Result: Extensive experiments on standard auto-bidding benchmarks validate the effectiveness of the approach.
Conclusion: VAO provides a principled data-sharing method for generative auto-bidding that effectively leverages auxiliary data while mitigating gradient bias from distribution shifts across tasks.
Abstract: Generative auto-bidding has demonstrated strong performance in online advertising, yet it often suffers from data scarcity in small-scale settings with limited advertiser participation. While cross-task data sharing is a natural remedy to mitigate this issue, naive approaches often introduce gradient bias due to distribution shifts across different tasks, and existing methods are not readily applicable to generative auto-bidding. In this paper, we propose Validation-Aligned Optimization (VAO), a principled data-sharing method that adaptively reweights cross-task data contributions based on validation performance feedback. Notably, VAO aligns training dynamics to prioritize updates that improve generalization on the target task, effectively leveraging auxiliary data and mitigating gradient bias. Building on VAO, we introduce a unified generative autobidding framework that generalizes across multiple tasks using a single model and all available task data. Extensive experiments on standard auto-bidding benchmarks validate the effectiveness of our approach.
[566] Bandits with Single-Peaked Preferences and Limited Resources
Omer Ben-Porat, Gur Keinan, Rotem Torkan
Main category: cs.LG
TL;DR: Online stochastic matching algorithm for budget-constrained matching with single-peaked user preferences, achieving efficient regret bounds.
Details
Motivation: Online matching with budget constraints is NP-hard without structural assumptions, making efficient online learning infeasible. The paper aims to overcome this by leveraging single-peaked preferences structure from social choice theory.Method: Develops efficient offline algorithm for budgeted matching with single-peaked preferences, then leverages it into online algorithm with PQ tree-based order approximation. Also creates UCB-like algorithm when structure is known.
Result: Achieves regret bound of Õ(UKT^{2/3}) with unknown structure and Õ(U√(TK)) with known single-peaked structure, both efficient algorithms.
Conclusion: Single-peaked preferences enable efficient online matching algorithms with provable regret bounds, overcoming computational hardness of general matching problems.
Abstract: We study an online stochastic matching problem in which an algorithm sequentially matches $U$ users to $K$ arms, aiming to maximize cumulative reward over $T$ rounds under budget constraints. Without structural assumptions, computing the optimal matching is NP-hard, making online learning computationally infeasible. To overcome this barrier, we focus on single-peaked preferences – a well-established structure in social choice theory, where users’ preferences are unimodal with respect to a common order over arms. We devise an efficient algorithm for the offline budgeted matching problem, and leverage it into an efficient online algorithm with a regret of $\tilde O(UKT^{2/3})$. Our approach relies on a novel PQ tree-based order approximation method. If the single-peaked structure is known, we develop an efficient UCB-like algorithm that achieves a regret bound of $\tilde O(U\sqrt{TK})$.
[567] Auto-Rubric: Learning From Implicit Weights to Explicit Rubrics for Reward Modeling
Lipeng Xie, Sen Huang, Zhuo Zhang, Anni Zou, Yunpeng Zhai, Dingchao Ren, Kezun Zhang, Haoyuan Hu, Boyin Liu, Haoran Chen, Zhaoyang Liu, Bolin Ding
Main category: cs.LG
TL;DR: Training-free framework for explicit reward modeling using natural language rubrics instead of neural weights, achieving state-of-the-art performance with minimal data
Details
Motivation: To address the opacity and data-hungry nature of conventional neural reward models by shifting from implicit weight-based parameterization to explicit natural language rubrics for better interpretability and data efficiencyMethod: Iterative rubric learning with verification-driven refinement and information-theoretic compression to create hierarchical rubric structures (high-level dimensions with concrete verification checks)
Result: Using only 70 preference pairs, rubric-guided judges outperform fully trained reward models on diverse benchmarks, achieving 80.91% on RewardBench2 with Qwen3-8B
Conclusion: Alignment signals are highly compressible and can be effectively captured through explicit symbolic search in natural language rubrics rather than continuous weight optimization
Abstract: Conventional reward modeling relies on gradient descent over neural weights, creating opaque, data-hungry “black boxes.” We propose a paradigm shift from implicit to explicit reward parameterization, recasting optimization from continuous weight spaces to the discrete space of natural language rubrics. We introduce a training-free framework based on iterative rubric learning: it locally induces discriminative criteria via verification-driven refinement, and globally compresses the candidate criteria pool into a compact core set by maximizing an information-theoretic coding rate objective. We organize the compressed core set into a hierarchical rubric structure – high-level evaluation dimensions supported by concrete verification checks – serving as an interpretable, portable reward function. Empirically, our approach challenges prevailing data scaling assumptions: using only 70 preference pairs, our rubric-guided judges outperform fully trained reward models on diverse benchmarks. For instance, Qwen3-8B equipped with our learned rubrics achieves 80.91% on RewardBench2, surpassing the specialized Skywork-Reward-V2-Qwen3-8B (78.20%). These results demonstrate that alignment signals are highly compressible and can be effectively captured through explicit symbolic search.
[568] Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options
Joongkyu Lee, Seouh-won Yi, Min-hwan Oh
Main category: cs.LG
TL;DR: Online preference-based RL algorithm (M-AUPO) using ranking feedback with Plackett-Luce model achieves improved sample efficiency with larger action subsets, avoiding exponential dependence on unknown parameters.
Details
Motivation: Existing PbRL theoretical work focuses on pairwise comparisons, while recent works using multiple comparisons fail to show improved performance with richer ranking feedback despite its availability in applications like LLM alignment.Method: Proposes M-AUPO algorithm using Plackett-Luce model for ranking feedback over action subsets, selecting actions by maximizing average uncertainty within offered subsets.
Result: Achieves suboptimality gap of Õ(d/T √∑(1/|S_t|)) where |S_t| is subset size, showing direct improvement with larger subsets, and establishes near-matching lower bound Ω(d/(K√T)).
Conclusion: First theoretical result in PbRL with ranking feedback demonstrating explicit sample efficiency improvement with subset size, addressing limitations of previous works.
Abstract: We study online preference-based reinforcement learning (PbRL) with the goal of improving sample efficiency. While a growing body of theoretical work has emerged-motivated by PbRL’s recent empirical success, particularly in aligning large language models (LLMs)-most existing studies focus only on pairwise comparisons. A few recent works (Zhu et al., 2023, Mukherjee et al., 2024, Thekumparampil et al., 2024) have explored using multiple comparisons and ranking feedback, but their performance guarantees fail to improve-and can even deteriorate-as the feedback length increases, despite the richer information available. To address this gap, we adopt the Plackett-Luce (PL) model for ranking feedback over action subsets and propose M-AUPO, an algorithm that selects multiple actions by maximizing the average uncertainty within the offered subset. We prove that M-AUPO achieves a suboptimality gap of $\tilde{O}\left( \frac{d}{T} \sqrt{ \sum_{t=1}^T \frac{1}{|S_t|}} \right)$, where $T$ is the total number of rounds, $d$ is the feature dimension, and $|S_t|$ is the size of the subset at round $t$. This result shows that larger subsets directly lead to improved performance and, notably, the bound avoids the exponential dependence on the unknown parameter’s norm, which was a fundamental limitation in most previous works. Moreover, we establish a near-matching lower bound of $Ω\left( \frac{d}{K \sqrt{T}} \right)$, where $K$ is the maximum subset size. To the best of our knowledge, this is the first theoretical result in PbRL with ranking feedback that explicitly shows improved sample efficiency as a function of the subset size.
[569] Differentiable Constraint-Based Causal Discovery
Jincheng Zhou, Mengbo Wang, Anqi He, Yumeng Zhou, Hessam Olya, Murat Kocaoglu, Bruno Ribeiro
Main category: cs.LG
TL;DR: A novel causal discovery method using differentiable d-separation scores via soft logic percolation theory, enabling gradient-based optimization of conditional independence constraints.
Details
Motivation: Existing causal discovery methods have limitations: constraint-based methods struggle with small sample sizes, while score-based methods lack explicit conditional independence testing. There's a need for a third approach that combines the strengths of both.Method: Develops differentiable d-separation scores using percolation theory with soft logic, enabling gradient-based optimization of conditional independence constraints for causal discovery.
Result: The method demonstrates robust performance in low-sample regimes, outperforming traditional constraint-based and score-based baselines on real-world datasets.
Conclusion: The proposed approach offers a promising third avenue for causal discovery that combines the rigor of constraint-based methods with the optimization flexibility of score-based methods.
Abstract: Causal discovery from observational data is a fundamental task in artificial intelligence, with far-reaching implications for decision-making, predictions, and interventions. Despite significant advances, existing methods can be broadly categorized as constraint-based or score-based approaches. Constraint-based methods offer rigorous causal discovery but are often hindered by small sample sizes, while score-based methods provide flexible optimization but typically forgo explicit conditional independence testing. This work explores a third avenue: developing differentiable $d$-separation scores, obtained through a percolation theory using soft logic. This enables the implementation of a new type of causal discovery method: gradient-based optimization of conditional independence constraints. Empirical evaluations demonstrate the robust performance of our approach in low-sample regimes, surpassing traditional constraint-based and score-based baselines on a real-world dataset. Code and data of the proposed method are publicly available at https://github$.$com/PurdueMINDS/DAGPA.
[570] TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting
Vladyslav Moroshan, Julien Siems, Arber Zela, Timur Carstensen, Frank Hutter
Main category: cs.LG
TL;DR: TempoPFN is a univariate time series foundation model using linear RNNs pre-trained on synthetic data, achieving state-of-the-art zero-shot forecasting performance with efficient parallelizable training.
Details
Motivation: Existing foundation models for zero-shot time series forecasting struggle with efficient long-horizon prediction and reproducibility, with synthetic-only approaches underperforming on challenging benchmarks.Method: Uses a GatedDeltaProduct architecture with state-weaving based on linear Recurrent Neural Networks, pre-trained exclusively on synthetic data from a comprehensive pipeline including stochastic differential equations, Gaussian processes, and audio synthesis with novel augmentations.
Result: Achieves top-tier competitive performance on Gift-Eval, fev-bench and Chronos-ZS benchmarks, outperforming all existing synthetic-only approaches and surpassing most models trained on real-world data, while being more efficient through fully parallelizable training and inference.
Conclusion: TempoPFN provides a reproducible foundation for time series forecasting with synthetic-only pre-training, offering efficient parallelizable training and strong zero-shot performance while open-sourcing the complete data generation pipeline.
Abstract: Foundation models for zero-shot time series forecasting face challenges in efficient long-horizon prediction and reproducibility, with existing synthetic-only approaches underperforming on challenging benchmarks. This paper presents TempoPFN, a univariate time series foundation model based on linear Recurrent Neural Networks (RNNs) pre-trained exclusively on synthetic data. The model uses a GatedDeltaProduct architecture with state-weaving for fully parallelizable training across sequence lengths, eliminating the need for windowing or summarization techniques while maintaining robust temporal state-tracking. Our comprehensive synthetic data pipeline unifies diverse generators, including stochastic differential equations, Gaussian processes, and audio synthesis, with novel augmentations. In zero-shot evaluations on the Gift-Eval, fev-bench and Chronos-ZS benchmarks, TempoPFN achieves top-tier competitive performance, outperforming all existing synthetic-only approaches and surpassing the majority of models trained on real-world data, while being more efficient than existing baselines by leveraging fully parallelizable training and inference. We open-source our complete data generation pipeline and training code, providing a reproducible foundation for future research.
[571] Alignment of Diffusion Models: Fundamentals, Challenges, and Future
Buhua Liu, Shitong Shao, Bao Li, Lichen Bai, Zhiqiang Xu, Haoyi Xiong, James Kwok, Sumi Helal, Zeke Xie
Main category: cs.LG
TL;DR: A comprehensive review paper on aligning diffusion models with human preferences and intentions, covering fundamentals, techniques, benchmarks, and evaluation methods.
Details
Motivation: Diffusion models have become dominant in generative modeling but often produce outputs misaligned with human intentions, containing undesired properties or harmful content. Inspired by successful alignment techniques in large language models, researchers are now focusing on aligning diffusion models with human expectations and preferences.Method: This is a review paper that systematically examines: 1) Fundamentals of alignment, 2) Alignment techniques for diffusion models, 3) Preference benchmarks, and 4) Evaluation methods for diffusion models. It provides a comprehensive survey of existing research in this emerging field.
Result: The paper presents the first comprehensive review of alignment techniques for diffusion models, organizing the field into coherent categories and identifying key research directions. It serves as a foundational resource for researchers and engineers working on aligning generative models with human preferences.
Conclusion: Alignment of diffusion models is an important emerging research area that addresses critical issues of safety, ethics, and usability in generative AI. The review identifies current challenges and promising future directions for developing diffusion models that better align with human intentions and preferences.
Abstract: Diffusion models have emerged as the leading paradigm in generative modeling, excelling in various applications. Despite their success, these models often misalign with human intentions and generate results with undesired properties or even harmful content. Inspired by the success and popularity of alignment in tuning large language models, recent studies have investigated aligning diffusion models with human expectations and preferences. This work mainly reviews alignment of diffusion models, covering advancements in fundamentals of alignment, alignment techniques of diffusion models, preference benchmarks, and evaluation for diffusion models. Moreover, we discuss key perspectives on current challenges and promising future directions on solving the remaining challenges in alignment of diffusion models. To the best of our knowledge, our work is the first comprehensive review paper for researchers and engineers to comprehend, practice, and research alignment of diffusion models.
[572] Sparse Attention Post-Training for Mechanistic Interpretability
Florent Draye, Anson Lei, Hsiao-Ru Pan, Ingmar Posner, Bernhard Schölkopf
Main category: cs.LG
TL;DR: A post-training method that makes transformer attention sparse without performance loss, achieving ~0.4% connectivity while preserving pretraining loss, revealing redundant computation and enabling interpretability.
Details
Motivation: Transformers have dense attention patterns that are computationally expensive and difficult to interpret. The authors aim to show that attention can be made much sparser without sacrificing performance, suggesting that current dense attention contains significant redundancy and that sparsity could serve as a structural prior for more interpretable models.Method: A simple post-training method using flexible sparsity regularization under a constrained-loss objective. The approach applies sparsity regularization to transformer attention mechanisms after initial training, preserving the original pretraining loss while dramatically reducing attention connectivity. The method also uses cross-layer transcoders to analyze attention attribution.
Result: Achieved ~0.4% attention connectivity (reduction to 0.4% of original edges) while retaining original pretraining loss on models up to 7B parameters. Found that local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components with up to 100x fewer connecting edges. Sparse attention substantially simplifies attention attribution, enabling unified view of feature-based and circuit-based perspectives.
Conclusion: Transformer attention can be made orders of magnitude sparser, suggesting much of its computation is redundant. Sparsity may serve as a guiding principle for more structured and interpretable models, offering a pathway to simplify complex attention patterns while maintaining performance.
Abstract: We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 7B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.4 %$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. Additionally, using cross-layer transcoders, we show that sparse attention substantially simplifies attention attribution, enabling a unified view of feature-based and circuit-based perspectives. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models.
[573] Test-Time Iterative Error Correction for Efficient Diffusion Models
Yunshan Zhong, Weiqi Yan, Yuxin Zhang
Main category: cs.LG
TL;DR: IEC is a test-time error correction method that reduces exponential error accumulation in efficient diffusion models by iteratively refining outputs without retraining.
Details
Motivation: Efficient diffusion models for resource-constrained devices suffer from approximation errors that accumulate exponentially across timesteps, degrading generation quality. These errors are difficult to correct post-deployment since model modifications are typically infeasible.Method: Iterative Error Correction (IEC) - a test-time method that mitigates inference-time errors by iteratively refining the model’s output. It’s theoretically proven to reduce error propagation from exponential to linear growth without requiring retraining or architectural changes.
Result: Extensive experiments show IEC consistently improves generation quality across various datasets, efficiency techniques, and model architectures. It enables flexible trade-off between performance and efficiency.
Conclusion: IEC is a practical and generalizable solution for test-time enhancement of efficient diffusion models, addressing the fundamental problem of error accumulation in resource-constrained deployment scenarios.
Abstract: With the growing demand for high-quality image generation on resource-constrained devices, efficient diffusion models have received increasing attention. However, such models suffer from approximation errors introduced by efficiency techniques, which significantly degrade generation quality. Once deployed, these errors are difficult to correct, as modifying the model is typically infeasible in deployment environments. Through an analysis of error propagation across diffusion timesteps, we reveal that these approximation errors can accumulate exponentially, severely impairing output quality. Motivated by this insight, we propose Iterative Error Correction (IEC), a novel test-time method that mitigates inference-time errors by iteratively refining the model’s output. IEC is theoretically proven to reduce error propagation from exponential to linear growth, without requiring any retraining or architectural changes. IEC can seamlessly integrate into the inference process of existing diffusion models, enabling a flexible trade-off between performance and efficiency. Extensive experiments show that IEC consistently improves generation quality across various datasets, efficiency techniques, and model architectures, establishing it as a practical and generalizable solution for test-time enhancement of efficient diffusion models.
[574] WebSTAR: Scalable Data Synthesis for Computer Use Agents with Step-Level Filtering
Yifei He, Pranit Chawla, Yaser Souri, Subhojit Som, Xia Song
Main category: cs.LG
TL;DR: WebSTAR: A scalable data synthesis pipeline for computer use agents that filters noisy rollouts at step-level, creates reasoning-augmented trajectories, and trains models that outperform state-of-the-art on WebVoyager benchmark.
Details
Motivation: Training computer use agents (CUAs) is difficult due to high GUI interaction costs and scarcity of high-quality trajectory data. Existing datasets rely on human demonstrations which don't scale, while synthetic data from strong CUAs contains too many incorrect/suboptimal actions for effective imitation learning.Method: Step-level filtering pipeline that evaluates individual actions in noisy rollouts to retain only correct steps, complemented by reasoning augmentation for improved planning. Created WebSTAR dataset (13.3K trajectories, 267K graded steps) from OpenAI’s computer-use-preview model. Also created WebSCORE dataset of graded step-level actions and trained StepRM, a 7B multimodal process reward model distilled from o4-mini.
Result: Qwen-2.5-VL-Instruct models trained on WebSTAR: 7B model surpasses SoTA open-source CUA UI-TARS-1.5-7B by >15% on WebVoyager with only supervised finetuning. StepRM matches o4-mini’s grading quality while being far more efficient to deploy at scale.
Conclusion: Step-level filtering is a key principle for scalable CUA training. WebSTAR and WebSCORE datasets plus StepRM reward model provide practical tools to advance robust and efficient computer use agents.
Abstract: Computer use agents (CUAs) can operate real-world digital interfaces but remain difficult to train due to the high cost of graphical user interface (GUI) interaction and the scarcity of high-quality trajectory data. Existing datasets rely on human demonstrations, limiting scalability. A natural alternative is to synthesize data from strong CUAs, yet their rollouts are highly noisy, with incorrect or suboptimal actions consisting a large proportion of the steps, making naive imitation ineffective. To tackle this challenge, we introduce a scalable data synthesis pipeline that transforms noisy rollouts into reliable supervision without human annotation. The core idea is step-level filtering, which evaluates actions individually to retain only correct steps, complemented by reasoning augmentation for improved planning. Using this pipeline, we construct WebSTAR, a dataset of 13.3K trajectories and 267K graded, reasoning-rich steps synthesized from OpenAI’s computer-use-preview model. We train Qwen-2.5-VL-Instruct models (7B and 32B) on WebSTAR. On WebVoyager, our 7B model surpasses SoTA open-source CUA model UI-TARS-1.5-7B by more than 15% with only supervised finetuning. Building on step-level grading, we further create WebSCORE, a dataset of graded step-level actions, and train StepRM, a 7B multimodal process reward model distilled from o4-mini, which matches its grading quality while being far more efficient to deploy at scale. Our results establish step-level filtering as a key principle for scalable CUA training and construct two new datasets (WebSTAR, WebSCORE) and a lightweight process reward model (StepRM) as practical tools to advance robust and efficient CUAs.
[575] Vector Quantization using Gaussian Variational Autoencoder
Tongda Xu, Wendi Zheng, Jiajun He, Jose Miguel Hernandez-Lobato, Yan Wang, Ya-Qin Zhang, Jie Tang
Main category: cs.LG
TL;DR: Gaussian Quant (GQ) converts trained Gaussian VAEs into VQ-VAEs without additional training by using random Gaussian noise as codebook and finding closest vectors to posterior means.
Details
Motivation: VQ-VAEs are difficult to train due to discretization challenges. The authors propose a simpler approach that leverages existing Gaussian VAE training and converts them to discrete representations without additional training overhead.Method: Two-stage approach: 1) Train a Gaussian VAE under target divergence constraints (TDC) to prepare for conversion, 2) Convert to VQ-VAE by generating random Gaussian noise as codebook and quantizing posterior means to nearest noise vectors.
Result: GQ outperforms previous VQ-VAE methods (VQGAN, FSQ, LFQ, BSQ) on both UNet and ViT architectures. TDC also improves previous Gaussian VAE discretization methods like TokenBridge.
Conclusion: Gaussian Quant provides an effective way to obtain discrete representations from continuous VAEs, simplifying VQ-VAE training while achieving better performance than existing methods.
Abstract: Vector-quantized variational autoencoders (VQ-VAEs) are discrete autoencoders that compress images into discrete tokens. However, they are difficult to train due to discretization. In this paper, we propose a simple yet effective technique dubbed Gaussian Quant (GQ), which first trains a Gaussian VAE under certain constraints and then converts it into a VQ-VAE without additional training. For conversion, GQ generates random Gaussian noise as a codebook and finds the closest noise vector to the posterior mean. Theoretically, we prove that when the logarithm of the codebook size exceeds the bits-back coding rate of the Gaussian VAE, a small quantization error is guaranteed. Practically, we propose a heuristic to train Gaussian VAEs for effective conversion, named the target divergence constraint (TDC). Empirically, we show that GQ outperforms previous VQ-VAEs, such as VQGAN, FSQ, LFQ, and BSQ, on both UNet and ViT architectures. Furthermore, TDC also improves previous Gaussian VAE discretization methods, such as TokenBridge. The source code is provided in the supplementary materials.
[576] From Link Prediction to Forecasting: Addressing Challenges in Batch-based Temporal Graph Learning
Moritz Lampert, Christopher Blöcker, Ingo Scholtes
Main category: cs.LG
TL;DR: The paper critiques traditional batch-oriented evaluation for dynamic link prediction, showing it causes information leakage and inconsistent tasks, and proposes reformulating it as link forecasting for fairer comparisons.
Details
Motivation: Traditional batch-oriented evaluation for dynamic link prediction is problematic because it groups edges into fixed-sized batches regardless of their temporal occurrence, leading to information loss or leakage and creating inconsistent time windows that skew model performance and hinder fair method comparisons.Method: The authors empirically demonstrate issues with batch-based evaluation and propose reformulating dynamic link prediction as a link forecasting task that better accounts for temporal information in the data.
Result: The paper shows that traditional batch-based evaluation leads to skewed model performance and unfair comparisons between methods, which can be mitigated by the proposed link forecasting approach.
Conclusion: Dynamic link prediction should be evaluated using link forecasting rather than batch-oriented approaches to properly account for temporal information and enable fair method comparisons.
Abstract: Dynamic link prediction is an important problem considered in many recent works that propose approaches for learning temporal edge patterns. To assess their efficacy, models are evaluated on continuous-time and discrete-time temporal graph datasets, typically using a traditional batch-oriented evaluation setup. However, as we show in this work, a batch-oriented evaluation is often unsuitable and can cause several issues. Grouping edges into fixed-sized batches regardless of their occurrence time leads to information loss or leakage, depending on the temporal granularity of the data. Furthermore, fixed-size batches create time windows with different durations, resulting in an inconsistent dynamic link prediction task. In this work, we empirically show how traditional batch-based evaluation leads to skewed model performance and hinders the fair comparison of methods. We mitigate this problem by reformulating dynamic link prediction as a link forecasting task that better accounts for temporal information present in the data.
[577] GeoRA: Geometry-Aware Low-Rank Adaptation for RLVR
Jiaying Zhang, Lei Shi, Jiguo Li, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He
Main category: cs.LG
TL;DR: GeoRA is a geometry-aware low-rank adaptation method designed specifically for Reinforcement Learning with Verifiable Rewards (RLVR) that addresses optimization instability and spectral collapse issues in existing parameter-efficient methods.
Details
Motivation: Existing parameter-efficient methods like PiSSA and MiLoRA are designed for Supervised Fine-Tuning but fail to account for the distinct optimization dynamics and geometric structures of RLVR, leading to spectral collapse and optimization instability when applied directly.Method: GeoRA exploits the anisotropic and compressible nature of RL update subspaces by initializing adapters via Singular Value Decomposition (SVD) within geometrically constrained subspaces while freezing residual components, preserving pre-trained geometric structure and enabling efficient GPU computation.
Result: GeoRA consistently outperforms established low-rank baselines on key mathematical benchmarks, achieving state-of-the-art results, and shows superior generalization and resilience to catastrophic forgetting in out-of-domain tasks.
Conclusion: GeoRA effectively addresses geometric misalignment issues in RLVR, providing a parameter-efficient adaptation method that maintains optimization stability while achieving superior performance on reasoning tasks.
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is crucial for advancing large-scale reasoning models. However, existing parameter-efficient methods, such as PiSSA and MiLoRA, are designed for Supervised Fine-Tuning (SFT) and do not account for the distinct optimization dynamics and geometric structures of RLVR. Applying these methods directly leads to spectral collapse and optimization instability, which severely limit model performance. Meanwhile, alternative approaches that leverage update sparsity encounter significant efficiency bottlenecks on modern hardware due to unstructured computations. To address these challenges, we propose GeoRA (Geometry-Aware Low-Rank Adaptation), which exploits the anisotropic and compressible nature of RL update subspaces. GeoRA initializes adapters by extracting principal directions via Singular Value Decomposition (SVD) within a geometrically constrained subspace while freezing the residual components. This method preserves the pre-trained geometric structure and enables efficient GPU computation through dense operators. Experiments on Qwen and Llama demonstrate that GeoRA mitigates optimization bottlenecks caused by geometric misalignment. It consistently outperforms established low-rank baselines on key mathematical benchmarks, achieving state-of-the-art (SOTA) results. Moreover, GeoRA shows superior generalization and resilience to catastrophic forgetting in out-of-domain tasks.
[578] A Policy Gradient-Based Sequence-to-Sequence Method for Time Series Prediction
Qi Sima, Xinze Zhang, Yukun Bao, Siyue Yang, Liang Shen
Main category: cs.LG
TL;DR: A reinforcement learning approach for sequence-to-sequence time series forecasting that learns adaptive input selection to mitigate exposure bias and error compounding in multi-step prediction.
Details
Motivation: Standard sequence-to-sequence models for time series prediction face a dilemma: using ground truth inputs during training creates exposure bias (mismatch with inference), while using model's own outputs causes error compounding. Current methods don't effectively address both issues simultaneously.Method: Proposes a reinforcement learning framework with policy gradient optimization. Auxiliary models generate plausible input candidates, and a trainable policy network dynamically selects the most beneficial inputs to maximize long-term prediction performance.
Result: Empirical evaluations on diverse time series datasets show the approach enhances both accuracy and stability in multi-step forecasting compared to conventional methods.
Conclusion: The reinforcement learning-based adaptive input selection strategy effectively addresses exposure bias and error compounding in sequence-to-sequence time series prediction, improving multi-step forecasting performance.
Abstract: Sequence-to-sequence architectures built upon recurrent neural networks have become a standard choice for multi-step-ahead time series prediction. In these models, the decoder produces future values conditioned on contextual inputs, typically either actual historical observations (ground truth) or previously generated predictions. During training, feeding ground-truth values helps stabilize learning but creates a mismatch between training and inference conditions, known as exposure bias, since such true values are inaccessible during real-world deployment. On the other hand, using the model’s own outputs as inputs at test time often causes errors to compound rapidly across prediction steps. To mitigate these limitations, we introduce a new training paradigm grounded in reinforcement learning: a policy gradient-based method to learn an adaptive input selection strategy for sequence-to-sequence prediction models. Auxiliary models first synthesize plausible input candidates for the decoder, and a trainable policy network optimized via policy gradients dynamically chooses the most beneficial inputs to maximize long-term prediction performance. Empirical evaluations on diverse time series datasets confirm that our approach enhances both accuracy and stability in multi-step forecasting compared to conventional methods.
[579] EEG Foundation Models: Progresses, Benchmarking, and Open Problems
Dingkun Liu, Yuheng Chen, Zhu Chen, Zhenyao Cui, Yaozhi Wen, Jiayu An, Jingwei Luo, Dongrui Wu
Main category: cs.LG
TL;DR: Comprehensive evaluation of 12 EEG foundation models across 13 datasets shows that linear probing is often insufficient, specialist models remain competitive, and larger models don’t necessarily improve generalization under current data regimes.
Details
Motivation: There's a lack of fair and comprehensive comparisons of existing EEG foundation models due to inconsistent pre-training objectives, preprocessing choices, and downstream evaluation protocols. This paper aims to fill this gap by providing systematic evaluation.Method: Reviewed 50 representative models and organized design choices into a unified taxonomic framework. Evaluated 12 open-source foundation models and specialist baselines across 13 EEG datasets spanning 9 BCI paradigms. Used cross-subject generalization (leave-one-subject-out) and rapid calibration (within-subject few-shot) protocols. Compared full-parameter fine-tuning vs linear probing, and examined model scale vs performance relationships.
Result: 1) Linear probing is frequently insufficient for optimal performance; 2) Specialist models trained from scratch remain competitive across many tasks; 3) Larger foundation models do not necessarily yield better generalization performance under current data regimes and training practices.
Conclusion: The study provides comprehensive benchmarking of EEG foundation models, revealing important insights about transfer learning strategies, model scale effects, and the continued relevance of specialized approaches in brain-computer interfaces.
Abstract: Electroencephalography (EEG) foundation models have recently emerged as a promising paradigm for brain-computer interfaces (BCIs), aiming to learn transferable neural representations from large-scale heterogeneous recordings. Despite rapid progresses, there lacks fair and comprehensive comparisons of existing EEG foundation models, due to inconsistent pre-training objectives, preprocessing choices, and downstream evaluation protocols. This paper fills this gap. We first review 50 representative models and organize their design choices into a unified taxonomic framework including data standardization, model architectures, and self-supervised pre-training strategies. We then evaluate 12 open-source foundation models and competitive specialist baselines across 13 EEG datasets spanning nine BCI paradigms. Emphasizing real-world deployments, we consider both cross-subject generalization under a leave-one-subject-out protocol and rapid calibration under a within-subject few-shot setting. We further compare full-parameter fine-tuning with linear probing to assess the transferability of pre-trained representations, and examine the relationship between model scale and downstream performance. Our results indicate that: 1) linear probing is frequently insufficient; 2) specialist models trained from scratch remain competitive across many tasks; and, 3) larger foundation models do not necessarily yield better generalization performance under current data regimes and training practices.
[580] GAMformer: Bridging Tabular Foundation Models and Interpretable Machine Learning
Andreas Mueller, Julien Siems, Harsha Nori, David Salinas, Arber Zela, Rich Caruana, Frank Hutter
Main category: cs.LG
TL;DR: GAMformer is a tabular foundation model for Generalized Additive Models that enables interpretable in-context learning for tabular data, trained exclusively on synthetic data to prevent leakage.
Details
Motivation: Existing tabular foundation models lack interpretability needed for safety-critical applications, while traditional GAMs are incompatible with in-context learning paradigms. There's a need to bridge foundation model power with interpretability requirements.Method: GAMformer estimates GAM shape functions in a single forward pass using in-context learning, representing a departure from conventional iterative approaches. It’s trained exclusively on synthetically generated tables to prevent data leakage.
Result: GAMformer performs comparably to other leading GAMs across various classification benchmarks, demonstrating effective interpretable tabular modeling.
Conclusion: GAMformer successfully bridges the gap between foundation model power and interpretability requirements for tabular data, enabling in-context learning with GAM interpretability.
Abstract: While interpretability is crucial for machine learning applications in safety-critical domains and for regulatory compliance, existing tabular foundation models like TabPFN lack transparency. Generalized Additive Models (GAMs) provide the needed interpretability through their additive structure, but traditional GAM methods rely on iterative learning algorithms (such as splines, boosted trees, or neural networks) that are fundamentally incompatible with the in-context learning paradigm of foundation models. In this paper, we introduce GAMformer, the first tabular foundation model for GAMs that bridges the gap between the power of foundation models and the interpretability requirements of critical real-world applications. GAMformer estimates GAM shape functions in a single forward pass using in-context learning, representing a significant departure from conventional iterative approaches. Building on previous research on tabular foundation models, we train GAMformer exclusively on synthetically generated tables to prevent data leakage. Our experiments demonstrate that GAMformer performs comparably to other leading GAMs across various classification benchmarks.
[581] Representation Geometry as a Diagnostic for Out-of-Distribution Robustness
Ali Zia, Farid Hazratian
Main category: cs.LG
TL;DR: Geometry-based diagnostic framework using embedding structure to predict model robustness under distribution shift without target labels
Details
Motivation: Current methods struggle to monitor and optimize model generalization under distribution shift without target-domain labels, as models with similar in-distribution accuracy can have very different OOD performance. There's a need for post-hoc diagnostic signals beyond training-time regularization and low-order representation statistics.Method: Proposes a geometry-based diagnostic framework that constructs class-conditional mutual k-nearest-neighbor graphs from in-distribution embeddings. Extracts two invariants: 1) global spectral complexity proxy based on reduced log-determinant of normalized Laplacian, and 2) local smoothness measure based on Ollivier-Ricci curvature.
Result: Across multiple architectures, training regimes, and corruption benchmarks, lower spectral complexity and higher mean curvature consistently predict stronger OOD accuracy across checkpoints. Controlled perturbations and topological analyses show these signals reflect meaningful representation structure rather than superficial embedding statistics.
Conclusion: Representation geometry enables interpretable, label-free robustness diagnosis and supports reliable unsupervised checkpoint selection under distribution shift.
Abstract: Robust generalization under distribution shift remains difficult to monitor and optimize in the absence of target-domain labels, as models with similar in-distribution accuracy can exhibit markedly different out-of-distribution (OOD) performance. While prior work has focused on training-time regularization and low-order representation statistics, little is known about whether the geometric structure of learned embeddings provides reliable post-hoc signals of robustness. We propose a geometry-based diagnostic framework that constructs class-conditional mutual k-nearest-neighbor graphs from in-distribution embeddings and extracts two complementary invariants: a global spectral complexity proxy based on the reduced log-determinant of the normalized Laplacian, and a local smoothness measure based on Ollivier–Ricci curvature. Across multiple architectures, training regimes, and corruption benchmarks, we find that lower spectral complexity and higher mean curvature consistently predict stronger OOD accuracy across checkpoints. Controlled perturbations and topological analyses further show that these signals reflect meaningful representation structure rather than superficial embedding statistics. Our results demonstrate that representation geometry enables interpretable, label-free robustness diagnosis and supports reliable unsupervised checkpoint selection under distribution shift.
[582] Energy Guided smoothness to improve Robustness in Graph Classification
Farooq Ahmad Wani, Maria Sofia Bucarelli, Andrea Giuseppe Di Francesco, Oleksandr Pryymak, Fabrizio Silvestri
Main category: cs.LG
TL;DR: GNNs struggle with noisy labels; their robustness is linked to reducing Dirichlet Energy of node representations; two training strategies are introduced to enhance robustness without harming noise-free performance.
Details
Motivation: Graph Neural Networks (GNNs) are powerful for graph classification but real-world applications often contain noisy labels. The paper aims to understand GNN robustness to label noise and develop strategies to improve it.Method: 1) Study GNN failure modes under noisy labels; 2) Establish empirical/theoretical links between GNN robustness and reduction of total Dirichlet Energy; 3) Introduce two training strategies: removing negative eigenvalues from weight matrices (connected to Dirichlet Energy minimization) and extending a loss penalty that promotes learned smoothness.
Result: The paper demonstrates GNN failure modes in noisy label scenarios and shows that robustness is connected to smoothness inductive bias. The proposed training strategies enhance GNN robustness without negatively impacting performance in noise-free settings.
Conclusion: GNN robustness to label noise stems from their smoothness inductive bias. The proposed methods effectively improve robustness while maintaining performance in clean settings, supporting the hypothesis that smoothness is key to GNN robustness.
Abstract: Graph Neural Networks (GNNs) are powerful at solving graph classification tasks, yet applied problems often contain noisy labels. In this work, we study GNN robustness to label noise, demonstrate GNN failure modes when models struggle to generalise on low-order graphs, low label coverage, or when a model is over-parameterized. We establish both empirical and theoretical links between GNN robustness and the reduction of the total Dirichlet Energy of learned node representations, which encapsulates the hypothesized GNN smoothness inductive bias. Finally, we introduce two training strategies to enhance GNN robustness: (1) by incorporating a novel inductive bias in the weight matrices through the removal of negative eigenvalues, connected to Dirichlet Energy minimization; (2) by extending to GNNs a loss penalty that promotes learned smoothness. Importantly, neither approach negatively impacts performance in noise-free settings, supporting our hypothesis that the source of GNNs robustness is their smoothness inductive bias.
[583] Learning to Discover at Test Time
Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun
Main category: cs.LG
TL;DR: TTT-Discover uses test-time reinforcement learning with LLMs to discover state-of-the-art solutions for scientific problems across domains like mathematics, GPU engineering, algorithms, and biology.
Details
Motivation: To develop AI systems that can discover new state-of-the-art solutions for scientific problems through test-time learning, moving beyond frozen LLM prompting to enable continual learning specifically tailored to each problem.Method: Uses reinforcement learning at test time where the LLM continues to train with experience specific to each problem. The approach prioritizes promising solutions through specialized learning objectives and search subroutines designed for single-problem optimization rather than generalization.
Result: Achieves new state-of-the-art results across multiple domains: (1) Erdős’ minimum overlap problem and autocorrelation inequality, (2) GPU kernel engineering (up to 2× faster), (3) past AtCoder algorithm competitions, and (4) single-cell analysis denoising. All results achieved with open models and reproducible code.
Conclusion: TTT-Discover demonstrates that test-time training with LLMs can effectively discover novel solutions to diverse scientific problems, outperforming previous methods while using open models and maintaining reproducibility.
Abstract: How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős’ minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.
[584] Hierarchical Subspaces of Policies for Continual Offline Reinforcement Learning
Anthony Kobanda, Rémy Portelas, Odalric-Ambrym Maillard, Ludovic Denoyer
Main category: cs.LG
TL;DR: HiSPO: Hierarchical framework for continual reinforcement learning in navigation tasks that prevents forgetting while adapting to new tasks from offline data.
Details
Motivation: Address the challenge of continual RL where agents must adapt to new tasks while retaining previously acquired skills, particularly important for autonomous robotics and video game navigation with changing environments.Method: HiSPO uses hierarchical framework with distinct policy subspaces of neural networks to enable flexible adaptation to new tasks while preserving existing knowledge, designed specifically for continual learning in navigation from offline data.
Result: Demonstrated effectiveness in MuJoCo maze environments and complex video game-like navigation simulations, showing competitive performance and good adaptability with respect to continual learning metrics, particularly memory usage and efficiency.
Conclusion: HiSPO provides an effective hierarchical solution for continual reinforcement learning in navigation tasks, successfully balancing adaptation to new tasks with preservation of past knowledge.
Abstract: We consider a Continual Reinforcement Learning setup, where a learning agent must continuously adapt to new tasks while retaining previously acquired skill sets, with a focus on the challenge of avoiding forgetting past gathered knowledge and ensuring scalability with the growing number of tasks. Such issues prevail in autonomous robotics and video game simulations, notably for navigation tasks prone to topological or kinematic changes. To address these issues, we introduce HiSPO, a novel hierarchical framework designed specifically for continual learning in navigation settings from offline data. Our method leverages distinct policy subspaces of neural networks to enable flexible and efficient adaptation to new tasks while preserving existing knowledge. We demonstrate, through a careful experimental study, the effectiveness of our method in both classical MuJoCo maze environments and complex video game-like navigation simulations, showcasing competitive performances and satisfying adaptability with respect to classical continual learning metrics, in particular regarding the memory usage and efficiency.
[585] VFScale: Intrinsic Reasoning through Verifier-Free Test-time Scalable Diffusion Model
Tao Zhang, Jia-Shu Pan, Ruiqi Feng, Tailin Wu
Main category: cs.LG
TL;DR: VFScale introduces a verifier-free test-time scalable diffusion model for complex reasoning tasks like Maze and Sudoku, using intrinsic energy functions as verifiers and hybrid Monte Carlo Tree Search for efficient inference.
Details
Motivation: Current diffusion models lack test-time scaling for complex reasoning tasks, relying on external verifiers unlike human intrinsic reasoning, and suffer from inefficient search algorithms.Method: VFScale uses MRNCL loss and KL regularization to improve energy landscape (making energy function a reliable verifier), and integrates denoising with hybrid Monte Carlo Tree Search (hMCTS) for efficient inference.
Result: On Maze reasoning tasks, VFScale trained on 6×6 mazes solves 88% of much larger 15×15 mazes, while standard diffusion models completely fail.
Conclusion: VFScale enables scalable intrinsic reasoning in diffusion models without external verifiers, demonstrating strong generalization on complex reasoning tasks.
Abstract: Inspired by human SYSTEM 2 thinking, LLMs excel at complex reasoning tasks via extended Chain-of-Thought. However, similar test-time scaling for diffusion models to tackle complex reasoning remains largely unexplored. From existing work, two primary challenges emerge in this setting: (i) the dependence on an external verifier indicating a notable gap from intrinsic reasoning of human intelligence without any external feedback, and (ii) the lack of an efficient search algorithm. In this paper, we introduce the Verifier-free Test-time Scalable Diffusion Model (VFScale) to achieve scalable intrinsic reasoning, which equips number-of-sample test-time scaling with the intrinsic energy function of diffusion models as the verifier. Concretely, VFScale comprises two key innovations to address the aforementioned challenges. On the training side, VFScale consists of a novel MRNCL loss and a KL regularization to improve the energy landscape, ensuring that the learned energy function itself serves as a reliable verifier. On the inference side, VFScale integrates the denoising process with a novel hybrid Monte Carlo Tree Search (hMCTS) to improve search efficiency. On challenging reasoning tasks of Maze and Sudoku, we demonstrate the effectiveness of VFScale’s training objective and scalable inference method. In particular, trained with Maze sizes of up to $6\times6$, our VFScale solves 88% of Maze problems with much larger sizes of $15\times15$, while standard diffusion models completely fail. The code can be found at https://github.com/AI4Science-WestlakeU/VFScale.
[586] GenIAS: Generator for Instantiating Anomalies in time Series
Zahra Zamanzadeh Darban, Qizhou Wang, Geoffrey I. Webb, Shirui Pan, Charu C. Aggarwal, Mahsa Salehi
Main category: cs.LG
TL;DR: GenIAS is a synthetic anomaly generation method for time series anomaly detection that uses learnable perturbations in VAE latent space to create diverse, realistic anomalies for training better detection models.
Details
Motivation: Existing synthetic anomaly injection methods for time series anomaly detection rely on hand-crafted strategies that fail to capture diverse and complex anomalous patterns, especially in multivariate settings, limiting their effectiveness.Method: Proposes GenIAS which generates anomalies via learnable perturbations in the latent space of a variational autoencoder, using variational reparameterization to inject abnormal patterns across temporal segments at varying scales. Introduces joint learning of perturbation scale and compact latent representations via a tunable prior.
Result: Extensive experiments show GenIAS produces more diverse and realistic anomalies, and detection models trained with these anomalies outperform 17 baseline methods on 9 popular TSAD benchmarks.
Conclusion: GenIAS provides an effective approach for synthetic anomaly generation that improves time series anomaly detection performance by creating more realistic and diverse training anomalies.
Abstract: Synthetic anomaly injection is a recent and promising approach for time series anomaly detection (TSAD), but existing methods rely on ad hoc, hand-crafted strategies applied to raw time series that fail to capture diverse and complex anomalous patterns, particularly in multivariate settings. We propose a synthetic anomaly generation method named Generator for Instantiating Anomalies in Time Series (GenIAS), which generates realistic and diverse anomalies via a novel learnable perturbation in the latent space of a variational autoencoder. This enables abnormal patterns to be injected across different temporal segments at varying scales based on variational reparameterization. To generate anomalies that align with normal patterns while remaining distinguishable, we introduce a learning strategy that jointly learns the perturbation scale and compact latent representations via a tunable prior, which improves the distinguishability of generated anomalies, as supported by our theoretical analysis. Extensive experiments show that GenIAS produces more diverse and realistic anomalies, and that detection models trained with these anomalies outperform 17 baseline methods on 9 popular TSAD benchmarks.
[587] Differential Privacy Analysis of Decentralized Gossip Averaging under Varying Threat Models
Antti Koskela, Tejas Kulkarni
Main category: cs.LG
TL;DR: Decentralized differential privacy framework for gossip-based averaging with node-level noise, showing privacy guarantees similar to centralized Gaussian mechanisms with O(T) sensitivity growth.
Details
Motivation: Achieving differential privacy in fully decentralized ML is challenging due to lack of central aggregator and varying trust assumptions among nodes. Need to analyze privacy leakage in decentralized gossip-based averaging algorithms.Method: Presents analytical framework based on linear systems formulation to characterize privacy leakage between nodes in graph. Analyzes DP guarantees of gossip-based averaging with additive node-level noise from arbitrary node views.
Result: DP guarantees are those of Gaussian mechanism with squared sensitivity asymptotically O(T) (where T is training rounds), similar to central aggregation. Excess risk of decentralized private learning for strongly convex losses is asymptotically similar to centralized private learning.
Conclusion: Decentralized private learning can achieve privacy guarantees comparable to centralized approaches, enabling privacy-preserving decentralized ML with formal DP analysis.
Abstract: Achieving differential privacy (DP) guarantees in fully decentralized machine learning is challenging due to the absence of a central aggregator and varying trust assumptions among nodes. We present a framework for DP analysis of decentralized gossip-based averaging algorithms with additive node-level noise, from arbitrary views of nodes in a graph. We present an analytical framework based on a linear systems formulation that accurately characterizes privacy leakage between nodes. Our main contribution is showing that the DP guarantees are those of a Gaussian mechanism, where the growth of the squared sensitivity is asymptotically $O(T)$, where $T$ is the number of training rounds, similarly as in the case of central aggregation. As an application of the sensitivity analysis, we show that the excess risk of decentralized private learning for strongly convex losses is asymptotically similar as in centralized private learning.
[588] Zenith: Scaling up Ranking Models for Billion-scale Livestreaming Recommendation
Ruifeng Zhang, Zexi Huang, Zikai Wang, Ke Sun, Bohang Zheng, Yuchen Jiang, Zhe Chen, Zhen Ouyang, Huimin Xie, Phil Shen, Junlin Zhang, Yuchao Zheng, Wentao Guo, Qinglei Wang
Main category: cs.LG
TL;DR: Zenith is a scalable ranking architecture for recommender systems that efficiently captures complex feature interactions with minimal runtime overhead, achieving significant performance gains on TikTok Live.
Details
Motivation: While scaling model capacity is important for recommender system performance, prior work hasn't adequately addressed efficient feature handling and scaling without excessive inference latency. The paper aims to solve this by creating a scalable architecture that learns complex feature interactions efficiently.Method: Zenith uses a few high-dimensional Prime Tokens with Token Fusion and Token Boost modules to capture feature interactions. The architecture is designed to handle feature heterogeneity efficiently and exhibits superior scaling laws compared to other ranking methods.
Result: Deployed on TikTok Live, Zenith achieved +1.05%/-1.10% improvements in online CTR AUC and Logloss, with +9.93% gains in Quality Watch Session/User and +8.11% in Quality Watch Duration/User.
Conclusion: Zenith provides an effective solution for scalable recommender systems that can capture complex feature interactions with minimal runtime overhead, demonstrating real-world impact on a major livestreaming platform.
Abstract: Accurately capturing feature interactions is essential in recommender systems, and recent trends show that scaling up model capacity could be a key driver for next-level predictive performance. While prior work has explored various model architectures to capture multi-granularity feature interactions, relatively little attention has been paid to efficient feature handling and scaling model capacity without incurring excessive inference latency. In this paper, we address this by presenting Zenith, a scalable and efficient ranking architecture that learns complex feature interactions with minimal runtime overhead. Zenith is designed to handle a few high-dimensional Prime Tokens with Token Fusion and Token Boost modules, which exhibits superior scaling laws compared to other state-of-the-art ranking methods, thanks to its improved token heterogeneity. Its real-world effectiveness is demonstrated by deploying the architecture to TikTok Live, a leading online livestreaming platform that attracts billions of users globally. Our A/B test shows that Zenith achieves +1.05%/-1.10% in online CTR AUC and Logloss, and realizes +9.93% gains in Quality Watch Session / User and +8.11% in Quality Watch Duration / User.
[589] Rethinking Multi-Modal Learning from Gradient Uncertainty
Peizheng Guo, Jingyao Wang, Wenwen Qiang, Jiahuan Zhou, Changwen Zheng, Gang Hua
Main category: cs.LG
TL;DR: BOGC-MML proposes a Bayesian-oriented gradient calibration method for multi-modal learning that models gradient uncertainty and uses evidence theory to weight gradients based on reliability.
Details
Motivation: Existing multi-modal learning optimization focuses on mitigating gradient direction conflicts, but performance fluctuations persist even in non-conflict settings. The authors argue that gradient reliability (uncertainty) is a decisive factor that needs explicit modeling beyond just gradient direction.Method: BOGC-MML models gradients as probability distributions to capture uncertainty, interprets gradient precision as evidence within subjective logic and evidence theory, and uses a reduced Dempster’s combination rule to aggregate signals and weight gradients adaptively based on reliability.
Result: Extensive experiments demonstrate the effectiveness and advantages of the proposed method, showing improved performance through better gradient calibration.
Conclusion: Explicit modeling of gradient uncertainty and reliability is crucial for multi-modal learning optimization, and the Bayesian-oriented approach with evidence theory provides an effective framework for gradient calibration.
Abstract: Multi-Modal Learning (MML) integrates information from diverse modalities to improve predictive accuracy. While existing optimization strategies have made significant strides by mitigating gradient direction conflicts, we revisit MML from a gradient-based perspective to explore further improvements. Empirically, we observe an interesting phenomenon: performance fluctuations can persist in both conflict and non-conflict settings. Based on this, we argue that: beyond gradient direction, the intrinsic reliability of gradients acts as a decisive factor in optimization, necessitating the explicit modeling of gradient uncertainty. Guided by this insight, we propose Bayesian-Oriented Gradient Calibration for MML (BOGC-MML). Our approach explicitly models gradients as probability distributions to capture uncertainty, interpreting their precision as evidence within the framework of subjective logic and evidence theory. By subsequently aggregating these signals using a reduced Dempster’s combination rule, BOGC-MML adaptively weights gradients based on their reliability to generate a calibrated update. Extensive experiments demonstrate the effectiveness and advantages of the proposed method.
[590] Are Your Generated Instances Truly Useful? GenBench-MILP: A Benchmark Suite for MILP Instance Generation
Yidong Luo, Chenguang Wang, Dong Li, Tianshu Yu
Main category: cs.LG
TL;DR: GenBench-MILP is a benchmark suite for evaluating Mixed-Integer Linear Programming instance generators across four dimensions: mathematical validity, structural similarity, computational hardness, and downstream utility, with novel analysis of solver-internal features.
Details
Motivation: Current evaluation of MILP instance generators relies on superficial metrics that fail to capture true computational complexity, creating a gap in assessing whether generated instances are truly useful and realistic for real-world problems.Method: Introduces GenBench-MILP benchmark suite with four evaluation dimensions: 1) mathematical validity, 2) structural similarity, 3) computational hardness, and 4) utility in downstream tasks. Key innovation is analyzing solver-internal features like root node gaps, heuristic success rates, and cut plane usage to capture dynamic solver behavior.
Result: Experiments show that instances with high structural similarity scores can still exhibit drastically divergent solver interactions and difficulty levels, revealing limitations of current evaluation approaches.
Conclusion: GenBench-MILP provides a comprehensive evaluation toolkit for rigorous comparison of MILP instance generators and guides development of high-fidelity generators by capturing nuanced computational characteristics missed by static metrics.
Abstract: The proliferation of machine learning-based methods for Mixed-Integer Linear Programming (MILP) instance generation has surged, driven by the need for diverse training datasets. However, a critical question remains: Are these generated instances truly useful and realistic? Current evaluation protocols often rely on superficial structural metrics or simple solvability checks, which frequently fail to capture the true computational complexity of real-world problems. To bridge this gap, we introduce GenBench-MILP, a comprehensive benchmark suite designed for the standardized and objective evaluation of MILP generators. Our framework assesses instance quality across four key dimensions: mathematical validity, structural similarity, computational hardness, and utility in downstream tasks. A distinctive innovation of GenBench-MILP is the analysis of solver-internal features – including root node gaps, heuristic success rates, and cut plane usage. By treating the solver’s dynamic behavior as an expert assessment, we reveal nuanced computational discrepancies that static graph features miss. Our experiments on instance generative models demonstrate that instances with high structural similarity scores can still exhibit drastically divergent solver interactions and difficulty levels. By providing this multifaceted evaluation toolkit, GenBench-MILP aims to facilitate rigorous comparisons and guide the development of high-fidelity instance generators.
[591] Avoiding Premature Collapse: Adaptive Annealing for Entropy-Regularized Structural Inference
Yizhi Liu
Main category: cs.LG
TL;DR: The paper identifies premature mode collapse in entropy-regularized optimal transport for differentiable matching layers, proposes EPH-ASC adaptive scheduling to stabilize training, and demonstrates effectiveness on large-scale datasets.
Details
Motivation: Differentiable matching layers using entropy-regularized Optimal Transport (OT) are unstable when recovering discrete permutations via annealing ε→0, causing training failures in structural prediction and architectural scaling.Method: Analyzes Sinkhorn fixed-point map dynamics, identifies thermodynamic speed limit problem, and proposes Efficient Piecewise Hybrid Adaptive Stability Control (EPH-ASC) algorithm that monitors inference stability and adaptively schedules annealing.
Result: EPH-ASC stabilizes Manifold-Constrained Hyper-Connections (mHC) during large-scale training on FineWeb-Edu dataset, preventing late-stage gradient explosions by enforcing linear stability law.
Conclusion: The proposed adaptive stability control addresses fundamental instability in OT-based differentiable matching, enabling reliable training of complex architectures with structural prediction components.
Abstract: Differentiable matching layers and residual connection paradigms, often implemented via entropy-regularized Optimal Transport (OT), serve as critical mechanisms in structural prediction and architectural scaling. However, recovering discrete permutations or maintaining identity mappings via annealing $ε\to 0$ is notoriously unstable. In this work, we identify a fundamental mechanism for this failure: \textbf{Premature Mode Collapse}. By analyzing the non-normal dynamics of the Sinkhorn fixed-point map, we reveal a theoretical thermodynamic speed limit: standard exponential cooling outpaces the contraction rate of the inference operator, which degrades as $O(1/ε)$. To address this, we propose \textbf{Efficient Piecewise Hybrid Adaptive Stability Control (EPH-ASC)}, an adaptive scheduling algorithm that monitors the stability of the inference process. We demonstrate that EPH-ASC is essential for stabilizing Manifold-Constrained Hyper-Connections (mHC) during large-scale training on the FineWeb-Edu dataset, effectively preventing late-stage gradient explosions by enforcing a linear stability law.
[592] Statistically Valid Post-Deployment Monitoring Should Be Standard for AI-Based Digital Health
Pavel Dolin, Weizhi Li, Gautam Dasarathy, Visar Berisha
Main category: cs.LG
TL;DR: Position paper advocating for statistically valid, label-efficient testing frameworks for post-deployment monitoring of clinical AI systems to ensure reliability and safety in real-world deployment.
Details
Motivation: Current post-deployment monitoring in clinical AI is underdeveloped, with only 9% of FDA-registered AI healthcare tools having surveillance plans. Existing approaches are manual, sporadic, and reactive, making them unsuitable for dynamic clinical environments.Method: Proposes framing detection of data changes and model performance degradation as distinct statistical hypothesis testing problems, using label-efficient and statistically valid methods that provide explicit error rate guarantees and support formal inference.
Result: Establishes a principled foundation for post-deployment monitoring that aligns with regulatory requirements, ensures reproducibility, and provides a scientifically sound basis for maintaining clinical AI system reliability.
Conclusion: Grounding clinical AI monitoring in statistical rigor enables reproducible, scientifically sound practices and opens new research directions for detection, attribution, and mitigation of post-deployment model failures in real-world settings.
Abstract: This position paper argues that post-deployment monitoring in clinical AI is underdeveloped and proposes statistically valid and label-efficient testing frameworks as a principled foundation for ensuring reliability and safety in real-world deployment. A recent review found that only 9% of FDA-registered AI-based healthcare tools include a post-deployment surveillance plan. Existing monitoring approaches are often manual, sporadic, and reactive, making them ill-suited for the dynamic environments in which clinical models operate. We contend that post-deployment monitoring should be grounded in label-efficient and statistically valid testing frameworks, offering a principled alternative to current practices. We use the term “statistically valid” to refer to methods that provide explicit guarantees on error rates (e.g., Type I/II error), enable formal inference under pre-defined assumptions, and support reproducibility–features that align with regulatory requirements. Specifically, we propose that the detection of changes in the data and model performance degradation should be framed as distinct statistical hypothesis testing problems. Grounding monitoring in statistical rigor ensures a reproducible and scientifically sound basis for maintaining the reliability of clinical AI systems. Importantly, it also opens new research directions for the technical community–spanning theory, methods, and tools for statistically principled detection, attribution, and mitigation of post-deployment model failures in real-world settings.
[593] “Faithful to What?” On the Limits of Fidelity-Based Explanations
Jackson Eshbaugh
Main category: cs.LG
TL;DR: The paper introduces a linearity score λ(f) to measure how linearly decodable a neural network’s behavior is, showing that high-fidelity surrogate models can fail to capture the actual predictive gains of neural networks over simpler models.
Details
Motivation: Current surrogate model evaluation in explainable AI focuses on fidelity to neural network predictions, but this measures alignment to the learned model rather than alignment to the underlying data-generating signal. There's a need to understand whether high-fidelity surrogates actually capture the predictive advantages that make neural networks superior to simpler models.Method: Introduces linearity score λ(f) defined as an R² measure of surrogate fit to the network. Evaluates across synthetic and real-world regression datasets, comparing surrogate performance against both neural networks and simpler linear baselines trained directly on the data.
Result: Surrogates can achieve high fidelity to neural networks while failing to recover the predictive gains that distinguish networks from simpler models. In several cases, high-fidelity surrogates underperform even linear baselines trained directly on the data.
Conclusion: Explaining a model’s behavior is not equivalent to explaining the task-relevant structure of the data. Fidelity-based explanations have limitations when used to reason about predictive performance, highlighting the need for better diagnostic tools like the linearity score.
Abstract: In explainable AI, surrogate models are commonly evaluated by their fidelity to a neural network’s predictions. Fidelity, however, measures alignment to a learned model rather than alignment to the data-generating signal underlying the task. This work introduces the linearity score $λ(f)$, a diagnostic that quantifies the extent to which a regression network’s input–output behavior is linearly decodable. $λ(f)$ is defined as an $R^2$ measure of surrogate fit to the network. Across synthetic and real-world regression datasets, we find that surrogates can achieve high fidelity to a neural network while failing to recover the predictive gains that distinguish the network from simpler models. In several cases, high-fidelity surrogates underperform even linear baselines trained directly on the data. These results demonstrate that explaining a model’s behavior is not equivalent to explaining the task-relevant structure of the data, highlighting a limitation of fidelity-based explanations when used to reason about predictive performance.
[594] Joint Continual Learning of Local Language Models and Cloud Offloading Decisions with Budget Constraints
Evan Chen, Wenzhi Fang, Shiqiang Wang, Christopher Brinton
Main category: cs.LG
TL;DR: DA-GRPO is a reinforcement learning method for small language models to intelligently decide when to offload tasks to cloud LLMs during continual learning, balancing task performance with cloud usage constraints.
Details
Motivation: Small language models deployed locally need to handle diverse tasks under memory/computation constraints, requiring selective cloud assistance. Current approaches using naive reward-based RL lead to unstable offloading behavior and exacerbate catastrophic forgetting when task distributions change.Method: DA-GRPO (Dual-Advantage Group Relative Policy Optimization) extends GRPO by incorporating cloud-usage constraints directly into advantage computation, avoiding fixed reward shaping and external routing models. It enables joint learning of task competence and collaboration behavior.
Result: Experiments on mathematical reasoning and code generation benchmarks show DA-GRPO improves post-switch accuracy, substantially reduces forgetting, and maintains stable cloud usage compared to prior collaborative and routing-based approaches.
Conclusion: DA-GRPO provides an effective framework for small language models to intelligently manage cloud assistance during continual learning, achieving better performance while respecting usage constraints and mitigating catastrophic forgetting.
Abstract: Locally deployed Small Language Models (SLMs) must continually support diverse tasks under strict memory and computation constraints, making selective reliance on cloud Large Language Models (LLMs) unavoidable. Regulating cloud assistance during continual learning is challenging, as naive reward-based reinforcement learning often yields unstable offloading behavior and exacerbates catastrophic forgetting as task distributions shift. We propose DA-GRPO, a dual-advantage extension of Group Relative Policy Optimization that incorporates cloud-usage constraints directly into advantage computation, avoiding fixed reward shaping and external routing models. This design enables the local model to jointly learn task competence and collaboration behavior, allowing cloud requests to emerge naturally during post-training while respecting a prescribed assistance budget. Experiments on mathematical reasoning and code generation benchmarks show that DA-GRPO improves post-switch accuracy, substantially reduces forgetting, and maintains stable cloud usage compared to prior collaborative and routing-based approaches.
[595] Thompson Sampling-Based Learning and Control for Unknown Dynamic Systems
Kaikai Zheng, Dawei Shi, Yang Shi, Long Wang
Main category: cs.LG
TL;DR: A kernel-based Thompson sampling framework for active learning control that treats control laws as elements in function spaces, enabling data-driven controller design without structural restrictions on systems or controllers.
Details
Motivation: Thompson sampling is effective for active learning-based controller design but relies on finite parametric representations, limiting its applicability to general spaces encountered in control system design. The authors aim to extend TS to more general function spaces.Method: Proposes parameterization of control laws using reproducing kernel Hilbert spaces (RKHS), treating control laws as elements in function spaces. Designs a data-driven active learning control approach with a TS framework for online exploration-exploitation tradeoff. Provides convergence guarantees and stability analysis.
Result: Theoretical analysis shows exponential learning rate for relationship between control laws and closed-loop performance. Derives upper bound for control regret. Numerical experiments on controlling unknown nonlinear systems validate effectiveness.
Conclusion: The proposed kernel-based TS framework enables flexible control law learning in function spaces, overcoming limitations of finite parametric representations while maintaining theoretical guarantees for convergence and stability.
Abstract: Thompson sampling (TS) is a Bayesian randomized exploration strategy that samples options (e.g., system parameters or control laws) from the current posterior and then applies the selected option that is optimal for a task, thereby balancing exploration and exploitation; this makes TS effective for active learning-based controller design. However, TS relies on finite parametric representations, which limits its applicability to more general spaces, which are more commonly encountered in control system design. To address this issue, this work proposes a parameterization method for control law learning using reproducing kernel Hilbert spaces and designs a data-driven active learning control approach. Specifically, the proposed method treats the control law as an element in a function space, allowing the design of control laws without imposing restrictions on the system structure or the form of the controller. A TS framework is proposed in this work to reduce control costs through online exploration and exploitation, and the convergence guarantees are further provided for the learning process. Theoretical analysis shows that the proposed method learns the relationship between control laws and closed-loop performance metrics at an exponential rate, and the upper bound of control regret is also derived. Furthermore, the closed-loop stability of the proposed learning framework is analyzed. Numerical experiments on controlling unknown nonlinear systems validate the effectiveness of the proposed method.
[596] Should Bias be Eliminated? A General Framework to Use Bias for OOD Generalization
Yan Li, Yunlong Deng, Zijian Li, Anpeng Wu, Zeyu Tang, Kun Zhang, Guangyi Chen
Main category: cs.LG
TL;DR: A framework that leverages bias for better out-of-distribution generalization by identifying bias factors through generative modeling and combining bias-aware predictors with invariant predictors.
Details
Motivation: Challenges the common practice of eliminating bias in OOD generalization, questioning whether bias should always be discarded and exploring how to leverage it effectively when it can contribute positively to generalization.Method: Uses a generative model to capture data generation process and identify bias factors, constructs bias-aware predictors, estimates environment states to train domain-specific experts, and combines them with a general invariant predictor that guides adaptation under label shift.
Result: Outperforms invariance-only baselines, recent bias utilization approaches, and advanced baselines on synthetic data and standard domain generalization benchmarks, demonstrating improved robustness and adaptability.
Conclusion: Bias can be effectively leveraged for OOD generalization through proper identification and combination with invariant features, challenging the conventional wisdom of eliminating all bias.
Abstract: Most approaches to out-of-distribution (OOD) generalization learn domain-invariant representations by discarding contextual bias. In this paper, we raise a critical question: Should bias be eliminated? If not, is there a general way to leverage bias for better OOD generalization? To answer these questions, we first provide a theoretical analysis that characterizes the circumstances in which biased features contribute positively. Although theoretical results show that bias may sometimes play a positive role, leveraging it effectively is non-trivial, since its harmful and beneficial components are often entangled. Recent advances have sought to refine the prediction of bias by presuming reliable predictions from invariant features. However, such assumptions may be too strong in the real world, especially when the target also shifts from training to testing domains. Motivated by this challenge, we introduce a framework to leverage bias in a more general scenario. Specifically, we employ a generative model to capture the data generation process and identify the underlying bias factors, which are then used to construct a bias-aware predictor. Since the bias-aware predictor may shift across environments, we first estimate the environment state to train predictors under different environments, combining them as a mixture of domain experts for the final prediction. Then, we build a general invariant predictor, which can be invariant under label shift to guide the adaptation of the bias-aware predictor. Evaluations on synthetic data and standard domain generalization benchmarks demonstrate that our method consistently outperforms both invariance only baselines, recent bias utilization approaches and advanced baselines, yielding improved robustness and adaptability.
[597] Symplectic convolutional neural networks
Süleyman Yıldız, Konrad Janik, Peter Benner
Main category: cs.LG
TL;DR: A symplectic convolutional neural network architecture that preserves symplectic structure in convolution layers and includes symplectic pooling for autoencoders, tested on PDEs like wave, nonlinear Schrödinger, and sine-Gordon equations.
Details
Motivation: To develop neural networks that preserve symplectic structure for solving Hamiltonian systems, which is important for long-term stability and accuracy in simulating physical systems governed by Hamiltonian dynamics.Method: Combines symplectic neural networks, proper symplectic decomposition, and tensor techniques to create symplectic convolution layers. Introduces a mathematically equivalent form of convolution layers parameterized to remain symplectic, and adds symplectic pooling layers to construct complete autoencoders.
Result: The symplectic CNN outperforms linear symplectic autoencoders obtained via proper symplectic decomposition on three PDE examples: wave equation, nonlinear Schrödinger equation, and sine-Gordon equation.
Conclusion: The proposed symplectic CNN architecture successfully preserves symplectic structure while maintaining the expressive power of convolutional networks, making it suitable for Hamiltonian system simulations with better performance than linear alternatives.
Abstract: We propose a new symplectic convolutional neural network (CNN) architecture by leveraging symplectic neural networks, proper symplectic decomposition, and tensor techniques. Specifically, we first introduce a mathematically equivalent form of the convolution layer and then, using symplectic neural networks, we demonstrate a way to parameterize the layers of the CNN to ensure that the convolution layer remains symplectic. To construct a complete autoencoder, we introduce a symplectic pooling layer. We demonstrate the performance of the proposed neural network on three examples: the wave equation, the nonlinear Schrödinger (NLS) equation, and the sine-Gordon equation. The numerical results indicate that the symplectic CNN outperforms the linear symplectic autoencoder obtained via proper symplectic decomposition.
[598] A Sketch-and-Project Analysis of Subsampled Natural Gradient Algorithms
Gil Goldshlager, Jiang Hu, Lin Lin
Main category: cs.LG
TL;DR: SNG analyzed as sketch-and-project method with new theoretical proxy based on squared volume sampling, providing convergence guarantees for single mini-batch settings and insights into spectral decay exploitation.
Details
Motivation: Standard analyses of subsampled natural gradient descent (SNG) fail to provide insight into realistic small-sample settings, motivating a new theoretical approach.Method: Analyze SNG as sketch-and-project method, replace standard theoretical proxy with new proxy based on squared volume sampling, and extend to structured momentum scheme (SPRING).
Result: Show that expectation of SNG direction equals preconditioned gradient descent step even with coupling, providing global convergence guarantees for single mini-batch and characterizing convergence rate.
Conclusion: New analysis reveals SNG’s advantage over SGD lies in better exploiting spectral decay in model Jacobian, and SPRING momentum scheme naturally arises from accelerated sketch-and-project methods.
Abstract: Subsampled natural gradient descent (SNG) has been used to enable high-precision scientific machine learning, but standard analyses based on stochastic preconditioning fail to provide insight into realistic small-sample settings. We overcome this limitation by instead analyzing SNG as a sketch-and-project method. Motivated by this lens, we discard the usual theoretical proxy which decouples gradients and preconditioners using two independent mini-batches, and we replace it with a new proxy based on squared volume sampling. Under this new proxy we show that the expectation of the SNG direction becomes equal to a preconditioned gradient descent step even in the presence of coupling, leading to (i) global convergence guarantees when using a single mini-batch of any size, and (ii) an explicit characterization of the convergence rate in terms of quantities related to the sketch-and-project structure. These findings in turn yield new insights into small-sample settings, for example by suggesting that the advantage of SNG over SGD is that it can more effectively exploit spectral decay in the model Jacobian. We also extend these ideas to explain a popular structured momentum scheme for SNG, known as SPRING, by showing that it arises naturally from accelerated sketch-and-project methods.
[599] Flexible inference for animal learning rules using neural networks
Yuhan Helena Liu, Victor Geadah, Jonathan Pillow
Main category: cs.LG
TL;DR: A framework using deep neural networks to infer flexible, data-driven learning rules from behavioral data during animal task learning, outperforming traditional reinforcement learning approaches.
Details
Motivation: Existing approaches assume fixed parametric forms for learning rules (like Q-learning or policy gradient) which may not accurately describe complex animal learning in realistic settings. There's a need for methods that can infer learning rules directly from behavioral data.Method: Animals are assumed to follow a decision policy parameterized by a generalized linear model (GLM), while their learning rule (mapping from task covariates to per-trial weight updates) is modeled using a deep neural network (DNN). A recurrent neural network (RNN) variant is also introduced to capture more complex learning dynamics that integrate information over multiple trials.
Result: Simulations show the framework can recover ground-truth learning rules. When applied to mouse behavioral data from a sensory decision-making task, both DNN and RNN-based methods outperformed traditional RL learning rules at predicting learning trajectories of held-out mice. Inferred learning rules showed reward-history-dependent dynamics with larger updates following sequences of rewarded trials.
Conclusion: The methods provide a flexible framework for inferring learning rules from behavioral data in de novo learning tasks, enabling improved animal training protocols and development of behavioral digital twins.
Abstract: Understanding how animals learn is a central challenge in neuroscience, with growing relevance to the development of animal- or human-aligned artificial intelligence. However, existing approaches tend to assume fixed parametric forms for the learning rule (e.g., Q-learning, policy gradient), which may not accurately describe the complex forms of learning employed by animals in realistic settings. Here we address this gap by developing a framework to infer learning rules directly from behavioral data collected during de novo task learning. We assume that animals follow a decision policy parameterized by a generalized linear model (GLM), and we model their learning rule – the mapping from task covariates to per-trial weight updates – using a deep neural network (DNN). This formulation allows flexible, data-driven inference of learning rules while maintaining an interpretable form of the decision policy itself. To capture more complex learning dynamics, we introduce a recurrent neural network (RNN) variant that relaxes the Markovian assumption that learning depends solely on covariates of the current trial, allowing for learning rules that integrate information over multiple trials. Simulations demonstrate that the framework can recover ground-truth learning rules. We applied our DNN and RNN-based methods to a large behavioral dataset from mice learning to perform a sensory decision-making task and found that they outperformed traditional RL learning rules at predicting the learning trajectories of held-out mice. The inferred learning rules exhibited reward-history-dependent learning dynamics, with larger updates following sequences of rewarded trials. Overall, these methods provide a flexible framework for inferring learning rules from behavioral data in de novo learning tasks, setting the stage for improved animal training protocols and the development of behavioral digital twins.
[600] On optimal solutions of classical and sliced Wasserstein GANs with non-Gaussian data
Yu-Jui Huang, Hsin-Hua Shen, Yu-Chih Huang, Wan-Yi Lin, Shih-Chun Lin
Main category: cs.LG
TL;DR: The paper analyzes optimal parameter selection for Wasserstein GANs beyond linear-quadratic-Gaussian settings, deriving closed-form solutions for 1D non-Gaussian data and proposing sliced WGAN variants with asymptotic optimality proofs.
Details
Motivation: Current GAN parameter selection methods lack theoretical guarantees, especially for Wasserstein GANs beyond simple LQG settings. Existing optimal parameter results are limited to linear generators with Gaussian data, which doesn't reflect real-world applications.Method: 1) Derive closed-form optimal parameters for 1D WGANs with non-linear activation functions and non-Gaussian data. 2) Use sliced Wasserstein framework for high-dimensional data, showing linear generators can be asymptotically optimal. 3) Propose new unprojected sliced WGAN variant and prove its asymptotic optimality.
Result: Theoretical proofs show asymptotic optimality for sliced WGAN variants. Empirical results demonstrate that sliced WGAN generators achieve better performance than r-PCA with only linear complexity (vs cubic complexity for r-PCA).
Conclusion: The paper provides theoretical foundations for optimal parameter selection in WGANs beyond LQG settings, offering practical sliced WGAN variants with proven optimality and computational efficiency advantages over traditional methods.
Abstract: The generative adversarial network (GAN) aims to approximate an unknown distribution via a parameterized neural network (NN). While GANs have been widely applied in reinforcement and semi-supervised learning as well as computer vision tasks, selecting their parameters often needs an exhaustive search, and only a few selection methods have been proven to be theoretically optimal. One of the most promising GAN variants is the Wasserstein GAN (WGAN). Prior work on optimal parameters for population WGAN is limited to the linear-quadratic-Gaussian (LQG) setting, where the generator NN is linear, and the data is Gaussian. In this paper, we focus on the characterization of optimal solutions of population WGAN beyond the LQG setting. As a basic result, closed-form optimal parameters for one-dimensional WGAN are derived when the NN has non-linear activation functions, and the data is non-Gaussian. For high-dimensional data, we adopt the sliced Wasserstein framework and show that the linear generator can be asymptotically optimal. Moreover, the original sliced WGAN only constrains the projected data marginal instead of the whole one in classical WGAN, and thus, we propose another new unprojected sliced WGAN and identify its asymptotic optimality. Empirical studies show that compared to the celebrated r-principal component analysis (r-PCA) solution, which has cubic complexity to the data dimension, our generator for sliced WGAN can achieve better performance with only linear complexity.
[601] Reversible Deep Learning for 13C NMR in Chemoinformatics: On Structures and Spectra
Stefan Kuhn, Vandana Dwarka, Przemyslaw Karol Grenda, Eero Vainikko
Main category: cs.LG
TL;DR: A reversible deep learning model for 13C NMR spectroscopy that uses a single conditional invertible neural network to map between molecular structures and spectra in both directions.
Details
Motivation: To create a unified model that can handle both spectrum prediction from molecular structures and structure generation from spectra, addressing the one-to-many nature of spectrum-to-structure inference while maintaining uncertainty awareness.Method: Uses a conditional invertible neural network built from i-RevNet style bijective blocks, trained to predict 128-bit binned spectrum codes from graph-based structure encodings, with remaining latent dimensions capturing residual variability.
Result: The model is numerically invertible on trained examples, achieves spectrum-code prediction above chance, and produces coarse but meaningful structural signals when inverted on validation spectra.
Conclusion: Invertible architectures can unify spectrum prediction and uncertainty-aware candidate generation within one end-to-end model for NMR spectroscopy applications.
Abstract: We introduce a reversible deep learning model for 13C NMR that uses a single conditional invertible neural network for both directions between molecular structures and spectra. The network is built from i-RevNet style bijective blocks, so the forward map and its inverse are available by construction. We train the model to predict a 128-bit binned spectrum code from a graph-based structure encoding, while the remaining latent dimensions capture residual variability. At inference time, we invert the same trained network to generate structure candidates from a spectrum code, which explicitly represents the one-to-many nature of spectrum-to-structure inference. On a filtered subset, the model is numerically invertible on trained examples, achieves spectrum-code prediction above chance, and produces coarse but meaningful structural signals when inverted on validation spectra. These results demonstrate that invertible architectures can unify spectrum prediction and uncertainty-aware candidate generation within one end-to-end model.
[602] Multi-Agent Inverted Transformer for Flight Trajectory Prediction
Seokbin Yoon, Keumjin Lee
Main category: cs.LG
TL;DR: MAIFormer is a novel neural architecture using masked multivariate attention and agent attention modules to predict multi-agent flight trajectories with improved accuracy and interpretability.
Details
Motivation: Predicting multi-agent flight trajectories is challenging due to difficulties in modeling individual aircraft behaviors over time, complex interactions between flights, and generating explainable prediction outcomes for practical air traffic control applications.Method: Proposes Multi-Agent Inverted Transformer (MAIFormer) with two key attention modules: (1) masked multivariate attention to capture spatio-temporal patterns of individual aircraft, and (2) agent attention to model social patterns among multiple agents in complex air traffic scenes.
Result: MAIFormer achieves best performance across multiple metrics on real-world ADS-B flight trajectory data from Incheon International Airport, outperforming other methods while producing interpretable prediction outcomes from a human perspective.
Conclusion: MAIFormer improves both prediction accuracy and transparency for multi-agent flight trajectory prediction, enhancing practical utility in air traffic control through explainable AI approaches.
Abstract: Flight trajectory prediction for multiple aircraft is essential and provides critical insights into how aircraft navigate within current air traffic flows. However, predicting multi-agent flight trajectories is inherently challenging. One of the major difficulties is modeling both the individual aircraft behaviors over time and the complex interactions between flights. Generating explainable prediction outcomes is also a challenge. Therefore, we propose a Multi-Agent Inverted Transformer, MAIFormer, as a novel neural architecture that predicts multi-agent flight trajectories. The proposed framework features two key attention modules: (i) masked multivariate attention, which captures spatio-temporal patterns of individual aircraft, and (ii) agent attention, which models the social patterns among multiple agents in complex air traffic scenes. We evaluated MAIFormer using a real-world automatic dependent surveillance-broadcast flight trajectory dataset from the terminal airspace of Incheon International Airport in South Korea. The experimental results show that MAIFormer achieves the best performance across multiple metrics and outperforms other methods. In addition, MAIFormer produces prediction outcomes that are interpretable from a human perspective, which improves both the transparency of the model and its practical utility in air traffic control.
[603] Optimal Robust Recourse with $L^p$-Bounded Model Change
Phone Kyaw, Kshitij Kayastha, Shahin Jabbari
Main category: cs.LG
TL;DR: Optimal robust recourse algorithm for generalized linear models using L^p norm constraints, achieving lower cost and better trade-offs than prior L^∞ approaches.
Details
Motivation: Current robust recourse methods lack theoretical optimality guarantees and use L^∞ norm constraints that lead to high-cost recourse solutions. There's a need for provably optimal algorithms with more practical norm constraints.Method: Develops a new algorithm that provably computes optimal robust recourse for generalized linear models using L^p norm constraints (p≥1, p≠∞) instead of L^∞ norm. The approach handles more constrained model changes and provides theoretical optimality guarantees.
Result: Achieves significantly lower price of recourse (up to several orders of magnitude) compared to prior work, better trade-off between implementation cost and validity, more sparse recourses, and remains resilient to post-processing approaches.
Conclusion: The proposed L^p norm-based robust recourse algorithm provides provably optimal solutions with practical advantages over existing methods, making recourse recommendations more cost-effective and resilient to model changes.
Abstract: Recourse provides individuals who received undesirable labels (e.g., denied a loan) from algorithmic decision-making systems with a minimum-cost improvement suggestion to achieve the desired outcome. However, in practice, models often get updated to reflect changes in the data distribution or environment, invalidating the recourse recommendations (i.e., following the recourse will not lead to the desirable outcome). The robust recourse literature addresses this issue by providing a framework for computing recourses whose validity is resilient to slight changes in the model. However, since the optimization problem of computing robust recourse is non-convex (even for linear models), most of the current approaches do not have any theoretical guarantee on the optimality of the recourse. Recent work by Kayastha et. al. provides the first provably optimal algorithm for robust recourse with respect to generalized linear models when the model changes are measured using the $L^{\infty}$ norm. However, using the $L^{\infty}$ norm can lead to recourse solutions with a high price. To address this shortcoming, we consider more constrained model changes defined by the $L^p$ norm, where $p\geq 1$ but $p\neq \infty$, and provide a new algorithm that provably computes the optimal robust recourse for generalized linear models. Empirically, for both linear and non-linear models, we demonstrate that our algorithm achieves a significantly lower price of recourse (up to several orders of magnitude) compared to prior work and also exhibits a better trade-off between the implementation cost of recourse and its validity. Our empirical analysis also illustrates that our approach provides more sparse recourses compared to prior work and remains resilient to post-processing approaches that guarantee feasibility.
[604] Sharpness-Aware Minimization Can Hallucinate Minimizers
Chanwoong Park, Uijeong Jang, Ernest K. Ryu, Insoon Yang
Main category: cs.LG
TL;DR: SAM can get stuck at “hallucinated minimizers” - points where the perturbed gradient vanishes but the original gradient is nonzero, causing training to stall despite not being actual stationary points.
Details
Motivation: To understand why SAM (Sharpness-Aware Minimization) sometimes fails to converge properly, particularly when using large perturbation distances (ρ), and to identify the specific failure mode where SAM gets stuck at non-stationary points.Method: Theoretical analysis showing that under certain nonconvex landscape conditions (presence of local minimizer and maximizer), SAM can converge to “hallucinated minimizers” where the perturbed-point gradient vanishes despite nonzero original gradient. Experimental validation on neural networks and proposed mitigation via SGD warm-start.
Result: Identified a new failure mode of SAM where it stalls at hallucinated minimizers, which explains performance degradation at large ρ. Showed that a short initial SGD warm-start before enabling SAM mitigates this issue and reduces sensitivity to ρ choice.
Conclusion: SAM has a previously unrecognized failure mode where it can converge to non-stationary points called hallucinated minimizers, especially with large ρ. This explains observed performance degradation, and the issue can be mitigated with SGD warm-start initialization.
Abstract: Sharpness-Aware Minimization (SAM) is widely used to seek flatter minima – often linked to better generalization. In its standard implementation, SAM updates the current iterate using the loss gradient evaluated at a point perturbed by distance $ρ$ along the normalized gradient direction. We show that, for some choices of $ρ$, SAM can stall at points where this shifted (perturbed-point) gradient vanishes despite a nonzero original gradient, and therefore, they are not stationary points of the original loss. We call these points hallucinated minimizers, prove their existence under simple nonconvex landscape conditions (e.g., the presence of a local minimizer and a local maximizer), and establish sufficient conditions for local convergence of the SAM iterates to them. We corroborate this failure mode in neural network training and observe that it aligns with SAM’s performance degradation often seen at large $ρ$. Finally, as a practical safeguard, we find that a short initial SGD warm-start before enabling SAM mitigates this failure mode and reduces sensitivity to the choice of $ρ$.
[605] Informed Asymmetric Actor-Critic: Leveraging Privileged Signals Beyond Full-State Access
Daniel Ebi, Gaspard Lambrechts, Damien Ernst, Klemens Böhm
Main category: cs.LG
TL;DR: Informed asymmetric actor-critic framework allows critics to use arbitrary privileged signals instead of full state, with theoretical guarantees and practical criteria for selecting optimal privileged information.
Details
Motivation: Traditional asymmetric actor-critic methods assume full state observability for critics during training, which is unrealistic in practice. There's a need for a framework that allows critics to use more realistic privileged signals without requiring access to the full state.Method: Proposes informed asymmetric actor-critic framework where critics can be conditioned on arbitrary state-dependent privileged signals. Introduces two informativeness criteria: 1) dependence-based test applicable before training, and 2) value prediction accuracy improvement criterion applicable post-hoc.
Result: Empirical results on partially observable benchmark tasks and synthetic environments show that carefully selected privileged signals can match or outperform full-state asymmetric baselines while using strictly less state information.
Conclusion: The framework substantially expands admissible privileged information in actor-critic methods, provides theoretical guarantees for unbiased policy gradients, and offers practical tools for selecting optimal privileged signals that can outperform full-state baselines with less information.
Abstract: Asymmetric actor-critic methods are widely used in partially observable reinforcement learning, but typically assume full state observability to condition the critic during training, which is often unrealistic in practice. We introduce the informed asymmetric actor-critic framework, allowing the critic to be conditioned on arbitrary state-dependent privileged signals without requiring access to the full state. We show that any such privileged signal yields unbiased policy gradient estimates, substantially expanding the set of admissible privileged information. This raises the problem of selecting the most adequate privileged information in order to improve learning. For this purpose, we propose two novel informativeness criteria: a dependence-based test that can be applied prior to training, and a criterion based on improvements in value prediction accuracy that can be applied post-hoc. Empirical results on partially observable benchmark tasks and synthetic environments demonstrate that carefully selected privileged signals can match or outperform full-state asymmetric baselines while relying on strictly less state information.
[606] Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning
Yicheng Lang, Yihua Zhang, Chongyu Fan, Changsheng Wang, Jinghan Jia, Sijia Liu
Main category: cs.LG
TL;DR: Investigates optimizer choice’s impact on LLM unlearning robustness, finding that downgrading from first-order to zeroth-order methods improves resistance to post-unlearning manipulations while maintaining unlearning quality.
Details
Motivation: LLM unlearning aims to remove undesired data/knowledge while preserving utility, but current methods are fragile to post-unlearning manipulations like quantization or fine-tuning. Prior work focused on reformulating objectives, but this paper investigates the optimizer's role in shaping unlearning robustness.Method: Analyzes optimizer ‘grade’ (zeroth-order, first-order, second-order) impact on unlearning robustness. Shows that downgrading optimizers (e.g., using zeroth-order methods or compressed-gradient variants) leads to stronger robustness. Proposes a hybrid optimizer combining first-order and zeroth-order updates.
Result: Extensive experiments on MUSE and WMDP benchmarks across multiple LLM unlearning algorithms validate that the approach achieves more resilient forgetting without sacrificing unlearning quality. Zeroth-order methods converge to harder-to-disturb basins in loss landscape.
Conclusion: Optimizer choice significantly impacts unlearning robustness independent of objectives. Downgrading optimizer grade improves resistance to post-training perturbations while maintaining efficacy. Hybrid optimizer offers practical solution for robust unlearning.
Abstract: Large language model (LLM) unlearning aims to surgically remove the influence of undesired data or knowledge from an existing model while preserving its utility on unrelated tasks. This paradigm has shown promise in addressing privacy and safety concerns. However, recent findings reveal that unlearning effects are often fragile: post-unlearning manipulations such as weight quantization or fine-tuning can quickly neutralize the intended forgetting. Prior efforts to improve robustness primarily reformulate unlearning objectives by explicitly assuming the role of vulnerability sources. In this work, we take a different perspective by investigating the role of the optimizer, independent of unlearning objectives and formulations, in shaping unlearning robustness. We show that the ‘grade’ of the optimizer, defined by the level of information it exploits, ranging from zeroth-order (gradient-free) to first-order (gradient-based) to second-order (Hessian-based), is tightly linked to the resilience of unlearning. Surprisingly, we find that downgrading the optimizer, such as using zeroth-order methods or compressed-gradient variants (e.g., gradient sign-based optimizers), often leads to stronger robustness. While these optimizers produce noisier and less precise updates, they encourage convergence to harder-to-disturb basins in the loss landscape, thereby resisting post-training perturbations. By connecting zeroth-order methods with randomized smoothing, we further highlight their natural advantage for robust unlearning. Motivated by these insights, we propose a hybrid optimizer that combines first-order and zeroth-order updates, preserving unlearning efficacy while enhancing robustness. Extensive experiments on the MUSE and WMDP benchmarks, across multiple LLM unlearning algorithms, validate that our approach achieves more resilient forgetting without sacrificing unlearning quality.
[607] An Attention-based Feature Memory Design for Energy-Efficient Continual Learning
Yuandou Wang, Filip Gunnarsson, Rihan Hai
Main category: cs.LG
TL;DR: AttenMLP: An energy-efficient continual learning approach for tabular data streams using attention-based feature replay and sliding buffer updates
Details
Motivation: Tabular data streams are common in real-time decision-making on resource-constrained devices, but existing continual learning approaches for tabular data lack energy and memory efficiency considerations.Method: Integrates attention-based feature replay with context retrieval and sliding buffer updates within a minibatch training framework for streaming tabular learning.
Result: Achieves comparable accuracy to strong baselines while reducing energy consumption up to 33.3% compared to TabPFNv2, with modest accuracy trade-offs (0.062 decrease under incremental drift, 0.038 decrease under abrupt drift).
Conclusion: AttenMLP demonstrates effective energy-accuracy trade-offs for continual learning on tabular data streams, making it suitable for resource-constrained edge devices.
Abstract: Tabular data streams are increasingly prevalent in real-time decision-making across healthcare, finance, and the Internet of Things, often generated and processed on resource-constrained edge and mobile devices. Continual learning (CL) enables models to learn sequentially from such streams while retaining previously acquired knowledge. While recent CL advances have made significant progress in mitigating catastrophic forgetting, the energy and memory efficiency of CL for tabular data streams remains largely unexplored. To address this gap, we propose AttenMLP, which integrates attention-based feature replay with context retrieval and sliding buffer updates within a minibatch training framework for streaming tabular learning. We evaluate AttenMLP against state-of-the-art (SOTA) tabular models on real-world concept drift benchmarks with temporal distribution shifts. Experimental results show that AttenMLP achieves accuracy comparable to strong baselines without replay, while substantially reducing energy consumption through tunable design choices. In particular, with the proposed attention-based feature memory design, AttenMLP costs a 0.062 decrease in final accuracy under the incremental concept drift dataset, while reducing energy usage up to 33.3% compared to TabPFNv2. Under the abrupt concept drift dataset, AttenMLP reduces 1.47% energy consumption compared to TabR, at the cost of a 0.038 decrease in final accuracy. Although ranking third in global efficiency, AttenMLP demonstrates energy-accuracy trade-offs across both abrupt and incremental concept drift scenarios compared to SOTA tabular models.
[608] Transmuting prompts into weights
Hanna Mazzawi, Benoit Dherin, Michael Munn, Michael Wunder, Javier Gonzalvo
Main category: cs.LG
TL;DR: Theoretical foundation for controlling LLM behavior via internal state modifications, showing how textual input can be converted into reusable weight updates through thought vectors/matrices.
Details
Motivation: Existing techniques for controlling LLM behavior through internal state modifications (vector additions to activations or weight matrix updates) lack theoretical grounding and rely on empirical heuristics. The paper aims to provide a principled theoretical foundation for these interventions.Method: Builds on recent findings that prompt influence can be mathematically mapped to token-dependent implicit weight updates. Derives a method for condensing this information into token-independent thought vectors and thought matrices, providing a theoretical explanation for existing editing techniques.
Result: Develops a principled method for transmuting textual input into reusable weight updates, offering a computationally-grounded approach to model editing that explains existing empirical techniques.
Conclusion: Provides theoretical foundation for LLM control techniques, showing how textual input can be systematically converted into weight modifications through thought vectors/matrices, bridging empirical practices with mathematical principles.
Abstract: A growing body of research has demonstrated that the behavior of large language models can be effectively controlled at inference time by directly modifying their internal states, either through vector additions to their activations or through updates to their weight matrices. These techniques, while powerful, are often guided by empirical heuristics, such as deriving steering vectors from the average activations of contrastive prompts. This work provides a theoretical foundation for these interventions, explaining how they emerge from the fundamental computations of the transformer architecture. Building on the recent finding that a prompt’s influence can be mathematically mapped to token-dependent implicit weight updates (Dherin et. al, 2025), we derive a principled method for condensing this information into token-independent thought vectors and thought matrices. These constructs provide a theoretical explanation for existing vector-and-matrix-based model editing techniques and offer a direct, computationally-grounded method for transmuting textual input into reusable weight updates.
[609] Progressive multi-fidelity learning with neural networks for physical system predictions
Paolo Conti, Mengwu Guo, Attilio Frangi, Andrea Manzoni
Main category: cs.LG
TL;DR: Progressive multi-fidelity surrogate model that sequentially incorporates diverse data types using tailored encoders and neural networks for multi-fidelity regression.
Details
Motivation: High-fidelity data is expensive and time-consuming to acquire, while low-fidelity data is more accessible but less accurate. Practical situations involve diverse data types from different modalities that may not be concurrently available, complicating surrogate modeling.Method: Progressive multi-fidelity model with tailored encoders for different data types, using neural networks for regression. Features dual connections: concatenations among encoded inputs and additive connections among final outputs, enabling additive corrections without altering previous levels.
Result: Demonstrated effectiveness on numerical benchmarks and real-world case study, showing reliable integration of multi-modal data, accurate predictions, and maintained performance when generalizing across time and parameter variations.
Conclusion: The approach successfully addresses challenges of multi-fidelity modeling with diverse data types, preventing performance degradation as new data is integrated and adapting predictions based on available inputs.
Abstract: Highly accurate datasets from numerical or physical experiments are often expensive and time-consuming to acquire, posing a significant challenge for applications that require precise evaluations, potentially across multiple scenarios and in real-time. Even building sufficiently accurate surrogate models can be extremely challenging with limited high-fidelity data. Conversely, less expensive, low-fidelity data can be computed more easily and encompass a broader range of scenarios. By leveraging multi-fidelity information, prediction capabilities of surrogates can be improved. However, in practical situations, data may be different in types, come from sources of different modalities, and not be concurrently available, further complicating the modeling process. To address these challenges, we introduce a progressive multi-fidelity surrogate model. This model can sequentially incorporate diverse data types using tailored encoders. Multi-fidelity regression from the encoded inputs to the target quantities of interest is then performed using neural networks. Input information progressively flows from lower to higher fidelity levels through two sets of connections: concatenations among all the encoded inputs, and additive connections among the final outputs. This dual connection system enables the model to exploit correlations among different datasets while ensuring that each level makes an additive correction to the previous level without altering it. This approach prevents performance degradation as new input data are integrated into the model and automatically adapts predictions based on the available inputs. We demonstrate the effectiveness of the approach on numerical benchmarks and a real-world case study, showing that it reliably integrates multi-modal data and provides accurate predictions, maintaining performance when generalizing across time and parameter variations.
[610] Hierarchical Time Series Forecasting with Robust Reconciliation
Shuhei Aikawa, Aru Suzuki, Kei Yoshitake, Kanata Teshigawara, Akira Iwabuchi, Ken Kobayashi, Kazuhide Nakata
Main category: cs.LG
TL;DR: Robust optimization framework for hierarchical time-series forecasting that accounts for uncertainty in covariance matrix estimation to improve forecast coherence and performance.
Details
Motivation: Existing hierarchical forecasting methods require estimating a covariance matrix for optimal reconciliation, but estimation errors degrade forecast performance. Need robust methods that account for covariance uncertainty.Method: Proposes robust optimization framework with uncertainty set for estimated covariance matrix. Formulates reconciliation as minimizing worst-case average of weighted squared residuals, cast as semidefinite optimization problem.
Result: Numerical experiments show proposed robust reconciliation method achieves better forecast performance than existing hierarchical forecasting methods.
Conclusion: Integrating uncertainty into reconciliation process improves hierarchical forecasting performance, demonstrating effectiveness of robust optimization approach.
Abstract: This paper focuses on forecasting hierarchical time-series data, where each higher-level observation equals the sum of its corresponding lower-level time series. In such contexts, the forecast values should be coherent, meaning that the forecast value of each parent series exactly matches the sum of the forecast values of its child series. Existing hierarchical forecasting methods typically generate base forecasts independently for each series and then apply a reconciliation procedure to adjust them so that the resulting forecast values are coherent across the hierarchy. These methods generally derive an optimal reconciliation, using a covariance matrix of the forecast error. In practice, however, the true covariance matrix is unknown and has to be estimated from finite samples in advance. This gap between the true and estimated covariance matrix may degrade forecast performance. To address this issue, we propose a robust optimization framework for hierarchical reconciliation that accounts for uncertainty in the estimated covariance matrix. We first introduce an uncertainty set for the estimated covariance matrix and formulate a reconciliation problem that minimizes the worst-case average of weighted squared residuals over this uncertainty set. We show that our problem can be cast as a semidefinite optimization problem. Numerical experiments demonstrate that the proposed robust reconciliation method achieved better forecast performance than existing hierarchical forecasting methods, which indicates the effectiveness of integrating uncertainty into the reconciliation process.
[611] Additive Models Explained: A Computational Complexity Approach
Shahaf Bassan, Michal Moshkovitz, Guy Katz
Main category: cs.LG
TL;DR: Theoretical analysis reveals that generating explanations for Generalized Additive Models (GAMs) has surprisingly diverse computational complexity outcomes, heavily influenced by input space structure, component models, task type (regression vs classification), and explanation methods.
Details
Motivation: While GAMs are considered interpretable ML models, there's an assumption that obtaining meaningful explanations for them should be computationally efficient. This paper challenges that hypothesis by rigorously analyzing the computational complexity of generating different types of explanations for various GAM forms.Method: Theoretical computational complexity analysis under standard assumptions (P!=NP). Examines different GAM forms, component models, input domains, and explanation methods across regression and classification tasks.
Result: Key findings: (1) Explanation complexity heavily depends on input space structure; (2) Complexity varies with component models but only under specific input domains; (3) Significant complexity differences between regression and classification tasks; (4) Expressing complex models additively (like neural additive models) makes them easier to explain, but only for certain explanation methods and input domains.
Conclusion: The paper provides a rigorous theoretical understanding of when computing explanations for GAMs is feasible or provably hard, challenging assumptions about GAM interpretability and revealing a complex computational landscape.
Abstract: Generalized Additive Models (GAMs) are commonly considered interpretable within the ML community, as their structure makes the relationship between inputs and outputs relatively understandable. Therefore, it may seem natural to hypothesize that obtaining meaningful explanations for GAMs could be performed efficiently and would not be computationally infeasible. In this work, we challenge this hypothesis by analyzing the computational complexity of generating different explanations for various forms of GAMs across multiple contexts. Our analysis reveals a surprisingly diverse landscape of both positive and negative complexity outcomes. Particularly, under standard complexity assumptions such as P!=NP, we establish several key findings: (1) in stark contrast to many other common ML models, the complexity of generating explanations for GAMs is heavily influenced by the structure of the input space; (2) the complexity of explaining GAMs varies significantly with the types of component models used - but interestingly, these differences only emerge under specific input domain settings; (3) significant complexity distinctions appear for obtaining explanations in regression tasks versus classification tasks in GAMs; and (4) expressing complex models like neural networks additively (e.g., as neural additive models) can make them easier to explain, though interestingly, this benefit appears only for certain explanation methods and input domains. Collectively, these results shed light on the feasibility of computing diverse explanations for GAMs, offering a rigorous theoretical picture of the conditions under which such computations are possible or provably hard.
[612] TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models
Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, Simone Alessi, Adrian Hayler, Mihir Manium, Rosen Yu, Felix Jablonski, Shi Bin Hoo, Anurag Garg, Jake Robertson, Magnus Bühler, Vladyslav Moroshan, Lennart Purucker, Clara Cornu, Lilly Charlotte Wehrhahn, Alessandro Bonetto, Bernhard Schölkopf, Sauraj Gambhir, Noah Hollmann, Frank Hutter
Main category: cs.LG
TL;DR: TabPFN-2.5 is a next-generation tabular foundation model that scales to 50K data points and 2K features, outperforming tree-based models and matching complex ensembles on industry benchmarks while offering distillation for production deployment.
Details
Motivation: To advance tabular AI by creating a more scalable foundation model that can handle larger datasets while maintaining superior performance over existing methods, and to provide practical deployment solutions for production use cases.Method: Develops TabPFN-2.5 as a tabular foundation model with 20x capacity increase over previous version, uses distillation engine to convert the model into compact MLP or tree ensembles for production deployment.
Result: Achieves leading performance on TabArena benchmark, substantially outperforms tuned tree-based models, matches AutoGluon 1.4 accuracy, has 100% win rate against default XGBoost on small-medium datasets, and 87% win rate on larger datasets.
Conclusion: TabPFN-2.5 represents a significant advancement in tabular foundation models with improved scalability and performance, while distillation enables practical production deployment with low latency.
Abstract: The first tabular foundation model, TabPFN, and its successor TabPFNv2 have impacted tabular AI substantially, with dozens of methods building on it and hundreds of applications across different use cases. This report introduces TabPFN-2.5, the next generation of our tabular foundation model, built for datasets with up to 50,000 data points and 2,000 features, a 20x increase in data cells compared to TabPFNv2. TabPFN-2.5 is now the leading method for the industry standard benchmark TabArena (which contains datasets with up to 100,000 training data points), substantially outperforming tuned tree-based models and matching the accuracy of AutoGluon 1.4, a complex four-hour tuned ensemble that even includes the previous TabPFNv2. Remarkably, default TabPFN-2.5 has a 100% win rate against default XGBoost on small to medium-sized classification datasets (<=10,000 data points, 500 features) and a 87% win rate on larger datasets up to 100K samples and 2K features (85% for regression). For production use cases, we introduce a new distillation engine that converts TabPFN-2.5 into a compact MLP or tree ensemble, preserving most of its accuracy while delivering orders-of-magnitude lower latency and plug-and-play deployment. This new release will immediately strengthen the performance of the many applications and methods already built on the TabPFN ecosystem.
[613] Data Heterogeneity and Forgotten Labels in Split Federated Learning
Joana Tirana, Dimitra Tsigkari, David Solans Noguero, Nicolas Kourtellis
Main category: cs.LG
TL;DR: SFL suffers from catastrophic forgetting due to sequential processing at server; Hydra method uses multi-head networks to mitigate this issue.
Details
Motivation: Split Federated Learning (SFL) with data heterogeneity causes catastrophic forgetting where models perform better on classes seen later in the sequence, similar to continual learning problems.Method: Proposes Hydra, a novel mitigation method inspired by multi-head neural networks adapted for SFL setting to address sequential processing issues at server.
Result: Extensive numerical evaluations show Hydra outperforms baseline methods and existing literature approaches for SFL.
Conclusion: Hydra effectively addresses catastrophic forgetting in SFL with data heterogeneity, improving model performance across all classes regardless of processing sequence.
Abstract: In Split Federated Learning (SFL), the clients collaboratively train a model with the help of a server by splitting the model into two parts. Part-1 is trained locally at each client and aggregated by the aggregator at the end of each round. Part-2 is trained at a server that sequentially processes the intermediate activations received from each client. We study the phenomenon of catastrophic forgetting (CF) in SFL in the presence of data heterogeneity. In detail, due to the nature of SFL, local updates of part-1 may drift away from global optima, while part-2 is sensitive to the processing sequence, similar to forgetting in continual learning (CL). Specifically, we observe that the trained model performs better in classes (labels) seen at the end of the sequence. We investigate this phenomenon with emphasis on key aspects of SFL, such as the processing order at the server and the cut layer. Based on our findings, we propose Hydra, a novel mitigation method inspired by multi-head neural networks and adapted for the SFL setting. Extensive numerical evaluations show that Hydra outperforms baselines and methods from the literature.
[614] The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces
Subramanyam Sahoo
Main category: cs.LG
TL;DR: CTVP is a verification framework that detects backdoors in LLM-generated code by analyzing consistency in predicted execution traces across semantically equivalent program transformations, using semantic orbit analysis and an Adversarial Robustness Quotient metric.
Details
Motivation: As LLMs increasingly generate code with minimal human oversight, there are critical concerns about backdoor injection and malicious behavior in AI-generated code, necessitating robust verification methods that don't require direct execution of potentially malicious code.Method: CTVP leverages the model’s own predictions of execution traces across semantically equivalent program transformations (semantic orbit analysis). It analyzes consistency patterns in these predicted traces to detect behavioral anomalies indicative of backdoors, and introduces the Adversarial Robustness Quotient (ARQ) to quantify verification computational cost.
Result: The approach demonstrates exponential growth of ARQ with orbit size, establishes information-theoretic bounds showing non-gamifiability (adversaries cannot improve through training due to fundamental space complexity constraints), but notes practical deployment requires addressing high false positive rates observed in initial evaluations.
Conclusion: Semantic orbit analysis provides a theoretically grounded approach to AI control for code generation tasks, offering a novel verification method that doesn’t require executing potentially malicious code, though practical challenges remain with false positive rates.
Abstract: Large language models (LLMs) increasingly generate code with minimal human oversight, raising critical concerns about backdoor injection and malicious behavior. We present Cross-Trace Verification Protocol (CTVP), a novel AI control framework that verifies untrusted code-generating models through semantic orbit analysis. Rather than directly executing potentially malicious code, CTVP leverages the model’s own predictions of execution traces across semantically equivalent program transformations. By analyzing consistency patterns in these predicted traces, we detect behavioral anomalies indicative of backdoors. Our approach introduces the Adversarial Robustness Quotient (ARQ), which quantifies the computational cost of verification relative to baseline generation, demonstrating exponential growth with orbit size. Theoretical analysis establishes information-theoretic bounds showing non-gamifiability - adversaries cannot improve through training due to fundamental space complexity constraints. This work demonstrates that semantic orbit analysis provides a theoretically grounded approach to AI control for code generation tasks, though practical deployment requires addressing the high false positive rates observed in initial evaluations.
[615] Softly Constrained Denoisers for Diffusion Models
Victor M. Yeom-Song, Severi Rissanen, Arno Solin, Samuel Kaski, Mingfei Sun
Main category: cs.LG
TL;DR: Softly constrained denoisers integrate constraint guidance directly into diffusion model architecture rather than loss or sampling, improving compliance while maintaining flexibility for misspecified constraints.
Details
Motivation: Diffusion models struggle with constraint satisfaction in scientific applications. Existing methods (loss regularization or sampling guidance) bias models away from true data distribution, which is problematic when constraints are misspecified - a common issue in scientific domains where constraint formulation is challenging.Method: Propose integrating guidance-inspired adjustments directly into the denoiser architecture instead of using regularization terms in the loss or guidance during sampling. This creates a soft inductive bias towards constraint-compliant samples while maintaining model flexibility.
Result: Softly constrained denoisers exploit constraint knowledge to improve compliance over standard denoisers, while maintaining enough flexibility to deviate from constraints when they are misspecified with observed data.
Conclusion: Direct integration of constraint guidance into denoiser architecture provides a better balance between constraint satisfaction and model fidelity compared to existing methods, especially important for scientific applications where constraint misspecification is common.
Abstract: Diffusion models struggle to produce samples that respect constraints, a common requirement in scientific applications. Recent approaches have introduced regularization terms in the loss or guidance methods during sampling to enforce such constraints, but they bias the generative model away from the true data distribution. This is a problem when the constraint is misspecified, which is a common issue in scientific applications where constraint formulation is challenging. We propose to integrate guidance-inspired adjustments to the denoiser, instead of the loss or sampling loop. This achieves a soft inductive bias towards constraint-compliant samples. We show that these softly constrained denoisers exploit constraint knowledge to improve compliance over standard denoisers, while maintaining enough flexibility to deviate from it in case of misspecification with observed data.
[616] Colorful Pinball: Density-Weighted Quantile Regression for Conditional Guarantee of Conformal Prediction
Qianyi Chen, Bo Li
Main category: cs.LG
TL;DR: A method to improve conditional coverage in conformal prediction by minimizing mean squared error of conditional coverage through density-weighted quantile regression with theoretical guarantees.
Details
Motivation: Conformal prediction provides marginal coverage guarantees but struggles with reliable conditional coverage for specific inputs. While exact distribution-free conditional coverage is impossible with finite samples, there's a need to improve conditional coverage of standard conformal procedures beyond relaxed notions.Method: Proposes minimizing mean squared error of conditional coverage by refining quantile regression components. Uses Taylor expansion to derive density-weighted pinball loss objective, where weights are conditional density of conformity score at true quantile. Implements three-headed quantile network that estimates weights via finite differences using auxiliary quantile levels at 1-α±δ, then fine-tunes central quantile by optimizing weighted loss.
Result: Provides theoretical analysis with exact non-asymptotic guarantees characterizing excess risk. Extensive experiments on diverse high-dimensional real-world datasets demonstrate remarkable improvements in conditional coverage performance.
Conclusion: The proposed method effectively improves conditional coverage in conformal prediction through density-weighted quantile regression refinement with theoretical guarantees and empirical validation.
Abstract: While conformal prediction provides robust marginal coverage guarantees, achieving reliable conditional coverage for specific inputs remains challenging. Although exact distribution-free conditional coverage is impossible with finite samples, recent work has focused on improving the conditional coverage of standard conformal procedures. Distinct from approaches that target relaxed notions of conditional coverage, we directly minimize the mean squared error of conditional coverage by refining the quantile regression components that underpin many conformal methods. Leveraging a Taylor expansion, we derive a sharp surrogate objective for quantile regression: a density-weighted pinball loss, where the weights are given by the conditional density of the conformity score evaluated at the true quantile. We propose a three-headed quantile network that estimates these weights via finite differences using auxiliary quantile levels at (1-α\pm δ), subsequently fine-tuning the central quantile by optimizing the weighted loss. We provide a theoretical analysis with exact non-asymptotic guarantees characterizing the resulting excess risk. Extensive experiments on diverse high-dimensional real-world datasets demonstrate remarkable improvements in conditional coverage performance.
[617] SolarGPT-QA: A Domain-Adaptive Large Language Model for Educational Question Answering in Space Weather and Heliophysics
Santosh Chapagain, MohammadReza EskandariNasab, Onur Vural, Shah Muhammad Hamdi, Soukaina Filali Boubrahimi
Main category: cs.LG
TL;DR: SolarGPT-QA is a domain-adapted LLM built on LLaMA-3, trained on space science literature and GPT-4-generated Q&A data, designed for educational explanations of solar phenomena with improved clarity and scientific accuracy.
Details
Motivation: Solar activity impacts critical infrastructure but general LLMs lack domain-specific knowledge and pedagogical capability to explain complex space science concepts clearly for educational purposes.Method: Built on LLaMA-3 base model, trained using scientific literature and large-scale Q&A data generated with GPT-4 and refined with Grok-3 in student-friendly storytelling style, combining domain-adaptive pretraining with pedagogical fine-tuning.
Result: Outperforms general-purpose models in zero-shot settings, achieves competitive performance vs instruction-tuned models for educational explanations, shows improved clarity/accessibility in student comprehension study, and demonstrates importance of combining domain adaptation with pedagogical tuning.
Conclusion: SolarGPT-QA represents initial step toward broader SolarGPT framework for space science education and forecasting, showing domain adaptation + pedagogical fine-tuning balances scientific accuracy with educational effectiveness.
Abstract: Solar activity, including solar flares, coronal mass ejections (CMEs), and geomagnetic storms, can significantly impact satellites, aviation, power grids, data centers, and space missions. Extreme solar events can cause substantial economic damage with limited advance warning, underscoring the importance of early-warning systems, accurate forecasting, and effective education in space science. Although large language models (LLMs) perform well on general tasks, they often lack domain-specific knowledge and pedagogical capability to clearly explain complex space science concepts. We introduce SolarGPT-QA, a question answering system based on a domain-adapted large language model built on the LLaMA-3 base model. The model is trained using scientific literature and large-scale question-answer data generated with GPT-4 and refined using Grok-3 in a student-friendly storytelling style. Human pairwise evaluations show that SolarGPT-QA outperforms general-purpose models in zero-shot settings and achieves competitive performance compared to instruction-tuned models for educational explanations in space weather and heliophysics. A small pilot student comprehension study further suggests improved clarity and accessibility of the generated explanations. Ablation experiments indicate that combining domain-adaptive pretraining with pedagogical fine-tuning is important for balancing scientific accuracy and educational effectiveness. This work represents an initial step toward a broader SolarGPT framework for space science education and forecasting.
[618] Entropic Risk-Aware Monte Carlo Tree Search
Pedro P. Santos, Jacopo Silvestrin, Alberto Sardinha, Francisco S. Melo
Main category: cs.LG
TL;DR: A provably correct Monte Carlo tree search algorithm for solving risk-aware Markov decision processes with entropic risk measure objectives, featuring non-asymptotic analysis showing convergence and polynomial regret concentration.
Details
Motivation: The paper addresses the need for efficient algorithms to solve risk-aware Markov decision processes with entropic risk measure objectives, which are important for decision-making under uncertainty but computationally challenging.Method: The authors propose a Monte Carlo tree search algorithm that exploits dynamic programming formulations for risk-aware MDPs with ERM objectives, using an upper confidence bound-based tree search approach with provable correctness guarantees.
Result: The algorithm is shown to be correct (empirical ERM converges to optimal ERM) and enjoys polynomial regret concentration. Experimental results demonstrate its effectiveness compared to relevant baselines.
Conclusion: The proposed risk-aware MCTS algorithm provides a provably correct and efficient solution for solving risk-aware MDPs with ERM objectives, with theoretical guarantees and practical effectiveness demonstrated through experiments.
Abstract: We propose a provably correct Monte Carlo tree search (MCTS) algorithm for solving risk-aware Markov decision processes (MDPs) with entropic risk measure (ERM) objectives. We provide a non-asymptotic analysis of our proposed algorithm, showing that the algorithm: (i) is correct in the sense that the empirical ERM obtained at the root node converges to the optimal ERM; and (ii) enjoys polynomial regret concentration. Our algorithm successfully exploits the dynamic programming formulations for solving risk-aware MDPs with ERM objectives introduced by previous works in the context of an upper confidence bound-based tree search algorithm. Finally, we provide a set of illustrative experiments comparing our risk-aware MCTS method against relevant baselines.
[619] Noninvasive Intracranial Pressure Estimation Using Subspace System Identification and Bespoke Machine Learning Algorithms: A Learning-to-Rank Approach
Anni Zhao, Ayca Ermis, Jeffrey Robert Vitt, Sergio Brasil, Wellingson Paiva, Magdalena Kasprowicz, Malgorzata Burzynska, Robert Hamilton, Runze Yan, Ofer Sadan, J. Claude Hemphill, Lieven Vandenberghe, Xiao Hu
Main category: cs.LG
TL;DR: Machine learning framework for noninvasive intracranial pressure estimation using system identification and ranking-constrained optimization from arterial blood pressure, cerebral blood velocity, and R-R interval signals.
Details
Motivation: Accurate noninvasive estimation of intracranial pressure (ICP) is a major challenge in critical care, as current methods require invasive procedures. There's a need for safe, broadly accessible ICP monitoring for patients with acute brain injury.Method: Combines subspace system identification to model cerebral hemodynamics using ABP, CBv, and R-R interval signals, with a ranking-constrained convex optimization framework to learn mapping functions between noninvasive signal features and estimation errors.
Result: Approximately 31.88% of testing entries achieved estimation errors within 2 mmHg and 34.07% between 2-6 mmHg using the nonlinear mapping with constraints, demonstrating feasibility of the approach.
Conclusion: The proposed noninvasive ICP estimation approach shows promise but requires further validation and refinement before clinical deployment, potentially enabling safer and more accessible ICP monitoring for brain injury patients.
Abstract: Accurate noninvasive estimation of intracranial pressure (ICP) remains a major challenge in critical care. We developed a bespoke machine learning algorithm that integrates system identification and ranking-constrained optimization to estimate mean ICP from noninvasive signals. A machine learning framework was proposed to obtain accurate mean ICP values using arbitrary noninvasive signals. The subspace system identification algorithm is employed to identify cerebral hemodynamics models for ICP simulation using arterial blood pressure (ABP), cerebral blood velocity (CBv), and R-wave to R-wave interval (R-R interval) signals in a comprehensive database. A mapping function to describe the relationship between the features of noninvasive signals and the estimation errors is learned using innovative ranking constraints through convex optimization. Patients across multiple clinical settings were randomly split into testing and training datasets for performance evaluation of the mapping function. The results indicate that about 31.88% of testing entries achieved estimation errors within 2 mmHg and 34.07% of testing entries between 2 mmHg and 6 mmHg from the nonlinear mapping with constraints. Our results demonstrate the feasibility of the proposed noninvasive ICP estimation approach. Further validation and technical refinement are required before clinical deployment, but this work lays the foundation for safe and broadly accessible ICP monitoring in patients with acute brain injury and related conditions.
[620] Partial Feedback Online Learning
Shihao Shao, Cong Fang, Zhouchen Lin, Dacheng Tao
Main category: cs.LG
TL;DR: The paper introduces partial-feedback online learning where learners only see one acceptable label per instance, develops new theoretical tools (collection version space, PFLdim, PMSdim) to characterize learnability, and shows deterministic and randomized learnability coincide under certain conditions.
Details
Motivation: To address the limitations of classical online learning frameworks where learners receive full feedback, focusing on scenarios where only partial feedback (one acceptable label per instance) is available, which requires new theoretical tools beyond traditional version space approaches.Method: Introduces collection version space (maintaining sets of hypotheses), defines Partial-Feedback Littlestone dimension (PFLdim) and Partial-Feedback Measure Shattering dimension (PMSdim), develops theoretical framework for set-realizable regime, and analyzes conditions for deterministic vs randomized learnability.
Result: PFLdim and PMSdim tightly characterize minimax regret for deterministic and randomized learners respectively; identifies nested inclusion condition where deterministic and randomized learnability coincide; shows linear minimax regret can occur even with small hypothesis spaces beyond set realizability.
Conclusion: The paper establishes fundamental theoretical foundations for partial-feedback online learning, providing tight characterizations of learnability and resolving open questions about the relationship between deterministic and randomized learning in this setting.
Abstract: We study a new learning protocol, termed partial-feedback online learning, where each instance admits a set of acceptable labels, but the learner observes only one acceptable label per round. We highlight that, while classical version space is widely used for online learnability, it does not directly extend to this setting. We address this obstacle by introducing a collection version space, which maintains sets of hypotheses rather than individual hypotheses. Using this tool, we obtain a tight characterization of learnability in the set-realizable regime. In particular, we define the Partial-Feedback Littlestone dimension (PFLdim) and the Partial-Feedback Measure Shattering dimension (PMSdim), and show that they tightly characterize the minimax regret for deterministic and randomized learners, respectively. We further identify a nested inclusion condition under which deterministic and randomized learnability coincide, resolving an open question of Raman et al. (2024b). Finally, given a hypothesis space H, we show that beyond set realizability, the minimax regret can be linear even when |H|=2, highlighting a barrier beyond set realizability.
[621] Boosting CVaR Policy Optimization with Quantile Gradients
Yudong Luo, Erick Delage
Main category: cs.LG
TL;DR: Proposes a more sample-efficient approach to optimizing Conditional Value-at-Risk (CVaR) in reinforcement learning by augmenting CVaR with an expected quantile term that leverages all sampled data through dynamic programming.
Details
Motivation: CVaR optimization using policy gradient (CVaR-PG) suffers from sample inefficiency because it focuses only on tail-end performance and ignores many sampled trajectories, making it impractical for real-world applications.Method: Augments CVaR with an expected quantile term, which admits a dynamic programming formulation that can leverage all sampled data rather than just tail-end trajectories, improving sample efficiency while preserving the CVaR objective.
Result: Empirical results in domains with verifiable risk-averse behavior show the proposed algorithm substantially improves upon CVaR-PG and consistently outperforms other existing methods within the Markovian policy class.
Conclusion: The quantile-augmented approach provides a more sample-efficient method for CVaR optimization in reinforcement learning while maintaining the risk-averse properties of the original CVaR objective.
Abstract: Optimizing Conditional Value-at-risk (CVaR) using policy gradient (a.k.a CVaR-PG) faces significant challenges of sample inefficiency. This inefficiency stems from the fact that it focuses on tail-end performance and overlooks many sampled trajectories. We address this problem by augmenting CVaR with an expected quantile term. Quantile optimization admits a dynamic programming formulation that leverages all sampled data, thus improves sample efficiency. This does not alter the CVaR objective since CVaR corresponds to the expectation of quantile over the tail. Empirical results in domains with verifiable risk-averse behavior show that our algorithm within the Markovian policy class substantially improves upon CVaR-PG and consistently outperforms other existing methods.
[622] Non-Intrusive Graph-Based Bot Detection for E-Commerce Using Inductive Graph Neural Networks
Sichen Zhao, Zhiming Xue, Yalun Qi, Xianling Zeng, Zihan Yu
Main category: cs.LG
TL;DR: Graph-based bot detection framework for e-commerce using inductive graph neural networks to identify automated behavior through user session modeling
Details
Motivation: Malicious bots are increasingly sophisticated in e-commerce platforms, evading traditional detection methods like IP blacklists and CAPTCHAs through proxies, botnets, and AI-assisted strategies, requiring more advanced detection approachesMethod: Non-intrusive graph-based framework that models user session behavior through graph representation and applies inductive graph neural network for classification, capturing both relational structure and behavioral semantics
Result: Outperforms session-level multilayer perceptron baseline in AUC and F1 score on real-world e-commerce traffic; remains robust under adversarial perturbations and generalizes effectively to unseen sessions and URLs
Conclusion: The proposed framework is deployment-friendly, integrates with existing systems without client-side instrumentation, supports real-time inference and incremental updates, making it suitable for practical e-commerce security deployments
Abstract: Malicious bots pose a growing threat to e-commerce platforms by scraping data, hoarding inventory, and perpetrating fraud. Traditional bot mitigation techniques, including IP blacklists and CAPTCHA-based challenges, are increasingly ineffective or intrusive, as modern bots leverage proxies, botnets, and AI-assisted evasion strategies. This work proposes a non-intrusive graph-based bot detection framework for e-commerce that models user session behavior through a graph representation and applies an inductive graph neural network for classification. The approach captures both relational structure and behavioral semantics, enabling accurate identification of subtle automated activity that evades feature-based methods. Experiments on real-world e-commerce traffic demonstrate that the proposed inductive graph model outperforms a strong session-level multilayer perceptron baseline in terms of AUC and F1 score. Additional adversarial perturbation and cold-start simulations show that the model remains robust under moderate graph modifications and generalizes effectively to previously unseen sessions and URLs. The proposed framework is deployment-friendly, integrates with existing systems without client-side instrumentation, and supports real-time inference and incremental updates, making it suitable for practical e-commerce security deployments.
[623] MeshGraphNet-Transformer: Scalable Mesh-based Learned Simulation for Solid Mechanics
Mikel M. Iparraguirre, Iciar Alfaro, David Gonzalez, Elias Cueto
Main category: cs.LG
TL;DR: MGN-T combines Transformers with MeshGraphNets for efficient physics simulation on high-resolution meshes by enabling global information propagation instead of iterative message passing.
Details
Motivation: Standard MeshGraphNets suffer from inefficient long-range information propagation on large, high-resolution meshes due to iterative message passing, limiting their ability to handle industrial-scale physics simulations with complex geometries and boundary conditions.Method: MGN-T integrates a physics-attention Transformer as a global processor that updates all nodal states simultaneously while preserving node and edge attributes, eliminating the need for deep message-passing stacks or hierarchical meshes.
Result: MGN-T successfully handles industrial-scale meshes for impact dynamics where standard MGN fails, accurately models self-contact, plasticity, and multivariate outputs, and outperforms state-of-the-art approaches on classical benchmarks with higher accuracy and fewer parameters.
Conclusion: The combination of Transformers’ global modeling capabilities with MeshGraphNets’ geometric inductive bias enables efficient learning on high-resolution meshes with varying geometries, topologies, and boundary conditions at industrial scale.
Abstract: We present MeshGraphNet-Transformer (MGN-T), a novel architecture that combines the global modeling capabilities of Transformers with the geometric inductive bias of MeshGraphNets, while preserving a mesh-based graph representation. MGN-T overcomes a key limitation of standard MGN, the inefficient long-range information propagation caused by iterative message passing on large, high-resolution meshes. A physics-attention Transformer serves as a global processor, updating all nodal states simultaneously while explicitly retaining node and edge attributes. By directly capturing long-range physical interactions, MGN-T eliminates the need for deep message-passing stacks or hierarchical, coarsened meshes, enabling efficient learning on high-resolution meshes with varying geometries, topologies, and boundary conditions at an industrial scale. We demonstrate that MGN-T successfully handles industrial-scale meshes for impact dynamics, a setting in which standard MGN fails due message-passing under-reaching. The method accurately models self-contact, plasticity, and multivariate outputs, including internal, phenomenological plastic variables. Moreover, MGN-T outperforms state-of-the-art approaches on classical benchmarks, achieving higher accuracy while maintaining practical efficiency, using only a fraction of the parameters required by competing baselines.
[624] Alignment-Aware Model Adaptation via Feedback-Guided Optimization
Gaurav Bhatt, Aditya Chinchure, Jiawei Zhou, Leonid Sigal
Main category: cs.LG
TL;DR: Alignment-aware fine-tuning framework that integrates external alignment signals through policy-gradient regularization with adaptive gating to balance supervised and alignment-driven gradients, preventing degradation of safety and hallucination avoidance during downstream adaptation.
Details
Motivation: Standard fine-tuning approaches optimize task objectives in isolation and degrade alignment objectives like safety and hallucination avoidance, failing to correct pre-existing misaligned behavior in foundation models.Method: Proposes policy-gradient-based regularization with adaptive gating that dynamically balances supervised and alignment-driven gradients per sample, prioritizing uncertain/misaligned cases while allowing well-aligned examples to follow standard updates. Learns abstention behavior for fully misaligned inputs.
Result: Experiments on instruction-tuning benchmarks show consistent reductions in harmful and hallucinated outputs without sacrificing downstream task performance. Demonstrates robustness to adversarial fine-tuning, prompt-based attacks, and unsafe initializations.
Conclusion: Adaptively gated alignment optimization is an effective approach for alignment-preserving and alignment-recovering model adaptation, maintaining safety and reducing hallucinations during fine-tuning.
Abstract: Fine-tuning is the primary mechanism for adapting foundation models to downstream tasks; however, standard approaches largely optimize task objectives in isolation and do not account for secondary yet critical alignment objectives (e.g., safety and hallucination avoidance). As a result, downstream fine-tuning can degrade alignment and fail to correct pre-existing misaligned behavior. We propose an alignment-aware fine-tuning framework that integrates feedback from an external alignment signal through policy-gradient-based regularization. Our method introduces an adaptive gating mechanism that dynamically balances supervised and alignment-driven gradients on a per-sample basis, prioritizing uncertain or misaligned cases while allowing well-aligned examples to follow standard supervised updates. The framework further learns abstention behavior for fully misaligned inputs, incorporating conservative responses directly into the fine-tuned model. Experiments on general and domain-specific instruction-tuning benchmarks demonstrate consistent reductions in harmful and hallucinated outputs without sacrificing downstream task performance. Additional analyses show robustness to adversarial fine-tuning, prompt-based attacks, and unsafe initializations, establishing adaptively gated alignment optimization as an effective approach for alignment-preserving and alignment-recovering model adaptation.
[625] Membership Inference Attacks from Causal Principles
Mathieu Even, Clément Berenfeld, Linus Bleistein, Tudor Cebere, Julie Josse, Aurélien Bellet
Main category: cs.LG
TL;DR: Causal inference framework for Membership Inference Attacks (MIAs) that defines memorization as causal effect of data inclusion, addresses biases in one-run/zero-run methods, and provides principled estimators with consistency guarantees.
Details
Motivation: Standard MIA evaluation requires repeated retraining which is computationally expensive for large models. One-run and zero-run methods are used as alternatives but their statistical validity is unclear. The paper aims to provide a principled foundation for privacy evaluation in modern AI systems.Method: Frames MIA evaluation as a causal inference problem, defining memorization as the causal effect of including a data point in the training set. Derives causal analogues of standard MIA metrics and proposes practical estimators for multi-run, one-run, and zero-run regimes with non-asymptotic consistency guarantees.
Result: Experiments on real-world data show the approach enables reliable memorization measurement even when retraining is impractical and under distribution shift. The framework reveals and formalizes key sources of bias in existing protocols.
Conclusion: Provides a principled causal inference foundation for privacy evaluation in modern AI systems, enabling reliable memorization measurement across different evaluation regimes with theoretical guarantees.
Abstract: Membership Inference Attacks (MIAs) are widely used to quantify training data memorization and assess privacy risks. Standard evaluation requires repeated retraining, which is computationally costly for large models. One-run methods (single training with randomized data inclusion) and zero-run methods (post hoc evaluation) are often used instead, though their statistical validity remains unclear. To address this gap, we frame MIA evaluation as a causal inference problem, defining memorization as the causal effect of including a data point in the training set. This novel formulation reveals and formalizes key sources of bias in existing protocols: one-run methods suffer from interference between jointly included points, while zero-run evaluations popular for LLMs are confounded by non-random membership assignment. We derive causal analogues of standard MIA metrics and propose practical estimators for multi-run, one-run, and zero-run regimes with non-asymptotic consistency guarantees. Experiments on real-world data show that our approach enables reliable memorization measurement even when retraining is impractical and under distribution shift, providing a principled foundation for privacy evaluation in modern AI systems.
[626] SC3D: Dynamic and Differentiable Causal Discovery for Temporal and Instantaneous Graphs
Sourajit Das, Dibyajyoti Chakraborty, Romit Maulik
Main category: cs.LG
TL;DR: SC3D is a two-stage differentiable framework for discovering causal structures in multivariate time series, jointly learning lag-specific adjacency matrices and instantaneous DAGs through edge preselection and refinement.
Details
Motivation: Causal discovery from multivariate time series is challenging due to interactions across multiple lags and potential instantaneous dependencies, with the search space being combinatorial in nature. Existing methods need improvement in stability and accuracy for recovering both lagged and instantaneous causal structures.Method: Two-stage differentiable framework: Stage 1 performs edge preselection through node-wise prediction to obtain masks for lagged and instantaneous edges. Stage 2 refines these masks by optimizing a likelihood with sparsity while enforcing acyclicity on the instantaneous block.
Result: SC3D achieves improved stability and more accurate recovery of both lagged and instantaneous causal structures compared to existing temporal baselines across synthetic and benchmark dynamical systems.
Conclusion: The proposed SC3D framework effectively addresses the combinatorial search space problem in dynamic causal discovery and provides a stable, accurate solution for recovering complex temporal causal structures with both lagged and instantaneous dependencies.
Abstract: Discovering causal structures from multivariate time series is a key problem because interactions span across multiple lags and possibly involve instantaneous dependencies. Additionally, the search space of the dynamic graphs is combinatorial in nature. In this study, we propose \textit{Stable Causal Dynamic Differentiable Discovery (SC3D)}, a two-stage differentiable framework that jointly learns lag-specific adjacency matrices and, if present, an instantaneous directed acyclic graph (DAG). In Stage 1, SC3D performs edge preselection through node-wise prediction to obtain masks for lagged and instantaneous edges, whereas Stage 2 refines these masks by optimizing a likelihood with sparsity along with enforcing acyclicity on the instantaneous block. Numerical results across synthetic and benchmark dynamical systems demonstrate that SC3D achieves improved stability and more accurate recovery of both lagged and instantaneous causal structures compared to existing temporal baselines.
[627] Optimization and Generation in Aerodynamics Inverse Design
Huaguan Chen, Ning Lin, Luxi Chen, Rui Zhang, Wenbing Huang, Chongxuan Li, Hao Sun
Main category: cs.LG
TL;DR: Inverse design framework for aerodynamic shape optimization using physics-based objectives, with methods for optimization and guided generation validated on 2D/3D benchmarks.
Details
Motivation: Inverse design with physics-based objectives is challenging due to high-dimensional geometry coupled with expensive simulations, particularly in aerodynamic shape optimization for drag reduction.Method: Proposes new training loss for cost predictors and density-gradient optimization method; unifies existing training-free guided generation methods; develops time- and memory-efficient algorithm for approximate covariance estimation to address high-dimensional conditional covariance approximation.
Result: Experiments on controlled 2D study and high-fidelity 3D aerodynamic benchmarks (car and aircraft) validated by OpenFOAM simulations and miniature wind-tunnel tests with 3D-printed prototypes demonstrate consistent gains in both optimization and guided generation.
Conclusion: The approach shows effectiveness for inverse design problems with physics-based objectives, with offline RL results further supporting the generality of the method.
Abstract: Inverse design with physics-based objectives is challenging because it couples high-dimensional geometry with expensive simulations, as exemplified by aerodynamic shape optimization for drag reduction. We revisit inverse design through two canonical solutions, the optimal design point and the optimal design distribution, and relate them to optimization and guided generation. Building on this view, we propose a new training loss for cost predictors and a density-gradient optimization method that improves objectives while preserving plausible shapes. We further unify existing training-free guided generation methods. To address their inability to approximate conditional covariance in high dimensions, we develop a time- and memory-efficient algorithm for approximate covariance estimation. Experiments on a controlled 2D study and high-fidelity 3D aerodynamic benchmarks (car and aircraft), validated by OpenFOAM simulations and miniature wind-tunnel tests with 3D-printed prototypes, demonstrate consistent gains in both optimization and guided generation. Additional offline RL results further support the generality of our approach.
[628] Separation-Utility Pareto Frontier: An Information-Theoretic Characterization
Shizhou Xu
Main category: cs.LG
TL;DR: The paper studies the trade-off between utility and separation fairness in machine learning, characterizes the Pareto frontier theoretically, and develops a conditional mutual information regularizer for practical implementation.
Details
Motivation: To understand and manage the fundamental trade-off between model utility (accuracy) and separation fairness (predictive independence from sensitive attributes given true outcomes), which is crucial for deploying fair ML systems in practice.Method: Uses information theory to characterize the utility-separation Pareto frontier, proves its concavity, and develops a conditional mutual information (CMI) regularizer that can be integrated with any gradient-based deep learning model to enforce separation constraints.
Result: Theoretical characterization of the Pareto frontier shows increasing marginal cost of separation in terms of utility. Empirical results on COMPAS, UCI Adult, UCI Bank, and CelebA datasets show the method substantially reduces separation violations while matching or exceeding baseline utility.
Conclusion: The paper provides a provable, stable, and flexible approach to enforcing separation fairness in deep learning through information-theoretic analysis and practical CMI regularization that offers tractable guarantees during training.
Abstract: We study the Pareto frontier (optimal trade-off) between utility and separation, a fairness criterion requiring predictive independence from sensitive attributes conditional on the true outcome. Through an information-theoretic lens, we prove a characterization of the utility-separation Pareto frontier, establish its concavity, and thereby prove the increasing marginal cost of separation in terms of utility. In addition, we characterize the conditions under which this trade-off becomes strict, providing a guide for trade-off selection in practice. Based on the theoretical characterization, we develop an empirical regularizer based on conditional mutual information (CMI) between predictions and sensitive attributes given the true outcome. The CMI regularizer is compatible with any deep model trained via gradient-based optimization and serves as a scalar monitor of residual separation violations, offering tractable guarantees during training. Finally, numerical experiments support our theoretical findings: across COMPAS, UCI Adult, UCI Bank, and CelebA, the proposed method substantially reduces separation violations while matching or exceeding the utility of established baseline methods. This study thus offers a provable, stable, and flexible approach to enforcing separation in deep learning.
[629] Maximum-Volume Nonnegative Matrix Factorization
Olivier Vu Thanh, Nicolas Gillis
Main category: cs.LG
TL;DR: MaxVol NMF maximizes volume of H instead of minimizing volume of W, leading to sparser decompositions and avoiding rank-deficient solutions, with applications in hyperspectral unmixing.
Details
Motivation: While minimum-volume NMF (MinVol NMF) improves interpretability by minimizing volume of W, it can generate rank-deficient solutions. The authors propose maximizing volume of H instead to obtain sparser decompositions and avoid rank deficiency.Method: Proposes maximum-volume NMF (MaxVol NMF) which maximizes volume of H rather than minimizing volume of W. Develops two algorithms to solve MaxVol NMF and a normalized variant that bridges standard NMF and orthogonal NMF.
Result: MaxVol NMF is identifiable under same conditions as MinVol NMF in noiseless case but behaves differently with noise. MaxVol NMF extracts sparser decompositions, avoids rank-deficient solutions, and corresponds to clustering columns of X in disjoint clusters.
Conclusion: MaxVol NMF offers advantages over MinVol NMF for obtaining sparse, non-rank-deficient solutions, with the normalized variant performing best and providing continuum between standard and orthogonal NMF.
Abstract: Nonnegative matrix factorization (NMF) is a popular data embedding technique. Given a nonnegative data matrix $X$, it aims at finding two lower dimensional matrices, $W$ and $H$, such that $X\approx WH$, where the factors $W$ and $H$ are constrained to be element-wise nonnegative. The factor $W$ serves as a basis for the columns of $X$. In order to obtain more interpretable and unique solutions, minimum-volume NMF (MinVol NMF) minimizes the volume of $W$. In this paper, we consider the dual approach, where the volume of $H$ is maximized instead; this is referred to as maximum-volume NMF (MaxVol NMF). MaxVol NMF is identifiable under the same conditions as MinVol NMF in the noiseless case, but it behaves rather differently in the presence of noise. In practice, MaxVol NMF is much more effective to extract a sparse decomposition and does not generate rank-deficient solutions. In fact, we prove that the solutions of MaxVol NMF with the largest volume correspond to clustering the columns of $X$ in disjoint clusters, while the solutions of MinVol NMF with smallest volume are rank deficient. We propose two algorithms to solve MaxVol NMF. We also present a normalized variant of MaxVol NMF that exhibits better performance than MinVol NMF and MaxVol NMF, and can be interpreted as a continuum between standard NMF and orthogonal NMF. We illustrate our results in the context of hyperspectral unmixing.
cs.MA
[630] AI Agent Systems for Supply Chains: Structured Decision Prompts and Memory Retrieval
Konosuke Yoshizato, Kazuma Shimizu, Ryota Higa, Takanobu Otsuka
Main category: cs.MA
TL;DR: LLM-based multi-agent systems for inventory management with proposed AIM-RM agent using similarity matching for better adaptation across supply chain scenarios.
Details
Motivation: To address uncertainties about whether LLM-based multi-agent systems can consistently derive optimal ordering policies and adapt to diverse supply chain scenarios in inventory management.Method: Examines LLM-based MAS with fixed-ordering strategy prompts, then proposes AIM-RM agent that leverages similar historical experiences through similarity matching to enhance adaptability.
Result: LLM-based MAS can determine optimal ordering decisions in restricted scenarios even without detailed prompt adjustments; AIM-RM outperforms benchmark methods across various supply chain scenarios.
Conclusion: LLM-based multi-agent systems show promise for inventory management, with AIM-RM demonstrating robustness and adaptability across diverse supply chain scenarios.
Abstract: This study investigates large language model (LLM) -based multi-agent systems (MASs) as a promising approach to inventory management, which is a key component of supply chain management. Although these systems have gained considerable attention for their potential to address the challenges associated with typical inventory management methods, key uncertainties regarding their effectiveness persist. Specifically, it is unclear whether LLM-based MASs can consistently derive optimal ordering policies and adapt to diverse supply chain scenarios. To address these questions, we examine an LLM-based MAS with a fixed-ordering strategy prompt that encodes the stepwise processes of the problem setting and a safe-stock strategy commonly used in inventory management. Our empirical results demonstrate that, even without detailed prompt adjustments, an LLM-based MAS can determine optimal ordering decisions in a restricted scenario. To enhance adaptability, we propose a novel agent called AIM-RM, which leverages similar historical experiences through similarity matching. Our results show that AIM-RM outperforms benchmark methods across various supply chain scenarios, highlighting its robustness and adaptability.
[631] Learning to Share: Selective Memory for Efficient Parallel Agentic Systems
Joseph Fioresi, Parth Parag Kulkarni, Ashmal Vayani, Song Wang, Mubarak Shah
Main category: cs.MA
TL;DR: LTS introduces a learned shared-memory mechanism for parallel agentic systems that enables selective cross-team information reuse to reduce redundant computations while maintaining or improving task performance.
Details
Motivation: Parallel agentic systems suffer from significant computational inefficiency due to redundant reasoning and tool execution across different teams working on similar sub-problems, leading to substantial overlapping computation.Method: Proposes Learning to Share (LTS) with a global memory bank accessible to all teams and a lightweight controller trained via stepwise reinforcement learning with usage-aware credit assignment to selectively admit useful intermediate steps to shared memory.
Result: LTS significantly reduces overall runtime while matching or improving task performance compared to memory-free parallel baselines on AssistantBench and GAIA benchmarks.
Conclusion: Learned memory admission is an effective strategy for improving efficiency of parallel agentic systems by enabling selective cross-team information reuse while controlling context growth.
Abstract: Agentic systems solve complex tasks by coordinating multiple agents that iteratively reason, invoke tools, and exchange intermediate results. To improve robustness and solution quality, recent approaches deploy multiple agent teams running in parallel to explore diverse reasoning trajectories. However, parallel execution comes at a significant computational cost: when different teams independently reason about similar sub-problems or execute analogous steps, they repeatedly perform substantial overlapping computation. To address these limitations, in this paper, we propose Learning to Share (LTS), a learned shared-memory mechanism for parallel agentic frameworks that enables selective cross-team information reuse while controlling context growth. LTS introduces a global memory bank accessible to all teams and a lightweight controller that decides whether intermediate agent steps should be added to memory or not. The controller is trained using stepwise reinforcement learning with usage-aware credit assignment, allowing it to identify information that is globally useful across parallel executions. Experiments on the AssistantBench and GAIA benchmarks show that LTS significantly reduces overall runtime while matching or improving task performance compared to memory-free parallel baselines, demonstrating that learned memory admission is an effective strategy for improving the efficiency of parallel agentic systems. Project page: https://joefioresi718.github.io/LTS_webpage/
[632] PhysicsAgentABM: Physics-Guided Generative Agent-Based Modeling
Kavana Venkatesh, Yinhan He, Jundong Li, Jiaming Cui
Main category: cs.MA
TL;DR: PhysicsAgentABM combines symbolic agents with neural transition models for scalable, calibrated agent-based simulations using LLM-driven clustering to reduce computational costs.
Details
Motivation: LLM-based multi-agent systems are expensive to scale and poorly calibrated for timestep-aligned simulation, while classical ABMs struggle with rich individual-level signals and non-stationary behaviors.Method: Uses behaviorally coherent agent clusters: state-specialized symbolic agents encode transition priors, multimodal neural transition model captures temporal/interaction dynamics, uncertainty-aware epistemic fusion yields calibrated cluster-level transitions. ANCHOR clustering uses LLM-driven behavioral responses with contrastive loss to reduce LLM calls.
Result: Reduces LLM calls by 6-8 times, shows consistent gains in event-time accuracy and calibration over mechanistic, neural, and LLM baselines across public health, finance, and social sciences.
Conclusion: PhysicsAgentABM establishes a new paradigm for scalable and calibrated simulation with LLMs by re-architecting generative ABM around population-level inference with uncertainty-aware neuro-symbolic fusion.
Abstract: Large language model (LLM)-based multi-agent systems enable expressive agent reasoning but are expensive to scale and poorly calibrated for timestep-aligned state-transition simulation, while classical agent-based models (ABMs) offer interpretability but struggle to integrate rich individual-level signals and non-stationary behaviors. We propose PhysicsAgentABM, which shifts inference to behaviorally coherent agent clusters: state-specialized symbolic agents encode mechanistic transition priors, a multimodal neural transition model captures temporal and interaction dynamics, and uncertainty-aware epistemic fusion yields calibrated cluster-level transition distributions. Individual agents then stochastically realize transitions under local constraints, decoupling population inference from entity-level variability. We further introduce ANCHOR, an LLM agent-driven clustering strategy based on cross-contextual behavioral responses and a novel contrastive loss, reducing LLM calls by up to 6-8 times. Experiments across public health, finance, and social sciences show consistent gains in event-time accuracy and calibration over mechanistic, neural, and LLM baselines. By re-architecting generative ABM around population-level inference with uncertainty-aware neuro-symbolic fusion, PhysicsAgentABM establishes a new paradigm for scalable and calibrated simulation with LLMs.
[633] Exploring Silicon-Based Societies: An Early Study of the Moltbook Agent Community
Yu-Zheng Lin, Bono Po-Jen Shih, Hsuan-Ying Alessandra Chien, Shalaka Satam, Jesus Horacio Pacheco, Sicong Shao, Soheil Salehi, Pratik Satam
Main category: cs.MA
TL;DR: Data-driven silicon sociology framework for studying large-scale autonomous agent ecosystems using data mining techniques on agent-created social structures.
Details
Motivation: The emergence of persistent, large-scale autonomous agent ecosystems requires systematic empirical frameworks beyond anecdotal observation or small-scale simulation to understand collective behavior and social structure formation.Method: Analyzed Moltbook platform with 150,000+ agents across thousands of sub-communities; collected 12,758 agent-authored sub-community descriptions; applied preprocessing, contextual embedding, and unsupervised clustering to uncover latent thematic patterns.
Result: Autonomous agents systematically organize collective space through reproducible patterns including human-mimetic interests, silicon-centric self-reflection, and early-stage economic/coordination behaviors, emerging directly from machine-generated data traces.
Conclusion: Establishes methodological foundation for data-driven silicon sociology, demonstrating data mining as powerful lens for understanding organization and evolution of large autonomous agent societies without predefined sociological taxonomies.
Abstract: The rapid emergence of autonomous large language model agents has given rise to persistent, large-scale agent ecosystems whose collective behavior cannot be adequately understood through anecdotal observation or small-scale simulation. This paper introduces data-driven silicon sociology as a systematic empirical framework for studying social structure formation among interacting artificial agents. We present a pioneering large-scale data mining investigation of an in-the-wild agent society by analyzing Moltbook, a social platform designed primarily for agent-to-agent interaction. At the time of study, Moltbook hosted over 150,000 registered autonomous agents operating across thousands of agent-created sub-communities. Using programmatic and non-intrusive data acquisition, we collected and analyzed the textual descriptions of 12,758 submolts, which represent proactive sub-community partitioning activities within the ecosystem. Treating agent-authored descriptions as first-class observational artifacts, we apply rigorous preprocessing, contextual embedding, and unsupervised clustering techniques to uncover latent patterns of thematic organization and social space structuring. The results show that autonomous agents systematically organize collective space through reproducible patterns spanning human-mimetic interests, silicon-centric self-reflection, and early-stage economic and coordination behaviors. Rather than relying on predefined sociological taxonomies, these structures emerge directly from machine-generated data traces. This work establishes a methodological foundation for data-driven silicon sociology and demonstrates that data mining techniques can provide a powerful lens for understanding the organization and evolution of large autonomous agent societies.
cs.MM
[634] XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning
Hanwen Zhang, Yao Liu, Peiyuan Jiang, Lang Junjie, Xie Jun, Yihui He, Yajiao Deng, Siyu Du, Qiao Liu
Main category: cs.MM
TL;DR: XEmoGPT is a novel explainable multimodal emotion recognition framework that enhances fine-grained emotional cue perception and reasoning through specialized video/audio cue bridges and a large-scale EmoCue dataset.
Details
Motivation: Current multimodal emotion recognition approaches struggle with cue-level perception and reasoning due to: 1) general-purpose modality encoders lacking sensitivity to fine-grained emotional cues, and 2) datasets trading off annotation quality vs scale, providing insufficient supervision for emotional cues. Existing evaluation metrics also fail to assess cue-level reasoning performance.Method: Proposes XEmoGPT with two specialized modules: Video Emotional Cue Bridge (VECB) and Audio Emotional Cue Bridge (AECB) that enhance video/audio encoders through carefully designed tasks for fine-grained emotional cue perception. Constructs large-scale EmoCue dataset to teach cue-level reasoning, and introduces EmoCue-360 metric for automated evaluation using semantic similarity matching.
Result: XEmoGPT achieves strong performance in both emotional cue perception and reasoning. The framework demonstrates effectiveness through experimental validation on the proposed benchmark.
Conclusion: XEmoGPT addresses key limitations in explainable multimodal emotion recognition by improving fine-grained cue perception and reasoning capabilities, supported by novel dataset construction and evaluation metrics.
Abstract: Explainable Multimodal Emotion Recognition plays a crucial role in applications such as human-computer interaction and social media analytics. However, current approaches struggle with cue-level perception and reasoning due to two main challenges: 1) general-purpose modality encoders are pretrained to capture global structures and general semantics rather than fine-grained emotional cues, resulting in limited sensitivity to emotional signals; and 2) available datasets usually involve a trade-off between annotation quality and scale, which leads to insufficient supervision for emotional cues and ultimately limits cue-level reasoning. Moreover, existing evaluation metrics are inadequate for assessing cue-level reasoning performance. To address these challenges, we propose eXplainable Emotion GPT (XEmoGPT), a novel EMER framework capable of both perceiving and reasoning over emotional cues. It incorporates two specialized modules: the Video Emotional Cue Bridge (VECB) and the Audio Emotional Cue Bridge (AECB), which enhance the video and audio encoders through carefully designed tasks for fine-grained emotional cue perception. To further support cue-level reasoning, we construct a large-scale dataset, EmoCue, designed to teach XEmoGPT how to reason over multimodal emotional cues. In addition, we introduce EmoCue-360, an automated metric that extracts and matches emotional cues using semantic similarity, and release EmoCue-Eval, a benchmark of 400 expert-annotated samples covering diverse emotional scenarios. Experimental results show that XEmoGPT achieves strong performance in both emotional cue perception and reasoning.
eess.AS
[635] ARCHI-TTS: A flow-matching-based Text-to-Speech Model with Self-supervised Semantic Aligner and Accelerated Inference
Chunyat Wu, Jiajun Deng, Zhengxi Liu, Zheqi Dai, Haolin He, Qiuqiang Kong
Main category: eess.AS
TL;DR: ARCHI-TTS: A diffusion-based non-autoregressive TTS system with semantic aligner for better text-speech alignment and efficient inference via feature reuse across denoising steps.
Details
Motivation: Current diffusion-based TTS systems face challenges with text-speech alignment modeling and high computational overhead from iterative denoising processes.Method: Proposes ARCHI-TTS with a dedicated semantic aligner for robust temporal/semantic consistency, efficient inference by reusing encoder features across denoising steps, and auxiliary CTC loss on condition encoder.
Result: Achieves WER of 1.98% on LibriSpeech-PC test-clean, 1.47%/1.42% on SeedTTS test-en/test-zh with high inference efficiency, outperforming recent state-of-the-art TTS systems.
Conclusion: ARCHI-TTS effectively addresses alignment and efficiency challenges in diffusion-based TTS while maintaining high synthesis quality.
Abstract: Although diffusion-based, non-autoregressive text-to-speech (TTS) systems have demonstrated impressive zero-shot synthesis capabilities, their efficacy is still hindered by two key challenges: the difficulty of text-speech alignment modeling and the high computational overhead of the iterative denoising process. To address these limitations, we propose ARCHI-TTS that features a dedicated semantic aligner to ensure robust temporal and semantic consistency between text and audio. To overcome high computational inference costs, ARCHI-TTS employs an efficient inference strategy that reuses encoder features across denoising steps, drastically accelerating synthesis without performance degradation. An auxiliary CTC loss applied to the condition encoder further enhances the semantic understanding. Experimental results demonstrate that ARCHI-TTS achieves a WER of 1.98% on LibriSpeech-PC test-clean, and 1.47%/1.42% on SeedTTS test-en/test-zh with a high inference efficiency, consistently outperforming recent state-of-the-art TTS systems.
[636] Exterior sound field estimation based on physics-constrained kernel
Juliano G. C. Ribeiro, Ryo Matsuda, Jorge Trevino
Main category: eess.AS
TL;DR: Gaussian process interpolation method for exterior sound fields using trainable point source kernel, outperforming conventional spherical wave functions and physics-informed ML models.
Details
Motivation: Exterior sound field interpolation is challenging due to requirements for specific array configurations and prior source knowledge. Existing methods lack flexibility in microphone distribution and require manual parameter tuning.Method: Proposes Gaussian process interpolation with point source reproducing kernel featuring trainable inner product formulation. The method learns parameters directly from recordings, works with arbitrary microphone distributions, and automatically attenuates higher harmonic orders.
Result: Achieves approximately 2 dB lower interpolation error on average compared to conventional spherical wave functions and established physics-informed ML models within 100 Hz to 2.5 kHz range, with more consistent reconstruction of ground truth sound fields.
Conclusion: The trainable kernel approach provides flexible, distribution-agnostic sound field interpolation that outperforms existing methods while requiring less prior knowledge about source conditions.
Abstract: Exterior sound field interpolation is a challenging problem that often requires specific array configurations and prior knowledge on the source conditions. We propose an interpolation method based on Gaussian processes using a point source reproducing kernel with a trainable inner product formulation made to fit exterior sound fields. While this estimation does not have a closed formula, it allows for the definition of a flexible estimator that is not restricted by microphone distribution and attenuates higher harmonic orders automatically with parameters directly optimized from the recordings, meaning an arbitrary distribution of microphones can be used. The proposed kernel estimator is compared in simulated experiments to the conventional method using spherical wave functions and an established physics-informed machine learning model, achieving lower interpolation error by approximately 2 dB on average within the analyzed frequencies of 100 Hz and 2.5 kHz and reconstructing the ground truth sound field more consistently within the target region.
[637] Wave-Trainer-Fit: Neural Vocoder with Trainable Prior and Fixed-Point Iteration towards High-Quality Speech Generation from SSL features
Hien Ohnaka, Yuma Shirahata, Masaya Kawamura
Main category: eess.AS
TL;DR: WaveTrainerFit is a neural vocoder that generates high-quality waveforms from data-driven SSL features using improved diffusion-GAN architecture with trainable priors and reference-aware gain adjustment for faster, higher-quality synthesis.
Details
Motivation: The paper aims to improve neural vocoders for waveform generation from data-driven features like SSL (self-supervised learning) features. Current methods like WaveFit combine diffusion models and GANs but require many inference steps and may not optimally handle speech energy matching.Method: WaveTrainerFit builds on WaveFit’s diffusion-GAN architecture with two key improvements: 1) Trainable priors that start inference from noise closer to target speech rather than Gaussian noise, 2) Reference-aware gain adjustment that imposes constraints on trainable priors to match speech energy. These reduce waveform modeling complexity.
Result: Experiments show WaveTrainerFit generates highly natural waveforms with improved speaker similarity from SSL features while requiring fewer iterations than WaveFit. The method works robustly across different depths of SSL feature extraction.
Conclusion: WaveTrainerFit enables high-quality waveform generation from data-driven features with fewer inference steps through trainable priors and energy-aware constraints, advancing neural vocoder technology for speech synthesis applications.
Abstract: We propose WaveTrainerFit, a neural vocoder that performs high-quality waveform generation from data-driven features such as SSL features. WaveTrainerFit builds upon the WaveFit vocoder, which integrates diffusion model and generative adversarial network. Furthermore, the proposed method incorporates the following key improvements: 1. By introducing trainable priors, the inference process starts from noise close to the target speech instead of Gaussian noise. 2. Reference-aware gain adjustment is performed by imposing constraints on the trainable prior to matching the speech energy. These improvements are expected to reduce the complexity of waveform modeling from data-driven features, enabling high-quality waveform generation with fewer inference steps. Through experiments, we showed that WaveTrainerFit can generate highly natural waveforms with improved speaker similarity from data-driven features, while requiring fewer iterations than WaveFit. Moreover, we showed that the proposed method works robustly with respect to the depth at which SSL features are extracted. Code and pre-trained models are available from https://github.com/line/WaveTrainerFit.
[638] Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection
Seohyun Joo, Yoori Oh
Main category: eess.AS
TL;DR: DAViHD introduces a dual-pathway audio encoder for video highlight detection, combining semantic understanding with spectro-temporal dynamics to better leverage audio cues.
Details
Motivation: Existing audio-visual highlight detection models underutilize audio, focusing mainly on high-level semantics while ignoring rich dynamic sound characteristics. There's a need to better leverage both semantic content and temporal dynamics of audio for improved highlight detection.Method: Proposes DAViHD with dual-pathway audio encoder: 1) Semantic pathway for content understanding (speech, music, sound events), 2) Dynamic pathway with frequency-adaptive mechanism to capture spectro-temporal dynamics, transient acoustic events, and rapid energy changes. Integrated into full audio-visual framework.
Result: Achieves new state-of-the-art performance on the large-scale MrHiSum benchmark, demonstrating that sophisticated dual-faceted audio representation advances highlight detection.
Conclusion: A sophisticated dual-pathway audio encoder combining semantic and dynamic features is crucial for advancing audio-visual video highlight detection, showing that rich audio representation significantly improves performance.
Abstract: Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully leverage the rich, dynamic characteristics of sound. To address this limitation, we propose a novel framework, Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD). The dual-pathway audio encoder is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics. The semantic pathway extracts high-level information by identifying the content within the audio, such as speech, music, or specific sound events. The dynamic pathway employs a frequency-adaptive mechanism as time evolves to jointly model these dynamics, enabling it to identify transient acoustic events via salient spectral bands and rapid energy changes. We integrate the novel audio encoder into a full audio-visual framework and achieve new state-of-the-art performance on the large-scale MrHiSum benchmark. Our results demonstrate that a sophisticated, dual-faceted audio representation is key to advancing the field of highlight detection.
[639] Zero-Shot TTS With Enhanced Audio Prompts: Bsc Submission For The 2026 Wildspoof Challenge TTS Track
Jose Giraldo, Alex Peiró-Lilja, Rodolfo Zevallos, Cristina España-Bonet
Main category: eess.AS
TL;DR: StyleTTS2 and F5-TTS non-autoregressive models with flexible duration modeling for spontaneous speech, enhanced by Sidon-based noise reduction pipeline, achieving high MOS scores and analyzing prompt quality impact.
Details
Motivation: Address the challenge of generating realistic, spontaneous speech that occurs in-the-wild, which has natural prosody variations and often contains acoustic noise that degrades synthesis quality.Method: Use two non-autoregressive architectures (StyleTTS2 and F5-TTS) with flexible duration modeling for prosodic naturalness. Implement multi-stage enhancement pipeline using Sidon model for noise reduction, outperforming standard Demucs. Fine-tune on enhanced audios and analyze reference prompt quality/length impact on zero-shot synthesis.
Result: Achieved up to 4.21 UTMOS and 3.47 DNSMOS scores. Sidon-based enhancement significantly outperforms Demucs in signal quality. Fine-tuning on enhanced audios yields superior robustness. Analysis shows reference prompt quality and length significantly impact zero-shot synthesis performance.
Conclusion: The proposed approach combining non-autoregressive TTS with flexible duration modeling and Sidon-based noise enhancement effectively addresses spontaneous in-the-wild speech generation, achieving high-quality results and demonstrating the importance of prompt quality for zero-shot synthesis.
Abstract: We evaluate two non-autoregressive architectures, StyleTTS2 and F5-TTS, to address the spontaneous nature of in-the-wild speech. Our models utilize flexible duration modeling to improve prosodic naturalness. To handle acoustic noise, we implement a multi-stage enhancement pipeline using the Sidon model, which significantly outperforms standard Demucs in signal quality. Experimental results show that finetuning enhanced audios yields superior robustness, achieving up to 4.21 UTMOS and 3.47 DNSMOS. Furthermore, we analyze the impact of reference prompt quality and length on zero-shot synthesis performance, demonstrating the effectiveness of our approach for realistic speech generation.
[640] Segmentation-free Goodness of Pronunciation
Xinwei Cao, Zijian Fan, Torbjørn Svendsen, Giampiero Salvi
Main category: eess.AS
TL;DR: Proposes segmentation-free GOP methods for mispronunciation detection using CTC-trained ASR models, achieving state-of-the-art phoneme-level pronunciation assessment.
Details
Motivation: Current phoneme-level mispronunciation detection systems rely on pre-segmentation of speech, limiting accuracy and preventing use of modern CTC-based acoustic models. Need segmentation-free methods for better MDD performance.Method: Proposes two methods: 1) GOP-SA (self-alignment GOP) enabling CTC-trained ASR models for MDD, and 2) GOP-SF (segmentation-free GOP) considering all possible segmentations of canonical transcription. Includes theoretical framework, implementation solving numerical issues, and proper normalization for different acoustic models.
Result: Extensive experiments on CMU Kids and speechocean762 datasets show feature vectors from proposed methods achieve state-of-the-art results on phoneme-level pronunciation assessment. Methods work with different acoustic model peakiness and context amounts.
Conclusion: Segmentation-free GOP methods enable use of modern CTC-trained ASR models for mispronunciation detection, overcoming limitations of pre-segmentation approaches and achieving superior phoneme-level pronunciation assessment performance.
Abstract: Mispronunciation detection and diagnosis (MDD) is a significant part in modern computer-aided language learning (CALL) systems. Most systems implementing phoneme-level MDD through goodness of pronunciation (GOP), however, rely on pre-segmentation of speech into phonetic units. This limits the accuracy of these methods and the possibility to use modern CTC-based acoustic models for their evaluation. In this study, we first propose self-alignment GOP (GOP-SA) that enables the use of CTC-trained ASR models for MDD. Next, we define a more general segmentation-free method that takes all possible segmentations of the canonical transcription into account (GOP-SF). We give a theoretical account of our definition of GOP-SF, an implementation that solves potential numerical issues as well as a proper normalization which allows the use of acoustic models with different peakiness over time. We provide extensive experimental results on the CMU Kids and speechocean762 datasets comparing the different definitions of our methods, estimating the dependency of GOP-SF on the peakiness of the acoustic models and on the amount of context around the target phoneme. Finally, we compare our methods with recent studies over the speechocean762 data showing that the feature vectors derived from the proposed method achieve state-of-the-art results on phoneme-level pronunciation assessment.
[641] Reasoning Beyond Majority Vote: An Explainable SpeechLM Framework for Speech Emotion Recognition
Bo-Hao Su, Hui-Ying Shih, Jinchuan Tian, Jiatong Shi, Chi-Chun Lee, Carlos Busso, Shinji Watanabe
Main category: eess.AS
TL;DR: An explainable Speech Language Model framework for Speech Emotion Recognition that generates both emotion labels and natural-language rationales based on lexical and acoustic cues, using teacher LLM-generated rationales as intermediate supervision.
Details
Motivation: Traditional SER systems use majority-voted labels which mask subjectivity, neglect minority annotations, and lack interpretability. There's a need for more transparent models that can explain their predictions while maintaining competitive performance.Method: Proposes an explainable SpeechLM framework that frames SER as generative reasoning: 1) generates transcript from utterance, 2) outputs emotion label and concise natural-language rationale grounded in lexical/acoustic cues. Uses reasoning-capable teacher LLM to generate rationales as intermediate supervision, combined with majority labels during fine-tuning.
Result: On MSP-Podcast v1.12, the model maintains improvements over zero-shot SpeechLM baselines. Produces rationales that human evaluators find plausible and well-grounded. Uses annotator-aware scoring that credits matches with any annotator label, complementing traditional majority-label metrics.
Conclusion: Incorporating rationale supervision offers a practical path toward interpretable SER without sacrificing predictive quality, demonstrating that explainability can be enhanced while preserving competitive performance.
Abstract: Speech Emotion Recognition (SER) is typically trained and evaluated on majority-voted labels, which simplifies benchmarking but masks subjectivity and provides little transparency into why predictions are made. This neglects valid minority annotations and limits interpretability. We propose an explainable Speech Language Model (SpeechLM) framework that frames SER as a generative reasoning task. Given an utterance, the model first produces a transcript, then outputs both an emotion label and a concise natural-language rationale grounded in lexical and acoustic cues. Rationales are generated by a reasoning-capable teacher LLM and used as intermediate supervision, combined with majority labels during fine-tuning. Unlike prior work primarily focused on boosting classification accuracy, we aim to enhance explainability while preserving competitive performance. To this end, we complement majority-label metrics with annotator-aware scoring that credits matches with any annotator label. On MSP-Podcast v1.12, our model maintains improvements over zero-shot SpeechLM baselines, and produces rationales that human evaluators find plausible and well grounded. This demonstrates that incorporating rationale supervision offers a practical path toward interpretable SER without sacrificing predictive quality.
[642] UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching
Woongjib Choi, Sangmin Lee, Hyungseob Lim, Hong-Goo Kang
Main category: eess.AS
TL;DR: A vocoder-free audio super-resolution framework using flow matching to directly generate complex-valued spectral coefficients and reconstruct waveforms via iSTFT, eliminating dependency on separate neural vocoders.
Details
Motivation: To overcome limitations of two-stage diffusion-based audio super-resolution methods that rely on pre-trained neural vocoders, which constrain final audio quality and complicate end-to-end optimization.Method: Uses flow matching generative model to capture conditional distribution of complex-valued spectral coefficients, then directly reconstructs waveforms via inverse Short-Time Fourier Transform (iSTFT) without needing a separate vocoder.
Result: Achieves state-of-the-art performance on speech and general audio datasets, consistently producing high-fidelity 48 kHz audio across diverse upsampling factors.
Conclusion: The vocoder-free framework simplifies end-to-end optimization and overcomes vocoder performance bottlenecks in audio super-resolution, enabling direct high-quality waveform reconstruction.
Abstract: In this paper, we present a vocoder-free framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients. Unlike conventional two-stage diffusion-based approaches that predict a mel-spectrogram and then rely on a pre-trained neural vocoder to synthesize waveforms, our method directly reconstructs waveforms via the inverse Short-Time Fourier Transform (iSTFT), thereby eliminating the dependence on a separate vocoder. This design not only simplifies end-to-end optimization but also overcomes a critical bottleneck of two-stage pipelines, where the final audio quality is fundamentally constrained by vocoder performance. Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors, achieving state-of-the-art performance on both speech and general audio datasets.
[643] Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models
Nikita Kuzmin, Songting Liu, Kong Aik Lee, Eng Siong Chng
Main category: eess.AS
TL;DR: Stream-Voice-Anon adapts causal language model-based neural audio codec architectures for streaming speaker anonymization with improved intelligibility and emotion preservation while maintaining privacy protection.
Details
Motivation: Streaming speaker anonymization is crucial for online voice applications but remains underexplored. Existing NAC-based online LM systems are designed for voice conversion rather than anonymization, lacking proper privacy protection techniques.Method: Adapts causal LM-based NAC architectures for streaming SA with anonymization techniques including pseudo-speaker representation sampling, speaker embedding mixing, diverse prompt selection strategies for LM conditioning, and exploration of dynamic vs fixed delay configurations for latency-privacy trade-offs.
Result: Achieves substantial improvements: up to 46% relative WER reduction in intelligibility, up to 28% UAR relative improvement in emotion preservation compared to previous SOTA streaming method DarkStream, with comparable latency (180ms vs 200ms) and privacy protection against lazy-informed attackers (though 15% degradation against semi-informed attackers).
Conclusion: Stream-Voice-Anon successfully adapts modern causal LM-based NAC architectures for streaming speaker anonymization, demonstrating improved intelligibility and emotion preservation while maintaining privacy-latency trade-offs suitable for real-time applications.
Abstract: Protecting speaker identity is crucial for online voice applications, yet streaming speaker anonymization (SA) remains underexplored. Recent research has demonstrated that neural audio codec (NAC) provides superior speaker feature disentanglement and linguistic fidelity. NAC can also be used with causal language models (LM) to enhance linguistic fidelity and prompt control for streaming tasks. However, existing NAC-based online LM systems are designed for voice conversion (VC) rather than anonymization, lacking the techniques required for privacy protection. Building on these advances, we present Stream-Voice-Anon, which adapts modern causal LM-based NAC architectures specifically for streaming SA by integrating anonymization techniques. Our anonymization approach incorporates pseudo-speaker representation sampling, a speaker embedding mixing and diverse prompt selection strategies for LM conditioning that leverage the disentanglement properties of quantized content codes to prevent speaker information leakage. Additionally, we compare dynamic and fixed delay configurations to explore latency-privacy trade-offs in real-time scenarios. Under the VoicePrivacy 2024 Challenge protocol, Stream-Voice-Anon achieves substantial improvements in intelligibility (up to 46% relative WER reduction) and emotion preservation (up to 28% UAR relative) compared to the previous state-of-the-art streaming method DarkStream while maintaining comparable latency (180ms vs 200ms) and privacy protection against lazy-informed attackers, though showing 15% relative degradation against semi-informed attackers.
[644] Audio Inpainting in Time-Frequency Domain with Phase-Aware Prior
Peter Balušík, Pavel Rajmic
Main category: eess.AS
TL;DR: Proposes a phase-aware signal prior method for time-frequency audio inpainting that outperforms existing approaches in quality and efficiency.
Details
Motivation: Existing time-frequency audio inpainting methods have limitations in reconstruction quality and computational efficiency, creating a need for better approaches.Method: Uses a phase-aware signal prior exploiting instantaneous frequency estimates, formulates an optimization problem solved with generalized Chambolle-Pock algorithm.
Result: Outperforms deep-prior neural network and Janssen-TF methods in objective evaluation and subjective listening tests, with substantially reduced computational cost.
Conclusion: Proposed method improves state-of-the-art in time-frequency audio inpainting with better quality and efficiency than existing approaches.
Abstract: We address the problem of time-frequency audio inpainting, where the goal is to fill missing spectrogram portions with reliable information. Despite recent advances, existing approaches still face limitations in both reconstruction quality and computational efficiency. To bridge this gap, we propose a method that utilizes a phase-aware signal prior which exploits estimates of the instantaneous frequency. An optimization problem is formulated and solved using the generalized Chambolle-Pock algorithm. The proposed method is evaluated against other time-frequency inpainting methods, specifically a deep-prior audio inpainting neural network and the autoregression-based approach known as Janssen-TF. Our proposed approach surpassed these methods by a large margin in the objective evaluation as well as in the conducted subjective listening test, improving the state of the art. In addition, the reconstructions are obtained with a substantially reduced computational cost compared to alternative methods.
eess.IV
[645] Smart Diagnosis and Early Intervention in PCOS: A Deep Learning Approach to Women’s Reproductive Health
Shayan Abrar, Samura Rahman, Ishrat Jahan Momo, Mahjabin Tasnim Samiha, B. M. Shahria Alam, Mohammad Tahmid Noor, Nishat Tasnim Niloy
Main category: eess.IV
TL;DR: A transfer learning framework using DenseNet201 and ResNet50 achieves 99.80% accuracy for classifying ovarian ultrasound images to detect PCOS, with XAI methods for interpretability.
Details
Motivation: PCOS is a common disorder in reproductive-age women with serious long-term complications, making early detection crucial. Automated diagnosis systems using ultrasound images can improve clinical practice.Method: Transfer learning with DenseNet201 and ResNet50 architectures trained on 3856 ultrasound images (224x224 pixels). Used MixUp and CutMix augmentation strategies with alpha values 0.25 and 0.4. Applied XAI methods (SHAP, Grad-CAM, LIME) for model interpretability.
Result: DenseNet201 achieved peak validation accuracy of 99.80% with validation loss of 0.617. The model demonstrated high performance in classifying cyst-infected vs non-infected ovarian ultrasound images.
Conclusion: The proposed automated system for medical image diagnosis shows high accuracy and interpretability, making it suitable for confident clinical application in PCOS detection.
Abstract: Polycystic Ovary Syndrome (PCOS) is a widespread disorder in women of reproductive age, characterized by a hormonal imbalance, irregular periods, and multiple ovarian cysts. Infertility, metabolic syndrome, and cardiovascular risks are long-term complications that make early detection essential. In this paper, we design a powerful framework based on transfer learning utilizing DenseNet201 and ResNet50 for classifying ovarian ultrasound images. The model was trained on an online dataset containing 3856 ultrasound images of cyst-infected and non-infected patients. Each ultrasound frame was resized to 224x224 pixels and encoded with precise pathological indicators. The MixUp and CutMix augmentation strategies were used to improve generalization, yielding a peak validation accuracy of 99.80% by Densenet201 and a validation loss of 0.617 with alpha values of 0.25 and 0.4, respectively. We evaluated the model’s interpretability using leading Explainable AI (XAI) approaches such as SHAP, Grad-CAM, and LIME, reasoning with and presenting explicit visual reasons for the model’s behaviors, therefore increasing the model’s transparency. This study proposes an automated system for medical picture diagnosis that may be used effectively and confidently in clinical practice.
[646] AI-Based Detection of In-Treatment Changes from Prostate MR-Linac Images
Seungbin Park, Peilin Wang, Ryan Pennell, Emily S. Weg, Himanshu Nagar, Timothy McClure, Mert R. Sabuncu, Daniel Margolis, Heejong Kim
Main category: eess.IV
TL;DR: AI model predicts temporal order of MR-Linac images to detect radiation-induced changes in prostate cancer patients over short intervals (average 2 days)
Details
Motivation: To investigate whether routinely acquired longitudinal MR-Linac images can be leveraged to characterize subtle treatment-induced changes during radiotherapy, particularly inter-fraction changes over short intervalsMethod: Retrospective study of 0.35T MR-Linac images from 761 patients using deep learning model to predict temporal order of paired images; trained with first-last fraction pairs (F1-FL) then all pairs; assessed with accuracy/AUC metrics, radiologist comparison, saliency maps, and input ablation experiments
Result: F1-FL model achieved near-perfect performance (AUC 0.99), significantly outperforming radiologist; All-pairs model AUC 0.97; primary predictive regions were prostate, bladder, and pubic symphysis; performance correlated with fraction intervals
Conclusion: Model accurately predicts temporal order of MR-Linac fractions and detects radiation-induced changes over short intervals, confirming prostate and adjacent organ alterations; underscores MR-Linac’s potential for advanced image analysis beyond image guidance
Abstract: Purpose: To investigate whether routinely acquired longitudinal MR-Linac images can be leveraged to characterize treatment-induced changes during radiotherapy, particularly subtle inter-fraction changes over short intervals (average of 2 days). Materials and Methods: This retrospective study included a series of 0.35T MR-Linac images from 761 patients. An artificial intelligence (deep learning) model was used to characterize treatment-induced changes by predicting the temporal order of paired images. The model was first trained with the images from the first and the last fractions (F1-FL), then with all pairs (All-pairs). Model performance was assessed using quantitative metrics (accuracy and AUC), compared to a radiologist’s performance, and qualitative analyses - the saliency map evaluation to investigate affected anatomical regions. Input ablation experiments were performed to identify the anatomical regions altered by radiotherapy. The radiologist conducted an additional task on partial images reconstructed by saliency map regions, reporting observations as well. Quantitative image analysis was conducted to investigate the results from the model and the radiologist. Results: The F1-FL model yielded near-perfect performance (AUC of 0.99), significantly outperforming the radiologist. The All-pairs model yielded an AUC of 0.97. This performance reflects therapy-induced changes, supported by the performance correlation to fraction intervals, ablation tests and expert’s interpretation. Primary regions driving the predictions were prostate, bladder, and pubic symphysis. Conclusion: The model accurately predicts temporal order of MR-Linac fractions and detects radiation-induced changes over one or a few days, including prostate and adjacent organ alterations confirmed by experts. This underscores MR-Linac’s potential for advanced image analysis beyond image guidance.
[647] Personalized White Matter Bundle Segmentation for Early Childhood
Elyssa M. McMaster, Michael E. Kim, Nancy R. Newlin, Gaurav Rudravaram, Adam M. Saunders, Aravind R. Krishnan, Jongyeon Yoon, Ji S. Kim, Bryce L. Geeraert, Meaghan V. Perdue, Catherine Lebel, Daniel Moyer, Kurt G. Schilling, Laurie E. Cutting, Bennett A. Landman
Main category: eess.IV
TL;DR: A deep learning model for pediatric white matter bundle segmentation from diffusion MRI, inspired by TractSeg but modified for pediatric-specific bundle definitions, showing statistically significant improvements over TractSeg.
Details
Motivation: Existing white matter segmentation methods lack pediatric-specific approaches, and the authors hypothesize that a deep learning model similar to TractSeg but adapted for pediatric bundle definitions will improve segmentation accuracy compared to expert-labeled ground truth.Method: Modified TractSeg’s 2D UNet architecture with pediatric-specific bundle definitions as inputs, used k-fold cross validation, and implemented masked Dice loss. Evaluated on 56 manually labeled white matter bundles using Dice score, volume overlap, and volume overreach metrics.
Result: The pediatric-specific model showed statistical significance across all bundles for all metrics (except one case in volume overlap) compared to TractSeg. Combined output masks created a 60-label atlas that produced smoother, continuous masks in cases where TractSeg failed.
Conclusion: The pediatric-specific deep learning approach significantly improves white matter pathway segmentation, enabling better understanding of neurodevelopment and more reliable individualized anatomy estimation for pediatric white matter diseases.
Abstract: White matter segmentation methods from diffusion magnetic resonance imaging range from streamline clustering-based approaches to bundle mask delineation, but none have proposed a pediatric-specific approach. We hypothesize that a deep learning model with a similar approach to TractSeg will improve similarity between an algorithm-generated mask and an expert-labeled ground truth. Given a cohort of 56 manually labelled white matter bundles, we take inspiration from TractSeg’s 2D UNet architecture, and we modify inputs to match bundle definitions as determined by pediatric experts, evaluation to use k fold cross validation, the loss function to masked Dice loss. We evaluate Dice score, volume overlap, and volume overreach of 16 major regions of interest compared to the expert labeled dataset. To test whether our approach offers statistically significant improvements over TractSeg, we compare Dice voxels, volume overlap, and adjacency voxels with a Wilcoxon signed rank test followed by false discovery rate correction. We find statistical significance across all bundles for all metrics with one exception in volume overlap. After we run TractSeg and our model, we combine their output masks into a 60 label atlas to evaluate if TractSeg and our model combined can generate a robust, individualized atlas, and observe smoothed, continuous masks in cases that TractSeg did not produce an anatomically plausible output. With the improvement of white matter pathway segmentation masks, we can further understand neurodevelopment on a population level scale, and we can produce reliable estimates of individualized anatomy in pediatric white matter diseases and disorders.
[648] Diffusion-aided Extreme Video Compression with Lightweight Semantics Guidance
Maojun Zhang, Haotian Wu, Richeng Jin, Deniz Gunduz, Krystian Mikolajczyk
Main category: eess.IV
TL;DR: A video compression framework using generative diffusion models with semantic representations, camera trajectories, and foreground segmentation for extremely low bit-rate video reconstruction.
Details
Motivation: Traditional video codecs and learning-based approaches fail at semantic reconstruction at extremely low bit-rates because they rely on low-level spatiotemporal redundancies. There's a need for a new paradigm that leverages high-level semantic understanding for compression.Method: Proposes a video compression framework that compresses high-level semantic representations and uses a conditional diffusion model for frame reconstruction. Motion information is characterized using global camera trajectories (represented by camera pose parameters) and foreground segmentation (using sparse segmentation masks) to improve compression efficiency.
Result: The method achieves significantly improved compression efficiency, enabling decent video reconstruction at extremely low bit-rates by leveraging generative priors and semantic understanding.
Conclusion: Generative models, particularly diffusion models, offer a promising new paradigm for video compression by moving beyond low-level redundancies to leverage high-level semantic understanding, enabling effective reconstruction at ultra-low bit-rates.
Abstract: Modern video codecs and learning-based approaches struggle for semantic reconstruction at extremely low bit-rates due to reliance on low-level spatiotemporal redundancies. Generative models, especially diffusion models, offer a new paradigm for video compression by leveraging high-level semantic understanding and powerful visual synthesis. This paper propose a video compression framework that integrates generative priors to drastically reduce bit-rate while maintaining reconstruction fidelity. Specifically, our method compresses high-level semantic representations of the video, then uses a conditional diffusion model to reconstruct frames from these semantics. To further improve compression, we characterize motion information with global camera trajectories and foreground segmentation: background motion is compactly represented by camera pose parameters while foreground dynamics by sparse segmentation masks. This allows for significantly boosts compression efficiency, enabling descent video reconstruction at extremely low bit-rates.
[649] Context-Aware Asymmetric Ensembling for Interpretable Retinopathy of Prematurity Screening via Active Query and Vascular Attention
Md. Mehedi Hassan, Taufiq Hasan
Main category: eess.IV
TL;DR: CAA Ensemble model for ROP screening uses clinical context-aware asymmetric streams to simulate clinical reasoning, achieving SOTA performance on imbalanced datasets with transparent visual explanations.
Details
Motivation: Automated ROP screening faces challenges due to limited data, complex conditions requiring both structural staging and microvascular analysis, and poor generalization of current deep learning models on small imbalanced public datasets.Method: Two specialized streams: MS-AQNet uses clinical contexts as dynamic query vectors for structural ridge localization, and VascuMIL encodes vascular topology maps with gated MIL for tortuosity detection, ensembled by a meta-learner.
Result: Achieved Macro F1-Score of 0.93 for ROP staging and AUC of 0.996 for Plus Disease detection on 188 infants (6,004 images), with transparent attention heatmaps and vascular threat maps.
Conclusion: Clinical metadata can guide visual search in medical AI, and architectural inductive bias can bridge the medical AI data gap, enabling effective screening with limited data.
Abstract: Retinopathy of Prematurity (ROP) is among the major causes of preventable childhood blindness. Automated screening remains challenging, primarily due to limited data availability and the complex condition involving both structural staging and microvascular abnormalities. Current deep learning models depend heavily on large private datasets and passive multimodal fusion, which commonly fail to generalize on small, imbalanced public cohorts. We thus propose the Context-Aware Asymmetric Ensemble Model (CAA Ensemble) that simulates clinical reasoning through two specialized streams. First, the Multi-Scale Active Query Network (MS-AQNet) serves as a structure specialist, utilizing clinical contexts as dynamic query vectors to spatially control visual feature extraction for localization of the fibrovascular ridge. Secondly, VascuMIL encodes Vascular Topology Maps (VMAP) within a gated Multiple Instance Learning (MIL) network to precisely identify vascular tortuosity. A synergistic meta-learner ensembles these orthogonal signals to resolve diagnostic discordance across multiple objectives. Tested on a highly imbalanced cohort of 188 infants (6,004 images), the framework attained State-of-the-Art performance on two distinct clinical tasks: achieving a Macro F1-Score of 0.93 for Broad ROP staging and an AUC of 0.996 for Plus Disease detection. Crucially, the system features `Glass Box’ transparency through counterfactual attention heatmaps and vascular threat maps, proving that clinical metadata dictates the model’s visual search. Additionally, this study demonstrates that architectural inductive bias can serve as an effective bridge for the medical AI data gap.
[650] Towards Segmenting the Invisible: An End-to-End Registration and Segmentation Framework for Weakly Supervised Tumour Analysis
Budhaditya Mukhopadhyay, Chirag Mandal, Pavan Tummala, Naghmeh Mahmoodian, Andreas Nürnberger, Soumick Chatterjee
Main category: eess.IV
TL;DR: A cross-modality weakly supervised framework for liver tumour segmentation that uses MRI-to-CT registration to generate pseudo-labels for CT images, but struggles when tumours are invisible in CT.
Details
Motivation: Liver tumours are clearly visible in pre-operative MRI but often invisible in intra-operative CT due to minimal contrast between pathological and healthy tissue, creating a need for cross-modality weak supervision approaches.Method: Hybrid registration-segmentation framework combining MSCGUNet for inter-modal image registration with a UNet-based segmentation module for registration-assisted pseudo-label generation on CT images.
Result: Achieved Dice score of 0.72 for healthy liver anatomy on CHAOS dataset, but performance degraded to 0.16 Dice score on clinical data with tumours, revealing limitations when pathology lacks corresponding visual features in target modality.
Conclusion: Registration-based label transfer cannot compensate for absence of discriminative features in target modality; segmenting truly invisible pathology remains an open challenge despite spatial propagation of labels via registration.
Abstract: Liver tumour ablation presents a significant clinical challenge: whilst tumours are clearly visible on pre-operative MRI, they are often effectively invisible on intra-operative CT due to minimal contrast between pathological and healthy tissue. This work investigates the feasibility of cross-modality weak supervision for scenarios where pathology is visible in one modality (MRI) but absent in another (CT). We present a hybrid registration-segmentation framework that combines MSCGUNet for inter-modal image registration with a UNet-based segmentation module, enabling registration-assisted pseudo-label generation for CT images. Our evaluation on the CHAOS dataset demonstrates that the pipeline can successfully register and segment healthy liver anatomy, achieving a Dice score of 0.72. However, when applied to clinical data containing tumours, performance degrades substantially (Dice score of 0.16), revealing the fundamental limitations of current registration methods when the target pathology lacks corresponding visual features in the target modality. We analyse the “domain gap” and “feature absence” problems, demonstrating that whilst spatial propagation of labels via registration is feasible for visible structures, segmenting truly invisible pathology remains an open challenge. Our findings highlight that registration-based label transfer cannot compensate for the absence of discriminative features in the target modality, providing important insights for future research in cross-modality medical image analysis. Code an weights are available at: https://github.com/BudhaTronix/Weakly-Supervised-Tumour-Detection
[651] Disc-Centric Contrastive Learning for Lumbar Spine Severity Grading
Sajjan Acharya, Pralisha Kansakar
Main category: eess.IV
TL;DR: A disc-centric approach for automated severity grading of lumbar spinal stenosis from MRI using contrastive pretraining with disc-level fine-tuning and auxiliary localization tasks.
Details
Motivation: To develop an automated method for grading lumbar spinal stenosis severity from MRI that focuses on disc-level features, reduces sensitivity to irrelevant image variations, and addresses class imbalance issues in medical imaging.Method: Combines contrastive pretraining with disc-level fine-tuning using anatomically localized regions of interest per intervertebral disc. Includes auxiliary regression for disc localization and weighted focal loss to handle class imbalance.
Result: Achieves 78.1% balanced accuracy and reduces severe-to-normal misclassification rate to 2.13% compared to supervised training from scratch, though moderate severity detection remains challenging.
Conclusion: Focusing on disc-level features provides a practical approach for assessing lumbar spinal stenosis, with contrastive learning helping the model focus on meaningful features while reducing sensitivity to irrelevant image variations.
Abstract: This work examines a disc-centric approach for automated severity grading of lumbar spinal stenosis from sagittal T2-weighted MRI. The method combines contrastive pretraining with disc-level fine-tuning, using a single anatomically localized region of interest per intervertebral disc. Contrastive learning is employed to help the model focus on meaningful disc features and reduce sensitivity to irrelevant differences in image appearance. The framework includes an auxiliary regression task for disc localization and applies weighted focal loss to address class imbalance. Experiments demonstrate a 78.1% balanced accuracy and a reduced severe-to-normal misclassification rate of 2.13% compared with supervised training from scratch. Detecting discs with moderate severity can still be challenging, but focusing on disc-level features provides a practical way to assess the lumbar spinal stenosis.
[652] A Contrastive Learning Foundation Model Based on Perfectly Aligned Sample Pairs for Remote Sensing Images
Hengtong Shen, Haiyan Gu, Haitao Li, Yi Yang, Agen Qiu
Main category: eess.IV
TL;DR: PerA is a self-supervised contrastive learning method specifically designed for remote sensing images that uses spatially disjoint masks instead of random cropping to create semantically aligned sample pairs, achieving efficient training and good performance on RS tasks.
Details
Motivation: Contrastive learning methods have succeeded in computer vision but require adaptation for remote sensing due to domain gaps. Existing methods face challenges with semantic inconsistency in RS images and memory inefficiency.Method: PerA produces semantically aligned sample pairs by applying spatially disjoint masks to augmented images rather than random cropping. It ensures consistency between teacher and student networks while predicting learnable mask tokens, enabling sparse inputs and larger batch training.
Result: The method demonstrates higher memory efficiency, adaptability to uncurated RS data, and achieves performance comparable to previous state-of-the-art methods on multiple downstream tasks with limited model scale, using a collected dataset of ~5 million RS images.
Conclusion: PerA provides an effective self-supervised approach for remote sensing that addresses domain-specific challenges while maintaining efficiency and performance, contributing to practical RS interpretation.
Abstract: Self-Supervised Learning (SSL) enables us to pre-train foundation models without costly labeled data. Among SSL methods, Contrastive Learning (CL) methods are better at obtaining accurate semantic representations in noise interference. However, due to the significant domain gap, while CL methods have achieved great success in many computer vision tasks, they still require specific adaptation for Remote Sensing (RS) images. To this end, we present a novel self-supervised method called PerA, which produces all-purpose RS features through semantically Perfectly Aligned sample pairs. Specifically, PerA obtains features from sampled views by applying spatially disjoint masks to augmented images rather than random cropping. Our framework provides high-quality features by ensuring consistency between teacher and student and predicting learnable mask tokens. Compared to previous contrastive methods, our method demonstrates higher memory efficiency and can be trained with larger batches due to its sparse inputs. Additionally, the proposed method demonstrates remarkable adaptability to uncurated RS data and reduce the impact of the potential semantic inconsistency. We also collect an unlabeled pre-training dataset, which contains about 5 million RS images. We conducted experiments on multiple downstream task datasets and achieved performance comparable to previous state-of-the-art methods with a limited model scale, demonstrating the effectiveness of our approach. We hope this work will contribute to practical remote sensing interpretation works.
[653] Plug-and-play linear attention with provable guarantees for training-free image restoration
Srinivasan Kidambi, Karthik Palaniappan, Pravin Nair
Main category: eess.IV
TL;DR: PnP-Nystra is a training-free Nyström-based linear attention module that replaces MHSA in pretrained vision Transformers for image restoration tasks, providing significant speedups with minimal quality degradation.
Details
Motivation: Multi-head self-attention (MHSA) in vision Transformers has quadratic complexity that limits real-time deployment in resource-constrained environments, especially for image restoration tasks that require processing high-resolution images.Method: Proposes PnP-Nystra, a Nyström-based linear attention module that approximates MHSA with provable kernel approximation guarantees. It’s designed as plug-and-play replacement for MHSA in pretrained window-based architectures like SwinIR, Uformer, and Dehazeformer without requiring finetuning.
Result: Achieves 1.8-3.6× speedups on NVIDIA RTX 4090 GPU and 1.8-7× speedups on CPU inference across denoising, deblurring, dehazing, and super-resolution tasks. Maintains closest output quality to original models compared to other training-free linear-attention baselines.
Conclusion: PnP-Nystra provides an effective training-free solution for accelerating vision Transformers in image restoration tasks while maintaining output quality, making it suitable for real-time and resource-constrained deployment.
Abstract: Multi-head self-attention (MHSA) is a key building block in modern vision Transformers, yet its quadratic complexity in the number of tokens remains a major bottleneck for real-time and resource-constrained deployment. We present PnP-Nystra, a training-free Nyström-based linear attention module designed as a plug-and-play replacement for MHSA in {pretrained} image restoration Transformers, with provable kernel approximation error guarantees. PnP-Nystra integrates directly into window-based architectures such as SwinIR, Uformer, and Dehazeformer, yielding efficient inference without finetuning. Across denoising, deblurring, dehazing, and super-resolution on images, PnP-Nystra delivers $1.8$–$3.6\times$ speedups on an NVIDIA RTX 4090 GPU and $1.8$–$7\times$ speedups on CPU inference. Compared with the strongest training-free linear-attention baselines we evaluate, our method incurs the smallest quality drop and stays closest to the original model’s outputs.
[654] Noisy MRI Reconstruction via MAP Estimation with an Implicit Deep-Denoiser Prior
Nikola Janjušević, Amirhossein Khalilian-Gourtani, Yao Wang, Li Feng
Main category: eess.IV
TL;DR: Implicit-MAP (ImMAP) is a diffusion-based MRI reconstruction framework that integrates acquisition noise models into maximum a posteriori formulation for more reliable and interpretable accelerated MRI reconstruction under realistic noise conditions.
Details
Motivation: Current diffusion models for MRI reconstruction lack explicit links to MRI physics and are sensitive to measurement noise, limiting their practical reliability. There's a need for reconstruction methods that handle realistic acquisition noise while maintaining interpretability.Method: ImMAP integrates acquisition noise models directly into a maximum a posteriori (MAP) formulation, building on the stochastic ascent method of Kadkhodaie et al. and generalizing it to handle MRI encoding operators and realistic measurement noise.
Result: ImMAP consistently outperforms state-of-the-art deep learning (LPDSNet) and diffusion-based (DDS) methods across both simulated and real noisy datasets.
Conclusion: ImMAP establishes a more reliable and interpretable diffusion-based reconstruction framework by clarifying the practical behavior and limitations of diffusion models under realistic noise conditions.
Abstract: Accelerating magnetic resonance imaging (MRI) remains challenging, particularly under realistic acquisition noise. While diffusion models have recently shown promise for reconstructing undersampled MRI data, many approaches lack an explicit link to the underlying MRI physics, and their parameters are sensitive to measurement noise, limiting their reliability in practice. We introduce Implicit-MAP (ImMAP), a diffusion-based reconstruction framework that integrates the acquisition noise model directly into a maximum a posteriori (MAP) formulation. Specifically, we build on the stochastic ascent method of Kadkhodaie et al. and generalize it to handle MRI encoding operators and realistic measurement noise. Across both simulated and real noisy datasets, ImMAP consistently outperforms state-of-the-art deep learning (LPDSNet) and diffusion-based (DDS) methods. By clarifying the practical behavior and limitations of diffusion models under realistic noise conditions, ImMAP establishes a more reliable and interpretable
[655] Optimized $k$-means color quantization of digital images in machine-based and human perception-based colorspaces
Ranjan Maitra
Main category: eess.IV
TL;DR: Color quantization performance comparison of k-means algorithm across RGB, CIE-XYZ, and CIE-LUV/HCL colorspaces using VIF quality assessment on 148 diverse images.
Details
Motivation: To investigate which colorspace (RGB, CIE-XYZ, or CIE-LUV/HCL) yields the best performance for k-means color quantization, as previous studies suggested human perception-based colorspaces might outperform machine-based RGB space.Method: Applied k-means color quantization at four quantization levels to 148 varied digital images across three colorspace families: RGB (machine-based), CIE-XYZ, and CIE-LUV/CIE-HCL (human perception-based). Used Visual Information Fidelity (VIF) measure to numerically assess quality of quantized images.
Result: In about half of cases, k-means performed best in RGB space. For higher quantization levels (k), CIE-XYZ colorspace usually performed better. For lower k values, CIE-LUV colorspace sometimes performed best. Analysis of hue, chromaticity, and luminance distributions provided nuanced characterization of which colorspace works best for different image types.
Conclusion: No single colorspace universally best for k-means color quantization; performance depends on quantization level and image characteristics. RGB performs well in many cases, but CIE-XYZ excels at higher quantization levels, while CIE-LUV works better for lower quantization levels.
Abstract: Color quantization represents an image using a fraction of its original number of colors while only minimally losing its visual quality. The $k$-means algorithm is commonly used in this context, but has mostly been applied in the machine-based RGB colorspace composed of the three primary colors. However, some recent studies have indicated its improved performance in human perception-based colorspaces. We investigated the performance of $k$-means color quantization at four quantization levels in the RGB, CIE-XYZ, and CIE-LUV/CIE-HCL colorspaces, on 148 varied digital images spanning a wide range of scenes, subjects and settings. The Visual Information Fidelity (VIF) measure numerically assessed the quality of the quantized images, and showed that in about half of the cases, $k$-means color quantization is best in the RGB space, while at other times, and especially for higher quantization levels ($k$), the CIE-XYZ colorspace is where it usually does better. There are also some cases, especially at lower $k$, where the best performance is obtained in the CIE-LUV colorspace. Further analysis of the performances in terms of the distributions of the hue, chromaticity and luminance in an image presents a nuanced perspective and characterization of the images for which each colorspace is better for $k$-means color quantization.
[656] CompSRT: Quantization and Pruning for Image Super Resolution Transformers
Dorsa Zeinali, Hailing Wang, Yitian Zhang, Yun Fu
Main category: eess.IV
TL;DR: CompSRT introduces Hadamard-based quantization with scalar decomposition for compressing SwinIR-light super-resolution models, achieving state-of-the-art compression performance with up to 1.53 dB gains.
Details
Motivation: There's a significant performance gap between compressed and full-precision image super-resolution models, and while Hadamard transforms have shown promise in LLM quantization for reducing outliers, their empirical effects on super-resolution models need deeper understanding.Method: Analyze weight/activation distributions in SwinIR-light, apply Hadamard-based quantization to reduce value ranges and increase values around zero, and introduce scalar decomposition with two trainable parameters for improved compression.
Result: CompSRT achieves statistically significant improvements over SOTA with gains up to 1.53 dB, better visual quality with reduced blurriness, and shows compatibility with pruning (40% weight removal) for 6.67-15% bit reduction while maintaining comparable performance.
Conclusion: Hadamard-based quantization with scalar decomposition effectively compresses image super-resolution transformers, bridging the gap between compressed and full-precision models while maintaining visual quality.
Abstract: Model compression has become an important tool for making image super resolution models more efficient. However, the gap between the best compressed models and the full precision model still remains large and a need for deeper understanding of compression theory on more performant models remains. Prior research on quantization of LLMs has shown that Hadamard transformations lead to weights and activations with reduced outliers, which leads to improved performance. We argue that while the Hadamard transform does reduce the effect of outliers, an empirical analysis on how the transform functions remains needed. By studying the distributions of weights and activations of SwinIR-light, we show with statistical analysis that lower errors is caused by the Hadamard transforms ability to reduce the ranges, and increase the proportion of values around $0$. Based on these findings, we introduce CompSRT, a more performant way to compress the image super resolution transformer network SwinIR-light. We perform Hadamard-based quantization, and we also perform scalar decomposition to introduce two additional trainable parameters. Our quantization performance statistically significantly surpasses the SOTA in metrics with gains as large as 1.53 dB, and visibly improves visual quality by reducing blurriness at all bitwidths. At $3$-$4$ bits, to show our method is compatible with pruning for increased compression, we also prune $40%$ of weights and show that we can achieve $6.67$-$15%$ reduction in bits per parameter with comparable performance to SOTA.
[657] EchoJEPA: A Latent Predictive Foundation Model for Echocardiography
Alif Munim, Adibvafa Fallahpour, Teodora Szasz, Ahmadreza Attarpour, River Jiang, Brana Sooriyakanthan, Maala Sooriyakanthan, Heather Whitney, Jeremy Slivnick, Barry Rubin, Wendy Tsang, Bo Wang
Main category: eess.IV
TL;DR: EchoJEPA is a foundation model for echocardiography that uses latent predictive objectives to learn robust anatomical representations while ignoring ultrasound speckle noise, achieving state-of-the-art performance on cardiac measurements with remarkable generalization capabilities.
Details
Motivation: Current foundation models for echocardiography struggle to separate anatomical signals from inherent ultrasound noise like speckle and acquisition artifacts, limiting their robustness and generalization in medical AI applications.Method: Trained on 18 million echocardiograms across 300K patients using a latent predictive objective (JEPA framework) to learn robust anatomical representations that ignore speckle noise, validated through multi-view probing with frozen backbones.
Result: Outperforms SOTA baselines by ~20% in LVEF estimation and 17% in RVSP estimation; achieves 79% view classification with only 1% labeled data vs 42% for baselines with 100%; degrades only 2% under acoustic perturbations vs 17% for competitors; zero-shot pediatric performance surpasses fine-tuned baselines.
Conclusion: Latent prediction is a superior paradigm for robust, generalizable medical AI, enabling foundation models that effectively disentangle anatomical signals from ultrasound noise and demonstrate exceptional generalization across patient populations and acquisition conditions.
Abstract: Foundation models for echocardiography often struggle to disentangle anatomical signal from the stochastic speckle and acquisition artifacts inherent to ultrasound. We present EchoJEPA, a foundation model trained on 18 million echocardiograms across 300K patients, representing the largest pretraining corpus for this modality to date. By leveraging a latent predictive objective, EchoJEPA learns robust anatomical representations that ignore speckle noise. We validate this using a novel multi-view probing framework with frozen backbones, where EchoJEPA outperforms state-of-the-art baselines by approximately 20% in left ventricular ejection fraction (LVEF) estimation and 17% in right ventricular systolic pressure (RVSP) estimation. The model also exhibits remarkable sample efficiency, reaching 79% view classification accuracy with only 1% of labeled data versus 42% for the best baseline trained on 100%. Crucially, EchoJEPA demonstrates superior generalization, degrading by only 2% under physics-informed acoustic perturbations compared to 17% for competitors. Most remarkably, its zero-shot performance on pediatric patients surpasses fully fine-tuned baselines, establishing latent prediction as a superior paradigm for robust, generalizable medical AI.